Skip to main content

Boilerplate Detection and Recoding

  • Conference paper
Advances in Information Retrieval (ECIR 2014)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 8416))

Included in the following conference series:

Abstract

Many information access applications have to tackle natural language texts that contain a large proportion of repeated and mostly invariable patterns – called boilerplates –, such as automatic templates, headers, signatures and table formats. These domain-specific standard formulations are usually much longer than traditional collocations or standard noun phrases and typically cover one or more sentences. Such motifs clearly have a non-compositional meaning and an ideal document representation should reflect this phenomenon.

We propose here a method that detects automatically and in an unsupervised way such motifs; and enriches the document representation by including specific features for these motifs. We experimentally show that this document recoding strategy leads to improved classification on different collections.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Baroni, M., Chantree, F., Kilgarriff, A., Sharoff, S.: CleanEval: a competition for cleaning webpages. In: LREC (2008)

    Google Scholar 

  2. Bernstein, Y., Zobel, J.: Accurate discovery of co-derivative documents via duplicate text detection. Inf. Syst. 31(7), 595–609 (2006)

    Article  Google Scholar 

  3. Iliopoulos, C.S., McHugh, J., Peterlongo, P., Pisanti, N., Rytter, W., Sagot, M.: A first approach to finding common motifs with gaps. International Journal of Foundation of Computer Science 16(6), 1145–1155 (2005)

    Article  MathSciNet  MATH  Google Scholar 

  4. Gallé, M.: Searching for Compact Hierarchical Structures in DNA by means of the Smallest Grammar Problem. Université de Rennes 1 (February 2011)

    Google Scholar 

  5. Gallé, M.: The bag-of-repeats representation of documents. In: SIGIR (2013)

    Google Scholar 

  6. Gusfield, D.: Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology. Cambridge University Press (January 1997)

    Google Scholar 

  7. Kohlschütter, C., Fankhauser, P., Nejdl, W.: Boilerplate detection using shallow text features. In: WSDM, p. 441. ACM Press, New York (2010)

    Google Scholar 

  8. Kohlschütter, C., Nejdl, W.: A Densitometric Approach to Web Page Segmentation Segmentation as a Visual Problem. In: CIKM, pp. 1173–1182 (2008)

    Google Scholar 

  9. Manning, C., Raghavan, P., Schütze, H.: Introduction to Inf Retrieval. Cambridge UP (2009)

    Google Scholar 

  10. Marsan, L., Sagot, M.-F.: Extracting structured motifs using a suffix tree–algorithms and application to promoter consensus identification. Journal of Computational Biology 7(3/4), 345–362 (2000)

    Article  Google Scholar 

  11. Pasternack, J., Roth, D.: Extracting Article Text from the Web with Maximum Subsequence Segmentation. In: WWW, pp. 971–980 (2009)

    Google Scholar 

  12. Pisanti, N., Carvalho, A.M., Marsan, L., Sagot, M.-F.: RISOTTO: Fast extraction of motifs with mismatches. In: Correa, J.R., Hevia, A., Kiwi, M. (eds.) LATIN 2006. LNCS, vol. 3887, pp. 757–768. Springer, Heidelberg (2006)

    Chapter  Google Scholar 

  13. Zhang, Y., Zaki, M.: Exmotif: efficient structured motif extraction. Algorithms for Molecular Biology 1(1), 21 (2006)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer International Publishing Switzerland

About this paper

Cite this paper

Gallé, M., Renders, JM. (2014). Boilerplate Detection and Recoding. In: de Rijke, M., et al. Advances in Information Retrieval. ECIR 2014. Lecture Notes in Computer Science, vol 8416. Springer, Cham. https://doi.org/10.1007/978-3-319-06028-6_42

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-06028-6_42

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-06027-9

  • Online ISBN: 978-3-319-06028-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics