Skip to main content

Automatic Wrapper Generation for Multilingual Web Resources

  • Conference paper
  • First Online:
Discovery Science (DS 2002)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2534))

Included in the following conference series:

Abstract

We present a wrapper generation system to extract contents of semi-structured documents which contain instances of a record. The generation is done automatically using general assumptions on the structure of instances. It outputs a set of pairs of left and right delimiters surrounding instances of a field. In addition to input documents, our system also receives a set of symbols with which a delimiter must begin or end. Our system treats semi-structured documents just as strings so that it does not depend on markup and natural languages. It does not require any training examples which show where instances are. We show experimental results on both static and dynamic pages which are gathered from 13 Web sites, markuped in HTML or XML, and written in four natural languages. In addition to usual contents, generated wrappers extract useful information hidden in comments or tags which are ignored by other wrapper generation algorithms. Some generated delimiters contain whitespaces or multibyte characters.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. N. Ashish and C. Knoblock, Wrapper Generation for Semi-structured Internet Sources, Proc. of Workshop on Management of Semistructured Data, 1997.

    Google Scholar 

  2. D. Buttler, L. Liu and C. Pu, A Fully Automated Object Extraction System for the World Wide Web, International Conference on Distributed Computing Systems, 2001.

    Google Scholar 

  3. C.-H. Chang and S.-C. Lui, IEPAD: Information Extraction Based on Pattern Discovery, Proc. of the Tenth International Conference of World Wide Web (WWW2001), pp. 4–15, 2001.

    Google Scholar 

  4. D. W. Embley, Y. Jiang and Y.-K. Ng, Record-Boundary Discovery in Web Documents, Proc. of ACM SIGMOD Conference, pp. 467–478, 1999.

    Google Scholar 

  5. D. Ikeda, Y. Yamada and S. Hirokawa, Eliminating Useless Parts in Semistructured Documents using Alternation Counts, Proc. of the Fourth International Conference on Discovery Science, Lecture Notes in Artificial Intelligence, Vol. 2226, pp. 113–127, 2001.

    Google Scholar 

  6. N. Kushmerick, D. S. Weld and R. B. Doorenbos, Wrapper Induction for Information Extraction, Intl. Joint Conference on Artificial Intelligence, pp. 729–737, 1997.

    Google Scholar 

  7. N. Kushmerick, Wrapper Induction: Efficiency and Expressiveness, Artificial Intelligence, Vol. 118, pp. 15–68, 2000.

    Article  MATH  MathSciNet  Google Scholar 

  8. K. Lerman, C. A. Knoblock and S Minton, Automatic Data Extraction from Lists and Tables in Web Sources, Adaptive Text Extraction and Mining workshop, 2001.

    Google Scholar 

  9. Y. Yamada, D. Ikeda and S. Hirokawa, SCOOP: A Record Extractor without Knowledge on Input, Proc. of the Fourth International Conference on Discovery Science, Lecture Notes in Artificial Intelligence, Vol. 2226, pp. 428–487, 2001.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2002 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Yamada, Y., Ikeda, D., Hirokawa, S. (2002). Automatic Wrapper Generation for Multilingual Web Resources. In: Lange, S., Satoh, K., Smith, C.H. (eds) Discovery Science. DS 2002. Lecture Notes in Computer Science, vol 2534. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-36182-0_33

Download citation

  • DOI: https://doi.org/10.1007/3-540-36182-0_33

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-00188-1

  • Online ISBN: 978-3-540-36182-4

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics