Automatic Wrapper Generation for Multilingual Web Resources

Yamada, Yasuhiro; Ikeda, Daisuke; Hirokawa, Sachio

doi:10.1007/3-540-36182-0_33

Yasuhiro Yamada⁷,
Daisuke Ikeda⁸ &
Sachio Hirokawa⁸

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2534))

Included in the following conference series:

International Conference on Discovery Science

953 Accesses
5 Citations

Abstract

We present a wrapper generation system to extract contents of semi-structured documents which contain instances of a record. The generation is done automatically using general assumptions on the structure of instances. It outputs a set of pairs of left and right delimiters surrounding instances of a field. In addition to input documents, our system also receives a set of symbols with which a delimiter must begin or end. Our system treats semi-structured documents just as strings so that it does not depend on markup and natural languages. It does not require any training examples which show where instances are. We show experimental results on both static and dynamic pages which are gathered from 13 Web sites, markuped in HTML or XML, and written in four natural languages. In addition to usual contents, generated wrappers extract useful information hidden in comments or tags which are ignored by other wrapper generation algorithms. Some generated delimiters contain whitespaces or multibyte characters.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

N. Ashish and C. Knoblock, Wrapper Generation for Semi-structured Internet Sources, Proc. of Workshop on Management of Semistructured Data, 1997.
Google Scholar
D. Buttler, L. Liu and C. Pu, A Fully Automated Object Extraction System for the World Wide Web, International Conference on Distributed Computing Systems, 2001.
Google Scholar
C.-H. Chang and S.-C. Lui, IEPAD: Information Extraction Based on Pattern Discovery, Proc. of the Tenth International Conference of World Wide Web (WWW2001), pp. 4–15, 2001.
Google Scholar
D. W. Embley, Y. Jiang and Y.-K. Ng, Record-Boundary Discovery in Web Documents, Proc. of ACM SIGMOD Conference, pp. 467–478, 1999.
Google Scholar
D. Ikeda, Y. Yamada and S. Hirokawa, Eliminating Useless Parts in Semistructured Documents using Alternation Counts, Proc. of the Fourth International Conference on Discovery Science, Lecture Notes in Artificial Intelligence, Vol. 2226, pp. 113–127, 2001.
Google Scholar
N. Kushmerick, D. S. Weld and R. B. Doorenbos, Wrapper Induction for Information Extraction, Intl. Joint Conference on Artificial Intelligence, pp. 729–737, 1997.
Google Scholar
N. Kushmerick, Wrapper Induction: Efficiency and Expressiveness, Artificial Intelligence, Vol. 118, pp. 15–68, 2000.
Article MATH MathSciNet Google Scholar
K. Lerman, C. A. Knoblock and S Minton, Automatic Data Extraction from Lists and Tables in Web Sources, Adaptive Text Extraction and Mining workshop, 2001.
Google Scholar
Y. Yamada, D. Ikeda and S. Hirokawa, SCOOP: A Record Extractor without Knowledge on Input, Proc. of the Fourth International Conference on Discovery Science, Lecture Notes in Artificial Intelligence, Vol. 2226, pp. 428–487, 2001.
Google Scholar

Download references

Author information

Authors and Affiliations

Graduate School of Information Science and Electrical Engineering, Kyushu University, 812-8581, Fukuoka, Japan
Yasuhiro Yamada
Computing and Communications Center, Kyushu University, 812-8581, Fukuoka, Japan
Daisuke Ikeda & Sachio Hirokawa

Authors

Yasuhiro Yamada
View author publications
You can also search for this author in PubMed Google Scholar
Daisuke Ikeda
View author publications
You can also search for this author in PubMed Google Scholar
Sachio Hirokawa
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Deutsches Forschungszentrum für Künstliche Intelligenz, Stuhlsatzenhausweg 3, 66123, Saarbrücken, Germany
Steffen Lange
National Institute of Informatics, 2-1-2 Hitotsubashi, Chiyoda-ku, 101-8430, Tokyo, Japan
Ken Satoh
Department of Computer Science, University of Maryland, College Park, 20742, Maryland, MD, USA
Carl H. Smith

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Yamada, Y., Ikeda, D., Hirokawa, S. (2002). Automatic Wrapper Generation for Multilingual Web Resources. In: Lange, S., Satoh, K., Smith, C.H. (eds) Discovery Science. DS 2002. Lecture Notes in Computer Science, vol 2534. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-36182-0_33

Download citation

DOI: https://doi.org/10.1007/3-540-36182-0_33
Published: 08 November 2002
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-00188-1
Online ISBN: 978-3-540-36182-4
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics