Abstract
We present a wrapper generation system to extract contents of semi-structured documents which contain instances of a record. The generation is done automatically using general assumptions on the structure of instances. It outputs a set of pairs of left and right delimiters surrounding instances of a field. In addition to input documents, our system also receives a set of symbols with which a delimiter must begin or end. Our system treats semi-structured documents just as strings so that it does not depend on markup and natural languages. It does not require any training examples which show where instances are. We show experimental results on both static and dynamic pages which are gathered from 13 Web sites, markuped in HTML or XML, and written in four natural languages. In addition to usual contents, generated wrappers extract useful information hidden in comments or tags which are ignored by other wrapper generation algorithms. Some generated delimiters contain whitespaces or multibyte characters.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
N. Ashish and C. Knoblock, Wrapper Generation for Semi-structured Internet Sources, Proc. of Workshop on Management of Semistructured Data, 1997.
D. Buttler, L. Liu and C. Pu, A Fully Automated Object Extraction System for the World Wide Web, International Conference on Distributed Computing Systems, 2001.
C.-H. Chang and S.-C. Lui, IEPAD: Information Extraction Based on Pattern Discovery, Proc. of the Tenth International Conference of World Wide Web (WWW2001), pp. 4–15, 2001.
D. W. Embley, Y. Jiang and Y.-K. Ng, Record-Boundary Discovery in Web Documents, Proc. of ACM SIGMOD Conference, pp. 467–478, 1999.
D. Ikeda, Y. Yamada and S. Hirokawa, Eliminating Useless Parts in Semistructured Documents using Alternation Counts, Proc. of the Fourth International Conference on Discovery Science, Lecture Notes in Artificial Intelligence, Vol. 2226, pp. 113–127, 2001.
N. Kushmerick, D. S. Weld and R. B. Doorenbos, Wrapper Induction for Information Extraction, Intl. Joint Conference on Artificial Intelligence, pp. 729–737, 1997.
N. Kushmerick, Wrapper Induction: Efficiency and Expressiveness, Artificial Intelligence, Vol. 118, pp. 15–68, 2000.
K. Lerman, C. A. Knoblock and S Minton, Automatic Data Extraction from Lists and Tables in Web Sources, Adaptive Text Extraction and Mining workshop, 2001.
Y. Yamada, D. Ikeda and S. Hirokawa, SCOOP: A Record Extractor without Knowledge on Input, Proc. of the Fourth International Conference on Discovery Science, Lecture Notes in Artificial Intelligence, Vol. 2226, pp. 428–487, 2001.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2002 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Yamada, Y., Ikeda, D., Hirokawa, S. (2002). Automatic Wrapper Generation for Multilingual Web Resources. In: Lange, S., Satoh, K., Smith, C.H. (eds) Discovery Science. DS 2002. Lecture Notes in Computer Science, vol 2534. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-36182-0_33
Download citation
DOI: https://doi.org/10.1007/3-540-36182-0_33
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-00188-1
Online ISBN: 978-3-540-36182-4
eBook Packages: Springer Book Archive