Harvesting relational tables from lists on the web

Elmeleegy, Hazem; Madhavan, Jayant; Halevy, Alon

doi:10.1007/s00778-011-0223-0

Harvesting relational tables from lists on the web

Special Issue Paper
Published: 02 March 2011

Volume 20, pages 209–226, (2011)
Cite this article

The VLDB Journal Aims and scope Submit manuscript

Hazem Elmeleegy¹,
Jayant Madhavan² &
Alon Halevy²

243 Accesses
27 Citations
3 Altmetric
Explore all metrics

Abstract

A large number of web pages contain data structured in the form of “lists”. Many such lists can be further split into multi-column tables, which can then be used in more semantically meaningful tasks. However, harvesting relational tables from such lists can be a challenging task. The lists are manually generated and hence need not have well-defined templates—they have inconsistent delimiters (if any) and often have missing information. We propose a novel technique for extracting tables from lists. The technique is domain independent and operates in a fully unsupervised manner. We first use multiple sources of information to split individual lines into multiple fields and then, compare the splits across multiple lines to identify and fix incorrect splits and bad alignments. In particular, we exploit a corpus of HTML tables, also extracted from the web, to identify likely fields and good alignments. For each extracted table, we compute an extraction score that reflects our confidence in the table’s quality. We conducted an extensive experimental study using both real web lists and lists derived from tables on the web. The experiments demonstrate the ability of our technique to extract tables with high accuracy. In addition, we applied our technique on a large sample of about 100,000 lists crawled from the web. The analysis of the extracted tables has led us to believe that there are likely to be tens of millions of useful and query-able relational tables extractable from lists on the web.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Arasu, A., Garcia-Molina, H.: Extracting structured data from Web pages. In: SIGMOD (2003)
Barish, G., Shin Chen, Y., Dipasquo, D., Knoblock, C.A., Minton, S., Muslea, I., Shahabi, C.: Theaterloc: using information integration technology to rapidly build virtual applications. In: ICDE (2000)
Barton G., Sternberg M.: A strategy for the rapid multiple alignment of protein sequences: confidence levels from tertiary structure comparisons. J. Mol. Biol. 198(2), 327–337 (1987)
Article Google Scholar
Bellman R.: On the approximation of curves by line segments using dynamic programming. Commun. ACM 4(6), 284 (1961)
Article Google Scholar
Borkar, V., Deshmukh, K., Sarawagi, S.: Automatic segmentation of text into structured records. In: SIGMOD (2001)
Brants, T., Popat, A.C., Xu, P., Och, F.J., Dean, J.: Large language models in machine translation. In: EMNLP-CoNLL (2007)
Cafarella M., Halevy A., Khoussainova N.: Data integration for the relational Web. PVLDB 2(1), 1090–1101 (2009)
Google Scholar
Cafarella M.J., Halevy A.Y., Wang D.Z., Wu E., Zhang Y.: WebTables: exploring the power of tables on the web. PVLDB 1(1), 538–549 (2008)
Google Scholar
Cafarella, M.J., Halevy, A.Y., Zhang, Y., Wang, D.Z., Wu, E.: Uncovering the relational Web. In: WebDB (2008)
Chang, C.-H., Lui S.-C.: IEPAD: information extraction based on pattern discovery. In: WWW (2001)
Cohen, W.W., Hurst, M., Jensen, L.S.: A flexible learning system for wrapping tables and lists in HTML documents. In: WWW (2002)
Cortez, E., da Silva, A.S., Goncalves, M.A., de Moura, E.S.: Ondux: on-demand unsupervised learning for information extraction. In: SIGMOD (2010)
Crescenzi, V., Mecca, G., Merialdo, P.: RoadRunner: towards automatic data extraction from large Web sites. In: VLDB (2001)
Edgar R.C., Batzoglou S.: Multiple sequence alignment. Curr. Opin. Struct. Biol. 3, 368–373 (2006)
Article Google Scholar
Elmeleegy H., Madhavan J., Halevy A.: Harvesting relational tables from lists on the web. PVLDB 2(1), 1078–1089 (2009)
Google Scholar
Embley, D.W., Jiang, Y., Ng, Y.-K.: Record-boundary discovery in Web documents. In: SIGMOD (1999)
Gonzalez, H., Halevy, A.Y., Jensen, C.S., Langen, A., Madhavan, J., Shapley, R., Shen, W., Goldberg-Kidon, J.: Google fusion tables: web-centered data management and collaboration. In: SIGMOD (2010)
Gupta R., Sarawagi S.: Answering table augmentation queries from unstructured lists on the web. PVLDB 2(1), 289–300 (2009)
Google Scholar
Kushmerick, N., Weld, D.S., Doorenbos, R.: Wrapper induction for information extraction. In: IJCAI (1997)
Lerman, K., Getoor, L., Minton, S., Knoblock, C.: Using the structure of Web sites for automatic segmentation of tables. In: SIGMOD (2004)
List of Indonesian floral emblems. http://en.wikipedia.org/wiki/List_of_Indonesian_floral_emblems
Liu, L., Pu, C., Han, W.: XWRAP: an XML-enabled wrapper construction system for Web information sources. In: ICDE (2000)
Madhavan J., Ko D., Kot L., Ganapathy V., Rasmussen A., Halevy A.Y.: Google’s deep web crawl. PVLDB 1(2), 1241–1252 (2008)
Google Scholar
Miao, G., Tatemura, J., Hsiung, W.-O., Sawires, A., Moser, L.E.: Extracting data records from the web using tag path clustering. In: WWW (2009)
Michelson M., Knoblock C.: Unsupervised information extraction from unstructured, ungrammatical data sources on the World Wide Web. IJDAR 10(3), 211–226 (2007)
Article Google Scholar
Needleman S.B., Wunsch C.D.: A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 48(3), 443–453 (1970)
Article Google Scholar
Notredame C.: Recent progresses in multiple sequence alignment: a survey. Pharmacogenomics 3(1), 327–337 (2002)
Article Google Scholar
Talukdar, P.P., Reisinger, J., Pasca, M., Ravichandran, D., Bhagat, R., Pereira, F.: Weakly supervised acquisition of labeled class instances using graph random walks. In: EMNLP (2008)
Wang, J., Lochovsky, F.H.: Data extraction and label assignment for web databases. In: WWW (2003)
Zhai, Y., Liu, B.: Web data extraction based on partial tree alignment. In: WWW (2005)
Zhai Y., Liu B.: Extracting Web data using instance-based learning. WWW 10(2), 113–132 (2007)
Google Scholar

Download references

Author information

Authors and Affiliations

AT&T Labs - Research, Florham Park, NJ, USA
Hazem Elmeleegy
Google Inc., Mountain View, CA, USA
Jayant Madhavan & Alon Halevy

Authors

Hazem Elmeleegy
View author publications
You can also search for this author in PubMed Google Scholar
Jayant Madhavan
View author publications
You can also search for this author in PubMed Google Scholar
Alon Halevy
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hazem Elmeleegy.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Elmeleegy, H., Madhavan, J. & Halevy, A. Harvesting relational tables from lists on the web. The VLDB Journal 20, 209–226 (2011). https://doi.org/10.1007/s00778-011-0223-0

Download citation

Received: 11 August 2010
Revised: 20 January 2011
Accepted: 10 February 2011
Published: 02 March 2011
Issue Date: April 2011
DOI: https://doi.org/10.1007/s00778-011-0223-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Harvesting relational tables from lists on the web

Abstract

Access this article

Similar content being viewed by others

Automatic Extraction of Logical Web Lists

Tabular Web Data: Schema Discovery and Integration

TAIPAN: Automatic Property Mapping for Tabular Data

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Harvesting relational tables from lists on the web

Abstract

Access this article

Similar content being viewed by others

Automatic Extraction of Logical Web Lists

Tabular Web Data: Schema Discovery and Integration

TAIPAN: Automatic Property Mapping for Tabular Data

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation