Skip to main content
Log in

Harvesting relational tables from lists on the web

  • Special Issue Paper
  • Published:
The VLDB Journal Aims and scope Submit manuscript

Abstract

A large number of web pages contain data structured in the form of “lists”. Many such lists can be further split into multi-column tables, which can then be used in more semantically meaningful tasks. However, harvesting relational tables from such lists can be a challenging task. The lists are manually generated and hence need not have well-defined templates—they have inconsistent delimiters (if any) and often have missing information. We propose a novel technique for extracting tables from lists. The technique is domain independent and operates in a fully unsupervised manner. We first use multiple sources of information to split individual lines into multiple fields and then, compare the splits across multiple lines to identify and fix incorrect splits and bad alignments. In particular, we exploit a corpus of HTML tables, also extracted from the web, to identify likely fields and good alignments. For each extracted table, we compute an extraction score that reflects our confidence in the table’s quality. We conducted an extensive experimental study using both real web lists and lists derived from tables on the web. The experiments demonstrate the ability of our technique to extract tables with high accuracy. In addition, we applied our technique on a large sample of about 100,000 lists crawled from the web. The analysis of the extracted tables has led us to believe that there are likely to be tens of millions of useful and query-able relational tables extractable from lists on the web.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Arasu, A., Garcia-Molina, H.: Extracting structured data from Web pages. In: SIGMOD (2003)

  2. Barish, G., Shin Chen, Y., Dipasquo, D., Knoblock, C.A., Minton, S., Muslea, I., Shahabi, C.: Theaterloc: using information integration technology to rapidly build virtual applications. In: ICDE (2000)

  3. Barton G., Sternberg M.: A strategy for the rapid multiple alignment of protein sequences: confidence levels from tertiary structure comparisons. J. Mol. Biol. 198(2), 327–337 (1987)

    Article  Google Scholar 

  4. Bellman R.: On the approximation of curves by line segments using dynamic programming. Commun. ACM 4(6), 284 (1961)

    Article  Google Scholar 

  5. Borkar, V., Deshmukh, K., Sarawagi, S.: Automatic segmentation of text into structured records. In: SIGMOD (2001)

  6. Brants, T., Popat, A.C., Xu, P., Och, F.J., Dean, J.: Large language models in machine translation. In: EMNLP-CoNLL (2007)

  7. Cafarella M., Halevy A., Khoussainova N.: Data integration for the relational Web. PVLDB 2(1), 1090–1101 (2009)

    Google Scholar 

  8. Cafarella M.J., Halevy A.Y., Wang D.Z., Wu E., Zhang Y.: WebTables: exploring the power of tables on the web. PVLDB 1(1), 538–549 (2008)

    Google Scholar 

  9. Cafarella, M.J., Halevy, A.Y., Zhang, Y., Wang, D.Z., Wu, E.: Uncovering the relational Web. In: WebDB (2008)

  10. Chang, C.-H., Lui S.-C.: IEPAD: information extraction based on pattern discovery. In: WWW (2001)

  11. Cohen, W.W., Hurst, M., Jensen, L.S.: A flexible learning system for wrapping tables and lists in HTML documents. In: WWW (2002)

  12. Cortez, E., da Silva, A.S., Goncalves, M.A., de Moura, E.S.: Ondux: on-demand unsupervised learning for information extraction. In: SIGMOD (2010)

  13. Crescenzi, V., Mecca, G., Merialdo, P.: RoadRunner: towards automatic data extraction from large Web sites. In: VLDB (2001)

  14. Edgar R.C., Batzoglou S.: Multiple sequence alignment. Curr. Opin. Struct. Biol. 3, 368–373 (2006)

    Article  Google Scholar 

  15. Elmeleegy H., Madhavan J., Halevy A.: Harvesting relational tables from lists on the web. PVLDB 2(1), 1078–1089 (2009)

    Google Scholar 

  16. Embley, D.W., Jiang, Y., Ng, Y.-K.: Record-boundary discovery in Web documents. In: SIGMOD (1999)

  17. Gonzalez, H., Halevy, A.Y., Jensen, C.S., Langen, A., Madhavan, J., Shapley, R., Shen, W., Goldberg-Kidon, J.: Google fusion tables: web-centered data management and collaboration. In: SIGMOD (2010)

  18. Gupta R., Sarawagi S.: Answering table augmentation queries from unstructured lists on the web. PVLDB 2(1), 289–300 (2009)

    Google Scholar 

  19. Kushmerick, N., Weld, D.S., Doorenbos, R.: Wrapper induction for information extraction. In: IJCAI (1997)

  20. Lerman, K., Getoor, L., Minton, S., Knoblock, C.: Using the structure of Web sites for automatic segmentation of tables. In: SIGMOD (2004)

  21. List of Indonesian floral emblems. http://en.wikipedia.org/wiki/List_of_Indonesian_floral_emblems

  22. Liu, L., Pu, C., Han, W.: XWRAP: an XML-enabled wrapper construction system for Web information sources. In: ICDE (2000)

  23. Madhavan J., Ko D., Kot L., Ganapathy V., Rasmussen A., Halevy A.Y.: Google’s deep web crawl. PVLDB 1(2), 1241–1252 (2008)

    Google Scholar 

  24. Miao, G., Tatemura, J., Hsiung, W.-O., Sawires, A., Moser, L.E.: Extracting data records from the web using tag path clustering. In: WWW (2009)

  25. Michelson M., Knoblock C.: Unsupervised information extraction from unstructured, ungrammatical data sources on the World Wide Web. IJDAR 10(3), 211–226 (2007)

    Article  Google Scholar 

  26. Needleman S.B., Wunsch C.D.: A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 48(3), 443–453 (1970)

    Article  Google Scholar 

  27. Notredame C.: Recent progresses in multiple sequence alignment: a survey. Pharmacogenomics 3(1), 327–337 (2002)

    Article  Google Scholar 

  28. Talukdar, P.P., Reisinger, J., Pasca, M., Ravichandran, D., Bhagat, R., Pereira, F.: Weakly supervised acquisition of labeled class instances using graph random walks. In: EMNLP (2008)

  29. Wang, J., Lochovsky, F.H.: Data extraction and label assignment for web databases. In: WWW (2003)

  30. Zhai, Y., Liu, B.: Web data extraction based on partial tree alignment. In: WWW (2005)

  31. Zhai Y., Liu B.: Extracting Web data using instance-based learning. WWW 10(2), 113–132 (2007)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hazem Elmeleegy.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Elmeleegy, H., Madhavan, J. & Halevy, A. Harvesting relational tables from lists on the web. The VLDB Journal 20, 209–226 (2011). https://doi.org/10.1007/s00778-011-0223-0

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00778-011-0223-0

Keywords

Navigation