Skip to main content

Advertisement

Log in

Converting heterogeneous statistical tables on the web to searchable databases

  • Original Paper
  • Published:
International Journal on Document Analysis and Recognition (IJDAR) Aims and scope Submit manuscript

Abstract

Much of the world’s quantitative data reside in scattered web tables. For a meaningful role in Big Data analytics, the facts reported in these tables must be brought into a uniform framework. Based on a formalization of header-indexed tables, we proffer an algorithmic solution to end-to-end table processing for a large class of human-readable tables. The proposed algorithms transform header-indexed tables to a category table format that maps easily to a variety of industry-standard data stores for query processing. The algorithms segment table regions based on the unique indexing of the data region by header paths, classify table cells, and factor header category structures of two-dimensional as well as the less common multidimensional tables. Experimental evaluations substantiate the algorithmic approach to processing heterogeneous tables. As demonstrable results, the algorithms generate queryable relational database tables and semantic-web triple stores. Application of our algorithms to 400 web tables randomly selected from diverse sources shows that the algorithmic solution automates end-to-end table processing.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18

Similar content being viewed by others

References

  1. Cafarella, W.J., Halevy, A., Wang, D.Z., Wu, E. , Zhang, Y.: Webtables: exploring the power of tables on the web. In: VLDB ’08, Auckland, New Zealand (2008)

  2. Galkin, M., Mouromtsev, D., Auer, S.: Identifying web tables—supporting a neglected type of content on the web. In: International Conference on Knowledge Engineering and Semantic Web (KESW). arXiv:1503.06598 [cs.IR] (2015)

  3. Wang, X.: Tabular abstraction, editing, and formatting, Ph.D. thesis, University of Waterloo (1996)

  4. Frier, B.: Roman life expectancy: Ulpian’s evidence. Harv. Stud. Classic. Philol. 86, 213–251 (1982)

    Article  Google Scholar 

  5. Zanibbi, R., Blostein, D., Cordy, J.R.: A survey of table recognition. Int. J. Doc. Anal. Recognit. 7(1), 1–16 (2004)

    Article  Google Scholar 

  6. Laurentini, A., Viada, P.: Identifying and understanding tabular material in compound documents. In: Proceedings of the Eleventh International Conference on Pattern Recognition (ICPR’92), The Hague, pp. 405–409 (1992)

  7. Turolla, E., Belaid, Y., Belaid, A.: Form item extraction based on line searching. In: Kasturi, R., Tombre, K. (eds.) Graphics Recognition—Methods and Applications. Lecture Notes in Computer Science, vol. 1072, pp. 69–79. Springer, Berlin (1996)

  8. Chandran, S., Kasturi, R.: Structural recognition of tabulated data. In: Proceedings of the Second International Conference on Document Analysis and Recognition (ICDAR’93), Tsukuba Science City, Japan, pp. 516–519 (1993)

  9. Itonori, K.: A table structure recognition based on textblock arrangement and ruled line position. In: Proceedings of the Second International Conference on Document Analysis and Recognition (ICDAR’93), Tsukuba Science City, Japan, pp. 765–768 (1993)

  10. Pinto, D., McCallum, A., Wei, X., Croft, W.B.: Table extraction using conditional random fields. In: Proceedings of the 26th Annual International ACM Y. SIGIR Conference on Research and Development in Information Retrieval, pp. 235–242 (2003)

  11. Hirayama, Y.: A method for table structure analysis using DP matching. In: Proceedings of the Third International Conference on Document Analysis and Recognition (ICDAR’95), Montreal, Canada, pp. 583–586 (1995)

  12. Handley, J.C.: Document recognition. In: Dougherty, E.R. (ed.) Electronic Imaging Technology, chap. 8. SPIE—The International Society for Optical Engineering (1999)

  13. Zuyev, K.: Table image segmentation. In: Proceedings of the International Conference on Document Analysis and Recognition (ICDAR’97), pp. 705–708 (1997)

  14. Cesarini, F., Marinai, S., Sarti, L., Soda, G.: Trainable table location in document images. Procs. 16th Int’l Conf on Pattern Recognition 3(236–240), 2002 (2002)

    Google Scholar 

  15. Wang, Y., Hu, J.: A machine learning approach to table detection on the web. In: WWW Conference, Honolulu, pp. 242–250 (2002)

  16. Abu-Tarif, A.: Table processing and table understanding, Master’s thesis, Rensselaer Polytechnic Institute, May (1998)

  17. Rastan, R., Paik, H.-Y., Shepherd, J.: TEXUS: A task-based approach for table extraction and understanding. In: Proceedings of the ACM Conference on Document Engineering, Lausanne, vol. 15, pp. 25–34, Sept (2015)

  18. Pyreddy, P., Croft, W.B.: TINTIN, a system for retrieval in text tables. Technical Report UM-CS-1997-002, University of Massachusetts, Amherst (1997)

  19. Kieninger, T.G.: Table structure recognition based on robust block segmentation. In: Proceedings of Document Recognition V (IS&T/SPIE Electronic Imaging’98), San Jose, CA, vol. 3305, pp. 22–32 (1998)

  20. Hu, J., Kashi, R., Lopresti, D., Wilfong, G.: Table structure recognition and its evaluation. In: Kantor, P.B., Lopresti, D.P., Zhou, J. (eds.) Proceedings of Document Recognition and Retrieval VIII(IS&T/SPIE Electronic Imaging), San Jose, CA, vol. 4307, pp. 44–55. (2001)

  21. W3, HTML: The Markup Language (an HTML language reference). Retrieved 25 Sept 2015. http://www.w3.org/TR/html-markup/syntax.html#doctype-syntax

  22. Creativyst, The Comma Separated Value (CSV) File Forma. http://creativyst.com/Doc/Articles/CSV/CSV01.htm

  23. Gatterbauer, W., Bohunsky, P., Krüpl, B., Pollak, B., Herzog, M.: Towards Domain Independent Information Extraction from Web Tables. In: WWW, Banff, Alberta, Canada, 8–12 May 2007

  24. Amano, A., Asada, N.: Graph grammar based analysis system of complex table form document. In: Proceedings of the Seventh International Conference on Document Analysis and Recognition (2003)

  25. Bing, L., Zao, J., Hong, X.: New method for logical structure extraction of form document image. In: Proceedings of Document Recognition and Retrieval VI (IS&T/SPIE Electronic Imaging ’99), San Jose, CA, vol. 3651, pp. 183–193 (1999)

  26. Kieninger, T., Dengel, A.: A paper-to-HTML table converting system. In: Proceedings of Document Analysis Systems, (DAS) 98, Nagano, Japan (1998)

  27. Coüasnon, B., Camillerapp, J., Leplumey, I.: Making handwritten archives documents accessible to public with a generic system of document image analysis. In: Proceedings of the International Workshop on Document Image Analysis for Libraries, Palo Alto, CA, pp. 270–277 (2004)

  28. Martinat, I., Coüasnon, B., Camillerapp, J.: An adaptative recognition system using a table description language for hierarchical table structures in archival documents. In: Graphics Recognition: Recent Advances and Perspectives. Lecture Note in Computer Science, vol. 5046, pp. 9–20. Springer (2008)

  29. Lemaitre, A., Camillerapp, J., Coüasnon, B.: Multiresolution cooperation improves document structure recognition. Int. J. Doc. Anal. Recognit. (IJDAR) 11(2), 97–109 (2008)

    Article  Google Scholar 

  30. Klein, B., Agne, S., Dengel, A.: On benchmarking of invoice analysis systems. In: Bunke, H., Spitz, A.L. (eds.) DAS 2006, LNCS, vol 3872, pp 312–323. Springer, Heidelberg (2006)

  31. Klein, B., Dengel, A.: Problem-adaptable document analysis and understanding for high-volume applications. IJDAR 6(3), 167–180 (2003)

    Article  Google Scholar 

  32. Hamza, H., Belaid, Y., Belaid, A.: A case-based reasoning approach for invoice structure extraction. In: Proceedings of the Ninth International Conference on Document Analysis and Recognition, ICDAR 2007, vol. 1, pp. 327–331 (2007)

  33. Watanabe, T., Quo, Q.L., Sugie, N.: Layout recognition of multikinds of table-form documents. IEEE Trans. Pattern Anal. Mach. Intell. 17(4), 432–445 (1995)

    Article  Google Scholar 

  34. Shamalian, H., Baird, H.S., Wood, T.L.: A retargetable table reader. In: Proceedings of the International Conference on Document Analysis and Recognition (ICDAR’97), pp. 158–163 (1997)

  35. Fang, J., Mitra, P., Tang, Z., Giles, L.: Table header detection and classification. In: Proceedings of the Twenty-Sixth AAAI Conference on Artificial Intelligence, vol. 599–605 (2012)

  36. Shigarov, A.O.: Table understanding using a rule engine. Expert Syst. Appl. 42(2), 929–937 (2015)

    Article  Google Scholar 

  37. Halevy, A., Norvig, P., Pereira, F.: The unreasonable effectiveness of data. In: IEEE Intelligent Systems (2009)

  38. Venetis, P., Halevy, A., Madhavan, J., Pasca, M., Shen, W., Wu, F., Miao, G., Wu, C.: Recovering semantics of tables on the web. In: Proceedings of the LDB Endowment, vol. 4, 9 edn. (2011)

  39. Gonzalez, H., Halevy, A.Y., Jensen, C.S., Langen, A., Madhavan, J., Shapley, R., Shen, W., Goldberg-Kidony, J.: Google fusion tables: web-centered data management and collaboration. In: SIGMOD’10, Indianapolis, Indiana, USA, 6–11 June 2010

  40. Adelfio, M.D., Samet, H.: Schema extraction for tabular data on the web. In: Proceedings of The 39th International Conference on Very Large Data Bases, (Proceedings of the VLDB Endowment, vol. 6, 6 edn.), Riva del Garda, Trento, Italy 26–30 August 2013

  41. Long, V.: An agent-based approach to table recognition and interpretation, Macquarie University Ph.D. dissertation, May (2010)

  42. Astrakhantsev, N.: Extracting objects and their attributes from tables in text documents. In: Turdakov, D., Simanovsky, A. (eds.) Proceedings of the Seventh Spring Researchers Colloquium on Databases and Information Systems, SYRCoDIS 2011, Moscow, Russia, CEUR Workshop Proceedings 735 CEUR-WS.org 2011 pp. 34–37 (2011)

  43. Hurst, M., Douglas, S.L: Layout and language: preliminary investigations in recognizing the structure of tables. In: Proceedings of the International Conference on Document Analysis and Recognition (ICDAR’97), pp. 1043–047 (1997)

  44. Hurst, M.: Towards a theory of tables. Int. J. Doc. Anal. Recognit. 8(2–3), 66–86 (2006). (Springer, Heidelberg)

    Google Scholar 

  45. Hurst, M.: The interpretation of tables in texts, Ph.D. thesis, University of Edinburgh, (2000)

  46. Costa e Silva, A., Jorge, A.M., Torgo, L.: Design of an end-to-end method to extract information from tables. Int. J. Doc. Anal. Recognit. 8(2), 144–171 (2006)

  47. Kim, Y.-S., Lee, K.-Y.: Extracting logical structures from HTML tables. Comput. Stand. Interfaces 30(5), 296–308 (2008)

    Article  Google Scholar 

  48. Pivk, A., et al.: Transforming arbitrary tables into logical form with TARTAR. Data Knowl. Eng. 60, 567–595 (2007)

    Article  Google Scholar 

  49. Chen, Z., Cafarella, M.: Automatic web spreadsheet data extraction. In: Proceedings of the 3rd International Workshop on Semantic Search over the Web (SSW 2013), Riva del Garda, Trento, Italy, 30 Aug (2013)

  50. Astrakev, N., Turdakov, D., Vassilieva, N.: Semi-automatic data extraction from tables. In: Proceedings of the 15th All-Russian Conference on Digital Libraries: Advanced Methods and Technologies, Digital Collection—RCDL, Yaroslavl, Russia (2013)

  51. Kasar, T., Bhowmik, T.K., Belaid, A.: Table information extraction and structure recognition using query patterns. In: Proceedings 13th International Conference on Document Analysis and Recognition, ICDAR 2015, vol. 1, pp. 1086–1080 (2015)

  52. Lopresti, D., Nagy, G.: Automated table processing: an (opinionated) survey. In: Proceedings of IAPR Workshop on Graphics Recognition (GREC99), Jaipur, India, pp. 109–134, Sept (1999)

  53. Hu, J., Kashi, R., Lopresti, D., Wilfong, G., Nagy, G.: Why table ground-truthing is hard. In: Proceedings of International Conference on Document Analysis and Recognition, pp. 129–133. IEEE Computer Society Press, Seattle, WA, Sept (2001)

  54. Embley, D.W., Lopresti, D., Nagy, G.: Notes on contemporary table recognition. In: Bunke, H., Spitz, A.L., (eds.) Proceedings of the 7th International Workshop on Document Analysis Systems VII DAS 2006, vol. 3872, LNCS, pp. 164–175, Springer, Nelson, New Zealand, 13–15 Feb (2006)

  55. Embley, D.W., Lopresti, D., Hurst, M., Nagy, G.: Table processing paradigms: a research survey. In: International Journal of Document Analysis and Recognition, vol. 8, 2–3 edn., pp. 66–86. Springer, June (2006)

  56. Embley, D., Tao, C., Liddle, S.: Automating the extraction of data from HTML tables with unknown structure. Data Knowl. Eng. 54(1), 3–28 (2005)

    Article  Google Scholar 

  57. Tao, C., Embley, D.W.: Automatic hidden-web table interpretation, conceptualization, and semantic annotation. Data Knowl. Eng. 68(7), 683–703 (2009)

    Article  Google Scholar 

  58. Jandhyala, R.C., Krishnamoorthy, M., Nagy, G., Padmanabhan, R., Seth, S., Silversmith, W.: From tessellations to table interpretation. In: Carette, J. et al. (eds.) Proceedings of the 8th International Conference on Mathematical Knowledge Management, MKM 2009, Grand Bend, Ontario, Calculemus/MKM 2009, LNAI 5625, pp. 422–437. Springer, Berlin (2009)

  59. Nagy, G.: Learning the characteristics of critical cells from web tables. In: Proceedings of the ICPR, Tsukuba, Japan, Nov (2012)

  60. Embley, D.W., Krishnamoorthy, M., Nagy, G., Seth, S.: Factoring Web Tables. In: Mehrotra, K.G. et al. (eds.): IEA/AIE 2011, Part I, LNAI 6703, pp. 253–263. Springer, Berlin (2011)

  61. Nagy, G., Tamhankar, M.: VeriClick, an efficient tool for table format verification. In: Proceedings of the SPIE 8297, Document Recognition and Retrieval XIX, 82970M, 23 Jan 2012

  62. Seth, S., Nagy, G.: Segmenting Tables via indexing of value cells by table headers. In: Proceedings of the ICDAR 2013, Washington, DC, Aug (2013)

  63. Nagy, G., Embley, D.W., Seth, S.: End-to-end conversion of HTML tables for populating a relational database. In: Proceedings of the DAS 2014, Tours, France (2014)

  64. Embley, D.W., Seth, S., Nagy, G. : Transforming Web tables to a relational database. In: Proceedings of the ICPR 2014, Stockholm, Sweden (2014)

  65. Embley, D.W., Seth, S., Krishnamoorthy, M., Nagy, G.: Clustering header categories extracted from web tables. In: Proceedings SPIE/IST Document Recognition and Retrieval, San Francisco, CA, Feb (2015)

  66. U.S. Government Printing Office, Style Manual: An official guide to the form and style of Federal Government printing, section 13, 281–299. http://www.gpoaccess.gov/stylemanual/index.html (2008)

  67. Balbiani, P., Condotta, J.-F., Farinas Del Cero, L.: Tractability results in the block algebra. J. Logic Comput. 12(5), 885–909 (2002)

    Article  MathSciNet  MATH  Google Scholar 

  68. Allen, J.F.: Maintaining knowledge about temporal intervals. Commun. ACM 26(11), 832–843 (1983)

    Article  MATH  Google Scholar 

  69. Padmanabhan, R., Jandhyala, R.C., Krishnamoorthy, M., Nagy, G., Seth, S., Silversmith, W.: Interactive conversion of large web tables. GREC 25–36, 2009 (2009)

    Google Scholar 

  70. Cafarella, M.: http://web.eecs.umich.edu/~michjc/structuredweb/index.html. Accessed 6 Jan 2016

  71. W3C Semantic Web: Resource Description Framework (RDF). Retrieved 1/31/2015 from www.w3.org/RDF/ (2014)

  72. W3C Semantic Web: Web Ontology Language (OWL). Retrieved 1/31/2015 from www.w3.org/OWL (2013)

Download references

Acknowledgments

Mukkai Krishnamoorthy acknowledges the help of Dr. Ravi Palla with Protégé. Prof. Andreas Dengel (DFKI) gave us excellent advice not only for improving the presentation but also for one of the algorithms.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to David W. Embley.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Embley, D.W., Krishnamoorthy, M.S., Nagy, G. et al. Converting heterogeneous statistical tables on the web to searchable databases. IJDAR 19, 119–138 (2016). https://doi.org/10.1007/s10032-016-0259-1

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10032-016-0259-1

Keywords

Navigation