Machine Learning for Digital Document Processing: from Layout Analysis to Metadata Extraction

Esposito, Floriana; Ferilli, Stefano; Basile, Teresa M. A.; Di Mauro, Nicola

doi:10.1007/978-3-540-76280-5_5

Machine Learning for Digital Document Processing: from Layout Analysis to Metadata Extraction

Floriana Esposito⁴,
Stefano Ferilli⁴,
Teresa M. A. Basile⁴ &
…
Nicola Di Mauro⁴

Chapter

2655 Accesses
22 Citations

Part of the book series: Studies in Computational Intelligence ((SCI,volume 90))

In the last years, the spread of computers and the Internet caused a significant amount of documents to be available in digital format. Collecting them in digital repositories raised problems that go beyond simple acquisition issues, and cause the need to organize and classify them in order to improve the effectiveness and efficiency of the retrieval procedure. The success of such a process is tightly related to the ability of understanding the semantics of the document components and content. Since the obvious solution of manually creating and maintaining an updated index is clearly infeasible, due to the huge amount of data under consideration, there is a strong interest in methods that can provide solutions for automatically acquiring such a knowledge. This work presents a framework that intensively exploits intelligent techniques to support different tasks of automatic document processing from acquisition to indexing, from categorization to storing and retrieval.

The prototypical version of the system DOMINUS is presented, whose main characteristic is the use of a Machine Learning Server, a suite of different inductive learning methods and systems, among which the more suitable for each specific document processing phase is chosen and applied. The core system is the incremental first-order logic learner INTHELEX. Thanks to incrementality, it can continuously update and refine the learned theories, dynamically extending its knowledge to handle even completely new classes of documents.

Since DOMINUS is general and flexible, it can be embedded as a document management engine into many different Digital Library systems. Experiments in a real-world domain scenario, scientific conference management, confirmed the good performance of the proposed prototype.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Hardcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Esposito, F., Malerba, D., Semeraro, G., Ferilli, S., Altamura, O., Basile, T.M.A., Berardi, M., Ceci, M., Mauro, N.D.: Machine learning methods for automatically processing historical documents: From paper acquisition to XML transformation. In: Proceedings of the First International Workshop on Docu- ment Image Analysis for Libraries (DIAL 2004). (2004) 328-335
Google Scholar
Berners Lee, T., Hendler, J., Lassila, O.: The semantic web. Scientific American 284 (5) (2001) 34-43
Article Google Scholar
Utgoff, P.E.: Incremental induction of decision trees. Machine Learning 4(2) (1989) 161-186
Article Google Scholar
Cauwenberghs, G., Poggio, T.: Incremental and decremental support vector ma-chine learning. In: Advances in Neural Information Processing Systems (NIPS 2000). Volume 13., Cambridge, MA, USA, MIT Press (2000) 409-415
Google Scholar
Solomonoff, R.: Progress in incremental machine learning. In: NIPS Workshop on Universal Learning Algorithms and Optimal Search, Dec. 14, 2002, Whistler, B.C., Canada, 27 pp. (2003)
Google Scholar
Wong, W., Fu, A.: Incremental document clustering for web page classifica-tion. In: IEEE 2000 Int. Conf. on Info. Society in the 21st century: emerging technologies and new challenges (IS2000), Nov 5-8, 2000, Japan. (2000)
Google Scholar
Dietterich, T.G., Lathrop, R.H., Lozano-Perez, T.: Solving the multiple instance problem with axis-parallel rectangles. Artificial Intelligence 89(1-2) (1997) 31-71
Article MATH Google Scholar
Breuel, T.M.: Two geometric algorithms for layout analysis. In: Workshop on Document Analysis Systems. (2002)
Google Scholar
Glunz, W.: pstoedit - a tool converting postscript and PDF files into various vector graphic formats (2007) (http://www.pstoedit.net).
Adobe Systems Inc.: PostScript language reference manual - 2nd ed. Addison Wesley (1990)
Google Scholar
Adobe Systems Inc.: PDF Reference version 1.3 - 2nd ed. Addison Wesley (2000)
Google Scholar
Esposito, F., Ferilli, S., Fanizzi, N., Basile, T.M., Di Mauro, N.: Incremental multistrategy learning for document processing. Applied Artificial Intelligence: An Internationa Journal 17(8/9) (2003) 859-883
Article Google Scholar
Muggleton, S., Raedt, L.D.: Inductive logic programming: Theory and methods. Journal of Logic Programming 19/20 (1994) 629-679
Article MathSciNet Google Scholar
Semeraro, G., Esposito, F., Malerba, D., Fanizzi, N., Ferilli, S.: A logic frame- work for the incremental inductive synthesis of datalog theories. In Fuchs, N., ed.: Proceedings of the 7th International Workshop on Logic Program Synthesis and Transformation. Volume 1463 of LNCS., Springer (1998) 300-321
Google Scholar
Becker, J.: Inductive learning of decision rules with exceptions: Methodology and experimentation. Master’s thesis, Dept. of Computer Science, University of Illinois at Urbana-Champaign, Urbana, Illinois (1985) B.S. diss., UIUCDCS-F-85-945
Google Scholar
Michalski, R.: Inferential theory of learning. developing foundations for mul-tistrategy learning. In Michalski, R., Tecuci, G., eds.: Machine Learning. A Multistrategy Approach. Volume IV. Morgan Kaufmann (1994) 3-61
Google Scholar
Kakas, A., Mancarella, P.: On the relation of truth maintenance and abduction. In: Proceedings of the 1st Pacific Rim International Conference on Artificial Intelligence, Nagoya, Japan (1990)
Google Scholar
Zucker, J.D.: Semantic abstraction for concept representation and learning. In Michalski, R.S., Saitta, L., eds.: Proceedings of the 4th International Workshop on Multistrategy Learning. (1998) 157-164
Google Scholar
Papadias, D., Theodoridis, Y.: Spatial relations, minimum bounding rectangles, and spatial data structures. International Journal of Geographical Information Science 11(2) (1997) 111-138
Article Google Scholar
Egenhofer, M.: Reasoning about binary topological relations. In Gunther, O., Schek, H.J., eds.: Second Symposium on Large Spatial Databases. Volume 525 of Lecture Notes in Computer Science., Springer (1991) 143-160
Google Scholar
Deerwester, S.C., Dumais, S.T., Landauer, T.K., Furnas, G.W., Harshman, R.A.: Indexing by Latent Semantic Analysis. Journal of the American Society of Information Science 41(6) (1990) 391-407
Article Google Scholar
.Baeza-Yates, R.A., Ribeiro-Neto, B.A.: Modern Information Retrieval. ACM Press / Addison-Wesley (1999)
Google Scholar
Berry, M.W., Dumais, S.T., O’Brien, G.W.: Using linear algebra for intelligent information retrieval. SIAM Rev. 37(4) (1995) 573-595
Article MATH MathSciNet Google Scholar
O’Brien, G.W.: Information management tools for updating an SVD-encoded in-dexing scheme. Technical Report UT-CS-94-258, University of Tennessee (1994)
Google Scholar
Porter, M.F.: An algorithm for suffix stripping. In Karen, J.S., Willet, P., eds.: Readings in information retrieval. Morgan Kaufmann Publishers Inc., San Fran-cisco, CA, USA (1997) 313-316
Google Scholar
Di Mauro, N., Basile, T.M.A., Ferilli, S.: GRAPE: An expert review assignment component for scientific conference management systems. In: Innovations in Ap-plied Artificial Intelligence: 18th International Conference on Industrial and En-gineering Applications of Artificial Intelligence and Expert Systems (IEA/AIE 2005). Volume 3533 of Lecture Notes in Computer Science., Springer Verlag (2005) 789-798
Google Scholar
Nagy, G.: Twenty years of document image analysis in PAMI. IEEE Transac-tions on Pattern Analysis and Machine Intelligence 22(1) (2000) 38-62
Article Google Scholar
Futrelle, R.P., Shao, M., Cieslik, C., Grimes, A.E.: Extraction, layout analysis and classification of diagrams in PDF documents. In: Proceedings of Seventh In-ternational Conference on Document Analysis and Recognition (ICDAR 2003). (2003) 1007-1014
Google Scholar
Chao, H.: Graphics extraction in PDF document. In Kanungo, T., Smith, E.H.B., Hu, J., Kantor, P.B., eds.: Proceedings of SPIE - The International Society for Optical Engineering. Volume 5010. (2003) 317-325
Google Scholar
Ramel, J.Y., Crucianu, M., Vincent, N., Faure, C.: Detection, extraction and representation of tables. In: Proceedings of the Seventh International Conference on Document Analysis and Recognition (ICDAR 2003), Washington, DC, USA, IEEE Computer Society (2003) 374-378
Chapter Google Scholar
Chao, H., Fan, J.: Layout and content extraction for pdf documents. In: Doc-ument Analysis Systems VI, Proceeding of the Sixth International Workshop (DAS 2004). Volume 3163 of Lecture Notes in Computer Science., Springer Ver-lag (2004) 213-224
Google Scholar
Lovegrove, W.S., Brailsford, D.F.: Document analysis of PDF files: methods, results and implications. Electronic Publishing - Origination, Dissemination and Design 8(2-3) (1995) 207-220
Google Scholar
Hadjar, K., Rigamonti, M., Lalanne, D., Ingold, R.: Xed: A new tool for extract-ing hidden structures from electronic documents. In: DIAL ’04: Proceedings of the First International Workshop on Document Image Analysis for Libraries (DIAL’04), Washington, DC, USA, IEEE Computer Society (2004) 212
Google Scholar
Rigamonti, M., Bloechle, J.L., Hadjar, K., Lalanne, D., Ingold, R.: Towards a canonical and structured representation of PDF documents through reverse en-gineering. In: ICDAR ’05: Proceedings of the Eighth International Conference on Document Analysis and Recognition, Washington, DC, USA, IEEE Computer Society (2005) 1050-1055
Chapter Google Scholar
Anjewierden, A.: AIDAS: Incremental logical structure discovery in pdf docu-ments. In: Proceedings of Sixth International Conference on Document Analysis and Recognition (ICDAR 2001). (2001) 374-378
Google Scholar

Download references

Author information

Authors and Affiliations

Dipartimento di Informatica, Università degli Studi di Bari, Via Orabona, 4, 70126, Bari, Italy
Floriana Esposito, Stefano Ferilli, Teresa M. A. Basile & Nicola Di Mauro

Authors

Floriana Esposito
View author publications
You can also search for this author in PubMed Google Scholar
Stefano Ferilli
View author publications
You can also search for this author in PubMed Google Scholar
Teresa M. A. Basile
View author publications
You can also search for this author in PubMed Google Scholar
Nicola Di Mauro
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Dipartimento di Sistemi e Informatica, University of Florence, Via S. Marta, 3, 50139, Firenze, Italy
Simone Marinai
Hitachi Central Research Laboratory, 1-280, Higashi-Koigakubo, Kokubunji-shi, Tokyo, 185-8601, Japan
Hiromichi Fujisawa

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Esposito, F., Ferilli, S., Basile, T.M.A., Di Mauro, N. (2008). Machine Learning for Digital Document Processing: from Layout Analysis to Metadata Extraction. In: Marinai, S., Fujisawa, H. (eds) Machine Learning in Document Analysis and Recognition. Studies in Computational Intelligence, vol 90. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-76280-5_5

Download citation

DOI: https://doi.org/10.1007/978-3-540-76280-5_5
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-76279-9
Online ISBN: 978-3-540-76280-5
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics

Buying options