Skip to main content

Machine Learning for Digital Document Processing: from Layout Analysis to Metadata Extraction

  • Chapter

Part of the book series: Studies in Computational Intelligence ((SCI,volume 90))

In the last years, the spread of computers and the Internet caused a significant amount of documents to be available in digital format. Collecting them in digital repositories raised problems that go beyond simple acquisition issues, and cause the need to organize and classify them in order to improve the effectiveness and efficiency of the retrieval procedure. The success of such a process is tightly related to the ability of understanding the semantics of the document components and content. Since the obvious solution of manually creating and maintaining an updated index is clearly infeasible, due to the huge amount of data under consideration, there is a strong interest in methods that can provide solutions for automatically acquiring such a knowledge. This work presents a framework that intensively exploits intelligent techniques to support different tasks of automatic document processing from acquisition to indexing, from categorization to storing and retrieval.

The prototypical version of the system DOMINUS is presented, whose main characteristic is the use of a Machine Learning Server, a suite of different inductive learning methods and systems, among which the more suitable for each specific document processing phase is chosen and applied. The core system is the incremental first-order logic learner INTHELEX. Thanks to incrementality, it can continuously update and refine the learned theories, dynamically extending its knowledge to handle even completely new classes of documents.

Since DOMINUS is general and flexible, it can be embedded as a document management engine into many different Digital Library systems. Experiments in a real-world domain scenario, scientific conference management, confirmed the good performance of the proposed prototype.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   129.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD   169.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Esposito, F., Malerba, D., Semeraro, G., Ferilli, S., Altamura, O., Basile, T.M.A., Berardi, M., Ceci, M., Mauro, N.D.: Machine learning methods for automatically processing historical documents: From paper acquisition to XML transformation. In: Proceedings of the First International Workshop on Docu- ment Image Analysis for Libraries (DIAL 2004). (2004) 328-335

    Google Scholar 

  2. Berners Lee, T., Hendler, J., Lassila, O.: The semantic web. Scientific American 284 (5) (2001) 34-43

    Article  Google Scholar 

  3. Utgoff, P.E.: Incremental induction of decision trees. Machine Learning 4(2) (1989) 161-186

    Article  Google Scholar 

  4. Cauwenberghs, G., Poggio, T.: Incremental and decremental support vector ma-chine learning. In: Advances in Neural Information Processing Systems (NIPS 2000). Volume 13., Cambridge, MA, USA, MIT Press (2000) 409-415

    Google Scholar 

  5. Solomonoff, R.: Progress in incremental machine learning. In: NIPS Workshop on Universal Learning Algorithms and Optimal Search, Dec. 14, 2002, Whistler, B.C., Canada, 27 pp. (2003)

    Google Scholar 

  6. Wong, W., Fu, A.: Incremental document clustering for web page classifica-tion. In: IEEE 2000 Int. Conf. on Info. Society in the 21st century: emerging technologies and new challenges (IS2000), Nov 5-8, 2000, Japan. (2000)

    Google Scholar 

  7. Dietterich, T.G., Lathrop, R.H., Lozano-Perez, T.: Solving the multiple instance problem with axis-parallel rectangles. Artificial Intelligence 89(1-2) (1997) 31-71

    Article  MATH  Google Scholar 

  8. Breuel, T.M.: Two geometric algorithms for layout analysis. In: Workshop on Document Analysis Systems. (2002)

    Google Scholar 

  9. Glunz, W.: pstoedit - a tool converting postscript and PDF files into various vector graphic formats (2007) (http://www.pstoedit.net).

  10. Adobe Systems Inc.: PostScript language reference manual - 2nd ed. Addison Wesley (1990)

    Google Scholar 

  11. Adobe Systems Inc.: PDF Reference version 1.3 - 2nd ed. Addison Wesley (2000)

    Google Scholar 

  12. Esposito, F., Ferilli, S., Fanizzi, N., Basile, T.M., Di Mauro, N.: Incremental multistrategy learning for document processing. Applied Artificial Intelligence: An Internationa Journal 17(8/9) (2003) 859-883

    Article  Google Scholar 

  13. Muggleton, S., Raedt, L.D.: Inductive logic programming: Theory and methods. Journal of Logic Programming 19/20 (1994) 629-679

    Article  MathSciNet  Google Scholar 

  14. Semeraro, G., Esposito, F., Malerba, D., Fanizzi, N., Ferilli, S.: A logic frame- work for the incremental inductive synthesis of datalog theories. In Fuchs, N., ed.: Proceedings of the 7th International Workshop on Logic Program Synthesis and Transformation. Volume 1463 of LNCS., Springer (1998) 300-321

    Google Scholar 

  15. Becker, J.: Inductive learning of decision rules with exceptions: Methodology and experimentation. Master’s thesis, Dept. of Computer Science, University of Illinois at Urbana-Champaign, Urbana, Illinois (1985) B.S. diss., UIUCDCS-F-85-945

    Google Scholar 

  16. Michalski, R.: Inferential theory of learning. developing foundations for mul-tistrategy learning. In Michalski, R., Tecuci, G., eds.: Machine Learning. A Multistrategy Approach. Volume IV. Morgan Kaufmann (1994) 3-61

    Google Scholar 

  17. Kakas, A., Mancarella, P.: On the relation of truth maintenance and abduction. In: Proceedings of the 1st Pacific Rim International Conference on Artificial Intelligence, Nagoya, Japan (1990)

    Google Scholar 

  18. Zucker, J.D.: Semantic abstraction for concept representation and learning. In Michalski, R.S., Saitta, L., eds.: Proceedings of the 4th International Workshop on Multistrategy Learning. (1998) 157-164

    Google Scholar 

  19. Papadias, D., Theodoridis, Y.: Spatial relations, minimum bounding rectangles, and spatial data structures. International Journal of Geographical Information Science 11(2) (1997) 111-138

    Article  Google Scholar 

  20. Egenhofer, M.: Reasoning about binary topological relations. In Gunther, O., Schek, H.J., eds.: Second Symposium on Large Spatial Databases. Volume 525 of Lecture Notes in Computer Science., Springer (1991) 143-160

    Google Scholar 

  21. Deerwester, S.C., Dumais, S.T., Landauer, T.K., Furnas, G.W., Harshman, R.A.: Indexing by Latent Semantic Analysis. Journal of the American Society of Information Science 41(6) (1990) 391-407

    Article  Google Scholar 

  22. .Baeza-Yates, R.A., Ribeiro-Neto, B.A.: Modern Information Retrieval. ACM Press / Addison-Wesley (1999)

    Google Scholar 

  23. Berry, M.W., Dumais, S.T., O’Brien, G.W.: Using linear algebra for intelligent information retrieval. SIAM Rev. 37(4) (1995) 573-595

    Article  MATH  MathSciNet  Google Scholar 

  24. O’Brien, G.W.: Information management tools for updating an SVD-encoded in-dexing scheme. Technical Report UT-CS-94-258, University of Tennessee (1994)

    Google Scholar 

  25. Porter, M.F.: An algorithm for suffix stripping. In Karen, J.S., Willet, P., eds.: Readings in information retrieval. Morgan Kaufmann Publishers Inc., San Fran-cisco, CA, USA (1997) 313-316

    Google Scholar 

  26. Di Mauro, N., Basile, T.M.A., Ferilli, S.: GRAPE: An expert review assignment component for scientific conference management systems. In: Innovations in Ap-plied Artificial Intelligence: 18th International Conference on Industrial and En-gineering Applications of Artificial Intelligence and Expert Systems (IEA/AIE 2005). Volume 3533 of Lecture Notes in Computer Science., Springer Verlag (2005) 789-798

    Google Scholar 

  27. Nagy, G.: Twenty years of document image analysis in PAMI. IEEE Transac-tions on Pattern Analysis and Machine Intelligence 22(1) (2000) 38-62

    Article  Google Scholar 

  28. Futrelle, R.P., Shao, M., Cieslik, C., Grimes, A.E.: Extraction, layout analysis and classification of diagrams in PDF documents. In: Proceedings of Seventh In-ternational Conference on Document Analysis and Recognition (ICDAR 2003). (2003) 1007-1014

    Google Scholar 

  29. Chao, H.: Graphics extraction in PDF document. In Kanungo, T., Smith, E.H.B., Hu, J., Kantor, P.B., eds.: Proceedings of SPIE - The International Society for Optical Engineering. Volume 5010. (2003) 317-325

    Google Scholar 

  30. Ramel, J.Y., Crucianu, M., Vincent, N., Faure, C.: Detection, extraction and representation of tables. In: Proceedings of the Seventh International Conference on Document Analysis and Recognition (ICDAR 2003), Washington, DC, USA, IEEE Computer Society (2003) 374-378

    Chapter  Google Scholar 

  31. Chao, H., Fan, J.: Layout and content extraction for pdf documents. In: Doc-ument Analysis Systems VI, Proceeding of the Sixth International Workshop (DAS 2004). Volume 3163 of Lecture Notes in Computer Science., Springer Ver-lag (2004) 213-224

    Google Scholar 

  32. Lovegrove, W.S., Brailsford, D.F.: Document analysis of PDF files: methods, results and implications. Electronic Publishing - Origination, Dissemination and Design 8(2-3) (1995) 207-220

    Google Scholar 

  33. Hadjar, K., Rigamonti, M., Lalanne, D., Ingold, R.: Xed: A new tool for extract-ing hidden structures from electronic documents. In: DIAL ’04: Proceedings of the First International Workshop on Document Image Analysis for Libraries (DIAL’04), Washington, DC, USA, IEEE Computer Society (2004) 212

    Google Scholar 

  34. Rigamonti, M., Bloechle, J.L., Hadjar, K., Lalanne, D., Ingold, R.: Towards a canonical and structured representation of PDF documents through reverse en-gineering. In: ICDAR ’05: Proceedings of the Eighth International Conference on Document Analysis and Recognition, Washington, DC, USA, IEEE Computer Society (2005) 1050-1055

    Chapter  Google Scholar 

  35. Anjewierden, A.: AIDAS: Incremental logical structure discovery in pdf docu-ments. In: Proceedings of Sixth International Conference on Document Analysis and Recognition (ICDAR 2001). (2001) 374-378

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2008 Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

Esposito, F., Ferilli, S., Basile, T.M.A., Di Mauro, N. (2008). Machine Learning for Digital Document Processing: from Layout Analysis to Metadata Extraction. In: Marinai, S., Fujisawa, H. (eds) Machine Learning in Document Analysis and Recognition. Studies in Computational Intelligence, vol 90. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-76280-5_5

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-76280-5_5

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-76279-9

  • Online ISBN: 978-3-540-76280-5

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics