Skip to main content

Genre Classification in Automated Ingest and Appraisal Metadata

  • Conference paper
Research and Advanced Technology for Digital Libraries (ECDL 2006)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 4172))

Included in the following conference series:

Abstract

Metadata creation is a crucial aspect of the ingest of digital materials into digital libraries. Metadata needed to document and manage digital materials are extensive and manual creation of them expensive. The Digital Curation Centre (DCC) has undertaken research to automate this process for some classes of digital material. We have segmented the problem and this paper discusses results in genre classification as a first step toward automating metadata extraction from documents. Here we propose a classification method built on looking at the documents from five directions; as an object exhibiting a specific visual format, as a linear layout of strings with characteristic grammar, as an object with stylo-metric signatures, as an object with intended meaning and purpose, and as an object linked to previously classified objects and other external sources. The results of some experiments in relation to the first two directions are described here; they are meant to be indicative of the promise underlying this multi-facetted approach.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Similar content being viewed by others

References

  1. Aiello, M., Monz, C., Todoran, L., Worring, M.: Document Understanding for a Broad Class of Documents. International Journal on Document Analysis and Recognition 5(1), 1–16 (2002)

    Article  MATH  Google Scholar 

  2. Automatic Metadata Generation: http://www.cs.kuleuven.ac.be/~hmdb/amg/documentation.php

  3. Arens, A., Blaesius, K.H.: Domain oriented information extraction from the Internet. In: Proceedings of SPIE Document Recognition and Retrieval 2003, vol. 5010, p. 286 (2003)

    Google Scholar 

  4. Bagdanov, A.D., Worring, M.: Fine-Grained Document Genre Classification Using First Order Random Graphs. In: Proceedings of International Conference on Document Analysis and Recognition 2001, p. 79 (2001)

    Google Scholar 

  5. Barbu, E., Heroux, P., Adam, S., Trupin, E.: Clustering Document Images Using a Bag of Symbols Representation. In: International Conference on Document Analysis and Recognition, pp. 1216–1220 (2005)

    Google Scholar 

  6. Bekkerman, R., McCallum, A., Huang, G.: Automatic Categorization of Email into Folders. Benchmark Experiments on Enron and SRI Corpora’, CIIR Technical Report, IR-418 (2004)

    Google Scholar 

  7. Biber, D.: Dimensions of Register Variation:a Cross-Linguistic Comparison. Cambridge University Press, Cambridge (1995)

    Book  Google Scholar 

  8. Boese, E.S.: Stereotyping the web: genre classification of web documents. Master’s thesis, Colorado State University (2005)

    Google Scholar 

  9. Breuel, T.M.: An Algorithm for Finding Maximal Whitespace Rectangles at Arbitrary Orientations for Document Layout Analysis. In: 7th International Conference for Document Analysis and Recognition (ICDAR), pp. 66–70 (2003)

    Google Scholar 

  10. Digital Curation Centre: http://www.dcc.ac.uk

  11. DC-dot, Dublin Core metadata editor: http://www.ukoln.ac.uk/metadata/dcdot/

  12. DELOS Network of Excellence on Digital Libraries: http://www.delos.info/

  13. NSF International Projects: http://www.dli2.nsf.gov/intl.html

  14. DELOS/NSF Working Groups: Reference Models for Digital Libraries: Actors and Roles (2003), http://www.dli2.nsf.gov/internationalprojects/workinggroupreports/actorsfinalreport.html

  15. Dublin Core Initiative: http://dublincore.org/tools/#automaticextraction

  16. Engineering and Physical Sciences Research Council: http://www.epsrc.ac.uk/

  17. Electronic Resources Preservation Access Network (ERPANET): http://www.erpanet.org

  18. ERPANET: Packaged Object Ingest Project, http://www.erpanet.org/events/2003/rome/presentations/ross_rusbridge_pres.pdf

  19. Giuffrida, G., Shek, E., Yang, J.: Knowledge-based Metadata Extraction from PostScript File. In: Proc. 5th ACM Intl. conf. Digital Libraries, pp. 77–84 (2000)

    Google Scholar 

  20. Han, H., Giles, L., Manavoglu, E., Zha, H., Zhang, Z., Fox, E.A.: Automatic Document Metadata Extraction using Support Vector Machines. In: Proc. 3rd ACM/IEEECS conf. Digital libraries, pp. 37–48 (2000)

    Google Scholar 

  21. Hedstrom, M., Ross, S., Ashley, K., Christensen-Dalsgaard, B., Duff, W., Gladney, H., Huc, C., Kenney, A.R., Moore, R., Neuhold, E.: Invest to Save: Report and Recommendations of the NSF-DELOS Working Group on Digital Archiving and Preservation. Report of the European Union DELOS and US National Science Foundation Workgroup on Digital Preservation and Archiving (2003), http://delos-noe.iei.pi.cnr.it/activities/internationalforum/Joint-WGs/digitalarchiving/Digitalarchiving.pdf

  22. Joint Information Systems Committee: http://www.jisc.ac.uk/

  23. Karlgren, J., Cutting, D.: Recognizing Text Genres with Simple Metric using Discriminant Analysis. In: Proc. 15th conf. Comp. Ling., vol. 2, pp. 1071–1075 (1994)

    Google Scholar 

  24. Ke, S.W., Bowerman, C., Oakes, M.: PERC: A Personal Email Classifier. In: Lalmas, M., MacFarlane, A., Rüger, S.M., Tombros, A., Tsikrika, T., Yavlinsky, A. (eds.) ECIR 2006. LNCS, vol. 3936, pp. 460–463. Springer, Heidelberg (2006)

    Chapter  Google Scholar 

  25. Kessler, B., Nunberg, G., Schuetze, H.: Automatic Detection of Text Genre. In: Proc. 35th Ann. Meeting ACL, pp. 32–38 (1997)

    Google Scholar 

  26. Le, Z.: Maximum Entropy Toolkit for Python and C++. LGPL license, http://homepages.inf.ed.ac.uk/s0450736/maxent_toolkit.html

  27. MetadataExtractor: http://pami-xeon.uwaterloo.ca/TextMiner/MetadataExtractor.aspx

  28. McCallum, A.: Bow: A Toolkit for Statistical Language Modeling, Text Retrieval, Classification and Clustering (1998), http://www.cs.cmu.edu/mccallum/bow/

  29. National Archives UK: DROID (Digital Object Identification), http://www.nationalarchives.gov.uk/aboutapps/pronom/droid.htm

  30. Natinal Library of Medicine US: http://www.nlm.nih.gov/

  31. National Library of New Zealand: Metadata Extraction Tool, http://www.natlib.govt.nz/en/whatsnew/4initiatives.html#extraction

  32. Adobe Acrobat PDF specification: http://partners.adobe.com/public/developer/pdf/index_reference.html

  33. Python Imaging Library: http://www.pythonware.com/products/pil/

  34. PREMIS (PREservation Metadata: Implementation Strategy) Working Group: http://www.oclc.org/research/projects/pmwg/

  35. Python: http://www.python.org

  36. Riloff, E., Wiebe, J., Wilson, T.: Learning Subjective Nouns using Extraction Pattern Bootstrapping. In: Proc. 7th CoNLL, pp. 25–32 (2003)

    Google Scholar 

  37. Ross, S., Hedstrom, M.: Preservation Research and Sustainable Digital Libraries. International Journal of Digital Libraries (Springer) (2005), doi:10.1007/s00799- 004-0099-3

    Google Scholar 

  38. Santini, M.: A Shallow Approach To Syntactic Feature Extraction For Genre Classification. In: Proceedings of the 7th Annual Colloquium for the UK Special Interest Group for Computational Linguistics, CLUK 2004 (2004)

    Google Scholar 

  39. Sebastiani, F.: Machine Learning in Automated Text Categorization. ACM Computing Surveys 34, 1–47 (2002)

    Article  MathSciNet  Google Scholar 

  40. Shafait, F., Keysers, D., Breuel, T.M.: Performance Comparison of Six Algorithms for Page Segmentation. In: 7th IAPR Workshop on Document Analysis Systems (DAS), pp. 368–379 (2006)

    Google Scholar 

  41. Shao, M., Futrelle, R.: Graphics Recognition in PDF document. In: Sixth IAPR International Workshop on Graphics Recognition (GREC 2005), pp. 218–227 (2005)

    Google Scholar 

  42. Thoma, G.: Automating the production of bibliographic records. R&D report of the Communications Engineering Branch, Lister Hill National Center for Biomedical Communications, National Library of Medicine (2001)

    Google Scholar 

  43. Witte, R., Krestel, R., Bergler, S.: ERSS 2005:Coreference-based Summarization Reloaded. DUC 2005 Document Understanding Workshop, Canada

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2006 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Kim, Y., Ross, S. (2006). Genre Classification in Automated Ingest and Appraisal Metadata. In: Gonzalo, J., Thanos, C., Verdejo, M.F., Carrasco, R.C. (eds) Research and Advanced Technology for Digital Libraries. ECDL 2006. Lecture Notes in Computer Science, vol 4172. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11863878_6

Download citation

  • DOI: https://doi.org/10.1007/11863878_6

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-44636-1

  • Online ISBN: 978-3-540-44638-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics