Genre Classification in Automated Ingest and Appraisal Metadata

Kim, Yunhyong; Ross, Seamus

doi:10.1007/11863878_6

Yunhyong Kim²⁰ &
Seamus Ross²⁰

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 4172))

Included in the following conference series:

International Conference on Theory and Practice of Digital Libraries

967 Accesses
3 Citations

Abstract

Metadata creation is a crucial aspect of the ingest of digital materials into digital libraries. Metadata needed to document and manage digital materials are extensive and manual creation of them expensive. The Digital Curation Centre (DCC) has undertaken research to automate this process for some classes of digital material. We have segmented the problem and this paper discusses results in genre classification as a first step toward automating metadata extraction from documents. Here we propose a classification method built on looking at the documents from five directions; as an object exhibiting a specific visual format, as a linear layout of strings with characteristic grammar, as an object with stylo-metric signatures, as an object with intended meaning and purpose, and as an object linked to previously classified objects and other external sources. The results of some experiments in relation to the first two directions are described here; they are meant to be indicative of the promise underlying this multi-facetted approach.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Finding a Fragment in a Pile of Geniza: A Practical Guide to Collections, Editions, and Resources

Article 19 March 2019

Exploration of Accuracy, Completeness and Consistency in Metadata for Physical Objects in Museum Collections

A Text Mining Framework for Accelerating the Semantic Curation of Literature

References

Aiello, M., Monz, C., Todoran, L., Worring, M.: Document Understanding for a Broad Class of Documents. International Journal on Document Analysis and Recognition 5(1), 1–16 (2002)
Article MATH Google Scholar
Automatic Metadata Generation: http://www.cs.kuleuven.ac.be/~hmdb/amg/documentation.php
Arens, A., Blaesius, K.H.: Domain oriented information extraction from the Internet. In: Proceedings of SPIE Document Recognition and Retrieval 2003, vol. 5010, p. 286 (2003)
Google Scholar
Bagdanov, A.D., Worring, M.: Fine-Grained Document Genre Classification Using First Order Random Graphs. In: Proceedings of International Conference on Document Analysis and Recognition 2001, p. 79 (2001)
Google Scholar
Barbu, E., Heroux, P., Adam, S., Trupin, E.: Clustering Document Images Using a Bag of Symbols Representation. In: International Conference on Document Analysis and Recognition, pp. 1216–1220 (2005)
Google Scholar
Bekkerman, R., McCallum, A., Huang, G.: Automatic Categorization of Email into Folders. Benchmark Experiments on Enron and SRI Corpora’, CIIR Technical Report, IR-418 (2004)
Google Scholar
Biber, D.: Dimensions of Register Variation:a Cross-Linguistic Comparison. Cambridge University Press, Cambridge (1995)
Book Google Scholar
Boese, E.S.: Stereotyping the web: genre classification of web documents. Master’s thesis, Colorado State University (2005)
Google Scholar
Breuel, T.M.: An Algorithm for Finding Maximal Whitespace Rectangles at Arbitrary Orientations for Document Layout Analysis. In: 7th International Conference for Document Analysis and Recognition (ICDAR), pp. 66–70 (2003)
Google Scholar
Digital Curation Centre: http://www.dcc.ac.uk
DC-dot, Dublin Core metadata editor: http://www.ukoln.ac.uk/metadata/dcdot/
DELOS Network of Excellence on Digital Libraries: http://www.delos.info/
NSF International Projects: http://www.dli2.nsf.gov/intl.html
DELOS/NSF Working Groups: Reference Models for Digital Libraries: Actors and Roles (2003), http://www.dli2.nsf.gov/internationalprojects/workinggroupreports/actorsfinalreport.html
Dublin Core Initiative: http://dublincore.org/tools/#automaticextraction
Engineering and Physical Sciences Research Council: http://www.epsrc.ac.uk/
Electronic Resources Preservation Access Network (ERPANET): http://www.erpanet.org
ERPANET: Packaged Object Ingest Project, http://www.erpanet.org/events/2003/rome/presentations/ross_rusbridge_pres.pdf
Giuffrida, G., Shek, E., Yang, J.: Knowledge-based Metadata Extraction from PostScript File. In: Proc. 5th ACM Intl. conf. Digital Libraries, pp. 77–84 (2000)
Google Scholar
Han, H., Giles, L., Manavoglu, E., Zha, H., Zhang, Z., Fox, E.A.: Automatic Document Metadata Extraction using Support Vector Machines. In: Proc. 3rd ACM/IEEECS conf. Digital libraries, pp. 37–48 (2000)
Google Scholar
Hedstrom, M., Ross, S., Ashley, K., Christensen-Dalsgaard, B., Duff, W., Gladney, H., Huc, C., Kenney, A.R., Moore, R., Neuhold, E.: Invest to Save: Report and Recommendations of the NSF-DELOS Working Group on Digital Archiving and Preservation. Report of the European Union DELOS and US National Science Foundation Workgroup on Digital Preservation and Archiving (2003), http://delos-noe.iei.pi.cnr.it/activities/internationalforum/Joint-WGs/digitalarchiving/Digitalarchiving.pdf
Joint Information Systems Committee: http://www.jisc.ac.uk/
Karlgren, J., Cutting, D.: Recognizing Text Genres with Simple Metric using Discriminant Analysis. In: Proc. 15th conf. Comp. Ling., vol. 2, pp. 1071–1075 (1994)
Google Scholar
Ke, S.W., Bowerman, C., Oakes, M.: PERC: A Personal Email Classifier. In: Lalmas, M., MacFarlane, A., Rüger, S.M., Tombros, A., Tsikrika, T., Yavlinsky, A. (eds.) ECIR 2006. LNCS, vol. 3936, pp. 460–463. Springer, Heidelberg (2006)
Chapter Google Scholar
Kessler, B., Nunberg, G., Schuetze, H.: Automatic Detection of Text Genre. In: Proc. 35th Ann. Meeting ACL, pp. 32–38 (1997)
Google Scholar
Le, Z.: Maximum Entropy Toolkit for Python and C++. LGPL license, http://homepages.inf.ed.ac.uk/s0450736/maxent_toolkit.html
MetadataExtractor: http://pami-xeon.uwaterloo.ca/TextMiner/MetadataExtractor.aspx
McCallum, A.: Bow: A Toolkit for Statistical Language Modeling, Text Retrieval, Classification and Clustering (1998), http://www.cs.cmu.edu/mccallum/bow/
National Archives UK: DROID (Digital Object Identification), http://www.nationalarchives.gov.uk/aboutapps/pronom/droid.htm
Natinal Library of Medicine US: http://www.nlm.nih.gov/
National Library of New Zealand: Metadata Extraction Tool, http://www.natlib.govt.nz/en/whatsnew/4initiatives.html#extraction
Adobe Acrobat PDF specification: http://partners.adobe.com/public/developer/pdf/index_reference.html
Python Imaging Library: http://www.pythonware.com/products/pil/
PREMIS (PREservation Metadata: Implementation Strategy) Working Group: http://www.oclc.org/research/projects/pmwg/
Python: http://www.python.org
Riloff, E., Wiebe, J., Wilson, T.: Learning Subjective Nouns using Extraction Pattern Bootstrapping. In: Proc. 7th CoNLL, pp. 25–32 (2003)
Google Scholar
Ross, S., Hedstrom, M.: Preservation Research and Sustainable Digital Libraries. International Journal of Digital Libraries (Springer) (2005), doi:10.1007/s00799- 004-0099-3
Google Scholar
Santini, M.: A Shallow Approach To Syntactic Feature Extraction For Genre Classification. In: Proceedings of the 7th Annual Colloquium for the UK Special Interest Group for Computational Linguistics, CLUK 2004 (2004)
Google Scholar
Sebastiani, F.: Machine Learning in Automated Text Categorization. ACM Computing Surveys 34, 1–47 (2002)
Article MathSciNet Google Scholar
Shafait, F., Keysers, D., Breuel, T.M.: Performance Comparison of Six Algorithms for Page Segmentation. In: 7th IAPR Workshop on Document Analysis Systems (DAS), pp. 368–379 (2006)
Google Scholar
Shao, M., Futrelle, R.: Graphics Recognition in PDF document. In: Sixth IAPR International Workshop on Graphics Recognition (GREC 2005), pp. 218–227 (2005)
Google Scholar
Thoma, G.: Automating the production of bibliographic records. R&D report of the Communications Engineering Branch, Lister Hill National Center for Biomedical Communications, National Library of Medicine (2001)
Google Scholar
Witte, R., Krestel, R., Bergler, S.: ERSS 2005:Coreference-based Summarization Reloaded. DUC 2005 Document Understanding Workshop, Canada
Google Scholar

Download references

Author information

Authors and Affiliations

Digital Curation Centre (DCC) & Humanities Adavanced Technology Information Institute (HATII), University of Glasgow, Glasgow, UK
Yunhyong Kim & Seamus Ross

Authors

Yunhyong Kim
View author publications
You can also search for this author in PubMed Google Scholar
Seamus Ross
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

No Affiliations,
Julio Gonzalo
Istituto di Scienza e Tecnologie dell’Informazione, Consiglio Nazionale delle Richerche, Via Moruzzi, 1, 56124, Pisa, Italy
Costantino Thanos
Dpto. Lenguajes y Sistemas Informáticos, UNED,
M. Felisa Verdejo
Dep. de Lenguajes y Sistemas Informáticos, Universidad de Alicante, E-03071, Alicante, Spain
Rafael C. Carrasco

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kim, Y., Ross, S. (2006). Genre Classification in Automated Ingest and Appraisal Metadata. In: Gonzalo, J., Thanos, C., Verdejo, M.F., Carrasco, R.C. (eds) Research and Advanced Technology for Digital Libraries. ECDL 2006. Lecture Notes in Computer Science, vol 4172. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11863878_6

Download citation

DOI: https://doi.org/10.1007/11863878_6
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-44636-1
Online ISBN: 978-3-540-44638-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Genre Classification in Automated Ingest and Appraisal Metadata

Abstract

Access this chapter

Preview

Similar content being viewed by others

Finding a Fragment in a Pile of Geniza: A Practical Guide to Collections, Editions, and Resources

Exploration of Accuracy, Completeness and Consistency in Metadata for Physical Objects in Museum Collections

A Text Mining Framework for Accelerating the Semantic Curation of Literature

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Genre Classification in Automated Ingest and Appraisal Metadata

Abstract

Access this chapter

Preview

Similar content being viewed by others

Finding a Fragment in a Pile of Geniza: A Practical Guide to Collections, Editions, and Resources

Exploration of Accuracy, Completeness and Consistency in Metadata for Physical Objects in Museum Collections

A Text Mining Framework for Accelerating the Semantic Curation of Literature

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation