Skip to main content

Extracting Contextual Information from Scientific Literature Using CERMINE System

  • Conference paper
  • First Online:
Semantic Web Evaluation Challenges (SemWebEval 2015)

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 548))

Included in the following conference series:

Abstract

CERMINE is a comprehensive open source system for extracting structured metadata and references from born-digital scientific literature. Among other information, the system is able to extract information related to the context the article was written in, such as the authors and their affiliations, the relations between them or references to other articles. Extracted information is presented in a structured, machine-readable form. CERMINE is based on a modular workflow, whose loosely coupled architecture allows for individual components evaluation and adjustment, enables effortless improvements and replacements of independent parts of the algorithm and facilitates future architecture expanding. The implementation of the workflow is based mostly on supervised and unsupervised machine-learning techniques, which simplifies the procedure of adapting the system to new document layouts and styles. In this paper we outline the overall workflow architecture, describe key aspects of the system implementation, provide details about training and adjusting of individual algorithms, and finally report how CERMINE was used for extracting contextual information from scientific articles in PDF format in the context of ESWC 2015 Semantic Publishing Challenge. CERMINE system is available under an open-source licence and can be accessed at http://cermine.ceon.pl.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Dublin Core. http://dublincore.org/

  2. iText. http://itextpdf.com/

  3. NLM. http://dtd.nlm.nih.gov/archiving/

  4. PdfMiner. http://www.unixuser.org/~euske/python/pdfminer/

  5. PubMed. http://www.ncbi.nlm.nih.gov/pubmed

  6. vCard. http://www.w3.org/TR/vcard-rdf/

  7. Chang, C., Lin, C.: LIBSVM: a library for support vector machines. ACM TIST 2(3), 27 (2011)

    Google Scholar 

  8. Giles, C.L., Bollacker, K.D., Lawrence, S.: Citeseer: An automatic citation indexing system. In: Proceedings of the 3rd ACM International Conference on Digital Libraries, pp. 89–98 (1998)

    Google Scholar 

  9. McCallum, A., Nigam, K., Rennie, J.: Automating the construction of internet portals with machine learning. Inf. Retrieval 3, 127–163 (2000)

    Article  Google Scholar 

  10. McCallum, A.K.: MALLET: A Machine Learning for Language Toolkit (2002)

    Google Scholar 

  11. O’Gorman, L.: The document spectrum for page layout analysis. IEEE Trans. Pattern Anal. Mach. Intell. 15(11), 1162–1173 (1993)

    Article  Google Scholar 

  12. Tkaczyk, D., Szostek, P., Bolikowski, L.: GROTOAP2 - the methodology of creating a large ground truth dataset of scientific articles. D-Lib Magazine (2014)

    Google Scholar 

  13. Tkaczyk, D., Szostek, P., Fedoryszak, M., Dendek, P.J., Bolikowski, L.: CERMINE: automatic extraction of structured metadata from scientific literature. Int. J. Doc. Anal. Recogn. (IJDAR), 1–19 (2015). http://dx.doi.org/10.1007/s10032-015-0249-8. doi:10.1007/s10032-015-0249-8

  14. Tkaczyk, D., et al.: Cermine: Cermine 1.6 (2015). http://dx.doi.org/10.5281/zenodo.17594

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Dominika Tkaczyk .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this paper

Cite this paper

Tkaczyk, D., Bolikowski, Ɓ. (2015). Extracting Contextual Information from Scientific Literature Using CERMINE System. In: Gandon, F., Cabrio, E., Stankovic, M., Zimmermann, A. (eds) Semantic Web Evaluation Challenges. SemWebEval 2015. Communications in Computer and Information Science, vol 548. Springer, Cham. https://doi.org/10.1007/978-3-319-25518-7_8

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-25518-7_8

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-25517-0

  • Online ISBN: 978-3-319-25518-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics