bioPDFX: preparing PDF scientific articles for biomedical text mining

Shitij Bhargava; Tsung-Ting Kuo; Ankit Goyal; Vincent Kuri; Gordon Lin; Chun-Nan Hsu

doi:10.7287/peerj.preprints.2993v1

bioPDFX: preparing PDF scientific articles for biomedical text mining

Shitij Bhargava¹, Tsung-Ting Kuo², Ankit Goyal¹, Vincent Kuri¹, Gordon Lin¹, Chun-Nan Hsu ²

1 Department of Computer Science and Engineering, Jacobs School of Engineering, University of California, San Diego, La Jolla, California, United States

2 Health System Department of Biomedical Informatics, School of Medicine, University of California, San Diego, La Jolla, California, United States

DOI: 10.7287/peerj.preprints.2993v1

Published: 2017-05-26
Accepted: 2017-05-26

Subject Areas: Bioinformatics, Data Science, Databases, Digital Libraries, World Wide Web and Web Science
Keywords: bioPDFX, PDF conversion, PDF transcribing, biomedical text mining, natural language processing

Copyright: © 2017 Bhargava et al.
Licence: This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ Preprints) and either DOI or URL of the article must be cited.

Cite this article: Bhargava S, Kuo T, Goyal A, Kuri V, Lin G, Hsu C. 2017. bioPDFX: preparing PDF scientific articles for biomedical text mining. PeerJ Preprints 5:e2993v1 https://doi.org/10.7287/peerj.preprints.2993v1

Abstract

Background. There is huge amount of full-text biomedical literatures available in public repositories like PubMed Central (PMC). However, a substantial number of the papers are in Portable Document Format (PDF) and do not provide plain text format ready for text mining and natural language processing (NLP). Although there exist many PDF-to-text converters, they still suffer from several challenges while processing biomedical PDFs, such as the correct transcription of titles/abstracts, segmenting references/acknowledgements, special characters, jumbling errors (the wrong order of the text), and word boundaries.

Methods. In this paper, we present bioPDFX, a novel tool which complements weaknesses with strengths of multiple state-of-the-art methods and then applies machine learning methods to address all issues above

Results. The experiment results on publications of Genome Wide Association Studies (GWAS) demonstrated that bioPDFX significantly improved the quality of XML comparing to state-of-the-art PDF-to-XML converter, leading to a biomedical database more suitable for text mining.

Discussion. Overall, the whole pipeline developed in this paper makes the published literature in form of PDF files much better suited for text mining tasks, while slightly improving the overall text quality as well. The service is open to access freely at URL: http://textmining.ucsd.edu:9000 . A list of PubMed Central IDs of the 941 articles (see Supplemental File 1) used in this study is available for download at the same URL. The instructions of how to run the service with a PubMed ID are described in Supplemental File 2.