bioPDFX: preparing PDF scientific articles for biomedical text mining
- Published
- Accepted
- Subject Areas
- Bioinformatics, Data Science, Databases, Digital Libraries, World Wide Web and Web Science
- Keywords
- bioPDFX, PDF conversion, PDF transcribing, biomedical text mining, natural language processing
- Copyright
- © 2017 Bhargava et al.
- Licence
- This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ Preprints) and either DOI or URL of the article must be cited.
- Cite this article
- 2017. bioPDFX: preparing PDF scientific articles for biomedical text mining. PeerJ Preprints 5:e2993v1 https://doi.org/10.7287/peerj.preprints.2993v1
Abstract
Background. There is huge amount of full-text biomedical literatures available in public repositories like PubMed Central (PMC). However, a substantial number of the papers are in Portable Document Format (PDF) and do not provide plain text format ready for text mining and natural language processing (NLP). Although there exist many PDF-to-text converters, they still suffer from several challenges while processing biomedical PDFs, such as the correct transcription of titles/abstracts, segmenting references/acknowledgements, special characters, jumbling errors (the wrong order of the text), and word boundaries.
Methods. In this paper, we present bioPDFX, a novel tool which complements weaknesses with strengths of multiple state-of-the-art methods and then applies machine learning methods to address all issues above
Results. The experiment results on publications of Genome Wide Association Studies (GWAS) demonstrated that bioPDFX significantly improved the quality of XML comparing to state-of-the-art PDF-to-XML converter, leading to a biomedical database more suitable for text mining.
Discussion. Overall, the whole pipeline developed in this paper makes the published literature in form of PDF files much better suited for text mining tasks, while slightly improving the overall text quality as well. The service is open to access freely at URL: http://textmining.ucsd.edu:9000 . A list of PubMed Central IDs of the 941 articles (see Supplemental File 1) used in this study is available for download at the same URL. The instructions of how to run the service with a PubMed ID are described in Supplemental File 2.
Author Comment
This is a submission to PeerJ Computer Science for review.
Supplemental Information
Supplemental File 1
The raw data of a list of PubMed Central IDs of the 941 articles.
Supplemental File 2
The instructions of how to run the bioPDFX service with a PubMed ID.