research-article

PPPred: Classifying Protein-phenotype Co-mentions Extracted from Biomedical Literature

Authors:
Morteza Pourreza Shahri

Montana State University, Bozeman, MT, USA

Montana State University, Bozeman, MT, USA
View Profile

,
Gillian Reynolds

Montana State University, Bozeman, MT, USA

Montana State University, Bozeman, MT, USA
View Profile

,
Mandi Marie Roe

Montana State University, Bozeman, MT, USA

Montana State University, Bozeman, MT, USA
View Profile

,
Indika Kahanda

Montana State University, Bozeman, MT, USA

Montana State University, Bozeman, MT, USA
View Profile

BCB '19: Proceedings of the 10th ACM International Conference on Bioinformatics, Computational Biology and Health InformaticsSeptember 2019Pages 414–422https://doi.org/10.1145/3307339.3342167

Published:04 September 2019Publication History

BCB '19: Proceedings of the 10th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics

Pages 414–422

ABSTRACT

The MEDLINE database provides an extensive source of scientific articles and heterogeneous biomedical information in the form of unstructured text. One of the most important knowledge present within articles are the relations between human proteins and their phenotypes, which can stay hidden due to the exponential growth of publications. This has presented a range of opportunities for the development of computational methods to extract these biomedical relations from the articles. However, currently, no such method exists for the automated extraction of relations involving human proteins and human phenotype ontology (HPO) terms. In our previous work, we developed a comprehensive database composed of all co-mentions of proteins and phenotypes. In this study, we present a supervised machine learning approach called PPPred (Protein-Phenotype Predictor) for classifying the validity of a given sentence-level co-mention. Using an in-house developed gold standard dataset, we demonstrate that PPPred significantly outperforms several baseline methods. This two-step approach of co-mention extraction and classification constitutes a complete biomedical relation extraction pipeline for extracting protein-phenotype relations.

References

Behrouz Bokharaeian, Alberto Diaz, Nasrin Taghizadeh, Hamidreza Chitsaz, and Ramyar Chavoshinejad. 2017. SNPPhenA: a corpus for extracting ranked associations of single-nucleotide polymorphisms and phenotypes from literature. Journal of biomedical semantics , Vol. 8, 1 (2017), 14.Google ScholarCross Ref
Quoc-Chinh Bui, Sophia Katrenko, and Peter MA Sloot. 2010. A hybrid approach to extract Protein--Protein Interactions. Bioinformatics , Vol. 27, 2 (2010), 259--265. Google ScholarDigital Library
Sumit Kumar Chaturvedi, Mohammad Khursheed Siddiqi, Parvez Alam, and Rizwan Hasan Khan. 2016. Protein misfolding and aggregation: mechanism, factors and detection. Process Biochemistry , Vol. 51, 9 (2016), 1183--1192.Google ScholarCross Ref
Elizabeth S Chen, George Hripcsak, Hua Xu, Marianthi Markatou, and Carol Friedman. 2008. Automated acquisition of disease--drug knowledge from biomedical and clinical documents: an initial study. Journal of the American Medical Informatics Association , Vol. 15, 1 (2008), 87--98.Google ScholarCross Ref
Fabrizio Chiti and Christopher M Dobson. 2017. Protein misfolding, amyloid formation, and human disease: a summary of progress over the last decade. Annual review of biochemistry , Vol. 86 (2017), 27--68.Google Scholar
Rajesh Chowdhary, Jinfeng Zhang, and Jun S Liu. 2009. Bayesian inference of protein--protein interactions from biological literature. Bioinformatics , Vol. 25, 12 (2009), 1536--1542. Google ScholarDigital Library
Adrien Coulet, Nigam H Shah, et almbox. 2010. Using text to build semantic networks for pharmacogenomics. Journal of biomedical informatics , Vol. 43, 6 (2010), 1009--1019. Google ScholarDigital Library
Mark Craven. 1999. Learning to extract relations from MEDLINE. In AAAI-99 workshop on machine learning for information extraction, Vol. 5. The AAAI Press, 604--611.Google Scholar
International Society for Biocuration. 2018. Biocuration: Distilling data into knowledge. PLOS Biology , Vol. 16, 4 (04 2018), 1--8.Google ScholarCross Ref
Chern-Sing Goh, Tara A Gianoulis, et almbox. 2006. Integration of curated databases to identify genotype-phenotype associations. BMC genomics , Vol. 7, 1 (2006), 257.Google Scholar
Peter W Harrison, Alison E Wright, and Judith E Mank. 2012. The evolution of gene expression and the transcriptome--phenotype relationship. In Seminars in cell & developmental biology, Vol. 23. Elsevier, 222--229.Google Scholar
F Ulrich Hartl. 2017. Protein misfolding diseases. Annual Review of Biochemistry , Vol. 86 (2017), 21--26.Google ScholarCross Ref
Minlie Huang, Xiaoyan Zhu, Yu Hao, Donald G Payan, Kunbin Qu, and Ming Li. 2004. Discovering patterns to extract Protein--Protein Interactions from full texts. Bioinformatics , Vol. 20, 18 (2004), 3604--3612. Google ScholarDigital Library
cS enay Kafkas and Robert Hoehndorf. 2019. Ontology based text mining of gene-phenotype associations: application to candidate gene prediction. Database , Vol. 2019 (2019).Google Scholar
Sophia Katrenko and Pieter Adriaans. 2007. Learning relations from biomedical corpora using dependency trees. In Knowledge Discovery and Emergent Complexity in Bioinformatics. Springer, 61--80. Google ScholarDigital Library
Maryam Khordad and Robert E Mercer. 2017. Identifying genotype-phenotype relationships in biomedical text. Journal of biomedical semantics , Vol. 8, 1 (2017), 57.Google ScholarCross Ref
Sebastian Köhler, Sandra C Doelken, et almbox. 2013. The Human Phenotype Ontology project: linking molecular biology and disease through phenotype data. Nucleic acids research , Vol. 42, D1 (2013), D966--D974.Google Scholar
Jan O Korbel, Tobias Doerks, et almbox. 2005. Systematic association of genes to phenotypes by genome and literature mining. PLoS biology , Vol. 3, 5 (2005), e134.Google Scholar
Andre Lamurias, Luka A Clarke, and Francisco M Couto. 2017. Extracting microRNA-gene relations from biomedical literature using distant supervision. PloS one , Vol. 12, 3 (2017), e0171929.Google ScholarCross Ref
Mark Larance and Angus I Lamond. 2015. Multidimensional proteomics for cell biology. Nature reviews Molecular cell biology , Vol. 16, 5 (2015), 269.Google Scholar
Pei-Yau Lung, Zhe He, Tingting Zhao, Disa Yu, and Jinfeng Zhang. 2019. Extracting chemical--protein interactions from literature using sentence structure analysis and feature engineering. Database , Vol. 2019 (2019).Google Scholar
ASM Ashique Mahmood, Tsung-Jung Wu, Raja Mazumder, and K Vijay-Shanker. 2016. DiMeX: a text mining system for mutation-disease association extraction. PloS one , Vol. 11, 4 (2016), e0152725.Google Scholar
Edward M Marcotte, Ioannis Xenarios, and David Eisenberg. 2001. Mining literature for Protein--Protein Interactions. Bioinformatics , Vol. 17, 4 (2001), 359--363.Google ScholarCross Ref
Mary L McHugh. 2012. Interrater reliability: the Kappa statistic. Biochemia medica: Biochemia medica , Vol. 22, 3 (2012), 276--282.Google Scholar
Ines Moreno-Gonzalez, George Edwards III, Natalia Salvadores, Mohammad Shahnawaz, Rodrigo Diaz-Espinoza, and Claudio Soto. 2017. Molecular interaction between type 2 diabetes and Alzheimer's disease through cross-seeding of protein misfolding. Molecular psychiatry , Vol. 22, 9 (2017), 1327.Google Scholar
See-Kiong Ng and Marie Wong. 1999. Toward routine automatic pathway discovery from on-line scientific text abstracts. Genome Informatics , Vol. 10 (1999), 104--112.Google Scholar
Yifan Peng, Anthony Rios, Ramakanth Kavuluru, and Zhiyong Lu. 2018. Extracting chemical--protein relations with ensembles of SVM and deep learning models. Database , Vol. 2018 (2018), bay073.Google ScholarCross Ref
Morteza Pourreza Shahri and Indika Kahanda. 2018. Extracting Co-mention Features from Biomedical Literature for Automated Protein Phenotype Prediction using PHENOstruct. In 10th International Conference on Bioinformatics and Computational Biology, BICOB 2018. 123--128.Google Scholar
Morteza Pourreza Shahri and Indika Kahanda. 2019. ProPheno 1.0: An online dataset for accelerating the complete characterization of the human protein-phenotype landscape in biomedical literature. (2019).Google Scholar
KE Ravikumar, Majid Rastegar-Mojarad, and Hongfang Liu. 2017. BELMiner: adapting a rule-based relation extraction system to extract biological expression language statements from bio-medical literature evidence sentences. Database , Vol. 2017 (2017).Google Scholar
Marylyn D Ritchie, Emily R Holzinger, Ruowang Li, Sarah A Pendergrass, and Dokyoon Kim. 2015. Methods of integrating data to uncover genotype--phenotype interactions. Nature Reviews Genetics , Vol. 16, 2 (2015), 85.Google ScholarCross Ref
Peter N Robinson. 2012. Deep phenotyping for precision medicine. Human mutation , Vol. 33, 5 (2012), 777--780.Google Scholar
Barbara Rosario and Marti A Hearst. 2004. Classifying semantic relations in bioscience texts. In Proceedings of the 42nd annual meeting on association for computational linguistics. Association for Computational Linguistics, 430. Google ScholarDigital Library
Takeshi Sekimizu, Hyun S Park, and Jun'ichi Tsujii. 1998. Identifying the interaction between genes and gene products based on frequently seen verbs in medline abstracts. Genome informatics , Vol. 9 (1998), 62--71.Google Scholar
Qiancheng Shen, Feixiong Cheng, Huili Song, Weiqiang Lu, Junfei Zhao, Xiaoli An, Mingyao Liu, Guoqiang Chen, Zhongming Zhao, and Jian Zhang. 2017. Proteome-scale investigation of protein allosteric regulation perturbed by somatic mutations in 7,000 cancer genomes. The American Journal of Human Genetics , Vol. 100, 1 (2017), 5--20.Google ScholarCross Ref
Ayush Singhal, Michael Simmons, and Zhiyong Lu. 2016a. Text mining for precision medicine: automating disease-mutation relationship extraction from biomedical literature. Journal of the American Medical Informatics Association , Vol. 23, 4 (2016), 766--772.Google ScholarCross Ref
Ayush Singhal, Michael Simmons, and Zhiyong Lu. 2016b. Text mining genotype-phenotype relationships from biomedical literature for database curation and precision medicine. PLoS computational biology , Vol. 12, 11 (2016), e1005017.Google Scholar
Joshua M Temkin and Mark R Gilder. 2003. Extraction of protein interaction information from unstructured text using a context-free grammar. Bioinformatics , Vol. 19, 16 (2003), 2046--2053.Google ScholarCross Ref
Akane Yakushiji, Yuka Tateisi, et almbox. 2000. Event extraction from biomedical papers using a full parser. In Biocomputing 2001 . World Scientific, 408--419.Google Scholar
Yijia Zhang, Hongfei Lin, Zhihao Yang, Jian Wang, Shaowu Zhang, Yuanyuan Sun, and Liang Yang. 2018. A hybrid model based on neural networks for biomedical relation extraction. Journal of biomedical informatics , Vol. 81 (2018), 83--92.Google ScholarCross Ref

Index Terms

PPPred: Classifying Protein-phenotype Co-mentions Extracted from Biomedical Literature
1. Applied computing
  1. Life and medical sciences
    1. Bioinformatics

Recommendations

Exploring Species-Based Strategies for Gene Normalization

We introduce a system developed for the BioCreative II.5 community evaluation of information extraction of proteins and protein interactions. The paper focuses primarily on the gene normalization task of recognizing protein mentions in text and mapping ...
Read More
Joint learning for biomedical NER and entity normalization: encoding schemes, counterfactual examples, and zero-shot evaluation
BCB '21: Proceedings of the 12th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics

Named entity recognition (NER) and normalization (EN) form an indispensable first step to many biomedical natural language processing applications. In biomedical information science, recognizing entities (e.g., genes, diseases, or drugs) and normalizing ...
Read More
BioREx: Improving biomedical relation extraction by leveraging heterogeneous datasets
Graphical abstract

Display Omitted

Abstract
Biomedical relation extraction (RE) is the task of automatically identifying and characterizing relations between biomedical concepts from free text. RE is a central task in biomedical natural language processing (NLP) research and plays a ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
BCB '19: Proceedings of the 10th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics
September 2019
716 pages
ISBN:9781450366663
DOI:10.1145/3307339
General Chairs:
Xinghua (Mindy) Shi
Temple University, USA
,
Michael Buck
University of Buffalo, USA
,
Program Chairs:
Jian Ma
Carnegie Mellon University, USA
,
Pierangelo Veltri
University Magna Graecia of Catanzaro, Italy
Copyright © 2019 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 4 September 2019
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
biomedical natural language processing
pppred
protein-phenotype prediction
Qualifiers
- research-article
Conference

Acceptance Rates
BCB '19 Paper Acceptance Rate42of157submissions,27%Overall Acceptance Rate254of885submissions,29%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 3
  Total Citations
  View Citations
- 103
  Total Downloads
- Downloads (Last 12 months)2
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

PPPred: Classifying Protein-phenotype Co-mentions Extracted from Biomedical Literature

BCB '19: Proceedings of the 10th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics

ABSTRACT

References

Cited By

Index Terms

Recommendations

Exploring Species-Based Strategies for Gene Normalization

Joint learning for biomedical NER and entity normalization: encoding schemes, counterfactual examples, and zero-shot evaluation

BioREx: Improving biomedical relation extraction by leveraging heterogeneous datasets

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

PPPred: Classifying Protein-phenotype Co-mentions Extracted from Biomedical Literature

BCB '19: Proceedings of the 10th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics

ABSTRACT

References

Cited By

Index Terms

Recommendations

Exploring Species-Based Strategies for Gene Normalization

Joint learning for biomedical NER and entity normalization: encoding schemes, counterfactual examples, and zero-shot evaluation

BioREx: Improving biomedical relation extraction by leveraging heterogeneous datasets

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media