ABSTRACT
The MEDLINE database provides an extensive source of scientific articles and heterogeneous biomedical information in the form of unstructured text. One of the most important knowledge present within articles are the relations between human proteins and their phenotypes, which can stay hidden due to the exponential growth of publications. This has presented a range of opportunities for the development of computational methods to extract these biomedical relations from the articles. However, currently, no such method exists for the automated extraction of relations involving human proteins and human phenotype ontology (HPO) terms. In our previous work, we developed a comprehensive database composed of all co-mentions of proteins and phenotypes. In this study, we present a supervised machine learning approach called PPPred (Protein-Phenotype Predictor) for classifying the validity of a given sentence-level co-mention. Using an in-house developed gold standard dataset, we demonstrate that PPPred significantly outperforms several baseline methods. This two-step approach of co-mention extraction and classification constitutes a complete biomedical relation extraction pipeline for extracting protein-phenotype relations.
- Behrouz Bokharaeian, Alberto Diaz, Nasrin Taghizadeh, Hamidreza Chitsaz, and Ramyar Chavoshinejad. 2017. SNPPhenA: a corpus for extracting ranked associations of single-nucleotide polymorphisms and phenotypes from literature. Journal of biomedical semantics , Vol. 8, 1 (2017), 14.Google Scholar
Cross Ref
- Quoc-Chinh Bui, Sophia Katrenko, and Peter MA Sloot. 2010. A hybrid approach to extract Protein--Protein Interactions. Bioinformatics , Vol. 27, 2 (2010), 259--265. Google Scholar
Digital Library
- Sumit Kumar Chaturvedi, Mohammad Khursheed Siddiqi, Parvez Alam, and Rizwan Hasan Khan. 2016. Protein misfolding and aggregation: mechanism, factors and detection. Process Biochemistry , Vol. 51, 9 (2016), 1183--1192.Google Scholar
Cross Ref
- Elizabeth S Chen, George Hripcsak, Hua Xu, Marianthi Markatou, and Carol Friedman. 2008. Automated acquisition of disease--drug knowledge from biomedical and clinical documents: an initial study. Journal of the American Medical Informatics Association , Vol. 15, 1 (2008), 87--98.Google Scholar
Cross Ref
- Fabrizio Chiti and Christopher M Dobson. 2017. Protein misfolding, amyloid formation, and human disease: a summary of progress over the last decade. Annual review of biochemistry , Vol. 86 (2017), 27--68.Google Scholar
- Rajesh Chowdhary, Jinfeng Zhang, and Jun S Liu. 2009. Bayesian inference of protein--protein interactions from biological literature. Bioinformatics , Vol. 25, 12 (2009), 1536--1542. Google Scholar
Digital Library
- Adrien Coulet, Nigam H Shah, et almbox. 2010. Using text to build semantic networks for pharmacogenomics. Journal of biomedical informatics , Vol. 43, 6 (2010), 1009--1019. Google Scholar
Digital Library
- Mark Craven. 1999. Learning to extract relations from MEDLINE. In AAAI-99 workshop on machine learning for information extraction, Vol. 5. The AAAI Press, 604--611.Google Scholar
- International Society for Biocuration. 2018. Biocuration: Distilling data into knowledge. PLOS Biology , Vol. 16, 4 (04 2018), 1--8.Google Scholar
Cross Ref
- Chern-Sing Goh, Tara A Gianoulis, et almbox. 2006. Integration of curated databases to identify genotype-phenotype associations. BMC genomics , Vol. 7, 1 (2006), 257.Google Scholar
- Peter W Harrison, Alison E Wright, and Judith E Mank. 2012. The evolution of gene expression and the transcriptome--phenotype relationship. In Seminars in cell & developmental biology, Vol. 23. Elsevier, 222--229.Google Scholar
- F Ulrich Hartl. 2017. Protein misfolding diseases. Annual Review of Biochemistry , Vol. 86 (2017), 21--26.Google Scholar
Cross Ref
- Minlie Huang, Xiaoyan Zhu, Yu Hao, Donald G Payan, Kunbin Qu, and Ming Li. 2004. Discovering patterns to extract Protein--Protein Interactions from full texts. Bioinformatics , Vol. 20, 18 (2004), 3604--3612. Google Scholar
Digital Library
- cS enay Kafkas and Robert Hoehndorf. 2019. Ontology based text mining of gene-phenotype associations: application to candidate gene prediction. Database , Vol. 2019 (2019).Google Scholar
- Sophia Katrenko and Pieter Adriaans. 2007. Learning relations from biomedical corpora using dependency trees. In Knowledge Discovery and Emergent Complexity in Bioinformatics. Springer, 61--80. Google Scholar
Digital Library
- Maryam Khordad and Robert E Mercer. 2017. Identifying genotype-phenotype relationships in biomedical text. Journal of biomedical semantics , Vol. 8, 1 (2017), 57.Google Scholar
Cross Ref
- Sebastian Köhler, Sandra C Doelken, et almbox. 2013. The Human Phenotype Ontology project: linking molecular biology and disease through phenotype data. Nucleic acids research , Vol. 42, D1 (2013), D966--D974.Google Scholar
- Jan O Korbel, Tobias Doerks, et almbox. 2005. Systematic association of genes to phenotypes by genome and literature mining. PLoS biology , Vol. 3, 5 (2005), e134.Google Scholar
- Andre Lamurias, Luka A Clarke, and Francisco M Couto. 2017. Extracting microRNA-gene relations from biomedical literature using distant supervision. PloS one , Vol. 12, 3 (2017), e0171929.Google Scholar
Cross Ref
- Mark Larance and Angus I Lamond. 2015. Multidimensional proteomics for cell biology. Nature reviews Molecular cell biology , Vol. 16, 5 (2015), 269.Google Scholar
- Pei-Yau Lung, Zhe He, Tingting Zhao, Disa Yu, and Jinfeng Zhang. 2019. Extracting chemical--protein interactions from literature using sentence structure analysis and feature engineering. Database , Vol. 2019 (2019).Google Scholar
- ASM Ashique Mahmood, Tsung-Jung Wu, Raja Mazumder, and K Vijay-Shanker. 2016. DiMeX: a text mining system for mutation-disease association extraction. PloS one , Vol. 11, 4 (2016), e0152725.Google Scholar
- Edward M Marcotte, Ioannis Xenarios, and David Eisenberg. 2001. Mining literature for Protein--Protein Interactions. Bioinformatics , Vol. 17, 4 (2001), 359--363.Google Scholar
Cross Ref
- Mary L McHugh. 2012. Interrater reliability: the Kappa statistic. Biochemia medica: Biochemia medica , Vol. 22, 3 (2012), 276--282.Google Scholar
- Ines Moreno-Gonzalez, George Edwards III, Natalia Salvadores, Mohammad Shahnawaz, Rodrigo Diaz-Espinoza, and Claudio Soto. 2017. Molecular interaction between type 2 diabetes and Alzheimer's disease through cross-seeding of protein misfolding. Molecular psychiatry , Vol. 22, 9 (2017), 1327.Google Scholar
- See-Kiong Ng and Marie Wong. 1999. Toward routine automatic pathway discovery from on-line scientific text abstracts. Genome Informatics , Vol. 10 (1999), 104--112.Google Scholar
- Yifan Peng, Anthony Rios, Ramakanth Kavuluru, and Zhiyong Lu. 2018. Extracting chemical--protein relations with ensembles of SVM and deep learning models. Database , Vol. 2018 (2018), bay073.Google Scholar
Cross Ref
- Morteza Pourreza Shahri and Indika Kahanda. 2018. Extracting Co-mention Features from Biomedical Literature for Automated Protein Phenotype Prediction using PHENOstruct. In 10th International Conference on Bioinformatics and Computational Biology, BICOB 2018. 123--128.Google Scholar
- Morteza Pourreza Shahri and Indika Kahanda. 2019. ProPheno 1.0: An online dataset for accelerating the complete characterization of the human protein-phenotype landscape in biomedical literature. (2019).Google Scholar
- KE Ravikumar, Majid Rastegar-Mojarad, and Hongfang Liu. 2017. BELMiner: adapting a rule-based relation extraction system to extract biological expression language statements from bio-medical literature evidence sentences. Database , Vol. 2017 (2017).Google Scholar
- Marylyn D Ritchie, Emily R Holzinger, Ruowang Li, Sarah A Pendergrass, and Dokyoon Kim. 2015. Methods of integrating data to uncover genotype--phenotype interactions. Nature Reviews Genetics , Vol. 16, 2 (2015), 85.Google Scholar
Cross Ref
- Peter N Robinson. 2012. Deep phenotyping for precision medicine. Human mutation , Vol. 33, 5 (2012), 777--780.Google Scholar
- Barbara Rosario and Marti A Hearst. 2004. Classifying semantic relations in bioscience texts. In Proceedings of the 42nd annual meeting on association for computational linguistics. Association for Computational Linguistics, 430. Google Scholar
Digital Library
- Takeshi Sekimizu, Hyun S Park, and Jun'ichi Tsujii. 1998. Identifying the interaction between genes and gene products based on frequently seen verbs in medline abstracts. Genome informatics , Vol. 9 (1998), 62--71.Google Scholar
- Qiancheng Shen, Feixiong Cheng, Huili Song, Weiqiang Lu, Junfei Zhao, Xiaoli An, Mingyao Liu, Guoqiang Chen, Zhongming Zhao, and Jian Zhang. 2017. Proteome-scale investigation of protein allosteric regulation perturbed by somatic mutations in 7,000 cancer genomes. The American Journal of Human Genetics , Vol. 100, 1 (2017), 5--20.Google Scholar
Cross Ref
- Ayush Singhal, Michael Simmons, and Zhiyong Lu. 2016a. Text mining for precision medicine: automating disease-mutation relationship extraction from biomedical literature. Journal of the American Medical Informatics Association , Vol. 23, 4 (2016), 766--772.Google Scholar
Cross Ref
- Ayush Singhal, Michael Simmons, and Zhiyong Lu. 2016b. Text mining genotype-phenotype relationships from biomedical literature for database curation and precision medicine. PLoS computational biology , Vol. 12, 11 (2016), e1005017.Google Scholar
- Joshua M Temkin and Mark R Gilder. 2003. Extraction of protein interaction information from unstructured text using a context-free grammar. Bioinformatics , Vol. 19, 16 (2003), 2046--2053.Google Scholar
Cross Ref
- Akane Yakushiji, Yuka Tateisi, et almbox. 2000. Event extraction from biomedical papers using a full parser. In Biocomputing 2001 . World Scientific, 408--419.Google Scholar
- Yijia Zhang, Hongfei Lin, Zhihao Yang, Jian Wang, Shaowu Zhang, Yuanyuan Sun, and Liang Yang. 2018. A hybrid model based on neural networks for biomedical relation extraction. Journal of biomedical informatics , Vol. 81 (2018), 83--92.Google Scholar
Cross Ref
Index Terms
- PPPred: Classifying Protein-phenotype Co-mentions Extracted from Biomedical Literature
Recommendations
Exploring Species-Based Strategies for Gene Normalization
We introduce a system developed for the BioCreative II.5 community evaluation of information extraction of proteins and protein interactions. The paper focuses primarily on the gene normalization task of recognizing protein mentions in text and mapping ...
Joint learning for biomedical NER and entity normalization: encoding schemes, counterfactual examples, and zero-shot evaluation
BCB '21: Proceedings of the 12th ACM Conference on Bioinformatics, Computational Biology, and Health InformaticsNamed entity recognition (NER) and normalization (EN) form an indispensable first step to many biomedical natural language processing applications. In biomedical information science, recognizing entities (e.g., genes, diseases, or drugs) and normalizing ...
BioREx: Improving biomedical relation extraction by leveraging heterogeneous datasets
Graphical abstractDisplay Omitted
AbstractBiomedical relation extraction (RE) is the task of automatically identifying and characterizing relations between biomedical concepts from free text. RE is a central task in biomedical natural language processing (NLP) research and plays a ...
Comments