Summary
To date, more than 16 million citations of published articles in biomedical domain are available in the MEDLINE database. These articles describe the new discoveries which accompany a tremendous development in biomedicine during the last decade. It is crucial for biomedical researchers to retrieve and mine some specific knowledge from the huge quantity of published articles with high efficiency. Researchers have been engaged in the development of text mining tools to find knowledge such as protein-protein interactions, which are most relevant and useful for specific analysis tasks. This chapter provides a road map to the various information extraction methods in biomedical domain, such as protein name recognition and discovery of protein-protein interactions. Disciplines involved in analyzing and processing unstructured-text are summarized. Current work in biomedical information extracting is categorized. Challenges in the field are also presented and possible solutions are discussed.
Keywords
- Support Vector Machine
- Hide Markov Model
- Natural Language Processing
- Information Extraction
- Biomedical Literature
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Pubmed-overview, http://www.ncbi.nlm.nih.gov/entrez/query/static/overview.html
Xenarios, I., Salwinski, L., Duan, X., Higney, P., Kim, S., Eisenberg, D.: DIP, the Database of Interacting Proteins: a research tool for studying cellular networks of protein interactions. Nucleic Acids Research 30(1), 303–305 (2002)
Bader, G., Betel, D., Hogue, C.: BIND: the Biomolecular Interaction Network Database. Nucleic Acids Research 31(1), 248–250 (2003)
Hermjakob, H., Montecchi-Palazzi, L., Lewington, C.: IntAct: an open source molecular interaction database. Nucleic Acids Research 1(32(Database issue)), 452–455 (2004)
von Mering, C., Jensen, L., Snel, B., Hooper, S., Krupp, M.: STRING: known and predicted protein-protein associations, integrated and transferred across organisms. Nucleic Acids Research 33(Database issue), 433–437 (2005)
Wong, L.: PIES, a protein interaction extraction system. In: Proc. Pacific Symposium on Biocomputing, Hawaii, U.S.A., pp. 520–531(2001)
Blaschke, C., Valencia, A.: The Frame-Based Module of the SUISEKI Information Extraction system. IEEE Intelligent Systems 17(2), 14–20 (2002)
Donaldson, I., Martin, J., de Bruijn, B., Wolting, C.: PreBIND and Textomymining the biomedical literature for protein-protein interactions using a support vector machine. BMC Bioinformatics 4(11) (2003)
Chiang, J.H., Yu, H.C., Hsu, H.J.: GIS: a biomedical text-mining system for gene information discovery. Bioinformatics 20(1), 120–121 (2004)
Ahmed, S.T., Chidambaram, D., Davulcu, H., Baral, C.: IntEx: A Syntactic Role Driven Protein-Protein Interaction Extractor for Bio-Medical Text. In: Proc. ACL-ISMB Workshop on Linking Biological Literature, Ontologies and Database 2005, pp. 54–61 (2005)
Rindflesch, T., Tanabe, L., Weinstein, J., Hunter, L.: EDGAR: extraction of drugs, genes and relations from the biomedical literature. In: Proc. Pacific Symposium Biocomputing, pp. 517–528 (2000)
Corney, D.P.A., Buxton, B.F., Langdon, W.B., Jones, D.T.: BioRAT: extracting biological information from full-length papers. Bioinformatics 20(17), 3206–3213 (2004)
Rzhetsky, A., Iossifov, I., Koike, T., Krauthammer, M., Kra, P., Morris, M., Yu, H., Duboué, P., Weng, W., Wilbur, W., Hatzivassiloglou, V., Friedman, C.: GeneWays: a system for extracting, analyzing, visualizing, and integrating molecular pathway data. Journal of Biomedical Informatic 37(1), 43–53 (2004)
Peri, S., Navarro, J.D., Amanchy, R.: Development of Human Protein Reference Database as an Initial Platform for Approaching Systems Biology in Humans. Genome Research 13, 2363–2371 (2003)
Zanzoni, A., Montecchi-Palazzi, L., Quondam, M., Ausiello, G., Helmer-Citterich, M., Cesareni, G.: MINT: a Molecular INTeraction database. FEBS letters 513(1), 135–140 (2002)
Chen, H., Sharp, B.M.: Conten-trich biological network constructed by mining PubMed abstracts. BMC Bioinformatics 8(5), 147 (2004)
Hoffmann, R., Valencia, A.: A gene network for navigating the literature. Nature Genetics 36, 664 (2004)
Mathiak, B., Eckstein, S.: Five Steps to Text Mining in Biomedical Literature. In: Proc. Data Mining and Text Mining for Bioinformatics European Workshop (2004)
Ding, J., Berleant, D., Nettleton, D., Wurtele, E.: Mining MEDLINE: Abstracts, sentences or phrases. In: Proc. Pacific Symposium on Biocomputing, Hawaii, U.S.A., pp. 326–337 (2002)
Krauthammer, M., Nenadic, G.: Term identification in the biomedical literature. Journal of biomedical informatics 37(6), 512–526 (2004)
Pearson, H.: Biology’s name game. Nature 411(6838), 631–632 (2001)
Chen, L., Liu, H., Friedman, C.: Gene name ambiguity of eukaryotic nomenclatures. Bioinformatics 21(2), 248–256 (2005)
Leser, U., Hakenberg, J.: What makes a gene name? Named entity recognition in the biomedical literature. Briefings in Bioinformatics 6(4), 257–269 (2005)
Drabkin, H.J., Hollenbeck, C., Hill, D.P., Blake, J.A.: Ontological visualization of protein-protein interactions. BMC Bioinformatics 6(29) (2005)
van Rijsbergen, C.: Information Retrieval (1999), http://www.dcs.gla.ac.uk/Keith/Preface.html
Hersh, W.: Evaluation of biomedical text-mining systems: Lessons learned from information retrieval. Briefings in Bioinformatics 6(4), 344–356 (2005)
Andrade, M., Bork, P.: Automated extraction of information in molecular biology. FEBS Lett. 476(1-2), 12–17 (2000)
Hirschman, L., Park, J.C., Tsujii, J., Wong, L., Wu, C.H.: Accomplishments and challenges in literature data mining for biology. Bioinformatics 18(12), 1553–1561 (2002)
Shatkay, H., Feldman, R.: Mining the biomedical literature in the genomic era: an overview. Journal of Computational Biology 10(6), 821–855 (2003)
Jensen, L.J., Saric, J., Bork, P.: Literature mining for the biologist: from information retrieval to biological discovery. Nature Reviews Genetics 7, 119–129 (2006)
Hersh, W.R.: Information Retrieval: A Health and Biomedical Perspective (2003)
Chang, J.T.: Using Machine Learning to Extract Drug and Gene Relationships from Text. Ph.D. Thesis, Stanford University (2003)
Marquez, L.: Machine learning and natural language processing. In: Tech. Rep. LSI-00-45-R, Departament de Llenguatges i Sistemes Informatics (LSI), Universitat Politecnica de Catalunya (UPC), Barcelona, Spain (2000)
Cohen, K.B., Hunter, L.: Natural language processing and systems biology. In: Dubitzky, W. (ed.) Computational Biology. Azuaje, Francisco, vol. 5 (2004)
Yandell, M., Majoros, W.: Genomics and natural language processing. Nature Reviews Genetics 3(8), 601–610 (2002)
Hunter, L., Cohen, K.B.: Biomedical Language Processing: What’s Beyond PubMed? Molecular Cell 21(5), 589–594 (2006)
Cardie, C.: Empirical Methods in Information Extraction. AI. Magazine 18(4), 65–80 (1997)
Blaschke, C., Hoffmann, R., Oliveros, J., Valencia, A.: Extracting information automatically from biological literature. Comparative and Functional Genomics 2(5), 2, 310–313 (2001)
Cunningham, H.: Information Extraction, Automatic, Encyclopedia of Language and Linguistics, 2nd edn. (2005)
Skusa, A., Rüegg, A., Köhler, J.: Extraction of biological interaction networks from scientific literature. Briefings in Bioinformatics 6(3), 263–276 (2005)
Text mining in the life sciences. Tech. rep. (2004)
Bruijn, B.D., Martin, J.: Literature Mining in Molecular Biology. In: Proc. EFMI Workshop on Natural Language Processing in Biomedical Application, pp. 1–5 (2002)
Krallinger, M., Erhardt, R., Valencia, A.: Text-mining approaches in molecular biology and biomedicine. Drug Discovery Today 10(6), 439–445 (2005)
Spasic, I., Ananiadou, S., McNaught, J., Kumar, A.: Text mining and ontologies in biomedicine: Making sense of raw text. Briefings in Bioinformatics 6(3), 239–251 (2005)
Cohen, A.M., Hersh, W.R.: A survey of current work in biomedical text mining. Briefings in Bioinformatics 6(1), 57–71 (2005)
Ananiadou, S., Mcnaught, J.: Text mining for biology and biomedicine (2006)
Shatkay, H., Craven, M.: Biomedical text mining. MIT Press, Cambridge (2007)
Egorov, S., Yuryev, A., Daraselia, N.: A simple and practical dictionary based approach for identification of proteins in medline abstracts. Journal of the American Medical Informatics Association 11, 174–178 (2004)
Krauthammer, M., Rzhetsky, A., Morozov, P., Friedman, C.: Using blast for identifying gene and protein names in journal articles. Gene 259(1), 245–252 (2000)
Altschul, S., Gish, W., Miller, W., Myers, E., Lipman, D.: Basic local alignment search tool. Journal of molecular biology 215, 403–410 (1990)
Fukuda, K., Tsunoda, T., Tamura, A., Takagi, T.: Toward information extraction: Identifying protein names from biological papers. In: Proc. Pacific Symposium on Biocomputing, Hawaii, USA, pp. 707–718 (1998)
Franzen, K., Eriksson, G., Olsson, F., Asker, L., Liden, P., Coster, J.: Protein names and how to find them. International Journal of Medical Informatics 67(1), 49–61 (2002)
Yu, H., Hatzivassiloglou, V., Rzhetsky, A., Wilburc, W.J.: Automatically identifying geneprotein terms in medline abstracts. Journal of Biomedical Informatics 35, 322–330 (2002)
Nobata, C., Collier, N., Tsujii, J.: Automatic term identification and classification in biology texts. In: Proc. 5th Natural Language Processing Pacific Rim Symposium, Beijing, China (1999)
Wilbur, W.: Boosting naive Bayesian learning on a large subset of MEDLINE. In: Proc. AMIA Symposium, Beijing, China, pp. 918–922 (2000)
Mika, S., Rost, B.: Nlprot: extracting protein names and sequences from papers. Nucleic Acids Research 32, 634–637 (2004)
Boeckmann, B., Bairoch, A., Apweiler, R., Blatter, M., Estreicher, A., Gasteiger, E., Martin, M., Michoud, K., ODonovan, C., Phan, I.: The Swiss-Prot protein knowledgebase and its supplement TrEMBL in 2003. Nucleic Acids Research 31(1), 365–370 (2003)
Chang, J., Schutze, H., Altman, R.: Gapscore: finding gene and protein names one word at a time. Bioinformatics 20(2), 216–225 (2004)
Hakenberg, J., Bickel, S., Plake, C., Brefeld, U., Zahn, H., Faulstich, L., Leser, U., Scheffer, T.: Systematic feature evaluation for gene name recognition. BMC Bioinformatics 6, S9 (2005)
Yeh, A., Morgan, A., Colosimo, M., Hirschman, L.: Biocreative task 1a: gene mention finding evaluation. BMC Bioinformatics 6, S2 (2005)
Collier, N., Nobata, C., Tsujii, J.: Extracting the names of genes and gene products with a hidden Markov model. In: Proc. 18th Conference on Computational linguistics, Saarbrucken, Germany, pp. 201–207 (2000)
Zhou, G., Zhang, J., Su, J., Shen, D., Tan, C.: Recognizing names in biomedical texts: a machine learning approach. Bioinformatics 20(7), 1178–1190 (2004)
Tanabe, L., Wilbur, W.: Tagging gene and protein names in biomedical text. Bioinformatics 18(8), 1124–1132 (2002)
Seki, K., Mostafa, J.: A hybrid approach to protein name identification in biomedical texts. Information Processing and Management 41, 723–743 (2005)
Yu, H., Hatzivassiloglou, V., Rzhetsky, A., Wilbur, W.J.: Automatically identifying gene/protein terms in MEDLINE abstracts. Biomedical Informatics 35(5/6), 322–330 (2002)
Yamamoto, K., Kudo, T., Konagaya, A., Matsumoto, Y.: Protein name tagging for biomedical annotation in text. In: Proc. ACL 2003 Workshop on Natural Language Processing in Biomedicine, Sapporo, Japan, pp. 65–72 (2003)
Sekimizu, T., Park, H., Tsujii, J.: Identifying the interaction between genes and gene products based on frequently seen verbs in MEDLINE abstracts. In: Proc. Workshop on Genome Informatics, vol. 9, pp. 62–71 (1998)
Rindflesch, T., Hunter, L., Aronson, A.: Mining molecular binding terminology from biomedical text. In: Proc. AMIA Symposium, pp. 127–131 (1999)
Thomas, J., Milward, D., Ouzounis, C., Pulman, S.: Automatic extraction of protein interactions from scientific abstracts. In: Proc. Pacific Symposium on Biocomputing, Hawaii, U.S.A., pp. 541–552 (2000)
Pustejovsky, J., Castano, J., Zhang, J., Kotecki, M., Cochran, B.: Robust Relational Parsing Over Biomedical Literature: Extracting Inhibit Relations. In: Proc. Pacific Symposium on Biocomputing, Hawaii, U.S.A., pp. 362–373 (2002)
Leroy, G., Chen, H., Martinez, J.D.: A Shallow Parser Based on Closed-Class Words to Capture Relations in Biomedical Text. Journal of Biomedical Informatics 36(3), 145–158 (2003)
Seymore, K., McCallum, A., Rosenfeld, R.: Learning Hidden Markov Model Structure for Information Extraction. In: Proc. AAAI 1999 Workshop on Machine Learning for Information Extraction (1999)
Ray, S., Craven, M.: Representing Sentence Structure in Hidden Markov Models for Information Extraction. In: Proc. 17th International Joint Conference on Artificial Intelligence (IJCAI 2001), pp. 1273–1279 (2001)
Park, J., Kim, H., Kim, J.: Bidirectional incremental parsing for automatic pathway identification with combinatory categorical grammar. In: Proc. Pacific Symposium on Biocomputing, Hawaii, U.S.A., vol. 6, pp. 396–407 (2001)
Yakushiji, A., Tateisi, Y., Miyao, Y., Tsujii, J.: Event extraction from biomedical papers using a full parser. In: Proc. Pacific Symposium on Biocomputing, Hawaii, U.S.A., vol. 6, pp. 408–419 (2001)
Friedman, C., Pauline Kra, H.Y., Krauthammer, M., Rzhetsky, A.: GENIES: a natural-language processing system for the extraction of molecular pathways from journal articles. Bioinformatics 17, S74–S82 (2001)
Novichkova, S., Egorov, S., Daraselia, N.: MedScan, a natural language processing engine for MEDLINE abstracts. Bioinformatics 19(13), 1699–1706 (2003)
Temkin, J.M., Gilder, M.R.: Extraction of protein interaction information from unstructured text using a context-free grammar. Bioinformatics 19(16), 2046–2053 (2003)
Ding, J., Berleant, D., Xu, J., Fulmer, A.W.: Extracting Biochemical Interactions from MEDLINE Using a Link Grammar Parser. In: Proc. 15th IEEE International Conference on Tools with Artificial Intelligence (ICTAI 2003) (2003)
Daraselia, N., Yuryev, A., Egorov, S., Novichkova, S., Nikitin, A., Mazo, l.: Extracting human protein interactions from MEDLINE using a full-sentence parser. Bioinformatics 20(5), 604–611 (2004)
Tan, S., Kwoh, C.K.: Cytokine Information System and Pathway Visualization. In: Proc. International Joint Conference of InCoB, AASBi and KSBI (BIOINFO 2005) (2005)
Skounakis, M., Craven, M., Ray, S.: Hierarchical Hidden Markov Models for Information Extraction. In: Proc. 18th International Joint Conference on Artificial Intelligence (2003)
Rinaldi, F., Schneider, G., Kaljurand, K., Dowdall, J.: Mining relations in the GENIA corpus. In: Proc. 2nd European Workshop on Data Mining and Text Mining for Bioinformatics (2004)
Ng, S.K., Wong, M.: Toward Routine Automatic Pathway Discovery from On-line Scientific Text Abstracts. In: Proc. 12th National Conference on Artificial Intelligence (1999)
Blaschke, C., Andrade, M.A., Ouzounis, C., Valencia, A.: Automatic Extraction of Biological Information from Scientific Text: Protein-Protein Interactions. In: Proc. 7th International Conference on Intelligent Systems for Molecular Biology, pp. 60–67. AAAI Press, Menlo Park (1999)
Blaschke, C.V.A.: The potential use of SUISEKI as a protein interaction discovery tool. In: Proc. Workshop on Genome Informatics, vol. 12, pp. 123–134 (2001)
Ono, T., Hishigaki, H., Tanigam, A., Takagi, T.: Automated extraction of information on protein-protein interactions from the biological literature. Bioinformatics 17(2), 155–161 (2001)
Leroy, G., Chen, H.: Filling preposition-based templates to capture information from medical abstracts. In: Proc. Pacific Symposium Biocomputing, pp. 350–361 (2002)
Proux, D., Rechenmann, F., Julliard, L.: A Pragmatic Information Extraction Strategy for Gathering Data on Genetic Interactions. In: Proc. 8th International Conference on Intelligent Systems for Molecular Biology, pp. 279–285. AAAI Press, Menlo Park (2000)
Huang, M., Zhu, X., Hao, Y.: Discovering patterns to extract protein-protein interactions from full text. Bioinformatics 20(18), 3604–3612 (2004)
Chun, H.W., Hwang, Y.S., Rim, H.C.: Unsupervised Event Extraction from Biomedical Literature using Co-occurrence Information and Basic Patterns. Lecture Notes in Artificial Intelligence, pp. 777–786 (2005)
Hao, Y., Zhu, X., Huang, M., Li, M.: Discovering patterns to extract protein-protein interactions from the literature: Part II. Bioinformatics 21(15), 3294–3300 (2005)
Phuong, T.M., Lee, D., Lee, K.H.: Learning rules to extract protein interactions from biomedical text. In: Whang, K.-Y., Jeon, J., Shim, K., Srivastava, J. (eds.) PAKDD 2003. LNCS (LNAI), vol. 2637. Springer, Heidelberg (2003)
Andrade, M.A., Valencia, A.: Automatic extraction of keywords from scientific text: application to the knowledge domain of protein families. Bioinformatic 14(7), 600–607 (1998)
Craven, M.: Learning to extract relations from medline. In: Proc. AAAI 1999 Workshop on Machine Learning for Information Extraction, pp. 25–30 (1999)
Mark, C., Johan, K.: Constructing biological knowledge bases by extracting information from text sources. In: Proc. 7th International Conference on Intelligent Systems for Molecular Biology, Heidelberg, Germany, pp. 77–86 (1999)
Stapley, B., Benoit, G.: Biobibliometrics: information retrieval and visualization from co-occurrences of gene names in Medline abstracts. In: Proc. Pacific Symposium on Biocomputing, Hawaii, U.S.A., pp. 529–540 (2000)
Stephens, M., Palakal, M., Mukhopadhyay, S., Raje, R., Mostafa, J.: Detecting gene relations from MEDLINE abstracts. In: Proc. Pacific Symposium on Biocomputing, Hawaii, USA, vol. 6, pp. 483–495 (2001)
Jenssen, T., Laegreid, A., Komorowski, J., Hovig, E.: A literature network of human genes for high-throughput analysis of gene expression. Nature Genetics 28(1), 21–28 (2001)
Marcotte, E.M., Xenarios, I., Eisenberg, D.: Mining literature for protein-protein interactions. Bioinformatics 17(4), 359–363 (2001)
Hahn, U., Romarker, M.: Rich knowledge capture from medical documents in the MEDSYNDIKATE system. In: Proc. Pacific Symposium on Biocomputing, Hawaii, U.S.A., pp. 338–349 (2002)
Eom, J.H., Zhang, B.T.: PubMiner: Machine Learning-Based Text Mining System for Biomedical Information Mining. In: Bussler, C.J., Fensel, D. (eds.) AIMSA 2004. LNCS (LNAI), vol. 3192. Springer, Heidelberg (2004)
Rosario, B., Hearst, M.: Multi-way Relation Classification: Application to Protein-Protein Interaction. In: Proc. HLT-NAACL 2005, Vancouver (2005)
Mooney, R.J., Bunescu, R.: Mining knowledge from text using information extraction. SIGKDD Explor. Newsl. 7(1), 3–10 (2005)
Bunescu, R., Ge, R., Kate, R.J., Marcotte, E.M., Mooney, R.J., Ramani, A.K., Wong, Y.W.: Comparative experiments on learning information extractors for proteins and their interactions. Journal of Artificial Intelligence in Medicine, 139–155 (2005)
Chun, H.w., Tsuruoka, Y., Kim, J.D., Shiba, R., Nagata, N., Hishiki, T., Tsujii, J.: Extraction of Gene-Disease Relations from MedLine using Domain Dictionaries and Machine Learning. In: Proc. Pacific Symposium on Biocomputing (PSB), pp. 4–15 (2006)
Zhou, D., He, Y., Kwoh, C.K.: Extracting Protein-Protein Interactions from the Literature using the Hidden Vector State Model. In: Proc. International Workshop on Bioinformatics Research and Applications, Reading, UK (2006)
He, Y., Young, S.: Semantic processing using the hidden vector state model. Computer Speech and Language 19(1), 85–106 (2005)
Hirschman, L., Yeh, A., Blaschke, C., Valencia, A.: Overview of BioCreAtIvE: critical assessment of information extraction for biology. BMC Bioinformatics 6(supp. 1) (2004)
Christian, B., Alexander, Y., Evelyn, C., Marc, C., Rolf, A., Lynette, H., Alfonso, V.: Do you do text? Bioinformatics 21(23), 4199–4200 (2005)
Nédellec, C.: Learning Language in Logic - Genic Interaction Extraction Challenge. In: Proc. Learning Language in Logic workshop (LLL 2005), pp. 31–37 (2005)
Ashburner, M., Ball, C., Blake, J., Botstein, D.: Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nature Genetics 25(1), 25–29 (2000)
Lomax, J.: Get ready to GO! A biologist’s guide to the Gene Ontology. Briefings in Bioinformatics 6(3), 298–304 (2005)
Salton, G.: Automatic Text Processing. Addison-Wesley series in Computer Science (1989)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2008 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Zhou, D., He, Y., Kwoh, C.K. (2008). From Biomedical Literature to Knowledge: Mining Protein-Protein Interactions. In: Smolinski, T.G., Milanova, M.G., Hassanien, AE. (eds) Computational Intelligence in Biomedicine and Bioinformatics. Studies in Computational Intelligence, vol 151. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-70778-3_17
Download citation
DOI: https://doi.org/10.1007/978-3-540-70778-3_17
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-70776-9
Online ISBN: 978-3-540-70778-3
eBook Packages: EngineeringEngineering (R0)