Abstract
An important approach to text mining involves the use of natural-language information extraction. Information extraction (IE) distills structured data or knowledge from unstructured text by identifying references to named entities as well as stated relationships between such entities. IE systems can be used to directly extricate abstract knowledge from a text corpus, or to extract concrete data from a set of documents which can then be further analyzed with traditional data-mining techniques to discover more general patterns. We discuss methods and implemented systems for both of these approaches and summarize results on mining real text corpora of biomedical abstracts, job announcements, and product descriptions. We also discuss challenges that arise when employing current information extraction technology to discover knowledge in text.
- R. Agrawal and R. Srikant. Fast algorithms for mining association rules. In Proceedings of the 20th International Conference on Very Large Databases (VLDB-94), pages 487--499, Santiago, Chile, Sept. 1994. Google ScholarDigital Library
- R. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval. ACM Press, New York, 1999. Google ScholarDigital Library
- S. W. Bennett, C. Aone, and C. Lovell. Learning to tag multilingual texts through observation. In Proceedings of the Second Conference on Empirical Methods in Natural Language Processing (EMNLP-97), pages 109--116, Providence, RI, 1997.Google Scholar
- D. M. Bikel. R. Schwartz, and R. M. Weischedel. An algorithm that learns what's in a name. Machine Learning, 34:211--232, 1999. Google ScholarDigital Library
- M. Bilenko, R. Mooney, W. Cohen, P. Ravikumar, and S. Fienberg. Adaptive name matching in information integration. IEEE Intelligent Systems, 18(5):16--23, 2003. Google ScholarDigital Library
- M. Bilenko and R. J. Mooney. Adaptive duplicate detection using learnable string similarity measures. In Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2003), pages 39--48, Washington, DC, Aug. 2003. Google ScholarDigital Library
- C. Blaschke and A. Valencia. Can bibliographic pointers for known biological data be found automatically? protein interactions as a case study. Comparative and Functional Genomics, 2:196--206, 2001.Google ScholarCross Ref
- C. Blaschke and A. Valencia. The frame-based module of the Suiseki information extraction system. IEEE Intelligent Systems, 17:14--20, 2002. Google ScholarDigital Library
- E. Brill. Transformation-based error-driven learning and natural language processing: A case study in part-of-speech tagging. Computational Linguistics, 21(4):543--565, 1995. Google ScholarDigital Library
- R. Bunescu, R. Ge, R. J. Kate, E. M. Marcotte, R. J. Mooney, A. K. Ramani, and Y. W. Wong. Comparative experiments on learning information extractors for proteins and their interactions. Artificial Intelligence in Medicine (special issue on Summarization and Information Extraction from Medical Documents), 33(2):139--155, 2005. Google ScholarDigital Library
- R. C. Bunescu and R. J. Mooney. Collective information extraction with relational Markov networks. In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL-04), pages 439--446, Barcelona, Spain, July 2004. Google ScholarDigital Library
- M. E. Califf and R. J. Mooney. Relational learning of pattern-match rules for information extraction. In Proceedings of the Sixteenth National Conference on Artificial Intelligence (AAAI-99), pages 328--334, Orlando, FL, July 1999. Google ScholarDigital Library
- M. E. Califf and R. J. Mooney. Bottom-up relational learning of pattern matching rules for information extraction. Journal of Machine Learning Research, 4:177--210, 2003. Google ScholarDigital Library
- C. Cardie Empirical methods in information extraction. AI Magazine, 18(4):65--79, 1997.Google ScholarDigital Library
- X. Carreras, L. Màrquez, and L. Padró. A simple named entity extractor using AdaBoost. In Proceedings of the Seventh Conference on Natural Language Learning (CoNLL-2003), Edmonton, Canada, 2003. Google ScholarDigital Library
- S. Chakrabarti. Mining the Web: Discovering Knowledge from Hypertext Data. Morgan Kaufmann, San Francisco, CA, 2002. Google ScholarDigital Library
- H. L. Chieu and H. T. Ng. Named entity recognition with a maximum entropy approach. In Proceedings of the Seventh Conference on Natural Language Learning (CoNLL-2003), pages 160--163, Edmonton, Canada, 2003. Google ScholarDigital Library
- K. W. Church. A stochastic parts program and noun phrase parser for unrestricted text. In Proceedings of the Second Conference on Applied Natural Language Processing, pages 136--143, Austin, TX, 1988. Association for Computational Linguistics. Google ScholarDigital Library
- F. Ciravegna, A. Dingli, D. Guthrie, and Y. Wilks. Mining web sites using unsupervised adaptive information extraction. In Proceedings of the 10th Conference of the European Chapter of the Association for Computational Linguistics, Budapest, Hungary, Apr. 2003. Google ScholarDigital Library
- W. W. Cohen. Fast effective rule induction. In Proceedings of the Twelfth International Conference on Machine Learning (ICML-95), pages 115--123, San Francisco, CA, 1995.Google ScholarDigital Library
- M. J. Collins. Three generative, lexicalised models for statistical parsing. In Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics (ACL-97), pages 16--23, 1997. Google ScholarDigital Library
- M. Craven, D. DiPasquo, D. Freitag, A. McCallum, T. Mitchell, K. Nigam, and S. Slattery. Learning to extract symbolic knowledge from the World Wide Web. In Proceedings of the Fifteenth National Conference on Artificial Intelligence (AAAI-98), pages 509--516, Madison, WI, July 1998. Google ScholarDigital Library
- M. Craven and J. Kumlien. Constructing biological knowledge bases by extracting information from text sources. In Proceedings of the 7th International Conference on Intelligent Systems for Molecular Biology (ISMB-1999), pages 77--86, Heidelberg, Germany, 1999. Google ScholarDigital Library
- A. Culotta and J. Sorensen. Dependency tree kernels for relation extraction. In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL-04), Barcelona, Spain, July 2004. Google ScholarDigital Library
- DARPA, editor. Proceedings of the Seventh Message Understanding Evaluation and Conference (MUC-98), Fairfax, VA, Apr. 1998. Morgan Kaufmann.Google Scholar
- P. Domingos. Unifying instance-based and rule-based induction. Machine Learning, 24:141--168, 1996. Google ScholarDigital Library
- R. B. Doorenbos, O. Etzioni, and D. S. Weld. A scalable comparison-shopping agent for the World-Wide Web. In Proceedings of the First International Conference on Autonomous Agents (Agents-97), pages 39--48, Marina del Rey, CA, Feb. 1997. Google ScholarDigital Library
- C. D. Fellbaum. WordNet: An Electronic Lexical Database. MIT Press, Cambridge, MA, 1998.Google ScholarCross Ref
- D. Freitag. Toward general-purpose learning for information extraction. In Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and COLING-98 (ACL/COLING-98), pages 404--408, Montreal, Quebec, 1998. Google ScholarDigital Library
- D. Freitag and N. Kushmerick. Boosted wrapper induction. In Proceedings of the Seventeenth National Conference on Artificial Intelligence (AAAI-2000), pages 577--583, Austin, TX, July 2000. AAAI Press / The MIT Press. Google ScholarDigital Library
- D. Freitag and A. McCallum. Information extraction with HMM structures learned by stochastic optimization. In Proceedings of the Seventeenth National Conference on Artificial Intelligence (AAAI-2000), Austin, TX, 2000. AAAI Press / The MIT Press. Google ScholarDigital Library
- C. Friedman, P. Kra, H. Yu, M. Krauthammer, and A. Rzhetsky. GENIES: A natural-language processing system for the extraction of molecular pathways from journal articles. Bioinformatics, 17:S74--S82, 2001. Supplement 1.Google ScholarCross Ref
- K. Fukuda, T. Tsunoda, A. Tamura, and T. Takagi. Information extraction: Identifying protein names from biological papers. In Proceedings of the 3rd Pacific Symposium on Biocomputing, pages 707--418, 1998.Google Scholar
- R. Ghani, R. Jones, D. Mladenić, K. Nigam, and S. Slattery. Data mining on symbolic knowledge extracted from the Web. In D. Mladenić, editor, Proceedings of the Sixth International Conference on Knowledge Discovery and Data Mining (KDD-2000) Workshop on Text Mining, pages 29--36, Boston, MA, Aug. 2000.Google Scholar
- D. Gusfield. Algorithms on Strings, Trees and Sequences. Cambridge University Press, New York, 1997. Google ScholarDigital Library
- T. Hasegawa, S. Sekine, and R. Grishman. Discovering relations among entities from large corpora. In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL-04), pages 416--423, Barcelona, Spain, July 2004. Google ScholarDigital Library
- N. Kushmerick, D. S. Weld, and R. B. Doorenbos. Wrapper induction for information extraction. In Proceedings of the Fifteenth International Joint Conference on Artificial Intelligence (IJCAI-97), pages 729--735, Nagoya, Japan, 1997.Google Scholar
- J. Lafferty, A. McCallum, and F. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of 18th International Conference on Machine Learning (ICML-2001), pages 282--289, Williamstown, MA, 2001. Google ScholarDigital Library
- E. Marcotte, I. Xenarios, and D. Eisenberg. Mining literature for protein-protein interactions. Bioinformatics, Apr;17(4):359--363, 2001.Google Scholar
- J. Mayfield, P. McNamee, and C. Piatko. Named entity recognition using hundreds of thousands of features. In Proceedings of the Seventh Conference on Natural Language Learning (CoNLL-2003), Edmonton, Canada, 2003. Google ScholarDigital Library
- A. McCallum and D. Jensen. A note on the unification of information extraction and data mining using conditional-probability, relational models. In Proceedings of the IJCAI-2003 Workshop on Learning Statistical Models from Relational Data, Acapulco, Mexico, Aug. 2003.Google Scholar
- A. McCallum, S. Tejada, and D. Quass, editors. Proceedings of the KDD-03 Workshop on Data Cleaning, Record Linkage, and Object Consolidation, Washington, DC, Aug. 2003.Google Scholar
- F. D. Meulder and W. Daelemans. Memory-based named entity recognition using unannotated data. In Proceedings of the Seventh Conference on Natural Language Learning (CoNLL-2003), Edmonton, Canada, 2003. Google ScholarDigital Library
- R. J. Mooney and L. Roy. Content-based book recommending using learning for text categorization. In Proceedings of the Fifth ACM Conference on Digital Libraries, pages 195--204, San Antonio, TX, June 2000. Google ScholarDigital Library
- S. H. Muggleton, editor. Inductive Logic Programming. Academic Press, New York, NY, 1992.Google Scholar
- U. Y. Nahm. Text Mining with Information Extraction. PhD thesis, Department of Computer Sciences, University of Texas, Austin, TX, Aug. 2004. Google ScholarDigital Library
- U. Y. Nahm, M. Bilenko, and R. J. Mooney. Two approaches to handling noisy variation in text mining. In Papers from the Nineteenth International Conference on Machine Learning (ICML-2002) Workshop on Text Learning, pages 18--27, Sydney, Australia, July 2002.Google Scholar
- U. Y. Nahm and R. J. Mooney. A mutually beneficial integration of data mining and information extraction. In Proceedings of the Seventeenth National Conference on Artificial Intelligence (AAAI-2000), pages 627--632, Austin, TX, July 2000. Google ScholarDigital Library
- U. Y. Nahm and R. J. Mooney. Using information extraction to aid the discovery of prediction rules from texts. In Proceedings of the Sixth International Conference on Knowledge Discovery and Data Mining (KDD-2000) Workshop on Text Mining, pages 51--58, Boston, MA, Aug. 2000.Google Scholar
- U. Y. Nahm and R. J. Mooney. Mining soft-matching rules from textual data. In Proceedings of the Seventeenth International Joint Conference on Artificial Intelligence (IJCAI-2001), pages 979--984, Seattle, WA, July 2001. Google ScholarDigital Library
- U. Y. Nahm and R. J. Mooney. Mining soft-matching association rules. In Proceedings of the Eleventh International Conference on Information and Knowledge Management (CIKM-2002), pages 681--683, McLean, VA, Nov. 2002. Google ScholarDigital Library
- U. Y. Nahm and R. J. Mooney. Using soft-matching mined rules to improve information extraction. In Proceedings of the AAAI-2004 Workshop on Adaptive Text Extraction and Mining (ATEM-2004), pages 27--32, San Jose, CA, July 2004.Google Scholar
- National Institute of Standards and Technology. ACE - Automatic Content Extraction. http://www.nist.gov/speech/tests/ace/.Google Scholar
- F. Peng and A. McCallum. Accurate information extraction from research papers using conditional random fields. In Proceedings of Human Language Technology Conference / North American Association for Computational Linguistics Annual Meeting (HLT-NAACL-2004), Boston, MA, 2004.Google Scholar
- C. Perez-Iratxeta, P. Bork, and M. A. Andrade. Association of genes to genetically inherited diseases using data mining. Nature Genetics, 31(3):316--319, July 2002.Google ScholarCross Ref
- J. R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, San Mateo, CA, 1993. Google ScholarDigital Library
- L. R. Rabiner. A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2):257--286, 1989.Google ScholarCross Ref
- A. K. Ramani, R. C. Bunescu, R. J. Mooney, and E. M. Marcotte. Consolidating the set of know human protein-protein interactions in preparation for largescale mapping of the human interactome. Genome Biology, 6(5):r40, 2005.Google ScholarCross Ref
- L. A. Ramshaw and M. P. Marcus. Text chunking using transformation-based learning. In Proceedings of the Third Workshop on Very Large Corpora, 1995.Google Scholar
- E. M. Rasmussen. Clustering algorithms. In W. B. Frakes and R. Baeza-Yates, editors, Information Retrieval. Prentice Hall, Englewood Cliffs, NJ, 1992. Google ScholarDigital Library
- S. Ray and M. Craven. Representing sentence structure in hidden Markov models for information extraction. In Proceedings of the Seventeenth International Joint Conference on Artificial Intelligence (IJCAI-2001), pages 1273--1279, Seattle, WA, 2001. Google ScholarDigital Library
- E. Riloff. Automatically generating extraction patterns from untagged text. In Proceedings of the Thirteenth National Conference on Artificial Intelligence (AAAI-96), pages 1044--1049, Portland, OR, 1996. Google ScholarDigital Library
- G. Salton. Automatic Text Processing: The Transformation, Analysis and Retrieval of Information by Computer. Addison-Wesley, 1989. Google ScholarDigital Library
- E. F. T. K. Sang and F. D. Meulder. Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition. In Proceedings of the Seventh Conference on Natural Language Learning (CoNLL-2003), Edmonton, Canada, 2003. Google ScholarDigital Library
- S. Sarawagi and W. W. Cohen. Semi-markov conditional random fields for information extraction. In Advances in Neural Information Processing Systems 17, Vancouver, Canada, 2005.Google Scholar
- F. Sebastiani. Machine learning in automated text categorization. ACM Computing Surveys, 34(1):1--47, 2002. Google ScholarDigital Library
- S. Soderland. Learning information extraction rules for semi-structured and free text. Machine Learning, 34:233--272, 1999. Google ScholarDigital Library
- L. Tanabe and W. J. Wilbur. Tagging gene and protein names in biomedical text. Bioinformatics, 18(8):1124--1132, 2002.Google ScholarCross Ref
- B. Taskar, P. Abbeel, and D. Koller. Discriminative probabilistic models for relational data. In Proceedings of 18th Conference on Uncertainty in Artificial Intelligence (UAI-2002), pages 485--492, Edmonton, Canada, 2002. Google ScholarDigital Library
- C. A. Thompson, M. E. Califf, and R. J. Mooney. Active learning for natural language parsing and information extraction. In Proceedings of the Sixteenth International Conference on Machine Learning (ICML-99), pages 406--414, Bled, Slovenia, June 1999. Google ScholarDigital Library
- A. J. Viterbi. Error bounds for convolutional codes and and asymptotically optimum decoding algorithm. IEEE Transactions on Information Theory, 13(2):260--269, 1967.Google ScholarDigital Library
- L. Wall, T. Christiansen, and R. L. Schwartz. Programming Perl. O'Reilly and Associates, Sebastopol, CA, 1996. Google ScholarDigital Library
- D. Zelenko, C. Aone, and A. Richardella. Kernel methods for relation extraction. Journal of Machine Learning Research, 3:1083--1106, 2003. Google ScholarDigital Library
Index Terms
- Mining knowledge from text using information extraction
Recommendations
A Flexible Text Mining System for Entity and Relation Extraction in PubMed
DTMBIO '15: Proceedings of the ACM Ninth International Workshop on Data and Text Mining in Biomedical InformaticsDue to an enormous number of scientific publications that cannot be handled manually, there is a rising interest in text-mining techniques for automated information extraction, especially in the biomedical field. Such techniques provide effective means ...
Comments