skip to main content
article

Mining knowledge from text using information extraction

Published:01 June 2005Publication History
Skip Abstract Section

Abstract

An important approach to text mining involves the use of natural-language information extraction. Information extraction (IE) distills structured data or knowledge from unstructured text by identifying references to named entities as well as stated relationships between such entities. IE systems can be used to directly extricate abstract knowledge from a text corpus, or to extract concrete data from a set of documents which can then be further analyzed with traditional data-mining techniques to discover more general patterns. We discuss methods and implemented systems for both of these approaches and summarize results on mining real text corpora of biomedical abstracts, job announcements, and product descriptions. We also discuss challenges that arise when employing current information extraction technology to discover knowledge in text.

References

  1. R. Agrawal and R. Srikant. Fast algorithms for mining association rules. In Proceedings of the 20th International Conference on Very Large Databases (VLDB-94), pages 487--499, Santiago, Chile, Sept. 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. R. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval. ACM Press, New York, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. S. W. Bennett, C. Aone, and C. Lovell. Learning to tag multilingual texts through observation. In Proceedings of the Second Conference on Empirical Methods in Natural Language Processing (EMNLP-97), pages 109--116, Providence, RI, 1997.Google ScholarGoogle Scholar
  4. D. M. Bikel. R. Schwartz, and R. M. Weischedel. An algorithm that learns what's in a name. Machine Learning, 34:211--232, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. M. Bilenko, R. Mooney, W. Cohen, P. Ravikumar, and S. Fienberg. Adaptive name matching in information integration. IEEE Intelligent Systems, 18(5):16--23, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. M. Bilenko and R. J. Mooney. Adaptive duplicate detection using learnable string similarity measures. In Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2003), pages 39--48, Washington, DC, Aug. 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. C. Blaschke and A. Valencia. Can bibliographic pointers for known biological data be found automatically? protein interactions as a case study. Comparative and Functional Genomics, 2:196--206, 2001.Google ScholarGoogle ScholarCross RefCross Ref
  8. C. Blaschke and A. Valencia. The frame-based module of the Suiseki information extraction system. IEEE Intelligent Systems, 17:14--20, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. E. Brill. Transformation-based error-driven learning and natural language processing: A case study in part-of-speech tagging. Computational Linguistics, 21(4):543--565, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. R. Bunescu, R. Ge, R. J. Kate, E. M. Marcotte, R. J. Mooney, A. K. Ramani, and Y. W. Wong. Comparative experiments on learning information extractors for proteins and their interactions. Artificial Intelligence in Medicine (special issue on Summarization and Information Extraction from Medical Documents), 33(2):139--155, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. R. C. Bunescu and R. J. Mooney. Collective information extraction with relational Markov networks. In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL-04), pages 439--446, Barcelona, Spain, July 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. M. E. Califf and R. J. Mooney. Relational learning of pattern-match rules for information extraction. In Proceedings of the Sixteenth National Conference on Artificial Intelligence (AAAI-99), pages 328--334, Orlando, FL, July 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. M. E. Califf and R. J. Mooney. Bottom-up relational learning of pattern matching rules for information extraction. Journal of Machine Learning Research, 4:177--210, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. C. Cardie Empirical methods in information extraction. AI Magazine, 18(4):65--79, 1997.Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. X. Carreras, L. Màrquez, and L. Padró. A simple named entity extractor using AdaBoost. In Proceedings of the Seventh Conference on Natural Language Learning (CoNLL-2003), Edmonton, Canada, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. S. Chakrabarti. Mining the Web: Discovering Knowledge from Hypertext Data. Morgan Kaufmann, San Francisco, CA, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. H. L. Chieu and H. T. Ng. Named entity recognition with a maximum entropy approach. In Proceedings of the Seventh Conference on Natural Language Learning (CoNLL-2003), pages 160--163, Edmonton, Canada, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. K. W. Church. A stochastic parts program and noun phrase parser for unrestricted text. In Proceedings of the Second Conference on Applied Natural Language Processing, pages 136--143, Austin, TX, 1988. Association for Computational Linguistics. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. F. Ciravegna, A. Dingli, D. Guthrie, and Y. Wilks. Mining web sites using unsupervised adaptive information extraction. In Proceedings of the 10th Conference of the European Chapter of the Association for Computational Linguistics, Budapest, Hungary, Apr. 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. W. W. Cohen. Fast effective rule induction. In Proceedings of the Twelfth International Conference on Machine Learning (ICML-95), pages 115--123, San Francisco, CA, 1995.Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. M. J. Collins. Three generative, lexicalised models for statistical parsing. In Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics (ACL-97), pages 16--23, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. M. Craven, D. DiPasquo, D. Freitag, A. McCallum, T. Mitchell, K. Nigam, and S. Slattery. Learning to extract symbolic knowledge from the World Wide Web. In Proceedings of the Fifteenth National Conference on Artificial Intelligence (AAAI-98), pages 509--516, Madison, WI, July 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. M. Craven and J. Kumlien. Constructing biological knowledge bases by extracting information from text sources. In Proceedings of the 7th International Conference on Intelligent Systems for Molecular Biology (ISMB-1999), pages 77--86, Heidelberg, Germany, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. A. Culotta and J. Sorensen. Dependency tree kernels for relation extraction. In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL-04), Barcelona, Spain, July 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. DARPA, editor. Proceedings of the Seventh Message Understanding Evaluation and Conference (MUC-98), Fairfax, VA, Apr. 1998. Morgan Kaufmann.Google ScholarGoogle Scholar
  26. P. Domingos. Unifying instance-based and rule-based induction. Machine Learning, 24:141--168, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. R. B. Doorenbos, O. Etzioni, and D. S. Weld. A scalable comparison-shopping agent for the World-Wide Web. In Proceedings of the First International Conference on Autonomous Agents (Agents-97), pages 39--48, Marina del Rey, CA, Feb. 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. C. D. Fellbaum. WordNet: An Electronic Lexical Database. MIT Press, Cambridge, MA, 1998.Google ScholarGoogle ScholarCross RefCross Ref
  29. D. Freitag. Toward general-purpose learning for information extraction. In Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and COLING-98 (ACL/COLING-98), pages 404--408, Montreal, Quebec, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. D. Freitag and N. Kushmerick. Boosted wrapper induction. In Proceedings of the Seventeenth National Conference on Artificial Intelligence (AAAI-2000), pages 577--583, Austin, TX, July 2000. AAAI Press / The MIT Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. D. Freitag and A. McCallum. Information extraction with HMM structures learned by stochastic optimization. In Proceedings of the Seventeenth National Conference on Artificial Intelligence (AAAI-2000), Austin, TX, 2000. AAAI Press / The MIT Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. C. Friedman, P. Kra, H. Yu, M. Krauthammer, and A. Rzhetsky. GENIES: A natural-language processing system for the extraction of molecular pathways from journal articles. Bioinformatics, 17:S74--S82, 2001. Supplement 1.Google ScholarGoogle ScholarCross RefCross Ref
  33. K. Fukuda, T. Tsunoda, A. Tamura, and T. Takagi. Information extraction: Identifying protein names from biological papers. In Proceedings of the 3rd Pacific Symposium on Biocomputing, pages 707--418, 1998.Google ScholarGoogle Scholar
  34. R. Ghani, R. Jones, D. Mladenić, K. Nigam, and S. Slattery. Data mining on symbolic knowledge extracted from the Web. In D. Mladenić, editor, Proceedings of the Sixth International Conference on Knowledge Discovery and Data Mining (KDD-2000) Workshop on Text Mining, pages 29--36, Boston, MA, Aug. 2000.Google ScholarGoogle Scholar
  35. D. Gusfield. Algorithms on Strings, Trees and Sequences. Cambridge University Press, New York, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. T. Hasegawa, S. Sekine, and R. Grishman. Discovering relations among entities from large corpora. In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL-04), pages 416--423, Barcelona, Spain, July 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. N. Kushmerick, D. S. Weld, and R. B. Doorenbos. Wrapper induction for information extraction. In Proceedings of the Fifteenth International Joint Conference on Artificial Intelligence (IJCAI-97), pages 729--735, Nagoya, Japan, 1997.Google ScholarGoogle Scholar
  38. J. Lafferty, A. McCallum, and F. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of 18th International Conference on Machine Learning (ICML-2001), pages 282--289, Williamstown, MA, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. E. Marcotte, I. Xenarios, and D. Eisenberg. Mining literature for protein-protein interactions. Bioinformatics, Apr;17(4):359--363, 2001.Google ScholarGoogle Scholar
  40. J. Mayfield, P. McNamee, and C. Piatko. Named entity recognition using hundreds of thousands of features. In Proceedings of the Seventh Conference on Natural Language Learning (CoNLL-2003), Edmonton, Canada, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. A. McCallum and D. Jensen. A note on the unification of information extraction and data mining using conditional-probability, relational models. In Proceedings of the IJCAI-2003 Workshop on Learning Statistical Models from Relational Data, Acapulco, Mexico, Aug. 2003.Google ScholarGoogle Scholar
  42. A. McCallum, S. Tejada, and D. Quass, editors. Proceedings of the KDD-03 Workshop on Data Cleaning, Record Linkage, and Object Consolidation, Washington, DC, Aug. 2003.Google ScholarGoogle Scholar
  43. F. D. Meulder and W. Daelemans. Memory-based named entity recognition using unannotated data. In Proceedings of the Seventh Conference on Natural Language Learning (CoNLL-2003), Edmonton, Canada, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. R. J. Mooney and L. Roy. Content-based book recommending using learning for text categorization. In Proceedings of the Fifth ACM Conference on Digital Libraries, pages 195--204, San Antonio, TX, June 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. S. H. Muggleton, editor. Inductive Logic Programming. Academic Press, New York, NY, 1992.Google ScholarGoogle Scholar
  46. U. Y. Nahm. Text Mining with Information Extraction. PhD thesis, Department of Computer Sciences, University of Texas, Austin, TX, Aug. 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. U. Y. Nahm, M. Bilenko, and R. J. Mooney. Two approaches to handling noisy variation in text mining. In Papers from the Nineteenth International Conference on Machine Learning (ICML-2002) Workshop on Text Learning, pages 18--27, Sydney, Australia, July 2002.Google ScholarGoogle Scholar
  48. U. Y. Nahm and R. J. Mooney. A mutually beneficial integration of data mining and information extraction. In Proceedings of the Seventeenth National Conference on Artificial Intelligence (AAAI-2000), pages 627--632, Austin, TX, July 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. U. Y. Nahm and R. J. Mooney. Using information extraction to aid the discovery of prediction rules from texts. In Proceedings of the Sixth International Conference on Knowledge Discovery and Data Mining (KDD-2000) Workshop on Text Mining, pages 51--58, Boston, MA, Aug. 2000.Google ScholarGoogle Scholar
  50. U. Y. Nahm and R. J. Mooney. Mining soft-matching rules from textual data. In Proceedings of the Seventeenth International Joint Conference on Artificial Intelligence (IJCAI-2001), pages 979--984, Seattle, WA, July 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. U. Y. Nahm and R. J. Mooney. Mining soft-matching association rules. In Proceedings of the Eleventh International Conference on Information and Knowledge Management (CIKM-2002), pages 681--683, McLean, VA, Nov. 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. U. Y. Nahm and R. J. Mooney. Using soft-matching mined rules to improve information extraction. In Proceedings of the AAAI-2004 Workshop on Adaptive Text Extraction and Mining (ATEM-2004), pages 27--32, San Jose, CA, July 2004.Google ScholarGoogle Scholar
  53. National Institute of Standards and Technology. ACE - Automatic Content Extraction. http://www.nist.gov/speech/tests/ace/.Google ScholarGoogle Scholar
  54. F. Peng and A. McCallum. Accurate information extraction from research papers using conditional random fields. In Proceedings of Human Language Technology Conference / North American Association for Computational Linguistics Annual Meeting (HLT-NAACL-2004), Boston, MA, 2004.Google ScholarGoogle Scholar
  55. C. Perez-Iratxeta, P. Bork, and M. A. Andrade. Association of genes to genetically inherited diseases using data mining. Nature Genetics, 31(3):316--319, July 2002.Google ScholarGoogle ScholarCross RefCross Ref
  56. J. R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, San Mateo, CA, 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. L. R. Rabiner. A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2):257--286, 1989.Google ScholarGoogle ScholarCross RefCross Ref
  58. A. K. Ramani, R. C. Bunescu, R. J. Mooney, and E. M. Marcotte. Consolidating the set of know human protein-protein interactions in preparation for largescale mapping of the human interactome. Genome Biology, 6(5):r40, 2005.Google ScholarGoogle ScholarCross RefCross Ref
  59. L. A. Ramshaw and M. P. Marcus. Text chunking using transformation-based learning. In Proceedings of the Third Workshop on Very Large Corpora, 1995.Google ScholarGoogle Scholar
  60. E. M. Rasmussen. Clustering algorithms. In W. B. Frakes and R. Baeza-Yates, editors, Information Retrieval. Prentice Hall, Englewood Cliffs, NJ, 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. S. Ray and M. Craven. Representing sentence structure in hidden Markov models for information extraction. In Proceedings of the Seventeenth International Joint Conference on Artificial Intelligence (IJCAI-2001), pages 1273--1279, Seattle, WA, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  62. E. Riloff. Automatically generating extraction patterns from untagged text. In Proceedings of the Thirteenth National Conference on Artificial Intelligence (AAAI-96), pages 1044--1049, Portland, OR, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  63. G. Salton. Automatic Text Processing: The Transformation, Analysis and Retrieval of Information by Computer. Addison-Wesley, 1989. Google ScholarGoogle ScholarDigital LibraryDigital Library
  64. E. F. T. K. Sang and F. D. Meulder. Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition. In Proceedings of the Seventh Conference on Natural Language Learning (CoNLL-2003), Edmonton, Canada, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  65. S. Sarawagi and W. W. Cohen. Semi-markov conditional random fields for information extraction. In Advances in Neural Information Processing Systems 17, Vancouver, Canada, 2005.Google ScholarGoogle Scholar
  66. F. Sebastiani. Machine learning in automated text categorization. ACM Computing Surveys, 34(1):1--47, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  67. S. Soderland. Learning information extraction rules for semi-structured and free text. Machine Learning, 34:233--272, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  68. L. Tanabe and W. J. Wilbur. Tagging gene and protein names in biomedical text. Bioinformatics, 18(8):1124--1132, 2002.Google ScholarGoogle ScholarCross RefCross Ref
  69. B. Taskar, P. Abbeel, and D. Koller. Discriminative probabilistic models for relational data. In Proceedings of 18th Conference on Uncertainty in Artificial Intelligence (UAI-2002), pages 485--492, Edmonton, Canada, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  70. C. A. Thompson, M. E. Califf, and R. J. Mooney. Active learning for natural language parsing and information extraction. In Proceedings of the Sixteenth International Conference on Machine Learning (ICML-99), pages 406--414, Bled, Slovenia, June 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  71. A. J. Viterbi. Error bounds for convolutional codes and and asymptotically optimum decoding algorithm. IEEE Transactions on Information Theory, 13(2):260--269, 1967.Google ScholarGoogle ScholarDigital LibraryDigital Library
  72. L. Wall, T. Christiansen, and R. L. Schwartz. Programming Perl. O'Reilly and Associates, Sebastopol, CA, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  73. D. Zelenko, C. Aone, and A. Richardella. Kernel methods for relation extraction. Journal of Machine Learning Research, 3:1083--1106, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Mining knowledge from text using information extraction

              Recommendations

              Comments

              Login options

              Check if you have access through your login credentials or your institution to get full access on this article.

              Sign in

              Full Access

              • Published in

                cover image ACM SIGKDD Explorations Newsletter
                ACM SIGKDD Explorations Newsletter  Volume 7, Issue 1
                Natural language processing and text mining
                June 2005
                81 pages
                ISSN:1931-0145
                EISSN:1931-0153
                DOI:10.1145/1089815
                Issue’s Table of Contents

                Copyright © 2005 Authors

                Publisher

                Association for Computing Machinery

                New York, NY, United States

                Publication History

                • Published: 1 June 2005

                Check for updates

                Qualifiers

                • article

              PDF Format

              View or Download as a PDF file.

              PDF

              eReader

              View online with eReader.

              eReader