Summary
Information extraction is the process of automatically identifying facts of interest from pieces of text, and so transforming free text into a structured database. Past work has often been successful but ad hoc, and in this paper we propose a more formal basis from which to discuss information extraction. We introduce a framework which will allow researchers to compare their methods as well as their results, and will help to reveal new insights into information extraction and text mining practices.
One problem in many information extraction applications is the creation of templates, which are textual patterns used to identify information of interest. Our framework describes formally what a template is and covers other typical information extraction tasks. We show how common search algorithms can be used to create and optimise templates automatically, using sequences of overlapping templates, and we develop heuristics that make this search feasible. Finally we demonstrate a successful implementation of the framework and apply it to a typical biological information extraction task.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Blaschke, C., Valencia, A.: The Frame-Based Module of the SUISEKI Information Extraction System, IEEE Intelligent Systems, 17(2), March 2002, 14–20.
Collier, R.: Automatic Template Creation for Information Extraction, Ph.D. Thesis, Department of Computer Science, University of Sheffield, 1998.
Colosimo, M. E., Morgan, A. A., Yeh, A. S., Colombe, J. B., Hirschman, L.: Data Preparation and Interannotator Agreement, BMC Bioinformatics, 6(Suppl 1), 2005.
Corney, D., Byrne, E., Buxton, B., Jones, D.: A Logical Framework for Template Creation and Information Extraction, Foundations of Semantic Oriented Data and Web Mining workshop, part of ICDM2005 (the Fifth IEEE International Conference on Data Mining), 2005.
Corney, D. P. A., Buxton, B. F., Langdon, W. B., Jones, D. T.: BioRAT: Extracting Biological Information from Full-Length Papers, Bioinformatics, 20(17), 2004, 3206–3213.
Costantino, M.: Financial Information Extraction Using Pre-Defined and User-Definable Templates in the LOLITA System, Ph.D. Thesis, University of Durham, Department of Computer Science, 1997.
Cowie, J., Lehnert, W.: Information Extraction, Communications of the ACM, 39(1), 1996, 80–91.
Cowie, J., Wilks, Y.: Information Extraction, in: Handbook of Natural Language Processing (Dale R., Moisl H., Somers H., Eds.), Marcel Dekker, New York, 2000.
Cunningham, H.: GATE, a General Architecture for Text Engineering, Computers and the Humanities, 36(2), May 2002, 223–254.
Fonseca, C. M., Fleming, P. J.: An Overview of Evolutionary Algorithms in Multiobjective Optimization, Evolutionary Computation, 3(1), 1995, 1–16.
Hersh, W., Bhuptiraju, R., Ross, L., Cohen, A., Kraemer, D.: TREC 2004 Genomics Track Overview, The Thirteenth Text Retrieval Conference (TREC 2004) NIST Special Publication SP 500–261, 2004.
Hirschman, L., Yeh, A., Blaschke, C., Valencia, A.: Overview of BioCreAtIvE: Critical Assessment of Information Extraction for Biology, BMC Bioinformatics, 6(Suppl 1), 2005.
Huang, M., Zhu, X., Hao, Y., Payan, D. G., Qu, K., Li, M.: Discovering Patterns to Extract Protein–Protein Interactions from Full Texts, Bioinformatics, 20(18), 2005, 3604–3612.
Kim, J., Ohta, T., Tateisi, Y., Tsujii, J.: GENIA Corpus – Semantically Annotated Corpus for Bio-Textmining, Bioinformatics, 19(Suppl 1), 2003, 180–182.
Knowles, J., Hughes, E. J.: Multiobjective Optimization on a Budget of 250 Evaluations, Evolutionary Multi-Criterion Optimization (EMO 2005), LNCS 3410, Springer, Berlin Heidelberg New York, 2005, 176–190.
Koike, A., Niwa, Y., Takagi, T.: Automatic Extraction of Gene/Protein Biological Functions from Biomedical Text, Bioinformatics, 21(7), April 2005, 1227–1236.
Marcus, M., Santorini, B., Marcinkiewicz, M. A.: Building a Large Annotated Corpus in English: the Penn Treebank, Computational Linguistics, 19, 1993, 313–330.
Nobata, C., Sekine, S.: Towards Automatic Acquisition of Patterns for Information Extraction, International Conference of Computer Processing of Oriental Languages, 1999.
Pierce, D., Cardie, C.: Limitations of Co-Training for Natural Language Learning from Large Datasets, Proceedings of the 2001 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics Research, 2001.
Pierce, D., Cardie, C.: User-Oriented Machine Learning Strategies for Information Extraction: Putting the Human Back in the Loop, Working Notes of the IJCAI-2001 Workshop on Adaptive Text Extraction and Mining, 2001, 80–81.
Porter, M. F.: An Algorithm for Suffix Stripping, Program, 14(3), 1980, 130–137.
Riloff, E.: Automatically Constructing a Dictionary for Information Extraction Tasks, National Conference on Artificial Intelligence, 1993, 811–816.
Riloff, E.: Automatically Generating Extraction Patterns from Untagged Text, Proceedings of the Thirteenth National Conference on Artificial Intelligence (AAAI-96), 1996, 1044–1049.
Russell, S., Norvig, P.: Artificial Intelligence: A Modern Approach, 2 edition, Prentice-Hall, Englewood Cliffs, NJ, 2003.
Sehgal, A.: Text Mining: The Search for Novelty in Text, Ph.D Comprehensive Examination Report, Department of Computer Science, The University of Iowa, April 2004.
Xenarios, I., Salwinski, L., Duan, X., Higney, P., Kim, S., Eisenberg, D.: DIP: The Database of Interacting Proteins. A Research Tool for Studying Cellular Networks of Protein Interactions, Nucleic Acids Research, 30(1), January 2002, 303–305.
Zhou, R., Hansen, E.: Sweep A: Space-Efficient Heuristic Search in Partially-ordered Graphs, Fifteenth IEEE International Conference on Tools with Artificial Intelligence, Sacremento, CA, November 2003.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2008 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Corney, D., Byrne, E., Buxton, B., Jones, D. (2008). A Logical Framework for Template Creation and Information Extraction. In: Lin, T.Y., Xie, Y., Wasilewska, A., Liau, CJ. (eds) Data Mining: Foundations and Practice. Studies in Computational Intelligence, vol 118. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-78488-3_5
Download citation
DOI: https://doi.org/10.1007/978-3-540-78488-3_5
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-78487-6
Online ISBN: 978-3-540-78488-3
eBook Packages: EngineeringEngineering (R0)