A Logical Framework for Template Creation and Information Extraction

Corney, David; Byrne, Emma; Buxton, Bernard; Jones, David

doi:10.1007/978-3-540-78488-3_5

David Corney⁶,
Emma Byrne⁷,
Bernard Buxton⁶ &
…
David Jones⁶

Part of the book series: Studies in Computational Intelligence ((SCI,volume 118))

1214 Accesses
1 Citations

Summary

Information extraction is the process of automatically identifying facts of interest from pieces of text, and so transforming free text into a structured database. Past work has often been successful but ad hoc, and in this paper we propose a more formal basis from which to discuss information extraction. We introduce a framework which will allow researchers to compare their methods as well as their results, and will help to reveal new insights into information extraction and text mining practices.

One problem in many information extraction applications is the creation of templates, which are textual patterns used to identify information of interest. Our framework describes formally what a template is and covers other typical information extraction tasks. We show how common search algorithms can be used to create and optimise templates automatically, using sequences of overlapping templates, and we develop heuristics that make this search feasible. Finally we demonstrate a successful implementation of the framework and apply it to a typical biological information extraction task.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Softcover Book: USD 219.99; Price excludes VAT (USA)

Hardcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Blaschke, C., Valencia, A.: The Frame-Based Module of the SUISEKI Information Extraction System, IEEE Intelligent Systems, 17(2), March 2002, 14–20.
Google Scholar
Collier, R.: Automatic Template Creation for Information Extraction, Ph.D. Thesis, Department of Computer Science, University of Sheffield, 1998.
Google Scholar
Colosimo, M. E., Morgan, A. A., Yeh, A. S., Colombe, J. B., Hirschman, L.: Data Preparation and Interannotator Agreement, BMC Bioinformatics, 6(Suppl 1), 2005.
Google Scholar
Corney, D., Byrne, E., Buxton, B., Jones, D.: A Logical Framework for Template Creation and Information Extraction, Foundations of Semantic Oriented Data and Web Mining workshop, part of ICDM2005 (the Fifth IEEE International Conference on Data Mining), 2005.
Google Scholar
Corney, D. P. A., Buxton, B. F., Langdon, W. B., Jones, D. T.: BioRAT: Extracting Biological Information from Full-Length Papers, Bioinformatics, 20(17), 2004, 3206–3213.
Article Google Scholar
Costantino, M.: Financial Information Extraction Using Pre-Defined and User-Definable Templates in the LOLITA System, Ph.D. Thesis, University of Durham, Department of Computer Science, 1997.
Google Scholar
Cowie, J., Lehnert, W.: Information Extraction, Communications of the ACM, 39(1), 1996, 80–91.
Article Google Scholar
Cowie, J., Wilks, Y.: Information Extraction, in: Handbook of Natural Language Processing (Dale R., Moisl H., Somers H., Eds.), Marcel Dekker, New York, 2000.
Google Scholar
Cunningham, H.: GATE, a General Architecture for Text Engineering, Computers and the Humanities, 36(2), May 2002, 223–254.
Article Google Scholar
Fonseca, C. M., Fleming, P. J.: An Overview of Evolutionary Algorithms in Multiobjective Optimization, Evolutionary Computation, 3(1), 1995, 1–16.
Article Google Scholar
Hersh, W., Bhuptiraju, R., Ross, L., Cohen, A., Kraemer, D.: TREC 2004 Genomics Track Overview, The Thirteenth Text Retrieval Conference (TREC 2004) NIST Special Publication SP 500–261, 2004.
Google Scholar
Hirschman, L., Yeh, A., Blaschke, C., Valencia, A.: Overview of BioCreAtIvE: Critical Assessment of Information Extraction for Biology, BMC Bioinformatics, 6(Suppl 1), 2005.
Google Scholar
Huang, M., Zhu, X., Hao, Y., Payan, D. G., Qu, K., Li, M.: Discovering Patterns to Extract Protein–Protein Interactions from Full Texts, Bioinformatics, 20(18), 2005, 3604–3612.
Article Google Scholar
Kim, J., Ohta, T., Tateisi, Y., Tsujii, J.: GENIA Corpus – Semantically Annotated Corpus for Bio-Textmining, Bioinformatics, 19(Suppl 1), 2003, 180–182.
Article Google Scholar
Knowles, J., Hughes, E. J.: Multiobjective Optimization on a Budget of 250 Evaluations, Evolutionary Multi-Criterion Optimization (EMO 2005), LNCS 3410, Springer, Berlin Heidelberg New York, 2005, 176–190.
Google Scholar
Koike, A., Niwa, Y., Takagi, T.: Automatic Extraction of Gene/Protein Biological Functions from Biomedical Text, Bioinformatics, 21(7), April 2005, 1227–1236.
Article Google Scholar
Marcus, M., Santorini, B., Marcinkiewicz, M. A.: Building a Large Annotated Corpus in English: the Penn Treebank, Computational Linguistics, 19, 1993, 313–330.
Google Scholar
Nobata, C., Sekine, S.: Towards Automatic Acquisition of Patterns for Information Extraction, International Conference of Computer Processing of Oriental Languages, 1999.
Google Scholar
Pierce, D., Cardie, C.: Limitations of Co-Training for Natural Language Learning from Large Datasets, Proceedings of the 2001 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics Research, 2001.
Google Scholar
Pierce, D., Cardie, C.: User-Oriented Machine Learning Strategies for Information Extraction: Putting the Human Back in the Loop, Working Notes of the IJCAI-2001 Workshop on Adaptive Text Extraction and Mining, 2001, 80–81.
Google Scholar
Porter, M. F.: An Algorithm for Suffix Stripping, Program, 14(3), 1980, 130–137.
Google Scholar
Riloff, E.: Automatically Constructing a Dictionary for Information Extraction Tasks, National Conference on Artificial Intelligence, 1993, 811–816.
Google Scholar
Riloff, E.: Automatically Generating Extraction Patterns from Untagged Text, Proceedings of the Thirteenth National Conference on Artificial Intelligence (AAAI-96), 1996, 1044–1049.
Google Scholar
Russell, S., Norvig, P.: Artificial Intelligence: A Modern Approach, 2 edition, Prentice-Hall, Englewood Cliffs, NJ, 2003.
Google Scholar
Sehgal, A.: Text Mining: The Search for Novelty in Text, Ph.D Comprehensive Examination Report, Department of Computer Science, The University of Iowa, April 2004.
Google Scholar
Xenarios, I., Salwinski, L., Duan, X., Higney, P., Kim, S., Eisenberg, D.: DIP: The Database of Interacting Proteins. A Research Tool for Studying Cellular Networks of Protein Interactions, Nucleic Acids Research, 30(1), January 2002, 303–305.
Article Google Scholar
Zhou, R., Hansen, E.: Sweep A: Space-Efficient Heuristic Search in Partially-ordered Graphs, Fifteenth IEEE International Conference on Tools with Artificial Intelligence, Sacremento, CA, November 2003.
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, University College London, Gower Street, London, WC1E 6BT, UK
David Corney, Bernard Buxton & David Jones
School of Primary Care and Population Sciences, University College London, Highgate Hill, London, N19 5LW, UK
Emma Byrne

Authors

David Corney
View author publications
You can also search for this author in PubMed Google Scholar
Emma Byrne
View author publications
You can also search for this author in PubMed Google Scholar
Bernard Buxton
View author publications
You can also search for this author in PubMed Google Scholar
David Jones
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Science, San Jose State University, San Jose, CA, 95192, USA
Tsau Young Lin
Department of Computer Science and Information Systems, Kennesaw State University, Building 11, Room 3060 1000 Chastain Road, Kennesaw, GA, 30144, USA
Ying Xie
Department of Computer Science, The University at Stony Brook, Stony Brook, New York, 11794-4400, USA
Anita Wasilewska
Institute of Information Science, Academia Sinica, No 128, Academia Road, Section 2 Nankang, Taipei, 11529, Taiwan
Churn-Jung Liau

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Corney, D., Byrne, E., Buxton, B., Jones, D. (2008). A Logical Framework for Template Creation and Information Extraction. In: Lin, T.Y., Xie, Y., Wasilewska, A., Liau, CJ. (eds) Data Mining: Foundations and Practice. Studies in Computational Intelligence, vol 118. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-78488-3_5

Download citation

DOI: https://doi.org/10.1007/978-3-540-78488-3_5
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-78487-6
Online ISBN: 978-3-540-78488-3
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics