Skip to main content

A Logical Framework for Template Creation and Information Extraction

  • Chapter
Data Mining: Foundations and Practice

Part of the book series: Studies in Computational Intelligence ((SCI,volume 118))

Summary

Information extraction is the process of automatically identifying facts of interest from pieces of text, and so transforming free text into a structured database. Past work has often been successful but ad hoc, and in this paper we propose a more formal basis from which to discuss information extraction. We introduce a framework which will allow researchers to compare their methods as well as their results, and will help to reveal new insights into information extraction and text mining practices.

One problem in many information extraction applications is the creation of templates, which are textual patterns used to identify information of interest. Our framework describes formally what a template is and covers other typical information extraction tasks. We show how common search algorithms can be used to create and optimise templates automatically, using sequences of overlapping templates, and we develop heuristics that make this search feasible. Finally we demonstrate a successful implementation of the framework and apply it to a typical biological information extraction task.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 169.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 219.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Blaschke, C., Valencia, A.: The Frame-Based Module of the SUISEKI Information Extraction System, IEEE Intelligent Systems, 17(2), March 2002, 14–20.

    Google Scholar 

  2. Collier, R.: Automatic Template Creation for Information Extraction, Ph.D. Thesis, Department of Computer Science, University of Sheffield, 1998.

    Google Scholar 

  3. Colosimo, M. E., Morgan, A. A., Yeh, A. S., Colombe, J. B., Hirschman, L.: Data Preparation and Interannotator Agreement, BMC Bioinformatics, 6(Suppl 1), 2005.

    Google Scholar 

  4. Corney, D., Byrne, E., Buxton, B., Jones, D.: A Logical Framework for Template Creation and Information Extraction, Foundations of Semantic Oriented Data and Web Mining workshop, part of ICDM2005 (the Fifth IEEE International Conference on Data Mining), 2005.

    Google Scholar 

  5. Corney, D. P. A., Buxton, B. F., Langdon, W. B., Jones, D. T.: BioRAT: Extracting Biological Information from Full-Length Papers, Bioinformatics, 20(17), 2004, 3206–3213.

    Article  Google Scholar 

  6. Costantino, M.: Financial Information Extraction Using Pre-Defined and User-Definable Templates in the LOLITA System, Ph.D. Thesis, University of Durham, Department of Computer Science, 1997.

    Google Scholar 

  7. Cowie, J., Lehnert, W.: Information Extraction, Communications of the ACM, 39(1), 1996, 80–91.

    Article  Google Scholar 

  8. Cowie, J., Wilks, Y.: Information Extraction, in: Handbook of Natural Language Processing (Dale R., Moisl H., Somers H., Eds.), Marcel Dekker, New York, 2000.

    Google Scholar 

  9. Cunningham, H.: GATE, a General Architecture for Text Engineering, Computers and the Humanities, 36(2), May 2002, 223–254.

    Article  Google Scholar 

  10. Fonseca, C. M., Fleming, P. J.: An Overview of Evolutionary Algorithms in Multiobjective Optimization, Evolutionary Computation, 3(1), 1995, 1–16.

    Article  Google Scholar 

  11. Hersh, W., Bhuptiraju, R., Ross, L., Cohen, A., Kraemer, D.: TREC 2004 Genomics Track Overview, The Thirteenth Text Retrieval Conference (TREC 2004) NIST Special Publication SP 500–261, 2004.

    Google Scholar 

  12. Hirschman, L., Yeh, A., Blaschke, C., Valencia, A.: Overview of BioCreAtIvE: Critical Assessment of Information Extraction for Biology, BMC Bioinformatics, 6(Suppl 1), 2005.

    Google Scholar 

  13. Huang, M., Zhu, X., Hao, Y., Payan, D. G., Qu, K., Li, M.: Discovering Patterns to Extract Protein–Protein Interactions from Full Texts, Bioinformatics, 20(18), 2005, 3604–3612.

    Article  Google Scholar 

  14. Kim, J., Ohta, T., Tateisi, Y., Tsujii, J.: GENIA Corpus – Semantically Annotated Corpus for Bio-Textmining, Bioinformatics, 19(Suppl 1), 2003, 180–182.

    Article  Google Scholar 

  15. Knowles, J., Hughes, E. J.: Multiobjective Optimization on a Budget of 250 Evaluations, Evolutionary Multi-Criterion Optimization (EMO 2005), LNCS 3410, Springer, Berlin Heidelberg New York, 2005, 176–190.

    Google Scholar 

  16. Koike, A., Niwa, Y., Takagi, T.: Automatic Extraction of Gene/Protein Biological Functions from Biomedical Text, Bioinformatics, 21(7), April 2005, 1227–1236.

    Article  Google Scholar 

  17. Marcus, M., Santorini, B., Marcinkiewicz, M. A.: Building a Large Annotated Corpus in English: the Penn Treebank, Computational Linguistics, 19, 1993, 313–330.

    Google Scholar 

  18. Nobata, C., Sekine, S.: Towards Automatic Acquisition of Patterns for Information Extraction, International Conference of Computer Processing of Oriental Languages, 1999.

    Google Scholar 

  19. Pierce, D., Cardie, C.: Limitations of Co-Training for Natural Language Learning from Large Datasets, Proceedings of the 2001 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics Research, 2001.

    Google Scholar 

  20. Pierce, D., Cardie, C.: User-Oriented Machine Learning Strategies for Information Extraction: Putting the Human Back in the Loop, Working Notes of the IJCAI-2001 Workshop on Adaptive Text Extraction and Mining, 2001, 80–81.

    Google Scholar 

  21. Porter, M. F.: An Algorithm for Suffix Stripping, Program, 14(3), 1980, 130–137.

    Google Scholar 

  22. Riloff, E.: Automatically Constructing a Dictionary for Information Extraction Tasks, National Conference on Artificial Intelligence, 1993, 811–816.

    Google Scholar 

  23. Riloff, E.: Automatically Generating Extraction Patterns from Untagged Text, Proceedings of the Thirteenth National Conference on Artificial Intelligence (AAAI-96), 1996, 1044–1049.

    Google Scholar 

  24. Russell, S., Norvig, P.: Artificial Intelligence: A Modern Approach, 2 edition, Prentice-Hall, Englewood Cliffs, NJ, 2003.

    Google Scholar 

  25. Sehgal, A.: Text Mining: The Search for Novelty in Text, Ph.D Comprehensive Examination Report, Department of Computer Science, The University of Iowa, April 2004.

    Google Scholar 

  26. Xenarios, I., Salwinski, L., Duan, X., Higney, P., Kim, S., Eisenberg, D.: DIP: The Database of Interacting Proteins. A Research Tool for Studying Cellular Networks of Protein Interactions, Nucleic Acids Research, 30(1), January 2002, 303–305.

    Article  Google Scholar 

  27. Zhou, R., Hansen, E.: Sweep A: Space-Efficient Heuristic Search in Partially-ordered Graphs, Fifteenth IEEE International Conference on Tools with Artificial Intelligence, Sacremento, CA, November 2003.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2008 Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

Corney, D., Byrne, E., Buxton, B., Jones, D. (2008). A Logical Framework for Template Creation and Information Extraction. In: Lin, T.Y., Xie, Y., Wasilewska, A., Liau, CJ. (eds) Data Mining: Foundations and Practice. Studies in Computational Intelligence, vol 118. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-78488-3_5

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-78488-3_5

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-78487-6

  • Online ISBN: 978-3-540-78488-3

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics