Elsevier

Computers in Biology and Medicine

Volume 40, Issues 11–12, November–December 2010, Pages 900-911
Computers in Biology and Medicine

Design and evaluation of an ontology based information extraction system for radiological reports

https://doi.org/10.1016/j.compbiomed.2010.10.002Get rights and content

Abstract

This paper describes an information extraction system that extracts and converts the available information in free text Turkish radiology reports into a structured information model using manually created extraction rules and domain ontology. The ontology provides flexibility in the design of extraction rules, and determines the information model for the extracted semantic information. Although our information extraction system mainly concentrates on abdominal radiology reports, the system can be used in another field of medicine by adapting its ontology and extraction rule set. We achieved very high precision and recall results during the evaluation of the developed system with unseen radiology reports.

Introduction

Health information systems and electronic health records are expected to lower costs and improve health care quality through improved access to information [1]. Free unstructured text is still the most common information source in medical records. Many medical disciplines such as radiology, pathology, and nuclear medicine almost completely rely on unstructured free text as the route of dissemination for information. This format is widely used for both storage and exchange of information about an individual patient, and the file of an individual patient usually contains several different free text reports such as clinical notes, patient history, or discharge summaries. Information covered in these reports is a valuable data resource for management, research, or educational purposes. Medical applications such as clinical decision support systems require utilizing this information. Nevertheless, this form of information is not as useful as structured and coded data for decision making nor knowledge discovery related to public health. Although the required information to answer many medical questions is stored electronically, we cannot answer precisely many questions like “What is the rate of non-pathological renal cysts in patients without renal complaints?”, “What are the average sizes of left and right kidneys in our population?”, and “How is renal parenchymal echogenic structure changing over the time, before a renal cancer is diagnosed?” since the required information is not available computationally.

As more and more text becomes available electronically, there is a growing need for systems that extract information automatically from narrative data. Manual extraction of this information is quite costly and time consuming process. As the text source grows, machine evaluation becomes mandatory to be able to use this huge amount of text. Information extraction (IE) and natural language processing (NLP) techniques are required to extract the useful information from these free texts.

Information extraction which is a sub-discipline of NLP focuses on the identification of the specific facts and relations within unstructured texts, the extraction of the relevant values, and their transformation into standardized codes and/or structured information. An information extraction task takes two inputs, namely a free text document that is the source of information and predefined templates, and fills these templates with suitable information extracted from the given document. The filled templates are the structured representation of the information available in the given document.

IE has become a popular research topic since late eighties by the promotion of Message Understanding Conferences (MUCs) sponsored by Defense Advanced Research Projects Agency (DARPA). The MUCs have a great impact on the research on information extraction. Many new IE problems have been identified, and the algorithms are developed to solve these problems. The MUCs have helped the development of the evaluation metrics that are used in the comparisons of the information extraction systems participated in the competitions.

A typical information extraction system may have two main subtasks: entity recognition and relation extraction. Entity recognition tries to identify the boundaries of the text segments representing entities in natural language texts. For example, protein name extraction is an entity recognition task that tries to identify text segments representing protein names in medical texts. Relation extraction tries to identify the relations between entities in order to fill predefined templates. For example, the extraction of interaction relations among proteins is a relation extraction task. Both of these tasks use pattern matching techniques in order to extract the required information. The extraction rules that are generally regular expressions are applied to a given document in order to extract entities or relations.

A successful IE system at least relies to some degree on domain knowledge and some level of grammatical information. All the facts, relations and implicit assumptions of the domain, which are required to identify semantic entities and extract the information within the text properly, must be conveyed to the IE system. The success of a system closely correlates with the coverage of the required domain knowledge which is made available to the system as data sources. The domain knowledge is very complex and covers all of our world knowledge for general natural language texts, and the complexity of the required grammatical information for general natural language texts are complex as the whole grammar of that natural language. On the other hand, medical narratives are relatively easier to process from grammatical point of view because of their nature. Like many other technical subjects, medical texts also use a narrower subset of the language with limited number of information types [2], relatively unambiguous terminology [3] and predictable presentation patterns [4]. In other words, an information extraction system targeting a specific field such as medical texts which use a specific domain knowledge and sublanguage can be more successful than a general information extraction system because of the less ambiguity problem in those texts. Our information extraction system concentrates only on Turkish abdominal radiology reports that have less ambiguity problem, and its required domain knowledge is limited.

There are two basic approaches for information extraction: a supervised methodology, also known as Knowledge Engineering Approach, and an unsupervised (or semi-supervised) methodology referred as Automatic Training Approach [5]. In the supervised approach, extraction rules are manually developed by a domain expert or a knowledge engineer in consultation with a domain expert. The system performance is affected by the performance of the knowledge engineer and/or the domain expert. The main disadvantages of these systems are difficulties in the adaptation to another domain, and the requirement of a domain expert for the domain knowledge. On the other hand, it is expected to have a higher performance in comparison to automatic training approach, as a consequence of human intelligence in the construction of the system parameters. The information extraction system described in this paper uses a supervised methodology, and its extraction rules and ontology are developed by a domain expert.

In the unsupervised approach, IE system is trained by means of an annotated training set data using statistical approaches. For example, after manual annotation of entity names, the text can be used to train the system on named entity recognition. During the training period, the system may interact with a user to test whether the extracted data is correct or not, so that it can fix its rules accordingly [5]. One of the major obstacles in IE is the manual adaptation of an IE system to a newer domain since the manual adaptation is a costly process. The manual adaptation requires recreation of rule-sets and templates on the basics of the new domain. The difficulty of the domain knowledge creation for a new domain is another limitation for the performance. As a consequence of these problems, machine learning techniques for information extraction are viable alternatives, and they are discussed as a research topic for information extraction [6].

Traditionally, IE systems do not try a deep semantic analysis of all aspects of a text. They generally use pattern matching techniques such as finite state methods or regular expressions [7]. The ontology is a formal specification of a shared understanding of the domain of interest [8], and it is getting more popular to share knowledge across the systems. In IE systems, it is claimed that the use of a formal ontology as one of the system’s resources improves the performance of entity recognition and semantic annotation tasks [9]. There are some published systems that use ontology during the information extraction task [10], [11], [12], [13], [14].

Ontologies are getting more and more popular to model knowledge in medical domain. OpenGalen is an initiative to create open source resources, which includes an ontology development environment and a large open source description logic-based ontology for the medical domain [15]. Rosse and Mejino published a reference ontology for functional model of anatomy (FMA) [16]. Another medical ontology, RadLEX, is derived from FMA, and it is extending FMA to cover radiological anatomy [17], [18]. A related work RadiO was developed as a prototype application ontology to close the gap between radiology reports and RadLEX [19].

Our IE system uses ontology in both entity recognition and rule extraction. We use the ontology to determine not only the possible attributes, attribute values and entities appearing in the radiology reports, but also missing entities, attributes and attribute values in the sentences of the reports. In other words, we use the ontology to extract the semantic knowledge by disambiguating the sentences. Since our rules that are used in entity recognition and relation extraction contain ontological concepts, they have more expressive power than the rules based on textual items.

In this paper, we present a prototype IE system for Turkish radiology reports. Our system is designed to process all kinds of reports from different types of radiological examinations such as ultrasonography, magnetic resonance imaging, computerized tomography and plain X-Rays, and all of them are referred as radiology reports in this paper. Although the prototype system presented here is designed to handle the radiology reports from different sources, it is tested with abdominal ultrasonography reports. Our IE system converts a complete report into a target relational information model. The Turkish radiological information extraction system (TRIES) uses rules as grammatical knowledge and ontology as both domain knowledge for named entity recognition and semantic analysis. One of the main contributions of this paper is the usage of ontology in information extraction that increases the expressive power of extraction rules and helps to determine missing items in the sentences. Our system is the first information extraction system for Turkish texts. Since Turkish is a morphologically rich language, we use a morphological analyzer and our extraction rules are also based on the morphological features.

The rest of the paper is organized as follows. Section 2 discusses the related work in medical information extraction systems and ontology-based information extraction systems. In Section 3, we present the details of our ontology-based information extraction system. The performance results of our information extraction system are given in Section 4. We give the concluding remarks in Section 5.

Section snippets

Related work

After the initial introduction of information extraction approaches, the medical domain has become a popular application field for these systems. Many different research groups have emerged, mainly focusing on indexing reports as a free medical text search facility, automatic term coding such as diseases or physical findings, and detection of abnormal conditions such as disease findings. Recently, many medical IE extraction systems have been developed using different approaches, and some of

Ontology based information extraction

TRIES is an information extraction system aiming to parse free text Turkish radiological reports into computationally usable structured information. The major components of TRIES are given in Fig. 1. All the words in a given report are analyzed by a Turkish morphological analyzer. Each word is converted into a sequence consisting of a root word followed by possible morphemes. Morphological analyzer uses a lexicon, which is the source of lexical information for a set of Turkish root words. The

Evaluation

For the performance evaluation of TRIES, 100 radiology reports are randomly selected as unseen data. On the average, each report is composed of 14.34 sentences and 105.43 words. The configuration of the system was frozen prior to analyzing the test set. A human domain expert is considered as the gold standard, and the domain expert extracted the relations from these 100 reports. Then, the relations extracted by TRIES are compared against the relations extracted by the domain expert. Table 4

Conclusion

In this paper, we introduced an information extraction system TRIES that uses an ontology as the domain knowledge for Turkish radiological reports. The ontology is main source of domain knowledge in TRIES. It is referenced by the term analyzer in the named entity recognition phase, by the relation extractor in pattern matching, and by the target information model. TRIES uses domain ontology to incorporate the knowledge of relevant concepts and their semantic relations into the system. TRIES

Summary

Free texts are still the main source of information in medical domain, and they are widely used for both storage and exchange of information. Nevertheless, this form of information is not as useful as structured and coded data for decision making nor knowledge discovery related to public health because of computational inaccessibility of the information in unstructured reports. Since the access to the information in free texts requires extensive efforts, information extraction systems can

Conflict of interest statement

Authors declare that they do not have any conflict of interest with any people or organization.

Acknowledgment

We would like to thank to Prof. Dr. Serdar Akyar, Prof. Dr. Mustafa Özmen and Prof. Dr. Utku Şenol for their support and their allowance to access to radiology reports during our research.

References (35)

  • U. Hahn et al.

    MEDSYNDIKATE a natural language system for the extraction of medical information from findings reports

    International Journal of Medical Informatics

    (2002)
  • A. Mykowiecka et al.

    Rule-based information extraction from patients’ clinical data

    Journal of Biomedical Informatics

    (2009)
  • J.M. Corrigan et al.

    Crossing the Quality Chasm: A New Health System for the 21st Century

    (2001)
  • N. Sager et al.

    Medical Language Processing: Computer Management of Narrative Data

    (1987)
  • A.M. Rassinoux, J.C. Wagner, C. Lovis, R.H. Baud, A. Rector, J.R. Scherrer, Analysis of medical texts based on a sound...
  • A.A. Archbold, D.A. Evans, On the Topical Structure of Medical Charts, in: Proceedings of the 13th Annual Symposium on...
  • D.E. Appelt

    Introduction to information extraction

    AI Communications

    (1999)
  • J. Turmo et al.

    Adaptive information extraction

    ACM Computing Surveys (CSUR)

    (2006)
  • D.A. Evans, N.D. Brownlow, W.R. Hersh, E.M. Campbell, Automating concept identification in the electronic medical...
  • M. Uschold et al.

    Ontologies: principles, methods and applications

    Knowledge Engineering Review.

    (1996)
  • K. Bontcheva, H. Cunningham, A. Kiryakov, V. Tablan, Semantic annotation and human language technology, in: Semantic...
  • P. Buitelaar, P. Cimiano, S. Racioppa, M. Siegel, Ontology-based information extraction with soba, in: Proceedings of...
  • D.W. Embley, D.M. Campbell, R.D. Smith, S.W. Liddle, Ontology-based extraction and structuring of information from...
  • A. Maedche, G. Neumann, S. Staab, G. Saarbruecken, Bootstrapping an ontology-based information extraction system, in:...
  • H. Müller et al.

    Textpresso: an ontology-based information retrieval and extraction system for biological literature

    PLoS Biology

    (2004)
  • A. Todirascu, L. Romary, D. Bekhouche, Vulcain—an ontology-based information extraction system, in: Proceedings of...
  • A.L. Rector, J.E. Rogers, P.E. Zanstra, E. van der Haring, OpenGALEN: open source medical terminology and tools, in:...
  • Cited by (0)

    View full text