CC BY-NC-ND 4.0 · Methods Inf Med 2018; 57(S 01): e22-e29
DOI: 10.3414/ME17-02-0010
Focus Theme – Original Articles
Schattauer GmbH

Ad Hoc Information Extraction for Clinical Data Warehouses

Georg Dietrich
1   Computer Science, University of Wuerzburg, Wuerzburg, Germany
,
Jonathan Krebs
1   Computer Science, University of Wuerzburg, Wuerzburg, Germany
,
Georg Fette
1   Computer Science, University of Wuerzburg, Wuerzburg, Germany
2   Comprehensive Heart Failure Center (CHFC), University Hospital of Wuerzburg, Wuerzburg, Germany
,
Maximilian Ertl
3   Service Center Medical Informatics, University Hospital of Wuerzburg, Wuerzburg, Germany
,
Mathias Kaspar
2   Comprehensive Heart Failure Center (CHFC), University Hospital of Wuerzburg, Wuerzburg, Germany
,
Stefan Störk
2   Comprehensive Heart Failure Center (CHFC), University Hospital of Wuerzburg, Wuerzburg, Germany
,
Frank Puppe
1   Computer Science, University of Wuerzburg, Wuerzburg, Germany
› Author Affiliations
This work was supported by the Comprehensive Heart Failure Center Würzburg (BMBF grants: #01EO1004 and #01EO1504).
Further Information

Publication History

received: 28 July 2017

accepted: 10 February 2018

Publication Date:
25 May 2018 (online)

Summary

Background: Clinical Data Warehouses (CDW) reuse Electronic health records (EHR) to make their data retrievable for research purposes or patient recruitment for clinical trials. However, much information are hidden in unstructured data like discharge letters. They can be preprocessed and converted to structured data via information extraction (IE), which is unfortunately a laborious task and therefore usually not available for most of the text data in CDW.

Objectives: The goal of our work is to provide an ad hoc IE service that allows users to query text data ad hoc in a manner similar to querying structured data in a CDW. While search engines just return text snippets, our systems also returns frequencies (e.g. how many patients exist with “heart failure” including textual synonyms or how many patients have an LVEF < 45) based on the content of discharge letters or textual reports for special investigations like heart echo. Three subtasks are addressed: (1) To recognize and to exclude negations and their scopes, (2) to extract concepts, i.e. Boolean values and (3) to extract numerical values.

Methods: We implemented an extended version of the NegEx-algorithm for German texts that detects negations and determines their scope. Furthermore, our document oriented CDW PaDaWaN was extended with query functions, e.g. context sensitive queries and regex queries, and an extraction mode for computing the frequencies for Boolean and numerical values.

Results: Evaluations in chest X-ray reports and in discharge letters showed high F1-scores for the three subtasks: Detection of negated concepts in chest X-ray reports with an F1-score of 0.99 and in discharge letters with 0.97; of Boolean values in chest X-ray reports about 0.99, and of numerical values in chest X-ray reports and discharge letters also around 0.99 with the exception of the concept age.

Discussion: The advantages of an ad hoc IE over a standard IE are the low development effort (just entering the concept with its variants), the promptness of the results and the adaptability by the user to his or her particular question. Disadvantage are usually lower accuracy and confidence.

This ad hoc information extraction approach is novel and exceeds existing systems: Roogle [[1]] extracts predefined concepts from texts at preprocessing and makes them retrievable at runtime. Dr. Warehouse [[2]] applies negation detection and indexes the produced subtexts which include affirmed findings. Our approach combines negation detection and the extraction of concepts. But the extraction does not take place during preprocessing, but at runtime. That provides an ad hoc, dynamic, interactive and adjustable information extraction of random concepts and even their values on the fly at runtime.

Conclusions: We developed an ad hoc information extraction query feature for Boolean and numerical values within a CDW with high recall and precision based on a pipeline that detects and removes negations and their scope in clinical texts.

 
  • References

  • 1 Cuggia M, Garcelon N, Campillo-Gimenez B, Bernicot T, Laurent J-F, Garin E. et al. Roogle: an information retrieval engine for clinical data ware-house. Stud Health Technol Inform 2011; 169: 584-588.
  • 2 Garcelon N, Neuraz A, Benoit V, Salomon R, Burgun A. Improving a full-text search engine: the importance of negation detection and family history context to identify cases in a biomedical data ware-house. J Am Med Inform Assoc 2017; 24 (03) 607-613.
  • 3 Starlinger J, Kittner M, Blankenstein O, Leser U. How to improve information extraction from German medical records. it-Information Technology 2016; 58: 10.
  • 4 Krieger H-U, Spurk C, Uszkoreit H, Xu F, Zhang Y, Müller F. et al. Information Extraction from German Patient Records via Hybrid Parsing and Relation Extraction Strategies. In: Chair; NCCChoukri K, Declerck T, Loftsson H, Maegaard B, Mariani J, Moreno A, Odijk J, Piperidis S, et al. editors. Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14). Reykjavik, Iceland: European Language Resources Association (ELRA); 2014: 2043-2048.
  • 5 Toepfer M, Corovic H, Fette G, Klügl P, Störk S, Puppe F. Fine-grained information extraction from German transthoracic echocardiography reports. BMC Med Inform Decis Mak 2015; 15: 91.
  • 6 Dorda W, Wrba T, Duftschmid G, Sachs P, Gall W, Rehnelt C. et al. ArchiMed: a medical information and retrieval system. Methods Inf Med 1999; 38 (01) 16-24.
  • 7 Gabetta M, Limongelli I, Rizzo E, Riva A, Segagni D, Bellazzi R. BigQ: a NoSQL based framework to handle genomic variants in i2b2. BMC Bioinformatics 2015; 16 (01) 415.
  • 8 Hu H, Correll M, Kvecher L, Osmond M, Clark J, Bekhash A. et al. DW4TR: a data warehouse for translational research. J Biomed Inform 2011; 44 (06) 1004-1019.
  • 9 Hussain S, Ouagne D, Sadou E, Dart T, Jaulent M-C, De Vloed B. et al. EHR4CR: A Semantic Web Based Interoperability Approach for Reusing Electronic Healthcare Records in Protocol Feasibility Studies. In: SWAT4LS. 2012
  • 10 Pennington JW, Ruth B, Italia MJ, Miller J, Wrazien S, Loutrel JG. et al. Harvest: an open platform for developing web-based biomedical data discovery and reporting applications. Journal of the American Medical Informatics Association 2013; 21 (02) 379-383.
  • 11 Murphy SN, Weber G, Mendis M, Gainer V, Chueh HC, Churchill S. et al. Serving the enterprise and beyond with informatics for integrating biology and the bedside (i2b2). Journal of the American Medical Informatics Association 2010; 17 (02) 124-130.
  • 12 Wolfe BA, Mamlin BW, Biondich PG, Fraser HS, Jazayeri D, Allen C. et al. The OpenMRS system: collaborating toward an open source EMR for developing countries. In: AMIA Annual Symposium Proceedings. American Medical Informatics Association; 2006: 1146.
  • 13 Harris PA, Taylor R, Thielke R, Payne J, Gonzalez N, Conde JG. Research electronic data capture (REDCap)—a metadata-driven methodology and workflow process for providing translational research informatics support. Journal of Biomedical Informatics 2009; 42 (02) 377-381.
  • 14 Lowe HJ, Ferris TA, Hernandez PM, Weber SC. STRIDE–An integrated standards-based translational research informatics platform. In: AMIA Annual Symposium Proceedings. American Medical Informatics Association; 2009: 391.
  • 15 Canuel V, Rance B, Avillach P, Degoulet P, Burgun A. Translational research platforms integrating clinical and omics data: a review of publicly available solutions. Briefings in Bioinformatics 2014; 16 (02) 280-290.
  • 16 Danciu I, Cowan JD, Basford M, Wang X, Saip A, Osgood S. et al. Secondary use of clinical data: the Vanderbilt approach. Journal of Biomedical Informatics 2014; 52: 28-35.
  • 17 Dziuballe P, Forster C, Breil B, Thiemann V, Fritz F, Lechtenbörger J. et al. The single source architecture x4T to connect medical documentation and clinical research. Stud Health Technol Inform 2011; 169: 902-906.
  • 18 Chapman WW, Bridewell W, Hanbury P, Cooper GF, Buchanan BG. Evaluation of negation phrases in narrative clinical reports. In: Proceedings of the AMIA Symposium. American Medical Informatics Association; 2001: 105.
  • 19 Chapman WW, Bridewell W, Hanbury P, Cooper GF, Buchanan BG. A simple algorithm for identifying negated findings and diseases in discharge summaries. Journal of Biomedical Informatics 2001; 34 (05) 301-310.
  • 20 Harkema H, Dowling JN, Thornblade T, Chapman WW. ConText: an algorithm for determining negation, experiencer, and temporal status from clinical reports. Journal of Biomedical Informatics 2009; 42 (05) 839-851.
  • 21 Skeppstedt M. Negation detection in Swedish clinical text: An adaption of NegEx to Swedish. Journal of Biomedical Semantics 2011; 02 (03) S3
  • 22 Deléger L, Grouin C. Detecting negation of medical problems in French clinical notes. In: Proceedings of the 2nd ACM SIGHIT International Health Informatics Symposium. ACM; 2012: 697-702.
  • 23 Cotik V, Stricker V, Vivaldi J, Rodriguez H. Syntactic methods for negation detection in radiology reports in Spanish. ACL 2016 2016; 156.
  • 24 Costumero R, López F, Gonzalo-Mart’ın C, Millan M, Menasalvas E. An approach to detect negation on medical documents in Spanish. In: International Conference on Brain Informatics and Health. Springer; 2014: 366-375.
  • 25 Afzal Z, Pons E, Kang N, Sturkenboom MC, Schuemie MJ, Kors JA. ContextD: an algorithm to identify contextual properties of medical terms in a Dutch clinical corpus. BMC Bioinformatics 2014; 15 (01) 373.
  • 26 Chapman WW, Hilert D, Velupillai S, Kvist M, Skeppstedt M, Chapman BE. et al. Extending the NegEx lexicon for multiple languages. Stud Health Technol Inf 2013; 192: 677.
  • 27 Cotik V, Roller R, Xu F, Uszkoreit H, BuddeO K, SchmidtO D. Negation Detection in Clinical Reports Written in German. BioTxtM 2016 2016; 115.
  • 28 Gros O, Stede M. Determining Negation Scope in German and English Medical Diagnoses. In: Taboada M, Trnavac R. editors. Nonveridicality and Evaluation: Theoretical, Computational and Corpus Approaches. (Studies in Pragmatics 11). Leiden/Boston: Brill; 2013: 113-126.
  • 29 Elkin PL, Brown SH, Bauer BA, Husser CS, Carruth W, Bergstrom LR. et al. A controlled trial of automated classification of negation from clinical notes. BMC Medical Informatics and Decision Making 2005; 05 (01) 13.
  • 30 Huang Y, Lowe HJ. A novel hybrid approach to automated negation detection in clinical radiology reports. Journal of the American Medical Informatics Association 2007; 14 (03) 304-311.
  • 31 Sohn S, Wu S, Chute CG. Dependency parser-based negation detection in clinical narratives. AMIA Jt Summits Transl Sci Proc 2012; 2012: 1-8.
  • 32 Wu S, Miller T, Masanz J, Coarr M, Halgrim S, Carrell D. et al. Negation’s not solved: generalizability versus optimizability in clinical natural language processing. PloS One 2014; 09 (11) e112774.
  • 33 Mehrabi S, Krishnan A, Sohn S, Roch AM, Schmidt H, Kesterson J. et al. DEEPEN: A negation detection system for clinical text incorporating dependency relation into NegEx. Journal of Biomedical Informatics 2015; 54: 213-219.
  • 34 Dietrich G, Fell F, Fette G, Krebs J, Ertl M, Kaspar M. et al. Web-PaDaWaN: Eine Web-basierte Benutzeroberfläche für ein klinisches Data Warehouse. In: HEC 2016, Joint Conference of GMDS, DGEpi, IEA-EEF, EFMI, DocAbstr 421, 2016. 2016
  • 35 Dietrich G, Ertl M, Fette G, Kaspar M, Krebs J, Mackenrodt D. et al. Extending the Query Language of a Data Warehouse for Patient Recruitment. Stud Health Technol Inform 2017; 243: 152.
  • 36 Krebs J, Corovic H, Dietrich G, Ertl M, Fette G, Kaspar M. et al. Semi-automatic Terminology Generation for Information Extraction from German Chest X-ray Reports. Stud Health Technol Inform 2017; 243: 80.
  • 37 Schmid H. Probabilistic part-of-speech tagging using decision trees. In: New methods in language processing. 2013: 154.