Abstract
The aim of this study is to develop a secure, Google-based data-mining tool for radiology reports using free and open source technologies and to explore its use within an academic radiology department. A Health Insurance Portability and Accountability Act (HIPAA)-compliant data repository, search engine and user interface were created to facilitate treatment, operations, and reviews preparatory to research. The Institutional Review Board waived review of the project, and informed consent was not required. Comprising 7.9 GB of disk space, 2.9 million text reports were downloaded from our radiology information system to a fileserver. Extensible markup language (XML) representations of the reports were indexed using Google Desktop Enterprise search engine software. A hypertext markup language (HTML) form allowed users to submit queries to Google Desktop, and Google’s XML response was interpreted by a practical extraction and report language (PERL) script, presenting ranked results in a web browser window. The query, reason for search, results, and documents visited were logged to maintain HIPAA compliance. Indexing averaged approximately 25,000 reports per hour. Keyword search of a common term like “pneumothorax” yielded the first ten most relevant results of 705,550 total results in 1.36 s. Keyword search of a rare term like “hemangioendothelioma” yielded the first ten most relevant results of 167 total results in 0.23 s; retrieval of all 167 results took 0.26 s. Data mining tools for radiology reports will improve the productivity of academic radiologists in clinical, educational, research, and administrative tasks. By leveraging existing knowledge of Google’s interface, radiologists can quickly perform useful searches.
Similar content being viewed by others
References
Iwata S, Chen RS: Science and the digital divide. Science 310:405, 2005
Thrall JH: Reinventing radiology in the digital age: part I. The all-digital department. Radiology 236:382–385, 2005
Hynes DM, Stevenson G, Nahmias C: Towards filmless and distance radiology. Lancet 350:657–660, 1997
Tamm EP, Kawashima A, Silverman P: An academic radiology information system (RIS): a review of the commercial RIS systems, and how an individualized academic RIS can be created and utilized. J Digit Imaging 14:131–134, 2001
Thrall JH: Reinventing radiology in the digital age. Part II. New directions and new stakeholder value. Radiology 237:15–18, 2005
Meghea CI, Sunshine JH: Who’s overworked and who’s underworked among radiologists? An update on the radiologist shortage. Radiology 236:932–938, 2005
Steinbrook R: Searching for the right search—reaching the medical literature. N Engl J Med 354:4–7, 2006
Birney E, Bateman A, Clamp ME, Hubbard TJ: Mining the draft human genome. Nature 409:827–828, 2001
Giustini D: How Google is changing medicine. BMJ 331:1487–1488, 2005
O’Connor JB, Johanson JF: Use of the Web for medical information by a gastroenterology clinic population. JAMA 284:1962–1964, 2000
Greenwald R: And a diagnostic test was performed. N Engl J Med 353:2089–2090, 2005
Hand DJ, Mannila P, Smyth P: Principle of Data Mining, Cambridge, MA: MIT, 2001
Mullins IM, Siadaty MS, Lyman J, et al: Data mining and clinical data repositories: insights from a 667,000 patient data set. Comput Biol Med 36:1351–1377, 2006
Nigrin DJ, Kohane IS: Data mining by clinicians. Proc AMIA Symp 1998:957–961, 1998
Prather JC, Lobach DF, Goodwin LK, Hales JW, Hage ML, Hammond WE: Medical data mining: knowledge discovery in a clinical data warehouse. Proc AMIA Annu Fall Symp 1997:101–105, 1997
Ananiadou S, Kell DB, Tsujii JI: Text mining and its potential applications in systems biology. Trends Biotechnol 24:571–579, 2006
Cohen AM, Hersh WR: A survey of current work in biomedical text mining. Brief Bioinform 6:57–71, 2005
Heinze DT, Morsch ML, Holbrook J: Mining free-text medical records. Proc AMIA Symp 2001:254–258, 2001
Roberts PM: Mining literature for systems biology. Brief Bioinform 7:399–406, 2006
Bekhuis T: Conceptual biology, hypothesis discovery, and text mining: Swanson’s legacy. Biomed Digit Libr 3:2, 2006
Scherf M, Epple A, Werner T: The next generation of literature analysis: integration of genomic analysis into text mining. Brief Bioinform 6:287–297, 2005
Schonbach C, Nagashima T, Konagaya A: Textmining in support of knowledge discovery for vaccine development. Methods 34:488–495, 2004
Sokol L, Garcia B, Rodriguez J, West M, Johnson K: Using data mining to find fraud in HCFA health care claims. Top Health Inf Manage 22:1–13, 2001
Definitions: research. Title 45 Code of Federal Regulation, Pt. 46.102(d), 2000
Use and Disclosure for Treatment, Payment and Health Care Operations. Title 45 Code of Federal Regulation, Pt. 164.506, 2000
Definition: health care operations. Title 45 Code of Federal Regulation, Pt. 164.501(2), 2000
IRB review of research. Title 45 Code of Federal Regulation, Pt. 46.109, 2000
Reviews Preparatory to Research. Title 45 Code of Federal Regulation, Pt. 164.512(h)(i)(1)(ii), 2000
De-identification of protected health information. Title 45 Code of Federal Regulation, Pt. 164.514(a), 2000
Magos A, Gambadauro P: Desktop search engines: a modern way to hand search in full text. Lancet 366:203–204, 2005
Smith AC: Effect of XML markup on retrieval of clinical documents. AMIA Annu Symp Proc 2003:614–618, 2003
Hulse NC, Rocha RA, Bradshaw R, Del Fiol G, Roemer L: Application of an XML-based document framework to knowledge content authoring and clinical information system development. AMIA Annu Symp Proc 2003:870, 2003
Hripcsak G, Austin JH, Alderson PO, Friedman C: Use of natural language processing to translate clinical information from a database of 889,921 chest radiographic reports. Radiology 224:157–163, 2002
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Erinjeri, J.P., Picus, D., Prior, F.W. et al. Development of a Google-Based Search Engine for Data Mining Radiology Reports. J Digit Imaging 22, 348–356 (2009). https://doi.org/10.1007/s10278-008-9110-7
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10278-008-9110-7