Skip to main content
Log in

Development of a Google-Based Search Engine for Data Mining Radiology Reports

  • Published:
Journal of Digital Imaging Aims and scope Submit manuscript

Abstract

The aim of this study is to develop a secure, Google-based data-mining tool for radiology reports using free and open source technologies and to explore its use within an academic radiology department. A Health Insurance Portability and Accountability Act (HIPAA)-compliant data repository, search engine and user interface were created to facilitate treatment, operations, and reviews preparatory to research. The Institutional Review Board waived review of the project, and informed consent was not required. Comprising 7.9 GB of disk space, 2.9 million text reports were downloaded from our radiology information system to a fileserver. Extensible markup language (XML) representations of the reports were indexed using Google Desktop Enterprise search engine software. A hypertext markup language (HTML) form allowed users to submit queries to Google Desktop, and Google’s XML response was interpreted by a practical extraction and report language (PERL) script, presenting ranked results in a web browser window. The query, reason for search, results, and documents visited were logged to maintain HIPAA compliance. Indexing averaged approximately 25,000 reports per hour. Keyword search of a common term like “pneumothorax” yielded the first ten most relevant results of 705,550 total results in 1.36 s. Keyword search of a rare term like “hemangioendothelioma” yielded the first ten most relevant results of 167 total results in 0.23 s; retrieval of all 167 results took 0.26 s. Data mining tools for radiology reports will improve the productivity of academic radiologists in clinical, educational, research, and administrative tasks. By leveraging existing knowledge of Google’s interface, radiologists can quickly perform useful searches.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig 1
Fig 2
Fig 3

Similar content being viewed by others

References

  1. Iwata S, Chen RS: Science and the digital divide. Science 310:405, 2005

    Article  PubMed  CAS  Google Scholar 

  2. Thrall JH: Reinventing radiology in the digital age: part I. The all-digital department. Radiology 236:382–385, 2005

    Article  PubMed  Google Scholar 

  3. Hynes DM, Stevenson G, Nahmias C: Towards filmless and distance radiology. Lancet 350:657–660, 1997

    Article  PubMed  CAS  Google Scholar 

  4. Tamm EP, Kawashima A, Silverman P: An academic radiology information system (RIS): a review of the commercial RIS systems, and how an individualized academic RIS can be created and utilized. J Digit Imaging 14:131–134, 2001

    Article  PubMed  CAS  Google Scholar 

  5. Thrall JH: Reinventing radiology in the digital age. Part II. New directions and new stakeholder value. Radiology 237:15–18, 2005

    Article  PubMed  Google Scholar 

  6. Meghea CI, Sunshine JH: Who’s overworked and who’s underworked among radiologists? An update on the radiologist shortage. Radiology 236:932–938, 2005

    Article  PubMed  Google Scholar 

  7. Steinbrook R: Searching for the right search—reaching the medical literature. N Engl J Med 354:4–7, 2006

    Article  PubMed  CAS  Google Scholar 

  8. Birney E, Bateman A, Clamp ME, Hubbard TJ: Mining the draft human genome. Nature 409:827–828, 2001

    Article  PubMed  CAS  Google Scholar 

  9. Giustini D: How Google is changing medicine. BMJ 331:1487–1488, 2005

    Article  PubMed  Google Scholar 

  10. O’Connor JB, Johanson JF: Use of the Web for medical information by a gastroenterology clinic population. JAMA 284:1962–1964, 2000

    Article  PubMed  Google Scholar 

  11. Greenwald R: And a diagnostic test was performed. N Engl J Med 353:2089–2090, 2005

    Article  PubMed  CAS  Google Scholar 

  12. Hand DJ, Mannila P, Smyth P: Principle of Data Mining, Cambridge, MA: MIT, 2001

    Google Scholar 

  13. Mullins IM, Siadaty MS, Lyman J, et al: Data mining and clinical data repositories: insights from a 667,000 patient data set. Comput Biol Med 36:1351–1377, 2006

    Article  PubMed  Google Scholar 

  14. Nigrin DJ, Kohane IS: Data mining by clinicians. Proc AMIA Symp 1998:957–961, 1998

    Google Scholar 

  15. Prather JC, Lobach DF, Goodwin LK, Hales JW, Hage ML, Hammond WE: Medical data mining: knowledge discovery in a clinical data warehouse. Proc AMIA Annu Fall Symp 1997:101–105, 1997

    Google Scholar 

  16. Ananiadou S, Kell DB, Tsujii JI: Text mining and its potential applications in systems biology. Trends Biotechnol 24:571–579, 2006

    Article  PubMed  CAS  Google Scholar 

  17. Cohen AM, Hersh WR: A survey of current work in biomedical text mining. Brief Bioinform 6:57–71, 2005

    Article  PubMed  CAS  Google Scholar 

  18. Heinze DT, Morsch ML, Holbrook J: Mining free-text medical records. Proc AMIA Symp 2001:254–258, 2001

    Google Scholar 

  19. Roberts PM: Mining literature for systems biology. Brief Bioinform 7:399–406, 2006

    Article  PubMed  CAS  Google Scholar 

  20. Bekhuis T: Conceptual biology, hypothesis discovery, and text mining: Swanson’s legacy. Biomed Digit Libr 3:2, 2006

    Article  PubMed  Google Scholar 

  21. Scherf M, Epple A, Werner T: The next generation of literature analysis: integration of genomic analysis into text mining. Brief Bioinform 6:287–297, 2005

    Article  PubMed  CAS  Google Scholar 

  22. Schonbach C, Nagashima T, Konagaya A: Textmining in support of knowledge discovery for vaccine development. Methods 34:488–495, 2004

    Article  PubMed  Google Scholar 

  23. Sokol L, Garcia B, Rodriguez J, West M, Johnson K: Using data mining to find fraud in HCFA health care claims. Top Health Inf Manage 22:1–13, 2001

    PubMed  CAS  Google Scholar 

  24. Definitions: research. Title 45 Code of Federal Regulation, Pt. 46.102(d), 2000

  25. Use and Disclosure for Treatment, Payment and Health Care Operations. Title 45 Code of Federal Regulation, Pt. 164.506, 2000

  26. Definition: health care operations. Title 45 Code of Federal Regulation, Pt. 164.501(2), 2000

  27. IRB review of research. Title 45 Code of Federal Regulation, Pt. 46.109, 2000

  28. Reviews Preparatory to Research. Title 45 Code of Federal Regulation, Pt. 164.512(h)(i)(1)(ii), 2000

  29. De-identification of protected health information. Title 45 Code of Federal Regulation, Pt. 164.514(a), 2000

  30. Magos A, Gambadauro P: Desktop search engines: a modern way to hand search in full text. Lancet 366:203–204, 2005

    Article  PubMed  Google Scholar 

  31. Smith AC: Effect of XML markup on retrieval of clinical documents. AMIA Annu Symp Proc 2003:614–618, 2003

    Google Scholar 

  32. Hulse NC, Rocha RA, Bradshaw R, Del Fiol G, Roemer L: Application of an XML-based document framework to knowledge content authoring and clinical information system development. AMIA Annu Symp Proc 2003:870, 2003

    Google Scholar 

  33. Hripcsak G, Austin JH, Alderson PO, Friedman C: Use of natural language processing to translate clinical information from a database of 889,921 chest radiographic reports. Radiology 224:157–163, 2002

    Article  PubMed  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Joseph P. Erinjeri.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Erinjeri, J.P., Picus, D., Prior, F.W. et al. Development of a Google-Based Search Engine for Data Mining Radiology Reports. J Digit Imaging 22, 348–356 (2009). https://doi.org/10.1007/s10278-008-9110-7

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10278-008-9110-7

Key words

Navigation