Testing a Cancer Meta Spider

https://doi.org/10.1016/S1071-5819(03)00118-6Get rights and content

Abstract

As in many other applications, the rapid proliferation and unrestricted Web-based publishing of health-related content have made finding pertinent and useful healthcare information increasingly difficult. Although the development of healthcare information retrieval systems such as medical search engines and peer-reviewed medical Web directories has helped alleviate this information and cognitive overload problem, the effectiveness of these systems has been limited by low search precision, poor presentation of search results, and the required user search effort. To address these challenges, we have developed a domain-specific meta-search tool called Cancer Spider. By leveraging post-retrieval document clustering techniques, this system aids users in querying multiple medical data sources to gain an overview of the retrieved documents and locating answers of high quality to a wide spectrum of health questions. The system presents the retrieved documents to users in two different views: (1) Web pages organized by a list of key phrases, and (2) Web pages clustered into regions discussing different topics on a two-dimensional map (self-organizing map). In this paper, we present the major components of the Cancer Spider system and a user evaluation study designed to evaluate the effectiveness and efficiency of our approach. Initial results comparing Cancer Spider with NLM Gateway, a premium medical search site, have shown that they achieved comparable performances measured by precision, recall, and F-measure. Cancer Spider required less user searching time, fewer documents that need to be browsed, and less user effort.

Introduction

In the healthcare profession, potentially significant decisions often depend on the availability of reliable and up-to-date information, although health-related data, especially those that are Web-based, are highly distributed, of varying quality, and difficult to locate. For instance, clinical information is often mingled with non-clinical information, and consumer information is undistinguished from clinical or research information (Hersh, 1996). Documents with different amounts of technical detail and varying quality are often mixed together in an unstructured way and it has become increasingly difficult to judge the quality and credibility of a piece of Web-based medical information. As a result, medical professionals and the general public increasingly experience the information and cognitive overload problem (Bowman et al., 1994) when seeking medical information.

The development of medical information retrieval (IR) systems such as medical search engines and peer-reviewed medical Web directories has helped alleviate this problem. However, the effectiveness and usefulness of these systems have been limited by low search precision and poor presentation of the retrieved documents. More specifically the following should be noted.

  • Finding specific answers to user questions can be time-consuming and expensive, in part because of the amount of effort required to browse through large collections of returned documents and to identify the relevant ones. A search query in a general search engine like Google often returns thousands of results.

  • Traditional search engines, including medical search engines, present search results as ranked lists, ordered by estimated relevance to the query. A major drawback of this presentation is that it fails to give users a quick overall “feel” for the retrieved documents and requires often significant manual browsing effort from users to locate documents of interest.

To address these problems with existing medical IR systems, we have developed Cancer Spider, a meta-search engine that performs post-retrieval document clustering and semantics-based visualization. In the post-retrieval phase of the system operation, we apply a linguistic-based noun phrasing technique to extract key concepts from documents, aiming to improve search precision. Semantics-based visualization using the Self-Organizing Map algorithm enables users to summarize easily the subject areas covered by the retrieved documents and navigate among them. From the user's perspective, Cancer Spider allows a user to easily access multiple medical literature databases, gain an overview of the retrieved documents, and to locate quality answers to a wide spectrum of health questions.

The rest of the paper is structured as follows. Section 2 surveys the current status of IR in the healthcare domain and some IR techniques, including meta-searching, document indexing, and post-retrieval clustering and visualization. In Section 3, we present the architectural design and major components of Cancer Spider. Section 4 discusses the design of a user study conducted to evaluate the proposed approach, and Section 5 reports and discusses the findings of this user study. We conclude the paper in Section 6 with a discussion of future research and system development.

Section snippets

Information retrieval in healthcare

Healthcare is an information-intensive business. Hersh (1996) classified textural health information into two main categories. The first category is patient-specific information. The second category is knowledge-based information, which can be further divided into the following three layers. Primary knowledge-based information contains original research reported in academic journals, books, technical reports, and other sources. Secondary knowledge-based information consists of indexes that

Cancer Spider system architecture

In this section, we present the architectural design of Cancer Spider, focusing both on its use as a meta medical search tool and as a document clustering and visualization tool. Although the current implementation of Cancer Spider focuses on cancer-related topics, it can be easily adapted for other medical areas, since the core technologies are domain-independent.

Intended end-users of Cancer Spider are cancer researchers, physicians and medical librarians. The design goal of Cancer Spider is

Comparison base

We conducted a user study to evaluate the proposed approach of meta-searching, clustering, and visualization implemented in the Cancer Spider system. In our experiment, Cancer Spider was compared with NLM Gateway (http://gateway.nlm.nih.gov/gw/Cmd?GMBasicSearch), the portal search engine to NLM's multiple literature databases.

NLM Gateway is a Web-based system that lets users search simultaneously multiple retrieval systems at the US National Library of Medicine (NLM) through a unified Web

Performance

Measured by theme-based precision and recall, the performances of the two IR systems, Cancer Spider and NLM Gateway were comparable. The main statistics are summarized in Table 1. Both the mean precision and the mean recall of Cancer Spider were comparable with those of NLM Gateway. On average, NLM Gateway seemed to be slightly better than Cancer Spider, but the difference was not statistically significant, as suggested by the p-value of the pair-wise t-test. The F-measure, which was the

Conclusions and future directions

Both Cancer Spider and NLM Gateway have strengths and weaknesses. NLM Gateway, as a portal site to many high-quality NLM databases, is a conventional medical search engine. Cancer Spider's data coverage is not as comprehensive as that of NLM Gateway. However, Cancer Spider draws its strength from post-retrieval processing and clustering capability.

With respect to system performance measured by theme-based precision and recall, Cancer Spider and NLM Gateway achieved comparable results. When

Acknowledgements

The Cancer Spider project is supported in part by the following research grants:

  • NSF Digital Library Initiative-2, “High-Performance Digital Library Systems: From Information to Knowledge Management,” IIS-9817473, April 1999–March 2002.

  • NIH/NLM, “UMLS Enhanced Dynamic Agents to Manage Medical Knowledge,” 1 R01 LM06919-1A1, February 2001–January 2004.

We would like to thank the researchers at the Arizona Cancer Center who participated in the user study and the librarians at the Arizona Health and

References (37)

  • D. Harman

    How effective is suffixing?

    Journal of the American Society for Information Science

    (1991)
  • Hearst, M., 1995. TileBars: visualization of term distribution information in full text information access. In:...
  • W.R. Hersh

    Information RetrievalA Health Care Perspective

    (1996)
  • A.E. Howe et al.

    SavvySearcha meta-search engine that learns which search engines to query

    AI Magazine

    (1997)
  • D.A. Hull

    Stemming algorithms—a case study for detailed evaluation

    Journal of the American Society for Information Science

    (1996)
  • R. Kiley

    Medical Information on the InternetA Guide for Health Professionals

    (1999)
  • Keonemann, J., Belkin, N., 1996. A case for interaction: a study of interactive information retrieval behavior and...
  • T. Kohonen

    Self-Organizing Maps

    (1995)
  • Cited by (11)

    • Vehicle defect discovery from social media

      2012, Decision Support Systems
      Citation Excerpt :

      For instance, in the healthcare profession, reliable and up-to-date health-related data is highly distributed, of varying quality, and difficult to locate on the Web [10]. In the healthcare domain, special-purpose cancer “vertical search spiders”, which use classification technology from the text mining literature, have been developed [11] to specifically identify high quality documents relating to cancer topics from among the thousands of low quality results returned by traditional search engines. In the automotive domain, sifting defect postings (especially safety defect postings) from the discussion ‘dregs’ is similarly challenging and important.

    • A hybrid system for online detection of emotional distress

      2012, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
    View all citing articles on Scopus
    View full text