Testing a Cancer Meta Spider

doi:10.1016/S1071-5819(03)00118-6

International Journal of Human-Computer Studies

Volume 59, Issue 5, November 2003, Pages 755-776

https://doi.org/10.1016/S1071-5819(03)00118-6 Get rights and content

Abstract

As in many other applications, the rapid proliferation and unrestricted Web-based publishing of health-related content have made finding pertinent and useful healthcare information increasingly difficult. Although the development of healthcare information retrieval systems such as medical search engines and peer-reviewed medical Web directories has helped alleviate this information and cognitive overload problem, the effectiveness of these systems has been limited by low search precision, poor presentation of search results, and the required user search effort. To address these challenges, we have developed a domain-specific meta-search tool called Cancer Spider. By leveraging post-retrieval document clustering techniques, this system aids users in querying multiple medical data sources to gain an overview of the retrieved documents and locating answers of high quality to a wide spectrum of health questions. The system presents the retrieved documents to users in two different views: (1) Web pages organized by a list of key phrases, and (2) Web pages clustered into regions discussing different topics on a two-dimensional map (self-organizing map). In this paper, we present the major components of the Cancer Spider system and a user evaluation study designed to evaluate the effectiveness and efficiency of our approach. Initial results comparing Cancer Spider with NLM Gateway, a premium medical search site, have shown that they achieved comparable performances measured by precision, recall, and F-measure. Cancer Spider required less user searching time, fewer documents that need to be browsed, and less user effort.

Introduction

In the healthcare profession, potentially significant decisions often depend on the availability of reliable and up-to-date information, although health-related data, especially those that are Web-based, are highly distributed, of varying quality, and difficult to locate. For instance, clinical information is often mingled with non-clinical information, and consumer information is undistinguished from clinical or research information (Hersh, 1996). Documents with different amounts of technical detail and varying quality are often mixed together in an unstructured way and it has become increasingly difficult to judge the quality and credibility of a piece of Web-based medical information. As a result, medical professionals and the general public increasingly experience the information and cognitive overload problem (Bowman et al., 1994) when seeking medical information.

The development of medical information retrieval (IR) systems such as medical search engines and peer-reviewed medical Web directories has helped alleviate this problem. However, the effectiveness and usefulness of these systems have been limited by low search precision and poor presentation of the retrieved documents. More specifically the following should be noted.

•
Finding specific answers to user questions can be time-consuming and expensive, in part because of the amount of effort required to browse through large collections of returned documents and to identify the relevant ones. A search query in a general search engine like Google often returns thousands of results.
•
Traditional search engines, including medical search engines, present search results as ranked lists, ordered by estimated relevance to the query. A major drawback of this presentation is that it fails to give users a quick overall “feel” for the retrieved documents and requires often significant manual browsing effort from users to locate documents of interest.

To address these problems with existing medical IR systems, we have developed Cancer Spider, a meta-search engine that performs post-retrieval document clustering and semantics-based visualization. In the post-retrieval phase of the system operation, we apply a linguistic-based noun phrasing technique to extract key concepts from documents, aiming to improve search precision. Semantics-based visualization using the Self-Organizing Map algorithm enables users to summarize easily the subject areas covered by the retrieved documents and navigate among them. From the user's perspective, Cancer Spider allows a user to easily access multiple medical literature databases, gain an overview of the retrieved documents, and to locate quality answers to a wide spectrum of health questions.

The rest of the paper is structured as follows. Section 2 surveys the current status of IR in the healthcare domain and some IR techniques, including meta-searching, document indexing, and post-retrieval clustering and visualization. In Section 3, we present the architectural design and major components of Cancer Spider. Section 4 discusses the design of a user study conducted to evaluate the proposed approach, and Section 5 reports and discusses the findings of this user study. We conclude the paper in Section 6 with a discussion of future research and system development.

Section snippets

Information retrieval in healthcare

Healthcare is an information-intensive business. Hersh (1996) classified textural health information into two main categories. The first category is patient-specific information. The second category is knowledge-based information, which can be further divided into the following three layers. Primary knowledge-based information contains original research reported in academic journals, books, technical reports, and other sources. Secondary knowledge-based information consists of indexes that

Cancer Spider system architecture

In this section, we present the architectural design of Cancer Spider, focusing both on its use as a meta medical search tool and as a document clustering and visualization tool. Although the current implementation of Cancer Spider focuses on cancer-related topics, it can be easily adapted for other medical areas, since the core technologies are domain-independent.

Intended end-users of Cancer Spider are cancer researchers, physicians and medical librarians. The design goal of Cancer Spider is

Comparison base

We conducted a user study to evaluate the proposed approach of meta-searching, clustering, and visualization implemented in the Cancer Spider system. In our experiment, Cancer Spider was compared with NLM Gateway (http://gateway.nlm.nih.gov/gw/Cmd?GMBasicSearch), the portal search engine to NLM's multiple literature databases.

NLM Gateway is a Web-based system that lets users search simultaneously multiple retrieval systems at the US National Library of Medicine (NLM) through a unified Web

Performance

Measured by theme-based precision and recall, the performances of the two IR systems, Cancer Spider and NLM Gateway were comparable. The main statistics are summarized in Table 1. Both the mean precision and the mean recall of Cancer Spider were comparable with those of NLM Gateway. On average, NLM Gateway seemed to be slightly better than Cancer Spider, but the difference was not statistically significant, as suggested by the p-value of the pair-wise t-test. The F-measure, which was the

Conclusions and future directions

Both Cancer Spider and NLM Gateway have strengths and weaknesses. NLM Gateway, as a portal site to many high-quality NLM databases, is a conventional medical search engine. Cancer Spider's data coverage is not as comprehensive as that of NLM Gateway. However, Cancer Spider draws its strength from post-retrieval processing and clustering capability.

With respect to system performance measured by theme-based precision and recall, Cancer Spider and NLM Gateway achieved comparable results. When

Acknowledgements

The Cancer Spider project is supported in part by the following research grants:

•
NSF Digital Library Initiative-2, “High-Performance Digital Library Systems: From Information to Knowledge Management,” IIS-9817473, April 1999–March 2002.
•
NIH/NLM, “UMLS Enhanced Dynamic Agents to Manage Medical Knowledge,” 1 R01 LM06919-1A1, February 2001–January 2004.

We would like to thank the researchers at the Arizona Cancer Center who participated in the user study and the librarians at the Arizona Health and

References (37)

L. Bin et al.
The retrieval effectiveness of medical information on the Web
International Journal of Medical Informatics
(2001)
H. Chen et al.
Internet categorization and searcha self-organizing approach
Journal of Visual Communication and Image Representation
(1996)
B. Shneiderman
Designing information-abundant Web sitesissues and recommendations
International Journal of Human–Computer Studies
(1997)
A.G. Sutcliffe et al.
Evaluating the effectiveness of visual user interfaces for information retrieval
International Journal of Human–Computer Studies
(2000)
C. Bowman et al.
Scalable Internet resource discoveryresearch problems and approaches
Communications of the ACM
(1994)
E. Brill
Transformation-based error-driven learning and natural language processing
Computational Linguistics
(1995)
Chau, M., Zeng, D., Chen, H., 2001. Personalized spiders for Web search and analysis. In: Proceedings of the First...
H. Chen et al.
Internet browsing and searchinguser evaluations of category map and concept space techniques
Journal of the American Society for Information Science
(1998)
H. Chen et al.
MetaSpidermeta-searching and categorization on the Web
Journal of the American Society for Information Science & Technology
(2001)
S. Gauch et al.
Profusionintelligent fusion from multiple different search engines
Journal of Universal Computer Science
(1996)

D. Harman

How effective is suffixing?

Journal of the American Society for Information Science

(1991)

Hearst, M., 1995. TileBars: visualization of term distribution information in full text information access. In:...

W.R. Hersh

Information RetrievalA Health Care Perspective

(1996)

A.E. Howe et al.

SavvySearcha meta-search engine that learns which search engines to query

AI Magazine

(1997)

D.A. Hull

Stemming algorithms—a case study for detailed evaluation

Journal of the American Society for Information Science

(1996)

R. Kiley

Medical Information on the InternetA Guide for Health Professionals

(1999)

Keonemann, J., Belkin, N., 1996. A case for interaction: a study of interactive information retrieval behavior and...

T. Kohonen

Self-Organizing Maps

(1995)

Cited by (11)

Vehicle defect discovery from social media
2012, Decision Support Systems
Citation Excerpt :
For instance, in the healthcare profession, reliable and up-to-date health-related data is highly distributed, of varying quality, and difficult to locate on the Web [10]. In the healthcare domain, special-purpose cancer “vertical search spiders”, which use classification technology from the text mining literature, have been developed [11] to specifically identify high quality documents relating to cancer topics from among the thousands of low quality results returned by traditional search engines. In the automotive domain, sifting defect postings (especially safety defect postings) from the discussion ‘dregs’ is similarly challenging and important.
A pressing need of vehicle quality management professionals is decision support for the vehicle defect discovery and classification process. In this paper, we employ text mining on a popular social medium used by vehicle enthusiasts: online discussion forums. We find that sentiment analysis, a conventional technique for consumer complaint detection, is insufficient for finding, categorizing, and prioritizing vehicle defects discussed in online forums, and we describe and evaluate a new process and decision support system for automotive defect identification and prioritization. Our findings provide managerial insights into how social media analytics can improve automotive quality management.
Designing the user interface and functions of a search engine development tool
2010, Decision Support Systems
Search engine development tools have been made to allow users to build their own search engines. However, most of these tools have been designed for advanced computer users. Users without a full understanding of topics such as Web spidering would find these tools difficult to use due to different issues in terms of user interface, performance, and reliability. In view of these issues, we presented a tool called SpidersRUs to strike a balance between usability and functionality. On one hand, beginners should be able to operate the tool by using the basic functions needed to build a search engine. On the other, advanced users should be given the options to exert a higher level of customization while working on the tool. To study the interface design of SpidersRUs, we compared its usability and functionality from the users' perspective with two other development tools, namely Alkaline and Greenstone, in an evaluation study. Our study showed that SpidersRUs was preferred over the other two, particularly in areas of screen layout and sequence, terminology and system information, and learning to use the system.
Mining communities and their relationships in blogs: A study of online hate groups
2007, International Journal of Human Computer Studies
Blogs, often treated as the equivalence of online personal diaries, have become one of the fastest growing types of Web-based media. Everyone is free to express their opinions and emotions very easily through blogs. In the blogosphere, many communities have emerged, which include hate groups and racists that are trying to share their ideology, express their views, or recruit new group members. It is important to analyze these virtual communities, defined based on membership and subscription linkages, in order to monitor for activities that are potentially harmful to society. While many Web mining and network analysis techniques have been used to analyze the content and structure of the Web sites of hate groups on the Internet, these techniques have not been applied to the study of hate groups in blogs. To address this issue, we have proposed a semi-automated approach in this research. The proposed approach consists of four modules, namely blog spider, information extraction, network analysis, and visualization. We applied this approach to identify and analyze a selected set of 28 anti-Blacks hate groups (820 bloggers) on Xanga, one of the most popular blog hosting sites. Our analysis results revealed some interesting demographical and topological characteristics in these groups, and identified at least two large communities on top of the smaller ones. The study also demonstrated the feasibility in applying the proposed approach in the study of hate groups and other related communities in blogs.
CMedPort: An integrated approach to facilitating Chinese medical information seeking
2006, Decision Support Systems
As the number of non-English resources available on the Web is increasing rapidly, developing information retrieval techniques for non-English languages is becoming an urgent and challenging issue. In this research to facilitate information seeking in a multilingual world, we focused on discovering how search-engine techniques developed for English could be generalized for use with other languages. We proposed a general framework incorporating a focused collection-building technique, a generic language processing ability, an integration of information resources, and a post-retrieval analysis module. Based on this approach, we developed CMedPort, a Chinese Web portal in the medical domain that not only allows users to search for Web pages from local collections and meta-search engines but also provides encoding conversion between simplified and traditional Chinese to support cross-regional search and document summarization and categorization. User studies were conducted to compare the effectiveness and efficiency of CMedPort with those of three major Chinese search engines. Results indicate that CMedPort achieved similar accuracy for search tasks, but exhibited significantly higher recall than each of the three search engines as well as higher precision than two of the search engines for browse tasks. There were no significant differences among the efficiency measures for CMedPort and benchmarks systems. A post-questionnaire regarding system usability indicated that CMedPort achieved significantly higher user satisfaction than any of the three benchmark systems. The subjects especially liked CMedPort's categorizer, commenting that it helped improve understanding of search results. These encouraging outcomes suggest a promising future for applying our approach to Internet searching and browsing in a multilingual world.
A Systematic Review of Cost, Effort, and Load Research in Information Search and Retrieval, 1972-2020
2023, ACM Transactions on Information Systems
A hybrid system for online detection of emotional distress
2012, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

View all citing articles on Scopus

View full text

Testing a Cancer Meta Spider

Abstract

Introduction

Section snippets

Information retrieval in healthcare

Cancer Spider system architecture

Comparison base

Performance

Conclusions and future directions

Acknowledgements

International Journal of Medical Informatics

Journal of Visual Communication and Image Representation

International Journal of Human–Computer Studies

International Journal of Human–Computer Studies

Scalable Internet resource discoveryresearch problems and approaches

Communications of the ACM

Transformation-based error-driven learning and natural language processing

Computational Linguistics

Internet browsing and searchinguser evaluations of category map and concept space techniques

Journal of the American Society for Information Science

MetaSpidermeta-searching and categorization on the Web

Journal of the American Society for Information Science & Technology

Profusionintelligent fusion from multiple different search engines

Journal of Universal Computer Science

How effective is suffixing?

Journal of the American Society for Information Science

Information RetrievalA Health Care Perspective

SavvySearcha meta-search engine that learns which search engines to query

AI Magazine

Stemming algorithms—a case study for detailed evaluation

Journal of the American Society for Information Science

Medical Information on the InternetA Guide for Health Professionals

Self-Organizing Maps