Automatic performance evaluation of Web search engines

https://doi.org/10.1016/S0306-4573(03)00040-2Get rights and content

Abstract

Measuring the information retrieval effectiveness of World Wide Web search engines is costly because of human relevance judgments involved. However, both for business enterprises and people it is important to know the most effective Web search engines, since such search engines help their users find higher number of relevant Web pages with less effort. Furthermore, this information can be used for several practical purposes. In this study we introduce automatic Web search engine evaluation method as an efficient and effective assessment tool of such systems. The experiments based on eight Web search engines, 25 queries, and binary user relevance judgments show that our method provides results consistent with human-based evaluations. It is shown that the observed consistencies are statistically significant. This indicates that the new method can be successfully used in the evaluation of Web search engines.

Introduction

The growth of the World Wide Web is an unprecedented phenomenon. Four years after the Web's birth in 1990, a million or more copies of the first well-known Web browser, Mosaic, were in use (Abbate, 1999). This growth was a result of the exponential increase of Web servers and the value and number of Web pages made accessible by these servers. In 1999 the number of Web servers was estimated at about 3 million and the number of Web pages at about 800 million (Lawrence & Giles, 1999), and three years later, in June 2002, the search engine AlltheWeb (alltheweb.com) announced that its index contained information about 2.1 billion Web pages. There are millions of Web users and about 85% of them use search engines to locate information on the Web (Kobayashi & Takeda, 2000). It is determined that search engine use is the second most popular Internet activity next to e-mail (Jansen & Pooch, 2001). Due to high demand there are hundreds of general purpose and thousands of specialized search engines (Kobayashi & Takeda, 2000; Lawrence & Giles, 1999).

People use search engines for finding information on the Web. A Web search engine is an information retrieval system (Salton & McGill, 1983), which is used to locate the Web pages relevant to user queries (in the paper the terms page and document will be used interchangeably). A Web search engine contains indexing, storage, query processing, spider (or crawler, robot), and user interface subsystems. The indexing subsystem aims to capture the information content of Web pages by using their words. During indexing, frequent words (that, the, this, etc.), known as stop words, may be eliminated since such words usually have no information value. Various statistics about words (e.g., number of occurrences in the individual pages or in all of the indexed Web pages) are usually stored in an inverted file structure. This organization is used during query processing to rank the pages according to their relevance scores for a given query. Hyperlink structure information about Web pages is also used for page ranking (Brin & Page, 1998; Kobayashi & Takeda, 2000). The spider subsystem brings the pages to be indexed to the system. However, for Web users a search engine is nothing but its user interface that accepts queries and presents the search results. In this study our concern is text-based search engines. We also assume that users expect a selection of documents as their search result (rather than a number––e.g., freezing point of water-, a word––e.g., the name of the largest planet-, etc.). This is the case in typical Web search engine queries (Spink, Dietmar, Jansen, & Saracevic, 2001).

For assessing the performance of search engines there are various measures such as database coverage, query response time, user effort, and retrieval effectiveness. The dynamic nature of the Web also brings some more performance measure concerns regarding index freshness and availability of the Web pages as time passes (Bar-Ilan, 2002). The most common effectiveness measures are precision (ratio of retrieved relevant documents to the number of retrieved documents) and recall (ratio of retrieved relevant documents to the total number of relevant documents in the database). Measuring the search engine effectiveness is expensive due to the human labor involved in judging relevancy. (For example, one of the subjects of our experiments spent about 6 h to judge the query results and similar observations are reported in other studies (Hawking, Craswel, Bailey, & Griffiths, 2001).) Evaluation of search engines may need to be done often due to changing needs of users or the dynamic nature of search engines (e.g., their changing Web coverage and ranking technology) and therefore it needs to be efficient.

The motivation of this study is based on the fact that identifying the most effective Web search engines satisfying the current information-needs is important both at a personal and a business level. Among others, this information can be used for (a) finding more relevant documents with less effort; (b) motivating search engine providers for higher standards; (c) implementing custom-made meta-search engines. The contribution of this study is a method for automatic performance evaluation of search engines. In the paper we first introduce the method, automatic Web search engine evaluation method (AWSEEM). Then we show that it provides statistically significant consistent results compared to human-based evaluations. This shows that our method, AWSEEM, can be used instead of expensive human-based evaluations. AWSEEM uses an experimental approach in determining the effectiveness of Web search engines. For this purpose we first collect user information-needs and the associated queries, and then run these queries on several search engines. AWSEEM downloads and ranks the top 200 pages returned by each search engine for each information-need specified by the users. In the automatic method, a certain number of the most similar pages are assumed as relevant pages (pseudo-relevance judgments). It is experimentally shown that the differences in human relevance assessments do not affect the relative performance of retrieval systems (Voorhees, 2000). Based on this observation, considering these pseudo- (or automatic) relevance judgments as different human relevance assessments, the performance of search engines is evaluated. In the final stage the consistency of the automatic and human-based evaluations is measured by statistical methods.

This paper is organized as follows. In Section 2 we give an overview of related works. We then explain the experimental environment in terms of queries and search engines involved in Section 3, since this information makes the explanation of AWSEEM, which is provided in Section 4, more intuitive. Detailed experimental results and their statistical interpretation are provided in Section 5. Conclusions and future research pointers are presented in Section 6.

Section snippets

Related work

The evaluation of the text retrieval performance in static document collections is a well-known research problem in the field of information retrieval (Salton & McGill, 1983). In this study our concern is Web search engines. In the literature there are two types of search engine evaluation approaches: testimonial and shootout. Testimonials are casual studies and state the general impression obtained after executing a few queries. Shootouts are rigorous studies and follow the information

User queries and information-needs

The process of measuring retrieval effectiveness requires user queries. For the construction of the queries we asked our participants––more technically known as subjects––(Fall 2001 students of CS533 students and two professors in the Computer Engineering Department of Bilkent University) to define their information-needs in English. For this purpose, they stated their needs in a form by writing a query, query subject, query description, and the related keywords. By this method we obtained 25

Automatic method AWSEEM

In AWSEEM, first all user queries are submitted to all search engines under consideration. After this for each query, the top b (200) Web pages of each search engine are determined. The content of these pages, if the pages are available, are downloaded and saved to build a separate document collection, or Web abstraction, for each query. In this process dead-links (i.e. unavailable pages), are regarded as useless answers. If a page is selected by more than one search engine, only one copy is

Evaluation measures and their calculation

The effectiveness of search engines is measured in terms of precision and recall. This is done both for human judgments and AWSEEM and their results are compared to test the consistency of these two approaches. In this study, we measure “relative recall”. In relative recall calculation, the denominator term, “the total number of relevant documents in the database”, is replaced by “the total number of relevant documents retrieved by all search engines” either by automatic judgment or human

Conclusions

In this study we present an automatic method for the performance evaluation of Web search engines. We measure the performance of search engines after examining various numbers of top pages returned by the search engines and check the consistency between human and automatic evaluations using these observations. In the experiments we use 25 queries and look at their performance in eight different search engines based on binary relevance judgments of users. Our experiments show a high level of

Acknowledgements

We would like to thank the Fall 2001 CS533, Information Retrieval Systems, students, and David Davenport of Bilkent University for writing and evaluating the queries used in this study. We would also like to thank our anonymous referees, and Jon M. Patton and Pedrito Uriah Maynard-Zhang for their valuable comments.

References (19)

There are more references available in the full text version of this article.

Cited by (0)

1

The majority of this work has been completed when the first author was on sabbatical leave at Bilkent University.

View full text