ABSTRACT
A common practice in comparative evaluation of information retrieval (IR) systems is to create a test collection comprising a set of topics (queries), a document corpus, and relevance judgments, and to monitor the performance of retrieval systems over such a collection. A typical evaluation of a system involves computing a performance metric, e.g., Average Precision (AP), for each topic and then using the average performance metric, e.g., Mean Average Precision (MAP) to express the overall system performance. However, averages do not capture all the important aspects of system performance, and used alone may not thoroughly express system effectiveness, i.e., average of performance can mask large variance in individual topic effectiveness. The author hypothesis is that, in addition to the average of overall performance, attention needs to be paid to how a system performance varies across topics. This variability can be measured by calculating the standard deviation (SD) of individual performance scores. We refer to this performance variation as Volatility.
- E. M. Voorhees. The trec 2005 robust track. volume 40, pages 41--48, New York, NY, USA, 2006. ACM. Google ScholarDigital Library
- W. Webber, A. Moat, and J. Zobel. Score standardization for inter-collection comparison of retrieval systems. In SIGIR 08:Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval, pages 51--58, New York, NY, USA, 2008. ACM. Google ScholarDigital Library
Index Terms
- A study on performance volatility in information retrieval
Recommendations
How many performance measures to evaluate information retrieval systems?
Evaluating effectiveness of information retrieval systems is achieved by performing on a collection of documents, a search, in which a set of test queries are performed and, for each query, the list of the relevant documents. This evaluation framework ...
On the reliability of information retrieval metrics based on graded relevance
Special issue: AIRS2005: Information retrieval research in AsiaThis paper compares 14 information retrieval metrics based on graded relevance, together with 10 traditional metrics based on binary relevance, in terms of stability, sensitivity and resemblance of system rankings. More specifically, we compare these ...
Incorporating rich features to boost information retrieval performance
Research highlights We propose a regression-based re-ranking framework that can take into account rich features for boosting information retrieval (IR) performance. A set of salient features that may affect IR performance are investigated. Extensive ...
Comments