ABSTRACT
Evaluation of information retrieval systems with test collections makes use of a suite of fixed resources: a document corpus; a set of topics; and associated judgments of the relevance of each document to each topic. With large modern collections, exhaustive judging is not feasible. Therefore an approach called pooling is typically used where, for example, the documents to be judged can be determined by taking the union of all documents returned in the top positions of the answer lists returned by a range of systems. Conventionally, pooling uses system variations to provide diverse documents to be judged for a topic; different user queries are not considered. We explore the ramifications of user query variability on pooling, and demonstrate that conventional test collections do not cover this source of variation. The effect of user query variation on the size of the judging pool is just as strong as the effect of retrieval system variation. We conclude that user query variation should be incorporated early in test collection construction, and cannot be considered effectively post hoc.
- P. Bailey, A. Moffat, F. Scholer, and P. Thomas. User variability and IR system evaluation. In Proc. SIGIR, 2015. Google ScholarDigital Library
- C. Buckley and J. Walz. The TREC-8 query track. In Proc. TREC, 1999.Google Scholar
- C. Buckley, D. Dimmick, I. Soboroff, and E. Voorhees. Bias and the limits of pooling for large collections. J. Inf. Ret., 10 (6): 491--508, 2007. Google ScholarDigital Library
- S. Büttcher, C. L. Clarke, P. C. Yeung, and I. Soboroff. Reliable information retrieval evaluation with incomplete and biased judgements. In Proc. SIGIR, pages 63--70, 2007. Google ScholarDigital Library
- C. L. A. Clarke, N. Craswell, and I. Soboroff. Overview of the TREC 2004 terabyte track. In Proc. TREC, 2004.Google Scholar
- D. Harman. The TREC test collections. In E. M. Voorhees and D. K. Harman, editors, TREC: Experiment and Evaluation in Information Retrieval, chapter 2, pages 21--52. MIT Press, 2005.Google Scholar
- D. Metzler and W. B. Croft. A Markov random field model for term dependencies. In Proc. SIGIR, pages 472--479, 2005. Google ScholarDigital Library
- A. Moffat, W. Webber, and J. Zobel. Strategic system comparisons via targeted relevance judgments. In Proc. SIGIR, pages 375--382, 2007. Google ScholarDigital Library
- K. Sparck Jones and C. J. van Rijsbergen. Report on the need for and the provision of an "ideal" information retrieval test collection. Technical report, Computer Laboratory, University of Cambridge, 1975. British Library Research and Development Report No. 5266.Google Scholar
- K. Sparck Jones, S. Walker, and S. E. Robertson. A probabilistic model of information retrieval: development and comparative experiments. Part 1. phInf. Proc. Man., 36 (6): 779--808, 2000. Google ScholarDigital Library
- E. M. Voorhees. Variations in relevance judgments and the measurement of retrieval effectiveness. Inf. Proc. Man., 36 (5): 697--716, 2000. Google ScholarDigital Library
- E. M. Voorhees. Overview of the TREC 2002 question answering track. In Proc. TREC, 2002.Google Scholar
- E. M. Voorhees. Overview of the TREC 2003 robust retrieval track. In Proc. TREC, 2003.Google Scholar
- W. Webber, A. Moffat, and J. Zobel. A similarity measure for indefinite rankings. ACM Trans. Inf. Sys., 28 (4): 20.1--20.38, 2010. Google ScholarDigital Library
- J. Zobel. How reliable are the results of large-scale information retrieval experiments? In Proc. SIGIR, pages 307--314, 1998. Google ScholarDigital Library
Index Terms
- Pooled Evaluation Over Query Variations: Users are as Diverse as Systems
Recommendations
Retrieval Consistency in the Presence of Query Variations
SIGIR '17: Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information RetrievalA search engine that can return the ideal results for a person's information need, independent of the specific query that is used to express that need, would be preferable to one that is overly swayed by the individual terms used; search engines should ...
User Variability and IR System Evaluation
SIGIR '15: Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information RetrievalTest collection design eliminates sources of user variability to make statistical comparisons among information retrieval (IR) systems more affordable. Does this choice unnecessarily limit generalizability of the outcomes to real usage scenarios? We ...
Incorporating User Expectations and Behavior into the Measurement of Search Effectiveness
Information retrieval systems aim to help users satisfy information needs. We argue that the goal of the person using the system, and the pattern of behavior that they exhibit as they proceed to attain that goal, should be incorporated into the methods ...
Comments