ABSTRACT
This paper presents results comparing user preference for search engine rankings with measures of effectiveness computed from a test collection. It establishes that preferences and evaluation measures correlate: systems measured as better on a test collection are preferred by users. This correlation is established for both "conventional web retrieval" and for retrieval that emphasizes diverse results. The nDCG measure is found to correlate best with user preferences compared to a selection of other well known measures. Unlike previous studies in this area, this examination involved a large population of users, gathered through crowd sourcing, exposed to a wide range of retrieval systems, test collections and search tasks. Reasons for user preferences were also gathered and analyzed. The work revealed a number of new results, but also showed that there is much scope for future work refining effectiveness measures to better capture user preferences.
- Agrawal, R., Gollapudi, S., Halverson, A., & Ieong, S. 2009. Diversifying search results. ACM WSDM, 5--14. Google ScholarDigital Library
- Al-Maskari, A., Sanderson, M. & Clough, P., 2007. The relationship between IR effectiveness measures and user satisfaction. ACM SIGIR, 773--774. Google ScholarDigital Library
- Al-Maskari, A., Sanderson, M., Clough, P., & Airio, E. 2008. The good and the bad system: does the test collection predict users' effectiveness? ACM SIGIR, 59--66. Google ScholarDigital Library
- Allan, J., Carterette, B., & Lewis, J. 2005. When will information retrieval be "good enough"? ACM SIGIR, 433--440. Google ScholarDigital Library
- Alonso, O., Rose, D. E., & Stewart, B. 2008. Crowdsourcing for relevance evaluation. SIGIR Forum 42, 2, 9--15. Google ScholarDigital Library
- Alonso, O. & Mizzaro, S., 2009. Can we get rid of TREC assessors? Using Mechanical Turk for relevance assessment. In Proceedings of the SIGIR 2009 Workshop on the Future of IR Evaluation. 15--16.Google Scholar
- Arni, T., Tang, J., Sanderson, M., Clough, P. 2008. Creating a test collection to evaluate diversity in image retrieval. Workshop on Beyond Binary Relevance, SIGIR 2008.Google Scholar
- Barry, C. L. 1994. User-defined relevance criteria: an exploratory study. J. Am. Soc. Inf. Sci. 45, 3, 149--159. Google ScholarDigital Library
- Broder, A. 2002. A taxonomy of web search. SIGIR Forum 36(2) 3--10. Google ScholarDigital Library
- Chapelle, O. and Zhang, Y., 2009. A dynamic bayesian network click model for web search ranking. Proc. 18th WWW Conf, 1--10 Google ScholarDigital Library
- Clarke, C., Kolla, M., Cormack, G., Vechtomova, O., Ashkan, A., Büttcher, S., MacKinnon, I. 2008. Novelty & diversity in information retrieval evaluation. ACM SIGIR, 659--666. Google ScholarDigital Library
- Clarke, C., Kolla, M., & Vechtomova, O. 2009. An Effectiveness Measure for Ambiguous and Underspecified Queries. Advances in Information Retrieval Theory, 188--199. Google ScholarDigital Library
- Clarke, C., Craswell, N., and Soboroff, I. 2009. Preliminary Report on the TREC 2009 Web Track. TREC 2009 Notebook.Google Scholar
- Hersh, W.R. et al., 2002. Factors associated with success in searching MEDLINE and applying evidence to answer clinical questions, Am Med Inform Assoc.Google Scholar
- Hersh, W., Turpin, A., Price, S., Chan, B., Kramer, D., Sacherek, L., & Olson, D. 2000. Do batch and user evaluations give the same results? ACM SIGIR, 17--24. Google ScholarDigital Library
- Huuskonen, S. & Vakkari, P. 2008. Students' search process and outcome in Medline in writing an essay for a class on evidence-based medicine. Journal of Documentation, 64(2), 287--303.Google ScholarCross Ref
- Joachims, T., 2002. Evaluating retrieval performance using click through data. Workshop on Mathematical/Formal Methods in IR, 12--15.Google Scholar
- Radlinski, F., Kurup, M., Joachims, T. 2008. How does click through data reflect retrieval quality? ACM CIKM, 43--52. Google ScholarDigital Library
- Robertson, S. 2006. On GMAP: and other transformations, ACM CIKM, 78--83. Google ScholarDigital Library
- Smith, C.L. & Kantor, P.B., 2008. User adaptation: good results from poor systems. ACM SIGIR, 147--154. Google ScholarDigital Library
- Su, L.T., 1992. Evaluation measures for interactive information retrieval. IP&M, 28(4), 503--516. Google ScholarDigital Library
- Tagliacozzo, R., 1977. Estimating the satisfaction of information users. Bulletin of the Medical Library Association, 65(2), 243--249.Google Scholar
- Thomas, D.R. 2006. A General Inductive Approach for Analyzing Qualitative Evaluation Data. American Journal of Evaluation, 27(2), 237--246.Google ScholarCross Ref
- Thomas, P. & Hawking, D., 2006. Evaluation by comparing result sets in context. ACM CIKM, 94--101. Google ScholarDigital Library
- Turpin, A.H., Hersh, W. 2001. Why batch and user evaluations do not give the same results. ACM SIGIR, 225--231. Google ScholarDigital Library
- Turpin, A. & Scholer, F. 2006. User performance versus precision measures for simple search tasks. ACM SIGIR, 11--18. Google ScholarDigital Library
- Yilmaz, E., Aslam, J., Robertson, S. 2008. A new rank correlation coefficient for information retrieval. ACM SIGIR, 587--594. Google ScholarDigital Library
- Zhai, C.X., Cohen, W.W. & Lafferty, J., 2003. Beyond independent relevance: methods and evaluation metrics for subtopic retrieval. ACM SIGIR, 10--17. Google ScholarDigital Library
Index Terms
- Do user preferences and evaluation measures line up?
Recommendations
Retrieval Evaluation Measures that Agree with Users’ SERP Preferences: Traditional, Preference-based, and Diversity Measures
We examine the “goodness” of ranked retrieval evaluation measures in terms of how well they align with users’ Search Engine Result Page (SERP) preferences for web search. The SERP preferences cover 1,127 topic-SERP-SERP triplets extracted from the NTCIR-...
Good Evaluation Measures based on Document Preferences
SIGIR '20: Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information RetrievalFor offline evaluation of IR systems, some researchers have proposed to utilise pairwise document preference assessments instead of relevance assessments of individual documents, as it may be easier for assessors to make relative decisions rather than ...
Which Diversity Evaluation Measures Are "Good"?
SIGIR'19: Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information RetrievalThis study evaluates 30 IR evaluation measures or their instances, of which nine are for adhoc IR and 21 are for diversified IR, primarily from the viewpoint of whether their preferences of one SERP (search engine result page) over another actually ...
Comments