research-article

Do user preferences and evaluation measures line up?

Authors:
Mark Sanderson

University of Sheffield, Sheffield, United Kingdom

University of Sheffield, Sheffield, United Kingdom
View Profile

,
Monica Lestari Paramita

University of Sheffield, Sheffield, United Kingdom

University of Sheffield, Sheffield, United Kingdom
View Profile

,
Paul Clough

University of Sheffield, Sheffield, United Kingdom

University of Sheffield, Sheffield, United Kingdom
View Profile

,
Evangelos Kanoulas

University of Sheffield, Sheffield, United Kingdom

University of Sheffield, Sheffield, United Kingdom
View Profile

SIGIR '10: Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrievalJuly 2010Pages 555–562https://doi.org/10.1145/1835449.1835542

Published:19 July 2010Publication History

SIGIR '10: Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval

Pages 555–562

ABSTRACT

This paper presents results comparing user preference for search engine rankings with measures of effectiveness computed from a test collection. It establishes that preferences and evaluation measures correlate: systems measured as better on a test collection are preferred by users. This correlation is established for both "conventional web retrieval" and for retrieval that emphasizes diverse results. The nDCG measure is found to correlate best with user preferences compared to a selection of other well known measures. Unlike previous studies in this area, this examination involved a large population of users, gathered through crowd sourcing, exposed to a wide range of retrieval systems, test collections and search tasks. Reasons for user preferences were also gathered and analyzed. The work revealed a number of new results, but also showed that there is much scope for future work refining effectiveness measures to better capture user preferences.

References

Agrawal, R., Gollapudi, S., Halverson, A., & Ieong, S. 2009. Diversifying search results. ACM WSDM, 5--14. Google ScholarDigital Library
Al-Maskari, A., Sanderson, M. & Clough, P., 2007. The relationship between IR effectiveness measures and user satisfaction. ACM SIGIR, 773--774. Google ScholarDigital Library
Al-Maskari, A., Sanderson, M., Clough, P., & Airio, E. 2008. The good and the bad system: does the test collection predict users' effectiveness? ACM SIGIR, 59--66. Google ScholarDigital Library
Allan, J., Carterette, B., & Lewis, J. 2005. When will information retrieval be "good enough"? ACM SIGIR, 433--440. Google ScholarDigital Library
Alonso, O., Rose, D. E., & Stewart, B. 2008. Crowdsourcing for relevance evaluation. SIGIR Forum 42, 2, 9--15. Google ScholarDigital Library
Alonso, O. & Mizzaro, S., 2009. Can we get rid of TREC assessors? Using Mechanical Turk for relevance assessment. In Proceedings of the SIGIR 2009 Workshop on the Future of IR Evaluation. 15--16.Google Scholar
Arni, T., Tang, J., Sanderson, M., Clough, P. 2008. Creating a test collection to evaluate diversity in image retrieval. Workshop on Beyond Binary Relevance, SIGIR 2008.Google Scholar
Barry, C. L. 1994. User-defined relevance criteria: an exploratory study. J. Am. Soc. Inf. Sci. 45, 3, 149--159. Google ScholarDigital Library
Broder, A. 2002. A taxonomy of web search. SIGIR Forum 36(2) 3--10. Google ScholarDigital Library
Chapelle, O. and Zhang, Y., 2009. A dynamic bayesian network click model for web search ranking. Proc. 18th WWW Conf, 1--10 Google ScholarDigital Library
Clarke, C., Kolla, M., Cormack, G., Vechtomova, O., Ashkan, A., Büttcher, S., MacKinnon, I. 2008. Novelty & diversity in information retrieval evaluation. ACM SIGIR, 659--666. Google ScholarDigital Library
Clarke, C., Kolla, M., & Vechtomova, O. 2009. An Effectiveness Measure for Ambiguous and Underspecified Queries. Advances in Information Retrieval Theory, 188--199. Google ScholarDigital Library
Clarke, C., Craswell, N., and Soboroff, I. 2009. Preliminary Report on the TREC 2009 Web Track. TREC 2009 Notebook.Google Scholar
Hersh, W.R. et al., 2002. Factors associated with success in searching MEDLINE and applying evidence to answer clinical questions, Am Med Inform Assoc.Google Scholar
Hersh, W., Turpin, A., Price, S., Chan, B., Kramer, D., Sacherek, L., & Olson, D. 2000. Do batch and user evaluations give the same results? ACM SIGIR, 17--24. Google ScholarDigital Library
Huuskonen, S. & Vakkari, P. 2008. Students' search process and outcome in Medline in writing an essay for a class on evidence-based medicine. Journal of Documentation, 64(2), 287--303.Google ScholarCross Ref
Joachims, T., 2002. Evaluating retrieval performance using click through data. Workshop on Mathematical/Formal Methods in IR, 12--15.Google Scholar
Radlinski, F., Kurup, M., Joachims, T. 2008. How does click through data reflect retrieval quality? ACM CIKM, 43--52. Google ScholarDigital Library
Robertson, S. 2006. On GMAP: and other transformations, ACM CIKM, 78--83. Google ScholarDigital Library
Smith, C.L. & Kantor, P.B., 2008. User adaptation: good results from poor systems. ACM SIGIR, 147--154. Google ScholarDigital Library
Su, L.T., 1992. Evaluation measures for interactive information retrieval. IP&M, 28(4), 503--516. Google ScholarDigital Library
Tagliacozzo, R., 1977. Estimating the satisfaction of information users. Bulletin of the Medical Library Association, 65(2), 243--249.Google Scholar
Thomas, D.R. 2006. A General Inductive Approach for Analyzing Qualitative Evaluation Data. American Journal of Evaluation, 27(2), 237--246.Google ScholarCross Ref
Thomas, P. & Hawking, D., 2006. Evaluation by comparing result sets in context. ACM CIKM, 94--101. Google ScholarDigital Library
Turpin, A.H., Hersh, W. 2001. Why batch and user evaluations do not give the same results. ACM SIGIR, 225--231. Google ScholarDigital Library
Turpin, A. & Scholer, F. 2006. User performance versus precision measures for simple search tasks. ACM SIGIR, 11--18. Google ScholarDigital Library
Yilmaz, E., Aslam, J., Robertson, S. 2008. A new rank correlation coefficient for information retrieval. ACM SIGIR, 587--594. Google ScholarDigital Library
Zhai, C.X., Cohen, W.W. & Lafferty, J., 2003. Beyond independent relevance: methods and evaluation metrics for subtopic retrieval. ACM SIGIR, 10--17. Google ScholarDigital Library

Index Terms

Do user preferences and evaluation measures line up?
1. Information systems
  1. Information retrieval

Recommendations

Retrieval Evaluation Measures that Agree with Users’ SERP Preferences: Traditional, Preference-based, and Diversity Measures

We examine the “goodness” of ranked retrieval evaluation measures in terms of how well they align with users’ Search Engine Result Page (SERP) preferences for web search. The SERP preferences cover 1,127 topic-SERP-SERP triplets extracted from the NTCIR-...
Read More
Good Evaluation Measures based on Document Preferences
SIGIR '20: Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval

For offline evaluation of IR systems, some researchers have proposed to utilise pairwise document preference assessments instead of relevance assessments of individual documents, as it may be easier for assessors to make relative decisions rather than ...
Read More
Which Diversity Evaluation Measures Are "Good"?
SIGIR'19: Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval

This study evaluates 30 IR evaluation measures or their instances, of which nine are for adhoc IR and 21 are for diversified IR, primarily from the viewpoint of whether their preferences of one SERP (search engine result page) over another actually ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGIR '10: Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
July 2010
944 pages
ISBN:9781450301534
DOI:10.1145/1835449
General Chairs:
Fabio Crestani
University of Lugano, CH
,
Stéphane Marchand-Maillet
University of Geneva, CH
,
Program Chairs:
Hsin-Hsi Chen
National Taiwan University, TW
,
Efthimis N. Efthimiadis
University of Washington, USA
,
Jacques Savoy
University of Neuchatel, CH
Copyright © 2010 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 19 July 2010
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
evaluation measures
mechanical turk
user experiment
Qualifiers
- research-article
Conference

Acceptance Rates
SIGIR '10 Paper Acceptance Rate87of520submissions,17%Overall Acceptance Rate792of3,983submissions,20%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 113
  Total Citations
  View Citations
- 808
  Total Downloads
- Downloads (Last 12 months)29
- Downloads (Last 6 weeks)3
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Do user preferences and evaluation measures line up?

SIGIR '10: Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval

ABSTRACT

References

Cited By

Index Terms

Recommendations

Retrieval Evaluation Measures that Agree with Users’ SERP Preferences: Traditional, Preference-based, and Diversity Measures

Good Evaluation Measures based on Document Preferences

Which Diversity Evaluation Measures Are "Good"?