research-article

The Benefits of Magnitude Estimation Relevance Assessments for Information Retrieval Evaluation

Authors:
Andrew Turpin

University of Melbourne, Melbourne, Australia

University of Melbourne, Melbourne, Australia
View Profile

,
Falk Scholer

RMIT University, Melbourne, Australia

RMIT University, Melbourne, Australia
View Profile

,
Stefano Mizzaro

University of Udine, Udine, Italy

University of Udine, Udine, Italy
View Profile

,
Eddy Maddalena

University of Udine, Udine, Italy

University of Udine, Udine, Italy
View Profile

SIGIR '15: Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information RetrievalAugust 2015Pages 565–574https://doi.org/10.1145/2766462.2767760

Published:09 August 2015Publication History

SIGIR '15: Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval

Pages 565–574

ABSTRACT

Magnitude estimation is a psychophysical scaling technique for the measurement of sensation, where observers assign numbers to stimuli in response to their perceived intensity. We investigate the use of magnitude estimation for judging the relevance of documents in the context of information retrieval evaluation, carrying out a large-scale user study across 18 TREC topics and collecting more than 50,000 magnitude estimation judgments. Our analysis shows that on average magnitude estimation judgments are rank-aligned with ordinal judgments made by expert relevance assessors. An advantage of magnitude estimation is that users can chose their own scale for judgments, allowing deeper investigations of user perceptions than when categorical scales are used.

We explore the application of magnitude estimation for IR evaluation, calibrating two gain-based effectiveness metrics, nDCG and ERR, directly from user-reported perceptions of relevance. A comparison of TREC system effectiveness rankings based on binary, ordinal, and magnitude estimation relevance shows substantial variation; in particular, the top systems ranked using magnitude estimation and ordinal judgments differ substantially. Analysis of the magnitude estimation scores shows that this effect is due in part to varying perceptions of relevance, in terms of how impactful relative differences in document relevance are perceived to be. We further use magnitude estimation to investigate gain profiles, comparing the currently assumed linear and exponential approaches with actual user-reported relevance perceptions. This indicates that the currently used exponential gain profiles in nDCG and ERR are mismatched with an average user, but perhaps more importantly that individual perceptions are highly variable. These results have direct implications for IR evaluation, suggesting that current assumptions about a single view of relevance being sufficient to represent a population of users are unlikely to hold. Finally, we demonstrate that magnitude estimation judgments can be reliably collected using crowdsourcing, and are competitive in terms of assessor cost.

References

O. Alonso and S. Mizzaro. Using crowdsourcing for TREC relevance assessment. Information Processing and Management, 48 (6): 1053--1066, 2012. Google ScholarDigital Library
E. G. Bard, D. Robertson, and A. Sorace. Magnitude estimation of linguistic acceptability. Language, 72 (1): 32--68, 1996.Google ScholarCross Ref
C. Burges, T. Shaked, E. Renshaw, A. Lazier, M. Deeds, N. Hamilton, and G. Hullender. Learning to rank using gradient descent. In Proceedings of the 22nd international conference on Machine learning, pages 89--96, Bonn, Germany, 2005. Google ScholarDigital Library
B. Carterette and I. Soboroff. The effect of assessor error on IR system evaluation. In Proceedings of the ACM SIGIR International Conference on Research and Development in Information Retrieval, pages 539--546, Geneva, Switzerland, 2010. Google ScholarDigital Library
O. Chapelle, D. Metlzer, Y. Zhang, and P. Grinspan. Expected reciprocal rank for graded relevance. In Proceedings of the ACM CIKM International Conference on Information and Knowledge Management, pages 621--630, Hong Kong, 2009. Google ScholarDigital Library
C. Clarke, N. Craswell, and I. Soboroff. Overview of the TREC 2004 terabyte track. In The Fourteenth Text REtrieval Conference (TREC 2005), Gaithersburg, MD, 2005. NIST.Google Scholar
K. Collins-Thompson, P. Bennett, F. Diaz, C. L. Clarke, and E. M. Voorhees. TREC 2013 Web Track Overview. In 22nd Text REtrieval Conference (TREC 2013), Gaithersburg, MD, 2014.Google Scholar
E. P. Cox,!III. The optimal number of response alternatives for a scale: A review. Journal of marketing research, 17 (4): 407--422, 1980.Google Scholar
W. H. Ehrenstein and A. Ehrenstein. Psychophysical methods. In U. Windhorst and H. Johansson, editors, Modern techniques in neuroscience research, pages 1211--1241. Springer, 1999.Google ScholarCross Ref
M. Eisenberg. Measuring relevance judgements. Information Processing and Management, 24: 373--389, 1988. Google ScholarDigital Library
G. Gescheider. Psychophysics: The Fundamentals. Lawrence Erlbaum Associates, 3rd edition, 1997.Google Scholar
K. Jarvelin and J. Kekalainen. Cumulated gain-based evaluation of IR techniques. ACM Transactions on Information Systems, 20 (4): 422--446, 2002. Google ScholarDigital Library
K. Johnson and A. Bojko. How much is too much? Using magnitude estimation for user experience research. Proceedings of the Human Factors and Ergonomics Society Annual Meeting, 54 (12): 942--946, 2010.Google ScholarCross Ref
E. Kanoulas and J. A. Aslam. Empirical justification of the gain and discount function for nDCG. In Proceedings of the ACM CIKM International Conference on Information and Knowledge Management, pages 611--620, Hong Kong, 2009. Google ScholarDigital Library
E. Maddalena, S. Mizzaro, F. Scholer, and A. Turpin. Judging relevance using magnitude estimation. In Advances in Information Retrieval - Proceedings of 37th ECIR, volume 9022 of LNCS, pages 215--220, Vienna, Austria, 2015.Google ScholarCross Ref
L. Marks. Sensory Processes: The new Psychophysics. Academic Press, 1974.Google Scholar
M. McGee. Usability magnitude estimation. Proceedings of the Human Factors and Ergonomics Society Annual Meeting, 47 (4): 691--695, 2003.Google ScholarCross Ref
H. R. Moskowitz. Magnitude estimation: notes on what, how, when, and why to use it. Journal of Food Quality, 1 (3): 195--227, 1977.Google ScholarCross Ref
T. Saracevic. Relevance: A review of the literature and a framework for thinking on the notion in information science. Part II: nature and manifestations of relevance. JASIST, 58 (13): 1915--1933, 2007. Google ScholarDigital Library
F. Scholer, A. Turpin, and M. Sanderson. Quantifying test collection quality based on the consistency of relevance judgements. In Proceedings of the ACM SIGIR International Conference on Research and Development in Information Retrieval, pages 1063--1072, Beijing, China, 2011. Google ScholarDigital Library
F. Scholer, E. Maddalena, S. Mizzaro, and A. Turpin. Magnitudes of relevance: Relevance judgements, magnitude estimation, and crowdsourcing. In The Sixth International Workshop on Evaluating Information Access (EVIA 2014), Tokyo, Japan, 2014.Google Scholar
D. Sheskin. Handbook of parametric and nonparametric statistical procedures, 4th ed. CRC Press, 2007. Google ScholarDigital Library
E. Sormunen. Liberal relevance criteria of TREC: Counting on negligible documents? In Proceedings of the ACM SIGIR International Conference on Research and Development in Information Retrieval, pages 324--330, Tampere, Finland, 2002. Google ScholarDigital Library
A. Spink and H. Greisdorf. Regions and levels: Measuring and mapping users' relevance judgments. JASIST, 52 (2): 161--173, 2001. Google ScholarDigital Library
S. S. Stevens. A metric for the social consensus. Science (New York, NY), 151 (3710): 530--541, 1966.Google Scholar
R. Tang, W. M. Shaw, and J. L. Vevea. Towards the identification of the optimal number of relevance categories. JASIS, 50 (3): 254--264, 1999. Google ScholarDigital Library
E. M. Voorhees and D. K. Harman. Overview of the Eighth Text REtrieval Conference (TREC-8). In The Eighth Text REtrieval Conference (TREC-8), pages 1--24, Gaithersburg, MD, 1999.Google Scholar
E. M. Voorhees and D. K. Harman. TREC: experiment and evaluation in information retrieval. MIT Press, 2005. Google ScholarDigital Library
W. Webber and J. Pickens. Assessor disagreement and text classifier accuracy. In Proceedings of the ACM SIGIR International Conference on Research and Development in Information Retrieval, pages 929--932, Dublin, Ireland, 2013. Google ScholarDigital Library

Index Terms

The Benefits of Magnitude Estimation Relevance Assessments for Information Retrieval Evaluation
1. Information systems
  1. Information retrieval
    1. Evaluation of retrieval results

Recommendations

On Crowdsourcing Relevance Magnitudes for Information Retrieval Evaluation

Magnitude estimation is a psychophysical scaling technique for the measurement of sensation, where observers assign numbers to stimuli in response to their perceived intensity. We investigate the use of magnitude estimation for judging the relevance of ...
Read More
On information retrieval metrics designed for evaluation with incomplete relevance assessments

Modern information retrieval (IR) test collections have grown in size, but the available manpower for relevance assessments has more or less remained constant. Hence, how to reliably evaluate and compare IR systems using incomplete relevance data, ...
Read More
Novelty and topicality in interactive information retrieval

The information science research community is characterized by a paradigm split, with a system-centered cluster working on information retrieval (IR) algorithms and a user-centered cluster working on user behavior. The two clusters rarely leverage each ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGIR '15: Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval
August 2015
1198 pages
ISBN:9781450336215
DOI:10.1145/2766462
General Chair:
Ricardo Baeza-Yates
Yahoo Labs, USA
,
Program Chairs:
Mounia Lalmas
Yahoo Labs, UK
,
Alistair Moffat
University of Melbourne, Australia
,
Berthier Ribeiro-Neto
Google, Brazil, and UFMG, Brazil
Copyright © 2015 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 9 August 2015
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Badges
- Honorable Mention
Author Tags
evaluation
magnitude estimation
relevance
relevance assessments
Qualifiers
- research-article
Conference

Acceptance Rates
SIGIR '15 Paper Acceptance Rate70of351submissions,20%Overall Acceptance Rate792of3,983submissions,20%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 22
  Total Citations
  View Citations
- 685
  Total Downloads
- Downloads (Last 12 months)37
- Downloads (Last 6 weeks)3
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

The Benefits of Magnitude Estimation Relevance Assessments for Information Retrieval Evaluation

SIGIR '15: Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval

ABSTRACT

References

Cited By

Index Terms

Recommendations

On Crowdsourcing Relevance Magnitudes for Information Retrieval Evaluation

On information retrieval metrics designed for evaluation with incomplete relevance assessments

Novelty and topicality in interactive information retrieval