ABSTRACT
Magnitude estimation is a psychophysical scaling technique for the measurement of sensation, where observers assign numbers to stimuli in response to their perceived intensity. We investigate the use of magnitude estimation for judging the relevance of documents in the context of information retrieval evaluation, carrying out a large-scale user study across 18 TREC topics and collecting more than 50,000 magnitude estimation judgments. Our analysis shows that on average magnitude estimation judgments are rank-aligned with ordinal judgments made by expert relevance assessors. An advantage of magnitude estimation is that users can chose their own scale for judgments, allowing deeper investigations of user perceptions than when categorical scales are used.
We explore the application of magnitude estimation for IR evaluation, calibrating two gain-based effectiveness metrics, nDCG and ERR, directly from user-reported perceptions of relevance. A comparison of TREC system effectiveness rankings based on binary, ordinal, and magnitude estimation relevance shows substantial variation; in particular, the top systems ranked using magnitude estimation and ordinal judgments differ substantially. Analysis of the magnitude estimation scores shows that this effect is due in part to varying perceptions of relevance, in terms of how impactful relative differences in document relevance are perceived to be. We further use magnitude estimation to investigate gain profiles, comparing the currently assumed linear and exponential approaches with actual user-reported relevance perceptions. This indicates that the currently used exponential gain profiles in nDCG and ERR are mismatched with an average user, but perhaps more importantly that individual perceptions are highly variable. These results have direct implications for IR evaluation, suggesting that current assumptions about a single view of relevance being sufficient to represent a population of users are unlikely to hold. Finally, we demonstrate that magnitude estimation judgments can be reliably collected using crowdsourcing, and are competitive in terms of assessor cost.
- O. Alonso and S. Mizzaro. Using crowdsourcing for TREC relevance assessment. Information Processing and Management, 48 (6): 1053--1066, 2012. Google ScholarDigital Library
- E. G. Bard, D. Robertson, and A. Sorace. Magnitude estimation of linguistic acceptability. Language, 72 (1): 32--68, 1996.Google ScholarCross Ref
- C. Burges, T. Shaked, E. Renshaw, A. Lazier, M. Deeds, N. Hamilton, and G. Hullender. Learning to rank using gradient descent. In Proceedings of the 22nd international conference on Machine learning, pages 89--96, Bonn, Germany, 2005. Google ScholarDigital Library
- B. Carterette and I. Soboroff. The effect of assessor error on IR system evaluation. In Proceedings of the ACM SIGIR International Conference on Research and Development in Information Retrieval, pages 539--546, Geneva, Switzerland, 2010. Google ScholarDigital Library
- O. Chapelle, D. Metlzer, Y. Zhang, and P. Grinspan. Expected reciprocal rank for graded relevance. In Proceedings of the ACM CIKM International Conference on Information and Knowledge Management, pages 621--630, Hong Kong, 2009. Google ScholarDigital Library
- C. Clarke, N. Craswell, and I. Soboroff. Overview of the TREC 2004 terabyte track. In The Fourteenth Text REtrieval Conference (TREC 2005), Gaithersburg, MD, 2005. NIST.Google Scholar
- K. Collins-Thompson, P. Bennett, F. Diaz, C. L. Clarke, and E. M. Voorhees. TREC 2013 Web Track Overview. In 22nd Text REtrieval Conference (TREC 2013), Gaithersburg, MD, 2014.Google Scholar
- E. P. Cox,!III. The optimal number of response alternatives for a scale: A review. Journal of marketing research, 17 (4): 407--422, 1980.Google Scholar
- W. H. Ehrenstein and A. Ehrenstein. Psychophysical methods. In U. Windhorst and H. Johansson, editors, Modern techniques in neuroscience research, pages 1211--1241. Springer, 1999.Google ScholarCross Ref
- M. Eisenberg. Measuring relevance judgements. Information Processing and Management, 24: 373--389, 1988. Google ScholarDigital Library
- G. Gescheider. Psychophysics: The Fundamentals. Lawrence Erlbaum Associates, 3rd edition, 1997.Google Scholar
- K. Jarvelin and J. Kekalainen. Cumulated gain-based evaluation of IR techniques. ACM Transactions on Information Systems, 20 (4): 422--446, 2002. Google ScholarDigital Library
- K. Johnson and A. Bojko. How much is too much? Using magnitude estimation for user experience research. Proceedings of the Human Factors and Ergonomics Society Annual Meeting, 54 (12): 942--946, 2010.Google ScholarCross Ref
- E. Kanoulas and J. A. Aslam. Empirical justification of the gain and discount function for nDCG. In Proceedings of the ACM CIKM International Conference on Information and Knowledge Management, pages 611--620, Hong Kong, 2009. Google ScholarDigital Library
- E. Maddalena, S. Mizzaro, F. Scholer, and A. Turpin. Judging relevance using magnitude estimation. In Advances in Information Retrieval - Proceedings of 37th ECIR, volume 9022 of LNCS, pages 215--220, Vienna, Austria, 2015.Google ScholarCross Ref
- L. Marks. Sensory Processes: The new Psychophysics. Academic Press, 1974.Google Scholar
- M. McGee. Usability magnitude estimation. Proceedings of the Human Factors and Ergonomics Society Annual Meeting, 47 (4): 691--695, 2003.Google ScholarCross Ref
- H. R. Moskowitz. Magnitude estimation: notes on what, how, when, and why to use it. Journal of Food Quality, 1 (3): 195--227, 1977.Google ScholarCross Ref
- T. Saracevic. Relevance: A review of the literature and a framework for thinking on the notion in information science. Part II: nature and manifestations of relevance. JASIST, 58 (13): 1915--1933, 2007. Google ScholarDigital Library
- F. Scholer, A. Turpin, and M. Sanderson. Quantifying test collection quality based on the consistency of relevance judgements. In Proceedings of the ACM SIGIR International Conference on Research and Development in Information Retrieval, pages 1063--1072, Beijing, China, 2011. Google ScholarDigital Library
- F. Scholer, E. Maddalena, S. Mizzaro, and A. Turpin. Magnitudes of relevance: Relevance judgements, magnitude estimation, and crowdsourcing. In The Sixth International Workshop on Evaluating Information Access (EVIA 2014), Tokyo, Japan, 2014.Google Scholar
- D. Sheskin. Handbook of parametric and nonparametric statistical procedures, 4th ed. CRC Press, 2007. Google ScholarDigital Library
- E. Sormunen. Liberal relevance criteria of TREC: Counting on negligible documents? In Proceedings of the ACM SIGIR International Conference on Research and Development in Information Retrieval, pages 324--330, Tampere, Finland, 2002. Google ScholarDigital Library
- A. Spink and H. Greisdorf. Regions and levels: Measuring and mapping users' relevance judgments. JASIST, 52 (2): 161--173, 2001. Google ScholarDigital Library
- S. S. Stevens. A metric for the social consensus. Science (New York, NY), 151 (3710): 530--541, 1966.Google Scholar
- R. Tang, W. M. Shaw, and J. L. Vevea. Towards the identification of the optimal number of relevance categories. JASIS, 50 (3): 254--264, 1999. Google ScholarDigital Library
- E. M. Voorhees and D. K. Harman. Overview of the Eighth Text REtrieval Conference (TREC-8). In The Eighth Text REtrieval Conference (TREC-8), pages 1--24, Gaithersburg, MD, 1999.Google Scholar
- E. M. Voorhees and D. K. Harman. TREC: experiment and evaluation in information retrieval. MIT Press, 2005. Google ScholarDigital Library
- W. Webber and J. Pickens. Assessor disagreement and text classifier accuracy. In Proceedings of the ACM SIGIR International Conference on Research and Development in Information Retrieval, pages 929--932, Dublin, Ireland, 2013. Google ScholarDigital Library
Index Terms
- The Benefits of Magnitude Estimation Relevance Assessments for Information Retrieval Evaluation
Recommendations
On Crowdsourcing Relevance Magnitudes for Information Retrieval Evaluation
Magnitude estimation is a psychophysical scaling technique for the measurement of sensation, where observers assign numbers to stimuli in response to their perceived intensity. We investigate the use of magnitude estimation for judging the relevance of ...
On information retrieval metrics designed for evaluation with incomplete relevance assessments
Modern information retrieval (IR) test collections have grown in size, but the available manpower for relevance assessments has more or less remained constant. Hence, how to reliably evaluate and compare IR systems using incomplete relevance data, ...
Novelty and topicality in interactive information retrieval
The information science research community is characterized by a paradigm split, with a system-centered cluster working on information retrieval (IR) algorithms and a user-centered cluster working on user behavior. The two clusters rarely leverage each ...
Comments