skip to main content
10.1145/2766462.2767760acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
research-article
Honorable Mention

The Benefits of Magnitude Estimation Relevance Assessments for Information Retrieval Evaluation

Published:09 August 2015Publication History

ABSTRACT

Magnitude estimation is a psychophysical scaling technique for the measurement of sensation, where observers assign numbers to stimuli in response to their perceived intensity. We investigate the use of magnitude estimation for judging the relevance of documents in the context of information retrieval evaluation, carrying out a large-scale user study across 18 TREC topics and collecting more than 50,000 magnitude estimation judgments. Our analysis shows that on average magnitude estimation judgments are rank-aligned with ordinal judgments made by expert relevance assessors. An advantage of magnitude estimation is that users can chose their own scale for judgments, allowing deeper investigations of user perceptions than when categorical scales are used.

We explore the application of magnitude estimation for IR evaluation, calibrating two gain-based effectiveness metrics, nDCG and ERR, directly from user-reported perceptions of relevance. A comparison of TREC system effectiveness rankings based on binary, ordinal, and magnitude estimation relevance shows substantial variation; in particular, the top systems ranked using magnitude estimation and ordinal judgments differ substantially. Analysis of the magnitude estimation scores shows that this effect is due in part to varying perceptions of relevance, in terms of how impactful relative differences in document relevance are perceived to be. We further use magnitude estimation to investigate gain profiles, comparing the currently assumed linear and exponential approaches with actual user-reported relevance perceptions. This indicates that the currently used exponential gain profiles in nDCG and ERR are mismatched with an average user, but perhaps more importantly that individual perceptions are highly variable. These results have direct implications for IR evaluation, suggesting that current assumptions about a single view of relevance being sufficient to represent a population of users are unlikely to hold. Finally, we demonstrate that magnitude estimation judgments can be reliably collected using crowdsourcing, and are competitive in terms of assessor cost.

References

  1. O. Alonso and S. Mizzaro. Using crowdsourcing for TREC relevance assessment. Information Processing and Management, 48 (6): 1053--1066, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. E. G. Bard, D. Robertson, and A. Sorace. Magnitude estimation of linguistic acceptability. Language, 72 (1): 32--68, 1996.Google ScholarGoogle ScholarCross RefCross Ref
  3. C. Burges, T. Shaked, E. Renshaw, A. Lazier, M. Deeds, N. Hamilton, and G. Hullender. Learning to rank using gradient descent. In Proceedings of the 22nd international conference on Machine learning, pages 89--96, Bonn, Germany, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. B. Carterette and I. Soboroff. The effect of assessor error on IR system evaluation. In Proceedings of the ACM SIGIR International Conference on Research and Development in Information Retrieval, pages 539--546, Geneva, Switzerland, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. O. Chapelle, D. Metlzer, Y. Zhang, and P. Grinspan. Expected reciprocal rank for graded relevance. In Proceedings of the ACM CIKM International Conference on Information and Knowledge Management, pages 621--630, Hong Kong, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. C. Clarke, N. Craswell, and I. Soboroff. Overview of the TREC 2004 terabyte track. In The Fourteenth Text REtrieval Conference (TREC 2005), Gaithersburg, MD, 2005. NIST.Google ScholarGoogle Scholar
  7. K. Collins-Thompson, P. Bennett, F. Diaz, C. L. Clarke, and E. M. Voorhees. TREC 2013 Web Track Overview. In 22nd Text REtrieval Conference (TREC 2013), Gaithersburg, MD, 2014.Google ScholarGoogle Scholar
  8. E. P. Cox,!III. The optimal number of response alternatives for a scale: A review. Journal of marketing research, 17 (4): 407--422, 1980.Google ScholarGoogle Scholar
  9. W. H. Ehrenstein and A. Ehrenstein. Psychophysical methods. In U. Windhorst and H. Johansson, editors, Modern techniques in neuroscience research, pages 1211--1241. Springer, 1999.Google ScholarGoogle ScholarCross RefCross Ref
  10. M. Eisenberg. Measuring relevance judgements. Information Processing and Management, 24: 373--389, 1988. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. G. Gescheider. Psychophysics: The Fundamentals. Lawrence Erlbaum Associates, 3rd edition, 1997.Google ScholarGoogle Scholar
  12. K. Jarvelin and J. Kekalainen. Cumulated gain-based evaluation of IR techniques. ACM Transactions on Information Systems, 20 (4): 422--446, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. K. Johnson and A. Bojko. How much is too much? Using magnitude estimation for user experience research. Proceedings of the Human Factors and Ergonomics Society Annual Meeting, 54 (12): 942--946, 2010.Google ScholarGoogle ScholarCross RefCross Ref
  14. E. Kanoulas and J. A. Aslam. Empirical justification of the gain and discount function for nDCG. In Proceedings of the ACM CIKM International Conference on Information and Knowledge Management, pages 611--620, Hong Kong, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. E. Maddalena, S. Mizzaro, F. Scholer, and A. Turpin. Judging relevance using magnitude estimation. In Advances in Information Retrieval - Proceedings of 37th ECIR, volume 9022 of LNCS, pages 215--220, Vienna, Austria, 2015.Google ScholarGoogle ScholarCross RefCross Ref
  16. L. Marks. Sensory Processes: The new Psychophysics. Academic Press, 1974.Google ScholarGoogle Scholar
  17. M. McGee. Usability magnitude estimation. Proceedings of the Human Factors and Ergonomics Society Annual Meeting, 47 (4): 691--695, 2003.Google ScholarGoogle ScholarCross RefCross Ref
  18. H. R. Moskowitz. Magnitude estimation: notes on what, how, when, and why to use it. Journal of Food Quality, 1 (3): 195--227, 1977.Google ScholarGoogle ScholarCross RefCross Ref
  19. T. Saracevic. Relevance: A review of the literature and a framework for thinking on the notion in information science. Part II: nature and manifestations of relevance. JASIST, 58 (13): 1915--1933, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. F. Scholer, A. Turpin, and M. Sanderson. Quantifying test collection quality based on the consistency of relevance judgements. In Proceedings of the ACM SIGIR International Conference on Research and Development in Information Retrieval, pages 1063--1072, Beijing, China, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. F. Scholer, E. Maddalena, S. Mizzaro, and A. Turpin. Magnitudes of relevance: Relevance judgements, magnitude estimation, and crowdsourcing. In The Sixth International Workshop on Evaluating Information Access (EVIA 2014), Tokyo, Japan, 2014.Google ScholarGoogle Scholar
  22. D. Sheskin. Handbook of parametric and nonparametric statistical procedures, 4th ed. CRC Press, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. E. Sormunen. Liberal relevance criteria of TREC: Counting on negligible documents? In Proceedings of the ACM SIGIR International Conference on Research and Development in Information Retrieval, pages 324--330, Tampere, Finland, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. A. Spink and H. Greisdorf. Regions and levels: Measuring and mapping users' relevance judgments. JASIST, 52 (2): 161--173, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. S. S. Stevens. A metric for the social consensus. Science (New York, NY), 151 (3710): 530--541, 1966.Google ScholarGoogle Scholar
  26. R. Tang, W. M. Shaw, and J. L. Vevea. Towards the identification of the optimal number of relevance categories. JASIS, 50 (3): 254--264, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. E. M. Voorhees and D. K. Harman. Overview of the Eighth Text REtrieval Conference (TREC-8). In The Eighth Text REtrieval Conference (TREC-8), pages 1--24, Gaithersburg, MD, 1999.Google ScholarGoogle Scholar
  28. E. M. Voorhees and D. K. Harman. TREC: experiment and evaluation in information retrieval. MIT Press, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. W. Webber and J. Pickens. Assessor disagreement and text classifier accuracy. In Proceedings of the ACM SIGIR International Conference on Research and Development in Information Retrieval, pages 929--932, Dublin, Ireland, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. The Benefits of Magnitude Estimation Relevance Assessments for Information Retrieval Evaluation

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      SIGIR '15: Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval
      August 2015
      1198 pages
      ISBN:9781450336215
      DOI:10.1145/2766462

      Copyright © 2015 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 9 August 2015

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      SIGIR '15 Paper Acceptance Rate70of351submissions,20%Overall Acceptance Rate792of3,983submissions,20%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader