skip to main content
10.1145/3459637.3482428acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
research-article

Evaluating Relevance Judgments with Pairwise Discriminative Power

Published:30 October 2021Publication History

ABSTRACT

Relevance judgments play an essential role in the evaluation of information retrieval systems. As many different relevance judgment settings have been proposed in recent years, an evaluation metric to compare relevance judgments in different annotation settings has become a necessity. Traditional metrics, such as ĸ, Krippendorff's α and Φ have mainly focused on the inter-assessor consistency to evaluate the quality of relevance judgments. They encounter "reliable but useless" problem when employed to compare different annotation settings (e.g. binary judgment v.s. 4-grade judgment). Meanwhile, other existing popular metrics such as discriminative power (DP) are not designed to compare relevance judgments across different annotation settings, they therefore suffer from limitations, such as the requirement of result ranking lists from different systems. Therefore, how to design an evaluation metric to compare relevance judgments under different grade settings needs further investigation. In this work, we propose a novel metric named pairwise discriminative power (PDP) to evaluate the quality of relevance judgment collections. By leveraging a small amount of document-level preference tests, PDP estimates the discriminative ability of relevance judgments on separating ranking lists with various qualities. With comprehensive experiments on both synthetic and real-world datasets, we show that PDP maintains a high degree of consistency with annotation quality in various grade settings. Compared with existing metrics (e.g., Krippendorff's α, Φ, DP, etc), it provides reliable evaluation results with affordable additional annotation efforts.

References

  1. Haldun Akoglu. 2018. User's guide to correlation coefficients. Turkish journal of emergency medicine 18, 3 (2018), 91--93.Google ScholarGoogle Scholar
  2. Enrique Amigó, Fernando Giner, Stefano Mizzaro, and Damiano Spina. 2018. A Formal Account of Effectiveness Evaluation and Ranking Fusion. In Proceedings of the 2018 ACM SIGIR International Conference on Theory of Information Retrieval. 123--130. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Peter Bailey, Nick Craswell, Ian Soboroff, Paul Thomas, Arjen P de Vries, and Emine Yilmaz. 2008. Relevance assessment: are judges exchangeable and does it matter. In Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval. 667--674. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Christopher Burges, Tal Shaked, Erin Renshaw, Ari Lazier, Matt Deeds, Nicole Hamilton, and Gregory N Hullender. 2005. Learning to rank using gradient descent. In Proceedings of the 22nd International Conference on Machine learning (ICML-05). 89--96. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Ben Carterette, Paul N Bennett, David Maxwell Chickering, and Susan T Dumais. 2008. Here or there. In European Conference on Information Retrieval. Springer, 16--27. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Praveen Chandar and Ben Carterette. 2012. Using preference judgments for novel document retrieval. In Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval. 861--870. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Praveen Chandar and Ben Carterette. 2013. Preference based evaluation measures for novelty and diversity. In Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval. 413--422. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Olivier Chapelle, Donald Metlzer, Ya Zhang, and Pierre Grinspan. 2009. Expected reciprocal rank for graded relevance. In Proceedings of the 18th ACM conference on Information and knowledge management. 621--630. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Alessandro Checco, Kevin Roitero, Eddy Maddalena, Stefano Mizzaro, and Gianluca Demartini. 2017. Let's agree to disagree: Fixing agreement measures for crowdsourcing. In Proceedings of the AAAI Conference on Human Computation and Crowdsourcing, Vol. 5.Google ScholarGoogle Scholar
  10. Charles LA Clarke, Nick Craswell, and Ian Soboroff. 2004. Overview of the TREC 2004 Terabyte Track.. In TREC, Vol. 4. 74.Google ScholarGoogle Scholar
  11. Charles LA Clarke, Mark D Smucker, and Alexandra Vtyurina. 2020. Offline Evaluation by Maximum Similarity to an Ideal Ranking. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management. 225--234. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Charles LA Clarke, Alexandra Vtyurina, and Mark D Smucker. 2020. Assessing top-?? preferences. arXiv preprint arXiv:2007.11682 (2020).Google ScholarGoogle Scholar
  13. Charles LA Clarke, Alexandra Vtyurina, and Mark D Smucker. 2020. Offline evaluation without gain. In Proceedings of the 2020aCM SIGIR on International Conference on Theory of Information Retrieval. 185--192. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Cyril Cleverdon and EM Keen. 1966. Aslib--Cranfield research project. Factors determining the performance of indexing systems 1 (1966).Google ScholarGoogle Scholar
  15. Jacob Cohen. 1960. A coefficient of agreement for nominal scales. Educational and psychological measurement 20, 1 (1960), 37--46.Google ScholarGoogle Scholar
  16. Kevyn Collins-Thompson, Craig Macdonald, Paul Bennett, Fernando Diaz, and Ellen M Voorhees. 2015. TREC 2014 web track overview. Technical Report. MICHIGAN UNIV ANN ARBOR.Google ScholarGoogle Scholar
  17. Pieter-Tjerk De Boer, Dirk P Kroese, Shie Mannor, and Reuven Y Rubinstein. 2005. A tutorial on the cross-entropy method. Annals of operations research 134, 1 (2005), 19--67.Google ScholarGoogle ScholarCross RefCross Ref
  18. Nicola Ferro and Carol Peters. 2019. Information Retrieval Evaluation in a Changing World: Lessons Learned from 20 Years of CLEF. Vol. 41. Springer.Google ScholarGoogle Scholar
  19. Joseph L Fleiss, Jacob Cohen, and Brian S Everitt. 1969. Large sample standard errors of kappa and weighted kappa. Psychological bulletin 72, 5 (1969), 323.Google ScholarGoogle Scholar
  20. John Guiver and Edward Snelson. 2009. Bayesian inference for Plackett-Luce ranking models. In proceedings of the 26th annual international conference on machine learning. 377--384. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Lei Han, Kevin Roitero, Eddy Maddalena, Stefano Mizzaro, and Gianluca Demartini. 2019. On transforming relevance scales. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management. 39--48. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Kalervo Järvelin and Jaana Kekäläinen. 2002. Cumulated gain-based evaluation of IR techniques. ACM Transactions on Information Systems (TOIS) 20, 4 (2002), 422--446. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Klaus Krippendorff. 2011. Computing Krippendorff's alpha-reliability. (2011).Google ScholarGoogle Scholar
  24. Cheng Luo, Tetsuya Sakai, Yiqun Liu, Zhicheng Dou, Chenyan Xiong, and Jingfang Xu. 2017. Overview of the NTCIR-13 we want web task. Proc. NTCIR-13 (2017).Google ScholarGoogle Scholar
  25. Alistair Moffat and Justin Zobel. 2008. Rank-biased precision for measurement of retrieval effectiveness. ACM Transactions on Information Systems (TOIS) 27, 1 (2008), 1--27. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Kevin Roitero, Eddy Maddalena, Gianluca Demartini, and Stefano Mizzaro. 2018. On fine-grained relevance scales. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval. 675--684. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Murray Rosenblatt. 1956. A central limit theorem and a strong mixing condition. Proceedings of the National Academy of Sciences of the United States of America 42, 1 (1956), 43.Google ScholarGoogle ScholarCross RefCross Ref
  28. Reuven Y Rubinstein and Dirk P Kroese. 2016. Simulation and the Monte Carlo method. Vol. 10. John Wiley & Sons. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Tetsuya Sakai. 2006. Evaluating evaluation metrics based on the bootstrap. In Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval. 525--532. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Tetsuya Sakai, Douglas W Oard, and Noriko Kando. [n.d.]. Evaluating Information Retrieval and Access Tasks: NTCIR's Legacy of Research Impact. Springer Nature.Google ScholarGoogle Scholar
  31. Tetsuya Sakai and Ruihua Song. 2011. Evaluating diversified search results using per-intent graded relevance. In Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval. 1043--1052. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Tetsuya Sakai, Sijie Tao, Zhaohao Zeng, Yukun Zheng, Jiaxin Mao, Zhumin Chu, Yiqun Liu, Maria Maistro, Zhicheng Dou, Nicola Ferro, et al. 2020. Overview of the NTCIR-15 We Want Web with CENTRE (WWW-3) Task. Proceedings of NTCIR-15. to appear (2020).Google ScholarGoogle Scholar
  33. Tetsuya Sakai and Zhaohao Zeng. 2020. Good evaluation measures based on document preferences. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval. 359--368. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Claude E Shannon. 1948. A mathematical theory of communication. The Bell system technical journal 27, 3 (1948), 379--423.Google ScholarGoogle Scholar
  35. Eero Sormunen. 2002. Liberal relevance criteria of TREC- counting on negligible documents?. In Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval. 324--330. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Rong Tang, William M Shaw Jr, and Jack L Vevea. 1999. Towards the identification of the optimal number of relevance categories. Journal of the American Society for Information Science 50, 3 (1999), 254--264. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Andrew Turpin, Falk Scholer, Stefano Mizzaro, and Eddy Maddalena. 2015. The benefits of magnitude estimation relevance assessments for information retrieval evaluation. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval. 565--574. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Anthony J Viera, Joanne M Garrett, et al. 2005. Understanding interobserver agreement: the kappa statistic. Fam med 37, 5 (2005), 360--363.Google ScholarGoogle Scholar
  39. Ellen M Voorhees and Donna Harman. 2002. Overview of TREC 2002.. In Trec.Google ScholarGoogle Scholar
  40. Ellen M Voorhees, Donna K Harman, et al. 2005. TREC: Experiment and evaluation in information retrieval. Vol. 63. MIT press Cambridge. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Ziying Yang, Alistair Moffat, and Andrew Turpin. 2018. Pairwise crowd judgments: Preference, absolute, and ratio. In Proceedings of the 23rd Australasian Document Computing Symposium. 1--8. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Dongqing Zhu and Ben Carterette. 2010. An analysis of assessor behavior in crowdsourced preference judgments. In SIGIR 2010 workshop on crowdsourcing for search evaluation. 17--20.Google ScholarGoogle Scholar

Index Terms

  1. Evaluating Relevance Judgments with Pairwise Discriminative Power

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      CIKM '21: Proceedings of the 30th ACM International Conference on Information & Knowledge Management
      October 2021
      4966 pages
      ISBN:9781450384469
      DOI:10.1145/3459637

      Copyright © 2021 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 30 October 2021

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      Overall Acceptance Rate1,861of8,427submissions,22%

      Upcoming Conference

    • Article Metrics

      • Downloads (Last 12 months)23
      • Downloads (Last 6 weeks)6

      Other Metrics

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader