ABSTRACT
Relevance judgments play an essential role in the evaluation of information retrieval systems. As many different relevance judgment settings have been proposed in recent years, an evaluation metric to compare relevance judgments in different annotation settings has become a necessity. Traditional metrics, such as ĸ, Krippendorff's α and Φ have mainly focused on the inter-assessor consistency to evaluate the quality of relevance judgments. They encounter "reliable but useless" problem when employed to compare different annotation settings (e.g. binary judgment v.s. 4-grade judgment). Meanwhile, other existing popular metrics such as discriminative power (DP) are not designed to compare relevance judgments across different annotation settings, they therefore suffer from limitations, such as the requirement of result ranking lists from different systems. Therefore, how to design an evaluation metric to compare relevance judgments under different grade settings needs further investigation. In this work, we propose a novel metric named pairwise discriminative power (PDP) to evaluate the quality of relevance judgment collections. By leveraging a small amount of document-level preference tests, PDP estimates the discriminative ability of relevance judgments on separating ranking lists with various qualities. With comprehensive experiments on both synthetic and real-world datasets, we show that PDP maintains a high degree of consistency with annotation quality in various grade settings. Compared with existing metrics (e.g., Krippendorff's α, Φ, DP, etc), it provides reliable evaluation results with affordable additional annotation efforts.
- Haldun Akoglu. 2018. User's guide to correlation coefficients. Turkish journal of emergency medicine 18, 3 (2018), 91--93.Google Scholar
- Enrique Amigó, Fernando Giner, Stefano Mizzaro, and Damiano Spina. 2018. A Formal Account of Effectiveness Evaluation and Ranking Fusion. In Proceedings of the 2018 ACM SIGIR International Conference on Theory of Information Retrieval. 123--130. Google ScholarDigital Library
- Peter Bailey, Nick Craswell, Ian Soboroff, Paul Thomas, Arjen P de Vries, and Emine Yilmaz. 2008. Relevance assessment: are judges exchangeable and does it matter. In Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval. 667--674. Google ScholarDigital Library
- Christopher Burges, Tal Shaked, Erin Renshaw, Ari Lazier, Matt Deeds, Nicole Hamilton, and Gregory N Hullender. 2005. Learning to rank using gradient descent. In Proceedings of the 22nd International Conference on Machine learning (ICML-05). 89--96. Google ScholarDigital Library
- Ben Carterette, Paul N Bennett, David Maxwell Chickering, and Susan T Dumais. 2008. Here or there. In European Conference on Information Retrieval. Springer, 16--27. Google ScholarDigital Library
- Praveen Chandar and Ben Carterette. 2012. Using preference judgments for novel document retrieval. In Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval. 861--870. Google ScholarDigital Library
- Praveen Chandar and Ben Carterette. 2013. Preference based evaluation measures for novelty and diversity. In Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval. 413--422. Google ScholarDigital Library
- Olivier Chapelle, Donald Metlzer, Ya Zhang, and Pierre Grinspan. 2009. Expected reciprocal rank for graded relevance. In Proceedings of the 18th ACM conference on Information and knowledge management. 621--630. Google ScholarDigital Library
- Alessandro Checco, Kevin Roitero, Eddy Maddalena, Stefano Mizzaro, and Gianluca Demartini. 2017. Let's agree to disagree: Fixing agreement measures for crowdsourcing. In Proceedings of the AAAI Conference on Human Computation and Crowdsourcing, Vol. 5.Google Scholar
- Charles LA Clarke, Nick Craswell, and Ian Soboroff. 2004. Overview of the TREC 2004 Terabyte Track.. In TREC, Vol. 4. 74.Google Scholar
- Charles LA Clarke, Mark D Smucker, and Alexandra Vtyurina. 2020. Offline Evaluation by Maximum Similarity to an Ideal Ranking. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management. 225--234. Google ScholarDigital Library
- Charles LA Clarke, Alexandra Vtyurina, and Mark D Smucker. 2020. Assessing top-?? preferences. arXiv preprint arXiv:2007.11682 (2020).Google Scholar
- Charles LA Clarke, Alexandra Vtyurina, and Mark D Smucker. 2020. Offline evaluation without gain. In Proceedings of the 2020aCM SIGIR on International Conference on Theory of Information Retrieval. 185--192. Google ScholarDigital Library
- Cyril Cleverdon and EM Keen. 1966. Aslib--Cranfield research project. Factors determining the performance of indexing systems 1 (1966).Google Scholar
- Jacob Cohen. 1960. A coefficient of agreement for nominal scales. Educational and psychological measurement 20, 1 (1960), 37--46.Google Scholar
- Kevyn Collins-Thompson, Craig Macdonald, Paul Bennett, Fernando Diaz, and Ellen M Voorhees. 2015. TREC 2014 web track overview. Technical Report. MICHIGAN UNIV ANN ARBOR.Google Scholar
- Pieter-Tjerk De Boer, Dirk P Kroese, Shie Mannor, and Reuven Y Rubinstein. 2005. A tutorial on the cross-entropy method. Annals of operations research 134, 1 (2005), 19--67.Google ScholarCross Ref
- Nicola Ferro and Carol Peters. 2019. Information Retrieval Evaluation in a Changing World: Lessons Learned from 20 Years of CLEF. Vol. 41. Springer.Google Scholar
- Joseph L Fleiss, Jacob Cohen, and Brian S Everitt. 1969. Large sample standard errors of kappa and weighted kappa. Psychological bulletin 72, 5 (1969), 323.Google Scholar
- John Guiver and Edward Snelson. 2009. Bayesian inference for Plackett-Luce ranking models. In proceedings of the 26th annual international conference on machine learning. 377--384. Google ScholarDigital Library
- Lei Han, Kevin Roitero, Eddy Maddalena, Stefano Mizzaro, and Gianluca Demartini. 2019. On transforming relevance scales. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management. 39--48. Google ScholarDigital Library
- Kalervo Järvelin and Jaana Kekäläinen. 2002. Cumulated gain-based evaluation of IR techniques. ACM Transactions on Information Systems (TOIS) 20, 4 (2002), 422--446. Google ScholarDigital Library
- Klaus Krippendorff. 2011. Computing Krippendorff's alpha-reliability. (2011).Google Scholar
- Cheng Luo, Tetsuya Sakai, Yiqun Liu, Zhicheng Dou, Chenyan Xiong, and Jingfang Xu. 2017. Overview of the NTCIR-13 we want web task. Proc. NTCIR-13 (2017).Google Scholar
- Alistair Moffat and Justin Zobel. 2008. Rank-biased precision for measurement of retrieval effectiveness. ACM Transactions on Information Systems (TOIS) 27, 1 (2008), 1--27. Google ScholarDigital Library
- Kevin Roitero, Eddy Maddalena, Gianluca Demartini, and Stefano Mizzaro. 2018. On fine-grained relevance scales. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval. 675--684. Google ScholarDigital Library
- Murray Rosenblatt. 1956. A central limit theorem and a strong mixing condition. Proceedings of the National Academy of Sciences of the United States of America 42, 1 (1956), 43.Google ScholarCross Ref
- Reuven Y Rubinstein and Dirk P Kroese. 2016. Simulation and the Monte Carlo method. Vol. 10. John Wiley & Sons. Google ScholarDigital Library
- Tetsuya Sakai. 2006. Evaluating evaluation metrics based on the bootstrap. In Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval. 525--532. Google ScholarDigital Library
- Tetsuya Sakai, Douglas W Oard, and Noriko Kando. [n.d.]. Evaluating Information Retrieval and Access Tasks: NTCIR's Legacy of Research Impact. Springer Nature.Google Scholar
- Tetsuya Sakai and Ruihua Song. 2011. Evaluating diversified search results using per-intent graded relevance. In Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval. 1043--1052. Google ScholarDigital Library
- Tetsuya Sakai, Sijie Tao, Zhaohao Zeng, Yukun Zheng, Jiaxin Mao, Zhumin Chu, Yiqun Liu, Maria Maistro, Zhicheng Dou, Nicola Ferro, et al. 2020. Overview of the NTCIR-15 We Want Web with CENTRE (WWW-3) Task. Proceedings of NTCIR-15. to appear (2020).Google Scholar
- Tetsuya Sakai and Zhaohao Zeng. 2020. Good evaluation measures based on document preferences. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval. 359--368. Google ScholarDigital Library
- Claude E Shannon. 1948. A mathematical theory of communication. The Bell system technical journal 27, 3 (1948), 379--423.Google Scholar
- Eero Sormunen. 2002. Liberal relevance criteria of TREC- counting on negligible documents?. In Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval. 324--330. Google ScholarDigital Library
- Rong Tang, William M Shaw Jr, and Jack L Vevea. 1999. Towards the identification of the optimal number of relevance categories. Journal of the American Society for Information Science 50, 3 (1999), 254--264. Google ScholarDigital Library
- Andrew Turpin, Falk Scholer, Stefano Mizzaro, and Eddy Maddalena. 2015. The benefits of magnitude estimation relevance assessments for information retrieval evaluation. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval. 565--574. Google ScholarDigital Library
- Anthony J Viera, Joanne M Garrett, et al. 2005. Understanding interobserver agreement: the kappa statistic. Fam med 37, 5 (2005), 360--363.Google Scholar
- Ellen M Voorhees and Donna Harman. 2002. Overview of TREC 2002.. In Trec.Google Scholar
- Ellen M Voorhees, Donna K Harman, et al. 2005. TREC: Experiment and evaluation in information retrieval. Vol. 63. MIT press Cambridge. Google ScholarDigital Library
- Ziying Yang, Alistair Moffat, and Andrew Turpin. 2018. Pairwise crowd judgments: Preference, absolute, and ratio. In Proceedings of the 23rd Australasian Document Computing Symposium. 1--8. Google ScholarDigital Library
- Dongqing Zhu and Ben Carterette. 2010. An analysis of assessor behavior in crowdsourced preference judgments. In SIGIR 2010 workshop on crowdsourcing for search evaluation. 17--20.Google Scholar
Index Terms
- Evaluating Relevance Judgments with Pairwise Discriminative Power
Recommendations
Comparing In Situ and Multidimensional Relevance Judgments
SIGIR '17: Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information RetrievalTo address concerns of TREC-style relevance judgments, we explore two improvements. The first one seeks to make relevance judgments contextual, collecting in situ feedback of users in an interactive search session and embracing usefulness as the primary ...
A user study of relevance judgments for e-discovery
ASIS&T '10: Proceedings of the 73rd ASIS&T Annual Meeting on Navigating Streams in an Information Ecosystem - Volume 47This paper presents a comparative user study that investigates the relevance judgments made by assessors with a law background and assessors without. Four law students and four library and information studies (LIS) students were recruited to judge ...
Understanding Relevance Judgments in Legal Case Retrieval
Legal case retrieval, which aims to retrieve relevant cases given a query case, has drawn increasing research attention in recent years. While much research has worked on developing automatic retrieval models, how to characterize relevance in this ...
Comments