research-article

Evaluating Relevance Judgments with Pairwise Discriminative Power

Authors:
Zhumin Chu

Tsinghua University, Beijing, China

Tsinghua University, Beijing, China
View Profile

,
Jiaxin Mao

Renmin University of China, Beijing, China

Renmin University of China, Beijing, China
View Profile

,
Fan Zhang

Tsinghua University, Beijing, China

Tsinghua University, Beijing, China
View Profile

,
Yiqun Liu

Tsinghua University, Beijing, China

Tsinghua University, Beijing, China
View Profile

,
Tetsuya Sakai

Waseda University, Tokyo, Japan

Waseda University, Tokyo, Japan
View Profile

,
Min Zhang

Tsinghua University, Beijing, China

Tsinghua University, Beijing, China
View Profile

,
Shaoping Ma

Tsinghua University, Beijing, China

Tsinghua University, Beijing, China
View Profile

CIKM '21: Proceedings of the 30th ACM International Conference on Information & Knowledge ManagementOctober 2021Pages 261–270https://doi.org/10.1145/3459637.3482428

Published:30 October 2021Publication History

CIKM '21: Proceedings of the 30th ACM International Conference on Information & Knowledge Management

Pages 261–270

ABSTRACT

Relevance judgments play an essential role in the evaluation of information retrieval systems. As many different relevance judgment settings have been proposed in recent years, an evaluation metric to compare relevance judgments in different annotation settings has become a necessity. Traditional metrics, such as ĸ, Krippendorff's α and Φ have mainly focused on the inter-assessor consistency to evaluate the quality of relevance judgments. They encounter "reliable but useless" problem when employed to compare different annotation settings (e.g. binary judgment v.s. 4-grade judgment). Meanwhile, other existing popular metrics such as discriminative power (DP) are not designed to compare relevance judgments across different annotation settings, they therefore suffer from limitations, such as the requirement of result ranking lists from different systems. Therefore, how to design an evaluation metric to compare relevance judgments under different grade settings needs further investigation. In this work, we propose a novel metric named pairwise discriminative power (PDP) to evaluate the quality of relevance judgment collections. By leveraging a small amount of document-level preference tests, PDP estimates the discriminative ability of relevance judgments on separating ranking lists with various qualities. With comprehensive experiments on both synthetic and real-world datasets, we show that PDP maintains a high degree of consistency with annotation quality in various grade settings. Compared with existing metrics (e.g., Krippendorff's α, Φ, DP, etc), it provides reliable evaluation results with affordable additional annotation efforts.

References

Haldun Akoglu. 2018. User's guide to correlation coefficients. Turkish journal of emergency medicine 18, 3 (2018), 91--93.Google Scholar
Enrique Amigó, Fernando Giner, Stefano Mizzaro, and Damiano Spina. 2018. A Formal Account of Effectiveness Evaluation and Ranking Fusion. In Proceedings of the 2018 ACM SIGIR International Conference on Theory of Information Retrieval. 123--130. Google ScholarDigital Library
Peter Bailey, Nick Craswell, Ian Soboroff, Paul Thomas, Arjen P de Vries, and Emine Yilmaz. 2008. Relevance assessment: are judges exchangeable and does it matter. In Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval. 667--674. Google ScholarDigital Library
Christopher Burges, Tal Shaked, Erin Renshaw, Ari Lazier, Matt Deeds, Nicole Hamilton, and Gregory N Hullender. 2005. Learning to rank using gradient descent. In Proceedings of the 22nd International Conference on Machine learning (ICML-05). 89--96. Google ScholarDigital Library
Ben Carterette, Paul N Bennett, David Maxwell Chickering, and Susan T Dumais. 2008. Here or there. In European Conference on Information Retrieval. Springer, 16--27. Google ScholarDigital Library
Praveen Chandar and Ben Carterette. 2012. Using preference judgments for novel document retrieval. In Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval. 861--870. Google ScholarDigital Library
Praveen Chandar and Ben Carterette. 2013. Preference based evaluation measures for novelty and diversity. In Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval. 413--422. Google ScholarDigital Library
Olivier Chapelle, Donald Metlzer, Ya Zhang, and Pierre Grinspan. 2009. Expected reciprocal rank for graded relevance. In Proceedings of the 18th ACM conference on Information and knowledge management. 621--630. Google ScholarDigital Library
Alessandro Checco, Kevin Roitero, Eddy Maddalena, Stefano Mizzaro, and Gianluca Demartini. 2017. Let's agree to disagree: Fixing agreement measures for crowdsourcing. In Proceedings of the AAAI Conference on Human Computation and Crowdsourcing, Vol. 5.Google Scholar
Charles LA Clarke, Nick Craswell, and Ian Soboroff. 2004. Overview of the TREC 2004 Terabyte Track.. In TREC, Vol. 4. 74.Google Scholar
Charles LA Clarke, Mark D Smucker, and Alexandra Vtyurina. 2020. Offline Evaluation by Maximum Similarity to an Ideal Ranking. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management. 225--234. Google ScholarDigital Library
Charles LA Clarke, Alexandra Vtyurina, and Mark D Smucker. 2020. Assessing top-?? preferences. arXiv preprint arXiv:2007.11682 (2020).Google Scholar
Charles LA Clarke, Alexandra Vtyurina, and Mark D Smucker. 2020. Offline evaluation without gain. In Proceedings of the 2020aCM SIGIR on International Conference on Theory of Information Retrieval. 185--192. Google ScholarDigital Library
Cyril Cleverdon and EM Keen. 1966. Aslib--Cranfield research project. Factors determining the performance of indexing systems 1 (1966).Google Scholar
Jacob Cohen. 1960. A coefficient of agreement for nominal scales. Educational and psychological measurement 20, 1 (1960), 37--46.Google Scholar
Kevyn Collins-Thompson, Craig Macdonald, Paul Bennett, Fernando Diaz, and Ellen M Voorhees. 2015. TREC 2014 web track overview. Technical Report. MICHIGAN UNIV ANN ARBOR.Google Scholar
Pieter-Tjerk De Boer, Dirk P Kroese, Shie Mannor, and Reuven Y Rubinstein. 2005. A tutorial on the cross-entropy method. Annals of operations research 134, 1 (2005), 19--67.Google ScholarCross Ref
Nicola Ferro and Carol Peters. 2019. Information Retrieval Evaluation in a Changing World: Lessons Learned from 20 Years of CLEF. Vol. 41. Springer.Google Scholar
Joseph L Fleiss, Jacob Cohen, and Brian S Everitt. 1969. Large sample standard errors of kappa and weighted kappa. Psychological bulletin 72, 5 (1969), 323.Google Scholar
John Guiver and Edward Snelson. 2009. Bayesian inference for Plackett-Luce ranking models. In proceedings of the 26th annual international conference on machine learning. 377--384. Google ScholarDigital Library
Lei Han, Kevin Roitero, Eddy Maddalena, Stefano Mizzaro, and Gianluca Demartini. 2019. On transforming relevance scales. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management. 39--48. Google ScholarDigital Library
Kalervo Järvelin and Jaana Kekäläinen. 2002. Cumulated gain-based evaluation of IR techniques. ACM Transactions on Information Systems (TOIS) 20, 4 (2002), 422--446. Google ScholarDigital Library
Klaus Krippendorff. 2011. Computing Krippendorff's alpha-reliability. (2011).Google Scholar
Cheng Luo, Tetsuya Sakai, Yiqun Liu, Zhicheng Dou, Chenyan Xiong, and Jingfang Xu. 2017. Overview of the NTCIR-13 we want web task. Proc. NTCIR-13 (2017).Google Scholar
Alistair Moffat and Justin Zobel. 2008. Rank-biased precision for measurement of retrieval effectiveness. ACM Transactions on Information Systems (TOIS) 27, 1 (2008), 1--27. Google ScholarDigital Library
Kevin Roitero, Eddy Maddalena, Gianluca Demartini, and Stefano Mizzaro. 2018. On fine-grained relevance scales. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval. 675--684. Google ScholarDigital Library
Murray Rosenblatt. 1956. A central limit theorem and a strong mixing condition. Proceedings of the National Academy of Sciences of the United States of America 42, 1 (1956), 43.Google ScholarCross Ref
Reuven Y Rubinstein and Dirk P Kroese. 2016. Simulation and the Monte Carlo method. Vol. 10. John Wiley & Sons. Google ScholarDigital Library
Tetsuya Sakai. 2006. Evaluating evaluation metrics based on the bootstrap. In Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval. 525--532. Google ScholarDigital Library
Tetsuya Sakai, Douglas W Oard, and Noriko Kando. [n.d.]. Evaluating Information Retrieval and Access Tasks: NTCIR's Legacy of Research Impact. Springer Nature.Google Scholar
Tetsuya Sakai and Ruihua Song. 2011. Evaluating diversified search results using per-intent graded relevance. In Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval. 1043--1052. Google ScholarDigital Library
Tetsuya Sakai, Sijie Tao, Zhaohao Zeng, Yukun Zheng, Jiaxin Mao, Zhumin Chu, Yiqun Liu, Maria Maistro, Zhicheng Dou, Nicola Ferro, et al. 2020. Overview of the NTCIR-15 We Want Web with CENTRE (WWW-3) Task. Proceedings of NTCIR-15. to appear (2020).Google Scholar
Tetsuya Sakai and Zhaohao Zeng. 2020. Good evaluation measures based on document preferences. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval. 359--368. Google ScholarDigital Library
Claude E Shannon. 1948. A mathematical theory of communication. The Bell system technical journal 27, 3 (1948), 379--423.Google Scholar
Eero Sormunen. 2002. Liberal relevance criteria of TREC- counting on negligible documents?. In Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval. 324--330. Google ScholarDigital Library
Rong Tang, William M Shaw Jr, and Jack L Vevea. 1999. Towards the identification of the optimal number of relevance categories. Journal of the American Society for Information Science 50, 3 (1999), 254--264. Google ScholarDigital Library
Andrew Turpin, Falk Scholer, Stefano Mizzaro, and Eddy Maddalena. 2015. The benefits of magnitude estimation relevance assessments for information retrieval evaluation. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval. 565--574. Google ScholarDigital Library
Anthony J Viera, Joanne M Garrett, et al. 2005. Understanding interobserver agreement: the kappa statistic. Fam med 37, 5 (2005), 360--363.Google Scholar
Ellen M Voorhees and Donna Harman. 2002. Overview of TREC 2002.. In Trec.Google Scholar
Ellen M Voorhees, Donna K Harman, et al. 2005. TREC: Experiment and evaluation in information retrieval. Vol. 63. MIT press Cambridge. Google ScholarDigital Library
Ziying Yang, Alistair Moffat, and Andrew Turpin. 2018. Pairwise crowd judgments: Preference, absolute, and ratio. In Proceedings of the 23rd Australasian Document Computing Symposium. 1--8. Google ScholarDigital Library
Dongqing Zhu and Ben Carterette. 2010. An analysis of assessor behavior in crowdsourced preference judgments. In SIGIR 2010 workshop on crowdsourcing for search evaluation. 17--20.Google Scholar

Index Terms

Evaluating Relevance Judgments with Pairwise Discriminative Power
1. Information systems
  1. Information retrieval
    1. Evaluation of retrieval results
      1. Relevance assessment

Recommendations

Comparing In Situ and Multidimensional Relevance Judgments
SIGIR '17: Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval

To address concerns of TREC-style relevance judgments, we explore two improvements. The first one seeks to make relevance judgments contextual, collecting in situ feedback of users in an interactive search session and embracing usefulness as the primary ...
Read More
A user study of relevance judgments for e-discovery
ASIS&T '10: Proceedings of the 73rd ASIS&T Annual Meeting on Navigating Streams in an Information Ecosystem - Volume 47

This paper presents a comparative user study that investigates the relevance judgments made by assessors with a law background and assessors without. Four law students and four library and information studies (LIS) students were recruited to judge ...
Read More
Understanding Relevance Judgments in Legal Case Retrieval
Legal case retrieval, which aims to retrieve relevant cases given a query case, has drawn increasing research attention in recent years. While much research has worked on developing automatic retrieval models, how to characterize relevance in this ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
CIKM '21: Proceedings of the 30th ACM International Conference on Information & Knowledge Management
October 2021
4966 pages
ISBN:9781450384469
DOI:10.1145/3459637
General Chairs:
Gianluca Demartini
The University of Queensland, Australia
,
Guido Zuccon
The University of Queensland, Australia
,
Program Chairs:
J. Shane Culpepper
RMIT University, Australia
,
Zi Huang
The University of Queensland, Australia
,
Hanghang Tong
University of Illinois at Urbana-Champaign, USA
Copyright © 2021 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 30 October 2021
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
evaluation metric
preference test
relevance judgment
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate1,861of8,427submissions,22%
Upcoming Conference
CIKM '24

Sponsor:

sigir

sigir

The 33rd ACM International Conference on Information and Knowledge Management

October 21 - 25, 2024

Boise , ID , USA
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 106
  Total Downloads
- Downloads (Last 12 months)23
- Downloads (Last 6 weeks)6
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Evaluating Relevance Judgments with Pairwise Discriminative Power

CIKM '21: Proceedings of the 30th ACM International Conference on Information & Knowledge Management

ABSTRACT

References

Cited By

Index Terms

Recommendations

Comparing In Situ and Multidimensional Relevance Judgments

A user study of relevance judgments for e-discovery

Understanding Relevance Judgments in Legal Case Retrieval