ABSTRACT
Existing commercial search engines often struggle to represent different perspectives of a search query. Argument retrieval systems address this limitation of search engines and provide both positive (PRO) and negative (CON) perspectives about a user's information need on a controversial topic (e.g., climate change). The effectiveness of such argument retrieval systems is typically evaluated based on topical relevance and argument quality, without taking into account the often differing number of documents shown for the argument stances (PRO or CON). Therefore, systems may retrieve relevant passages, but with a biased exposure of arguments. In this work, we analyze a range of non-stochastic fairness-aware ranking and diversity metrics to evaluate the extent to which argument stances are fairly exposed in argument retrieval systems.
Using the official runs of the argument retrieval task Ttouché at CLEF 2020, as well as synthetic data to control the amount and order of argument stances in the rankings, we show that systems with the best effectiveness in terms of topical relevance are not necessarily the most fair or the most diverse in terms of argument stance. The relationships we found between (un)fairness and diversity metrics shed light on how to evaluate group fairness -- in addition to topical relevance -- in argument retrieval settings.
- Yamen Ajjour, Henning Wachsmuth, Johannes Kiesel, Martin Potthast, Matthias Hagen, and Benno Stein. 2019. Data Acquisition for Argument Search: The args.me Corpus. In Joint German/Austrian Conference on Artificial Intelligence (Künstliche Intelligenz). Springer, 48--59. https://doi.org/10.1007/978-3-030-30179-8_4Google ScholarCross Ref
- James Allan, Jaime Arguello, Leif Azzopardi, Peter Bailey, Tim Baldwin, Krisztian Balog, Hannah Bast, Nick Belkin, Klaus Berberich, Bodo von Billerbeck, Jamie Callan, Rob Capra, Mark Carman, Ben Carterette, Charles L. A. Clarke, Kevyn Collins-Thompson, Nick Craswell, W. Bruce Croft, J. Shane Culpepper, Jeff Dalton, Gianluca Demartini, Fernado Diaz, Laura Dietz, Susan Dumais, Carsten Eickhoff, Nicola Ferro, Norbert Fuhr, Shlomo Geva, Claudia Hauff, David Hawking, Hideo Joho, Gareth Jones, Jaap Kamps, Noriko Kando, Diane Kelly, Jaewon Kim, Julia Kiseleva, Yiqun Liu, Xiaolu Lu, Stefano Mizzaro, Alistair Moffat, Jian-Yun Nie, Alexandra Olteanu, Iadh Ounis, Filip Radlinski, Maarten de Rijke, Mark Sanderson, Falk Scholer, Laurianne Sitbon, Mark Smucker, Ian Soboroff, Damiano Spina, Torsten Suel, James Thom, Paul Thomas, Andrew Trotman, Ellen Voorhees, Arjen P. de Vries, Emine Yilmaz, and Guido Zuccon. 2018. Research Frontiers in Information Retrieval: Report from the Third Strategic Workshop on Information Retrieval in Lorne (SWIRL 2018). SIGIR Forum, Vol. 52, 1 (Aug. 2018), 34--90. https://doi.org/10.1145/3274784.3274788Google Scholar
- Enrique Amigó, Damiano Spina, and Jorge Carrillo-de Albornoz. 2018. An Axiomatic Analysis of Diversity Evaluation Metrics: Introducing the Rank-Biased Utility Metric. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval (Ann Arbor, MI, USA) (SIGIR '18). Association for Computing Machinery, 625--634. https://doi.org/10.1145/3209978.3210024 Google ScholarDigital Library
- Asia J. Biega, Krishna P. Gummadi, and Gerhard Weikum. 2018. Equity of Attention: Amortizing Individual Fairness in Rankings. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval (Ann Arbor, MI, USA) (SIGIR '18). Association for Computing Machinery, 405--414. https://doi.org/10.1145/3209978.3210063 Google ScholarDigital Library
- Alexander Bondarenko, Maik Fröbe, Meriem Beloucif, Lukas Gienapp, Yamen Ajjour, Alexander Panchenko, Chris Biemann, Benno Stein, Henning Wachsmuth, Martin Potthast, et al. 2020. Overview of Touché 2020: Argument Retrieval. In International Conference of the Cross-Language Evaluation Forum for European Languages (CLEF '20). Springer, 384--395. https://doi.org/10.1007/978-3-030-58219-7_26Google ScholarDigital Library
- Ben Carterette. 2009. On Rank Correlation and the Distance between Rankings. In Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval (Boston, MA, USA) (SIGIR '09). Association for Computing Machinery, 436--443. https://doi.org/10.1145/1571941.1572017 Google ScholarDigital Library
- Carlos Castillo. 2019. Fairness and Transparency in Ranking. SIGIR Forum, Vol. 52, 2 (Jan. 2019), 64--71. https://doi.org/10.1145/3308774.3308783 Google ScholarDigital Library
- Charles L.A. Clarke, Maheedhar Kolla, Gordon V. Cormack, Olga Vechtomova, Azin Ashkan, Stefan Büttcher, and Ian MacKinnon. 2008. Novelty and Diversity in Information Retrieval Evaluation. In Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (Singapore, Singapore) (SIGIR '08). Association for Computing Machinery, 659--666. https://doi.org/10.1145/1390334.1390446 Google ScholarDigital Library
- Fernando Diaz, Bhaskar Mitra, Michael D. Ekstrand, Asia J. Biega, and Ben Carterette. 2020. Evaluating Stochastic Rankings with Expected Exposure. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management (Virtual Event, Ireland) (CIKM '20). Association for Computing Machinery, 275--284. https://doi.org/10.1145/3340531.3411962 Google ScholarDigital Library
- Tim Draws, Nava Tintarev, and Ujwal Gadiraju. 2021. Assessing Viewpoint Diversity in Search Results Using Ranking Fairness Metrics. SIGKDD Explor. Newsl., Vol. 23, 1 (May 2021), 50--58. https://doi.org/10.1145/3468507.3468515 Google ScholarDigital Library
- Lorik Dumani, Patrick J Neumann, and Ralf Schenkel. 2020. A Framework for Argument Retrieval. In Proceedings of the 42nd European Conference on Information Retrieval (ECIR '20). Springer, 431--445. https://doi.org/10.1007/978-3-030-45439-5_29Google ScholarDigital Library
- Cynthia Dwork, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Richard Zemel. 2012a. Fairness through Awareness. In Proceedings of the 3rd Innovations in Theoretical Computer Science Conference (Cambridge, Massachusetts) (ITCS '12). Association for Computing Machinery, 214--226. https://doi.org/10.1145/2090236.2090255 Google ScholarDigital Library
- Cynthia Dwork, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Richard Zemel. 2012b. Fairness through Awareness. In Proceedings of the 3rd Innovations in Theoretical Computer Science Conference (Cambridge, Massachusetts) (ITCS '12). Association for Computing Machinery, 214--226. https://doi.org/10.1145/2090236.2090255 Google ScholarDigital Library
- Nicola Ferro. 2017. What Does Affect the Correlation Among Evaluation Measures? ACM Trans. Inf. Syst., Vol. 36, 2, Article 19 (Aug. 2017), bibinfonumpages40 pages. https://doi.org/10.1145/3106371 Google ScholarDigital Library
- Ruoyuan Gao and Chirag Shah. 2020. Toward Creating a Fairer Ranking in Search Engine Results. Information Processing & Management, Vol. 57, 1 (2020), 102138. https://doi.org/10.1016/j.ipm.2019.102138Google ScholarDigital Library
- Sahin Cem Geyik, Stuart Ambler, and Krishnaram Kenthapadi. 2019. Fairness-Aware Ranking in Searc & Recommendation Systems with Application to LinkedIn Talent Search. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (Anchorage, AK, USA) (KDD '19). Association for Computing Machinery, 2221--2231. https://doi.org/10.1145/3292500.3330691 Google ScholarDigital Library
- Maurice G Kendall. 1938. A New Measure of Rank Correlation. Biometrika, Vol. 30, 1/2 (1938), 81--93. https://www.jstor.org/stable/2332226Google ScholarCross Ref
- Johannes Kiesel, Damiano Spina, Henning Wachsmuth, and Benno Stein. 2021. The Meant, the Said, and the Understood: Conversational Argument Search and Cognitive Biases. In CUI 2021 - 3rd Conference on Conversational User Interfaces (Bilbao (online), Spain) (CUI '21). Association for Computing Machinery, Article 20, bibinfonumpages5 pages. https://doi.org/10.1145/3469595.3469615 Google ScholarDigital Library
- Mucahid Kutlu, Tamer Elsayed, Maram Hasanain, and Matthew Lease. 2018. When Rank Order Isn't Enough: New Statistical-Significance-Aware Correlation Measures. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management (Torino, Italy) (CIKM '18). Association for Computing Machinery, 397--406. https://doi.org/10.1145/3269206.3271751 Google ScholarDigital Library
- Dino Pedreschi, Salvatore Ruggieri, and Franco Turini. 2009. Measuring Discrimination in Socially-Sensitive Decision Records. In Proceedings of the 2009 SIAM International Conference on Data Mining (SDM '09). SIAM, 581--592. https://doi.org/10.1137/1.9781611972795.50Google ScholarCross Ref
- Dino Pedreshi, Salvatore Ruggieri, and Franco Turini. 2008. Discrimination-Aware Data Mining. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (Las Vegas, Nevada, USA) (KDD '08). Association for Computing Machinery, 560--568. https://doi.org/10.1145/1401890.1401959 Google ScholarDigital Library
- Martin Potthast, Lukas Gienapp, Florian Euchner, Nick Heilenkötter, Nico Weidmann, Henning Wachsmuth, Benno Stein, and Matthias Hagen. 2019. Argument Search: Assessing Argument Relevance. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval (Paris, France) (SIGIR'19). Association for Computing Machinery, 1117--1120. https://doi.org/10.1145/3331184.3331327 Google ScholarDigital Library
- Tetsuya Sakai and Zhaohao Zeng. 2019. Which Diversity Evaluation Measures Are "Good"?. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval (Paris, France) (SIGIR'19). Association for Computing Machinery, 595--604. https://doi.org/10.1145/3331184.3331215 Google ScholarDigital Library
- Ashudeep Singh and Thorsten Joachims. 2018. Fairness of Exposure in Rankings. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (London, United Kingdom) (KDD '18). Association for Computing Machinery, 2219--2228. https://doi.org/10.1145/3219819.3220088 Google ScholarDigital Library
- Henning Wachsmuth, Nona Naderi, Yufang Hou, Yonatan Bilu, Vinodkumar Prabhakaran, Tim Alberdingk Thijm, Graeme Hirst, and Benno Stein. 2017a. Computational Argumentation Quality Assessment in Natural Language. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers (ACL '17). Association for Computational Linguistics, 176--187. https://www.aclweb.org/anthology/E17--1017Google ScholarCross Ref
- Henning Wachsmuth, Martin Potthast, Khalid Al-Khatib, Yamen Ajjour, Jana Puschmann, Jiani Qu, Jonas Dorsch, Viorel Morari, Janek Bevendorff, and Benno Stein. 2017b. Building an Argument Search Engine for the Web. In 4th Workshop on Argument Mining (ArgMining 2017) at EMNLP, Kevin Ashley, Claire Cardie, Nancy Green, Iryna Gurevych, Ivan Habernal, Diane Litman, Georgios Petasis, Chris Reed, Noam Slonim, and Vern Walker (Eds.). Association for Computational Linguistics, 49--59. https://www.aclweb.org/anthology/W17-5106Google ScholarCross Ref
- Colin Wilkie and Leif Azzopardi. 2014. Best and Fairest: An Empirical Analysis of Retrieval System Bias. In Proceedings of the 36th European Conference on IR Research on Advances in Information Retrieval - Volume 8416 (Amsterdam, The Netherlands) (ECIR '14). Springer-Verlag, 13--25.Google ScholarCross Ref
- Ke Yang and Julia Stoyanovich. 2017. Measuring Fairness in Ranked Outputs. In Proceedings of the 29th International Conference on Scientific and Statistical Database Management (Chicago, IL, USA) (SSDBM '17). Association for Computing Machinery, Article 22, 6 pages. https://doi.org/10.1145/3085504.3085526 Google ScholarDigital Library
- Emine Yilmaz, Javed A. Aslam, and Stephen Robertson. 2008. A New Rank Correlation Coefficient for Information Retrieval. In Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (Singapore, Singapore) (SIGIR '08). Association for Computing Machinery, 587--594. https://doi.org/10.1145/1390334.1390435 Google ScholarDigital Library
- Meike Zehlike, Ke Yang, and Julia Stoyanovich. 2021. Fairness in Ranking: A Survey. arxiv: 2103.14000 [cs.IR]Google Scholar
Index Terms
- Evaluating Fairness in Argument Retrieval
Recommendations
Web-Retrieval Supported Argument Space Exploration
CHIIR '17: Proceedings of the 2017 Conference on Conference Human Information Interaction and RetrievalSolid decision making should be ideally based on clear arguments that can be justified by trustworthy information sources. However, argument spaces can quickly get quite complex and it is very often hard to trace the line of arguments found in ...
Argument and Counter-Argument Generation: A Critical Survey
Natural Language Processing and Information SystemsAbstractArgument Generation (AG) is becoming an increasingly active research topic in Natural Language Processing (NLP), and a large variety of terms has been used to highlight different aspects and methods of AG such as argument construction, argument ...
Argument Retrieval from Web
Experimental IR Meets Multilinguality, Multimodality, and InteractionAbstractWe are well beyond the days of expecting search engines to help us find documents containing the answer to a question or information about a query. We expect a search engine to help us in the decision-making process. Argument retrieval task in ...
Comments