ABSTRACT
Prior work on using retrievability measures in the evaluation of information retrieval (IR) systems has laid out the foundations for investigating the relationship between retrieval effectiveness and retrieval bias. While various factors influencing bias have been examined, there has been no work examining the impact of using bigram within the index on retrieval bias. Intuitively, how the documents are represented, and what terms they contain, will influence whether they are retrievable or not. In this paper, we investigate how the bias of a system changes depending on how the documents are represented using unigrams, bigrams or both. Our analysis of three different retrieval models on three TREC collections, shows that using a bigram only representation results in the lowest bias compared to unigram only representation, but at the expense of retrieval effectiveness. However, when both representations are combined it results in reducing the overall bias, as well as increasing effectiveness. These findings suggest that when configuring and indexing the collection, that the bag-of-words approach (unigrams), should be augmented with bigrams to create better and fairer retrieval systems.
- Gianni Amati and Cornelis Joost Van Rijsbergen. 2002. Probabilistic models of information retrieval based on measuring the divergence from randomness. ACM Transactions on Information Systems (TOIS), Vol. 20, 4 (2002), 357--389.Google ScholarDigital Library
- Leif Azzopardi and Vishwa Vinay. 2008. Retrievability: An Evaluation Measure for Higher Order Information Access Tasks. In Proc. of CIKM '08. ACM, 561--570.Google ScholarDigital Library
- Ricardo Baeza-Yates. 2018. Bias on the Web. Comm. ACM, Vol. 61, 6 (2018), 54--61.Google ScholarDigital Library
- Shariq Bashir and Andreas Rauber. 2009. Improving retrievability of patents with cluster-based pseudo-relevance feedback documents selection. In Proc. of CIKM '09. 1863--1866.Google ScholarDigital Library
- Shariq Bashir and Andreas Rauber. 2010. Improving retrievability of patents in prior-art search. In Proc. of ECIR '10. 457--470.Google ScholarDigital Library
- Ruey-Cheng Chen, Leif Azzopardi, and Falk Scholer. 2017. An Empirical Analysis of Pruning Techniques: Performance, Retrievability and Bias. In Proc. of the 2017 ACM on Conference on Information and Knowledge Management. 2023--2026.Google ScholarDigital Library
- Debasis Ganguly, Ayan Bandyopadhyay, Mandar Mitra, and Gareth J.F. Jones. 2016. Retrievability of Code Mixed Microblogs. In Proc. of the 39th International ACM SIGIR Conference (Pisa, Italy) (SIGIR '16). 973--976.Google Scholar
- J Gastwirth. 1972. The Estimation of the Lorenz Curve and Gini Index. The Review of Economics and Statistics, Vol. 54 (1972), 306--316. Issue 3.Google ScholarCross Ref
- Christopher D Manning, Christopher D Manning, and Hinrich Schütze. 1999. Foundations of Statistical Natural Language Processing. MIT press.Google Scholar
- Christopher D Manning, Prabhakar Raghavan, and Hinrich Schütze. 2008. Introduction to information retrieval. Cambridge university press.Google Scholar
- Seung-Hoon Na, Jungi Kim, In-Su Kang, and Jong-Hyeok Lee. 2008. Exploiting proximity feature in bigram language model for information retrieval. 821--822.Google Scholar
- Jiaul H. Paik and Jimmy Lin. 2016. Retrievability in API-Based "Evaluation as a Service". In Proc. of the 2016 ACM International Conference on the Theory of Information Retrieval (Newark, Delaware, USA) (ICTIR '16). 91--94.Google Scholar
- Vassilis Plachouras and Iadh Ounis. 2007. Multinomial Randomness Models for Retrieval with Document Fields. In ECIR.Google Scholar
- Stephen Robertson and Hugo Zaragoza. 2009. The Probabilistic Relevance Framework: BM25 and Beyond. FNTIR, Vol. 3, 4 (2009), 333--389.Google Scholar
- Thaer Samar, Myriam C. Traub, Jacco Ossenbruggen, Lynda Hardman, and Arjen P. Vries. 2018. Quantifying Retrieval Bias in Web Archive Search. Int. J. Digit. Libr., Vol. 19, 1 (March 2018), 57--75.Google Scholar
- Fei Song and W. Bruce Croft. 1999. A General Language Model for Information Retrieval. In Proc. of the Eighth International Conference on Information and Knowledge Management (Kansas City, Missouri, USA) (CIKM '99). Association for Computing Machinery, New York, NY, USA, 316--321.Google Scholar
- Chade-Meng Tan, Yuan-Fang Wang, and Chan-Do Lee. 2002. The use of bigrams to enhance text categorization. IPM, Vol. 38, 4 (2002), 529--546.Google Scholar
- Myriam C. Traub, Thaer Samar, Jacco van Ossenbruggen, Jiyin He, Arjen de Vries, and Lynda Hardman. 2016. Querylog-Based Assessment of Retrievability Bias in a Large Newspaper Corpus. In Proc. of the 16th ACM/IEEE-CS on Joint Conference on Digital Libraries (Newark, New Jersey, USA) (JCDL '16). Association for Computing Machinery, New York, NY, USA, 7--16.Google ScholarDigital Library
- Colin Wilkie and Leif Azzopardi. 2013. Relating retrievability, performance and length. In Proc. of SIGIR '13 (Dublin, Ireland). 937--940.Google ScholarDigital Library
- Colin Wilkie and Leif Azzopardi. 2014a. Best and Fairest: An Empirical Analysis of Retrieval System Bias. Advances in Information Retrieval (2014), 13--25.Google Scholar
- Colin Wilkie and Leif Azzopardi. 2014b. A Retrievability Analysis: Exploring the Relationship Between Retrieval Bias and Retrieval Performance. In Proc. of CIKM '14 (Shanghai, China). 81--90.Google ScholarDigital Library
- Colin Wilkie and Leif Azzopardi. 2018. The impact of fielding on retrieval performance and bias. Proc. of the Association for Information Science and Technology, Vol. 55, 1 (2018), 564--572.Google ScholarCross Ref
Index Terms
- Analyzing the Influence of Bigrams on Retrieval Bias and Effectiveness
Recommendations
On the Orthogonality of Bias and Utility in Ad hoc Retrieval
SIGIR '21: Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information RetrievalVarious researchers have recently explored the impact of different types of biases on information retrieval tasks such as ad hoc retrieval and question answering. While the impact of bias needs to be controlled in order to avoid increased prejudices, the ...
A Retrievability Analysis: Exploring the Relationship Between Retrieval Bias and Retrieval Performance
CIKM '14: Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge ManagementRetrievability provides an alternative way to assess an Information Retrieval (IR) system by measuring how easily documents can be retrieved. Retrievability can also be used to determine the level of retrieval bias a system exerts upon a collection of ...
Algorithmic Bias: Do Good Systems Make Relevant Documents More Retrievable?
CIKM '17: Proceedings of the 2017 ACM on Conference on Information and Knowledge ManagementAlgorithmic bias presents a difficult challenge within Information Retrieval. Long has it been known that certain algorithms favour particular documents due to attributes of these documents that are not directly related to relevance. The evaluation of ...
Comments