skip to main content
10.1145/3409256.3409831acmconferencesArticle/Chapter ViewAbstractPublication PagesictirConference Proceedingsconference-collections
short-paper

Analyzing the Influence of Bigrams on Retrieval Bias and Effectiveness

Published:14 September 2020Publication History

ABSTRACT

Prior work on using retrievability measures in the evaluation of information retrieval (IR) systems has laid out the foundations for investigating the relationship between retrieval effectiveness and retrieval bias. While various factors influencing bias have been examined, there has been no work examining the impact of using bigram within the index on retrieval bias. Intuitively, how the documents are represented, and what terms they contain, will influence whether they are retrievable or not. In this paper, we investigate how the bias of a system changes depending on how the documents are represented using unigrams, bigrams or both. Our analysis of three different retrieval models on three TREC collections, shows that using a bigram only representation results in the lowest bias compared to unigram only representation, but at the expense of retrieval effectiveness. However, when both representations are combined it results in reducing the overall bias, as well as increasing effectiveness. These findings suggest that when configuring and indexing the collection, that the bag-of-words approach (unigrams), should be augmented with bigrams to create better and fairer retrieval systems.

References

  1. Gianni Amati and Cornelis Joost Van Rijsbergen. 2002. Probabilistic models of information retrieval based on measuring the divergence from randomness. ACM Transactions on Information Systems (TOIS), Vol. 20, 4 (2002), 357--389.Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Leif Azzopardi and Vishwa Vinay. 2008. Retrievability: An Evaluation Measure for Higher Order Information Access Tasks. In Proc. of CIKM '08. ACM, 561--570.Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Ricardo Baeza-Yates. 2018. Bias on the Web. Comm. ACM, Vol. 61, 6 (2018), 54--61.Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Shariq Bashir and Andreas Rauber. 2009. Improving retrievability of patents with cluster-based pseudo-relevance feedback documents selection. In Proc. of CIKM '09. 1863--1866.Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Shariq Bashir and Andreas Rauber. 2010. Improving retrievability of patents in prior-art search. In Proc. of ECIR '10. 457--470.Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Ruey-Cheng Chen, Leif Azzopardi, and Falk Scholer. 2017. An Empirical Analysis of Pruning Techniques: Performance, Retrievability and Bias. In Proc. of the 2017 ACM on Conference on Information and Knowledge Management. 2023--2026.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Debasis Ganguly, Ayan Bandyopadhyay, Mandar Mitra, and Gareth J.F. Jones. 2016. Retrievability of Code Mixed Microblogs. In Proc. of the 39th International ACM SIGIR Conference (Pisa, Italy) (SIGIR '16). 973--976.Google ScholarGoogle Scholar
  8. J Gastwirth. 1972. The Estimation of the Lorenz Curve and Gini Index. The Review of Economics and Statistics, Vol. 54 (1972), 306--316. Issue 3.Google ScholarGoogle ScholarCross RefCross Ref
  9. Christopher D Manning, Christopher D Manning, and Hinrich Schütze. 1999. Foundations of Statistical Natural Language Processing. MIT press.Google ScholarGoogle Scholar
  10. Christopher D Manning, Prabhakar Raghavan, and Hinrich Schütze. 2008. Introduction to information retrieval. Cambridge university press.Google ScholarGoogle Scholar
  11. Seung-Hoon Na, Jungi Kim, In-Su Kang, and Jong-Hyeok Lee. 2008. Exploiting proximity feature in bigram language model for information retrieval. 821--822.Google ScholarGoogle Scholar
  12. Jiaul H. Paik and Jimmy Lin. 2016. Retrievability in API-Based "Evaluation as a Service". In Proc. of the 2016 ACM International Conference on the Theory of Information Retrieval (Newark, Delaware, USA) (ICTIR '16). 91--94.Google ScholarGoogle Scholar
  13. Vassilis Plachouras and Iadh Ounis. 2007. Multinomial Randomness Models for Retrieval with Document Fields. In ECIR.Google ScholarGoogle Scholar
  14. Stephen Robertson and Hugo Zaragoza. 2009. The Probabilistic Relevance Framework: BM25 and Beyond. FNTIR, Vol. 3, 4 (2009), 333--389.Google ScholarGoogle Scholar
  15. Thaer Samar, Myriam C. Traub, Jacco Ossenbruggen, Lynda Hardman, and Arjen P. Vries. 2018. Quantifying Retrieval Bias in Web Archive Search. Int. J. Digit. Libr., Vol. 19, 1 (March 2018), 57--75.Google ScholarGoogle Scholar
  16. Fei Song and W. Bruce Croft. 1999. A General Language Model for Information Retrieval. In Proc. of the Eighth International Conference on Information and Knowledge Management (Kansas City, Missouri, USA) (CIKM '99). Association for Computing Machinery, New York, NY, USA, 316--321.Google ScholarGoogle Scholar
  17. Chade-Meng Tan, Yuan-Fang Wang, and Chan-Do Lee. 2002. The use of bigrams to enhance text categorization. IPM, Vol. 38, 4 (2002), 529--546.Google ScholarGoogle Scholar
  18. Myriam C. Traub, Thaer Samar, Jacco van Ossenbruggen, Jiyin He, Arjen de Vries, and Lynda Hardman. 2016. Querylog-Based Assessment of Retrievability Bias in a Large Newspaper Corpus. In Proc. of the 16th ACM/IEEE-CS on Joint Conference on Digital Libraries (Newark, New Jersey, USA) (JCDL '16). Association for Computing Machinery, New York, NY, USA, 7--16.Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Colin Wilkie and Leif Azzopardi. 2013. Relating retrievability, performance and length. In Proc. of SIGIR '13 (Dublin, Ireland). 937--940.Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Colin Wilkie and Leif Azzopardi. 2014a. Best and Fairest: An Empirical Analysis of Retrieval System Bias. Advances in Information Retrieval (2014), 13--25.Google ScholarGoogle Scholar
  21. Colin Wilkie and Leif Azzopardi. 2014b. A Retrievability Analysis: Exploring the Relationship Between Retrieval Bias and Retrieval Performance. In Proc. of CIKM '14 (Shanghai, China). 81--90.Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Colin Wilkie and Leif Azzopardi. 2018. The impact of fielding on retrieval performance and bias. Proc. of the Association for Information Science and Technology, Vol. 55, 1 (2018), 564--572.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Analyzing the Influence of Bigrams on Retrieval Bias and Effectiveness

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      ICTIR '20: Proceedings of the 2020 ACM SIGIR on International Conference on Theory of Information Retrieval
      September 2020
      207 pages
      ISBN:9781450380676
      DOI:10.1145/3409256

      Copyright © 2020 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 14 September 2020

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • short-paper

      Acceptance Rates

      Overall Acceptance Rate209of482submissions,43%

      Upcoming Conference

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader