ABSTRACT
Document ranking experiments should be repeatable. However, the interaction between multi-threaded indexing and score ties during retrieval may yield non-deterministic rankings, making repeatability not as trivial as one might imagine. In the context of the open-source Lucene search engine, score ties are broken by internal document ids, which are assigned at index time. Due to multi-threaded indexing, which makes experimentation with large modern document collections practical, internal document ids are not assigned consistently between different index instances of the same collection, and thus score ties are broken unpredictably. This short paper examines the effectiveness impact of such score ties, quantifying the variability that can be attributed to this phenomenon. The obvious solution to this non-determinism and to ensure repeatable document ranking is to break score ties using external collection document ids. This approach, however, comes with measurable efficiency costs due to the necessity of consulting external identifiers during query evaluation.
- V. Anh, O. de Kretser, and A. Moffat. 2001. Vector-Space Ranking with Effective Early Termination. In SIGIR. 35--42. Google ScholarDigital Library
- J. Arguello, M. Crane, F. Diaz, J. Lin, and A. Trotman. 2015. Report on the SIGIR 2015 Workshop on Reproducibility, Inexplicability, and Generalizability of Results (RIGOR). SIGIR Forum, Vol. 49, 2 (2015), 107--116. Google ScholarDigital Library
- G. Cabanac, G. Hubert, M. Boughanem, and C. Chrisment. 2010. Tie-Breaking Bias: Effect of an Uncontrolled Parameter on Information Retrieval Evaluation. In CLEF. 112--123. Google ScholarDigital Library
- N. Ferro and G. Silvello. 2015. Rank-Biased Precision Reloaded: Reproducibility and Generalization. In ECIR. 768--780.Google Scholar
- J. Lin, M. Crane, A. Trotman, J. Callan, I. Chattopadhyaya, J. Foley, G. Ingersoll, C. Macdonald, and S. Vigna. 2016. Toward Reproducible Baselines: The Open-Source IR Reproducibility Challenge. In ECIR. 408--420.Google Scholar
- J. Lin and A. Trotman. 2015. Anytime Ranking for Impact-Ordered Indexes. In ICTIR. 301--304. Google ScholarDigital Library
- F. McSherry and M. Najork. 2008. Computing Information Retrieval Performance Measures Efficiently in the Presence of Tied Scores. In ECIR. 414--421. Google ScholarDigital Library
- I. Ounis, C. Macdonald, J. Lin, and I. Soboroff. 2011. Overview of the TREC-2011 Microblog Track. In TREC.Google Scholar
- H. Wu and H. Fang. 2013. Tie Breaker: A Novel Way of Combining Retrieval Signal. In ICTIR. 72--75. Google ScholarDigital Library
- P. Yang, H. Fang, and J. Lin. 2017. Anserini: Enabling the Use of Lucene for Information Retrieval Research. In SIGIR. 1253--1256. Google ScholarDigital Library
- P. Yang, H. Fang, and J. Lin. 2018. Anserini: Reproducible Ranking Baselines Using Lucene. Journal of Data and Information Quality, Vol. 10, 4 (2018), Article 16. Google ScholarDigital Library
- Z. Yang, A. Moffat, and A. Turpin. 2016. How Precise Does Document Scoring Need to Be? In AIRS. 279--291.Google Scholar
Index Terms
- The Impact of Score Ties on Repeatability in Document Ranking
Recommendations
Efficient passage ranking for document databases
Queries to text collections are resolved by ranking the documents in the collection and returning the highest-scoring documents to the user. An alternative retrieval method is to rank passages, that is, short fragments of documents, a strategy that can ...
Context-sensitive document ranking
CIKM '09: Proceedings of the 18th ACM conference on Information and knowledge managementRanking is a main research issue in IR-styled keyword search over a set of documents. In this paper, we study a new keyword search problem, called context-sensitive document ranking, which is to rank documents with an additional context that provides ...
An Algorithm for Ranking Pages Based on Theme Seach Engines
CICN '12: Proceedings of the 2012 Fourth International Conference on Computational Intelligence and Communication NetworksWith the development of information technology, it is very difficult for users to find the information they need because more and more information is on the Internet. The search engine is a tool for users to find information, however, customers are not ...
Comments