ABSTRACT
Inferring the score distribution of relevant and non-relevant documents is an essential task for many IR applications (e.g. information filtering, recall-oriented IR, meta-search, distributed IR). Modeling score distributions in an accurate manner is the basis of any inference. Thus, numerous score distribution models have been proposed in the literature. Most of the models were proposed on the basis of empirical evidence and goodness-of-fit. In this work, we model score distributions in a rather different, systematic manner. We start with a basic assumption on the distribution of terms in a document. Following the transformations applied on term frequencies by two basic ranking functions, BM25 and Language Models, we derive the distribution of the produced scores for all documents. Then we focus on the relevant documents. We detach our analysis from particular ranking functions. Instead, we consider a model for precision-recall curves, and given this model, we present a general mathematical framework which, given any score distribution for all retrieved documents, produces an analytical formula for the score distribution of relevant documents that is consistent with the precision-recall curves that follow the aforementioned model. In particular, assuming a Gamma distribution for all retrieved documents, we show that the derived distribution for the relevant documents resembles a Gaussian distribution with a heavy right tail.
- A. Arampatzis, J. Kamps, and S. Robertson. Where to stop reading a ranked list?: threshold optimization using truncated score distributions. In SIGIR '09: Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval, pages 524--531, New York, NY, USA, 2009. ACM. Google ScholarDigital Library
- A. Arampatzis and A. van Hameren. The score-distributional threshold optimization for adaptive binary classification tasks. In SIGIR '01: Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, pages 285--293, New York, NY, USA, 2001. ACM. Google ScholarDigital Library
- J. A. Aslam and E. Yilmaz. A geometric interpretation and analysis of R-precision. In Proceedings of the Fourteenth ACM International Conference on Information and Knowledge Management, pages 664--671. ACM Press, October 2005. Google ScholarDigital Library
- R. D. Barr and W. P. Zehna. Probability: Modelling Uncertainty. Addison-Wesley, 1983.Google Scholar
- C. Baumgarten. A probabilistic solution to the selection and fusion problem in distributed information retrieval. In SIGIR '99: Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, pages 246--253, New York, NY, USA, 1999. ACM. Google ScholarDigital Library
- P. N. Bennett. Using asymmetric distributions to improve text classifier probability estimates. In SIGIR '03: Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval, pages 111--118, New York, NY, USA, 2003. ACM. Google ScholarDigital Library
- A. Bookstein. When the most 'pertinent' document should not be retrieved:an analysis of the swets model. Information Processing & Management, 13(6):377--383, 1977.Google ScholarCross Ref
- A. Bookstein and D. R. Swanson. Probabilistic models for automatic indexing. Journal of the American Society for Information Science, 25(5):312--318, 1974.Google ScholarCross Ref
- K. Collins-Thompson, P. Ogilvie, Y. Zhang, and J. Callan. Information filtering, novelty detection, and named-page finding. In In Proceedings of the 11th Text Retrieval Conference, 2003.Google Scholar
- S. P. Harter. A probabilistic approach to automatic keyword indexing: Part i. on the distribution of specialty words in a technical literature. Journal of the American Society for Information Science, 26(4):197--206, 1975).Google ScholarCross Ref
- E. Kanoulas, V. Pavlu, K. Dai, and J. A. Aslam. Modeling the score distributions of relevant and non-relevnat documents. In In Proceedings of the 2nd International Conference on the Theory of Information Retrieval, September 2009. Google ScholarDigital Library
- R. Manmatha, T. Rath, and F. Feng. Modeling score distributions for combining the outputs of search engines. In SIGIR '01: Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, pages 267--275, New York, NY, USA, 2001. ACM. Google ScholarDigital Library
- M. F. Neuts and S. Zacks. On mixtures of Ç2- and f-distributions which yield distributions of the same family. Annals of the Institute of Statistical Mathematics, 19(1):527--536, 1966.Google ScholarCross Ref
- S. Robertson. On score distributions and relevance. In G. Amati, C. Carpineto, and G. Romano, editors, Advances in Information Retrieval, 29th European Conference on IR Research, ECIR 2007, volume 4425/2007 of Lecture Notes in Computer Science, pages 40--51. Springer, June 2007. Google ScholarDigital Library
- M. Spitters and W. Kraaij. A language modeling approach to tracking news events. In Proceedings of TDT workshop 2000, pages 101--106, 2000.Google Scholar
- J. A. Swets. Information retrieval systems. Science, 141(3577):245--250, July 1963.Google ScholarCross Ref
- J. A. Swets. Effectiveness of information retrieval methods. American Documentation, 20:72--89, 1969.Google ScholarCross Ref
- M. Wiper, D. R. Insua, and F. Ruggeri. Mixtures of gamma distributions with applications. Journal of Computational and Graphical Statistics, 10(3):440--454, September 2001.Google ScholarCross Ref
- Y. Zhang and J. Callan. Maximum likelihood estimation for filtering thresholds. In SIGIR '01: Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, pages 294--302, New York, NY, USA, 2001. ACM. Google ScholarDigital Library
Index Terms
- Score distribution models: assumptions, intuition, and robustness to score manipulation
Recommendations
Where to stop reading a ranked list?: threshold optimization using truncated score distributions
SIGIR '09: Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrievalRanked retrieval has a particular disadvantage in comparison with traditional Boolean retrieval: there is no clear cut-off point where to stop consulting results. This is a serious problem in some setups. We investigate and further develop methods to ...
A signal-to-noise approach to score normalization
CIKM '09: Proceedings of the 18th ACM conference on Information and knowledge managementScore normalization is indispensable in distributed retrieval and fusion or meta-search where merging of result-lists is required. Distributional approaches to score normalization with reference to relevance, such as binary mixture models like the ...
Document Score Distribution Models for Query Performance Inference and Prediction
Modelling the distribution of document scores returned from an information retrieval (IR) system in response to a query is of both theoretical and practical importance. One of the goals of modelling document scores in this manner is the inference of ...
Comments