ABSTRACT
There have been a number of prior attempts to theoretically justify the effectiveness of the inverse document frequency (IDF). Those that take as their starting point Robertson and Sparck Jones's probabilistic model are based on strong or complex assumptions. We show that a more intuitively plausible assumption suffices. Moreover, the new assumption, while conceptually very simple, provides a solution to an estimation problem that had been deemed intractable by Robertson and Walker (1997).
- K. W. Church and W. A. Gale. Inverse document frequency (IDF): A measure of deviations from Poisson. In Proceedings of the Third Workshop on Very Large Corpora (WVLC), pages 121--130, 1995.Google Scholar
- W. B. Croft and D. J. Harper. Using probabilistic models of document retrieval without relevance information. Journal of Documentation, 35(4):285--295, 1979. Reprinted in Karen Spärck Jones and Peter Willett, eds., Readings in Information Retrieval, Morgan Kaufmann, pp. 339--344, 1997. Google ScholarCross Ref
- A. P. de Vries and T. Roelleke. Relevance information: A loss of entropy but a gain for idf? In Proceedings of SIGIR, pages 282--289, 2005. Google ScholarDigital Library
- H. Fang, T. Tao, and C. Zhai. A formal study of information retrieval heuristics. In Proceedings of SIGIR, pages 49--56, 2004. Google ScholarDigital Library
- W. R. Greiff. A theory of term weighting based on exploratory data analysis. In Proceedings of SIGIR, pages 11--19, New York, NY, USA, 1998. Google ScholarDigital Library
- D. Harman. The history of IDF and its in uences on IR and other fields. In Charting a New Course: Natural Language Processing and Information Retrieval: Essays in Honour of Karen Spärck Jones, pages 69--79. Springer, 2005.Google Scholar
- C. D. Manning, P. Raghavan, and H. Schütze. Introduction to Information Retrieval, chapter 11 (Probabilistic information retrieval). Cambridge University Press, 2007. Draft of April 28.Google Scholar
- K. Papineni. Why inverse document frequency? In Proceedings of the NAACL, pages 1--8, 1995. Google ScholarDigital Library
- S. E. Robertson. Understanding inverse document frequency: On theoretical arguments for IDF. Journal of Documentation, 60(5):503--520, 2004.Google ScholarCross Ref
- S. E. Robertson and K. Spärck Jones. Relevance weighting of search terms. Journal of the American Society for Information Science, 27(3):129--146, 1976.Google ScholarCross Ref
- S. E. Robertson and S. Walker. On relevance weights with little relevance information. In Proceedings of SIGIR, pages 16--24, 1997. Google ScholarDigital Library
- K. Spärck Jones. A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation, 28:11--21, 1972.Google ScholarCross Ref
- I. H. Witten, A. Moffat, and T. C. Bell. Managing Gigabytes: Compressing and Indexing Documents and Images. Morgan Kaufmann, second edition, 1999. Google ScholarDigital Library
- S. K. M. Wong and Y. Y. Yao. A note on inverse document frequency weighting scheme {sic}. Technical Report TR-89-990, Cornell University, Ithaca, NY, USA, 1989. Google ScholarDigital Library
Index Terms
- IDF revisited: a simple new derivation within the Robertson-Spärck Jones probabilistic model
Recommendations
N-gram IDF: A Global Term Weighting Scheme Based on Information Distance
WWW '15: Proceedings of the 24th International Conference on World Wide WebThis paper first reveals the relationship between Inverse Document Frequency (IDF), a global term weighting scheme, and information distance, a universal metric defined by Kolmogorov complexity. We concretely give a theoretical explanation that the IDF ...
Information Retrieval by Modified Term Weighting Method Using Random Walk Model with Query Term Position Ranking
ICSPS '09: Proceedings of the 2009 International Conference on Signal Processing SystemsTerm weighting is a core idea behind any information retrieval technique which has crucial importance in document ranking. In graph based ranking algorithm, terms within a document are represented as a graph of that document. Term weights for ...
Exploring the stability of IDF term weighting
AIRS'08: Proceedings of the 4th Asia information retrieval conference on Information retrieval technologyTF IDF has been widely used as a term weighting schemes in today's information retrieval systems. However, computation time and cost have become major concerns for its application. This study investigated the similarities and differences between IDF ...
Comments