skip to main content
10.1145/1277741.1277891acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
Article

IDF revisited: a simple new derivation within the Robertson-Spärck Jones probabilistic model

Published:23 July 2007Publication History

ABSTRACT

There have been a number of prior attempts to theoretically justify the effectiveness of the inverse document frequency (IDF). Those that take as their starting point Robertson and Sparck Jones's probabilistic model are based on strong or complex assumptions. We show that a more intuitively plausible assumption suffices. Moreover, the new assumption, while conceptually very simple, provides a solution to an estimation problem that had been deemed intractable by Robertson and Walker (1997).

References

  1. K. W. Church and W. A. Gale. Inverse document frequency (IDF): A measure of deviations from Poisson. In Proceedings of the Third Workshop on Very Large Corpora (WVLC), pages 121--130, 1995.Google ScholarGoogle Scholar
  2. W. B. Croft and D. J. Harper. Using probabilistic models of document retrieval without relevance information. Journal of Documentation, 35(4):285--295, 1979. Reprinted in Karen Spärck Jones and Peter Willett, eds., Readings in Information Retrieval, Morgan Kaufmann, pp. 339--344, 1997. Google ScholarGoogle ScholarCross RefCross Ref
  3. A. P. de Vries and T. Roelleke. Relevance information: A loss of entropy but a gain for idf? In Proceedings of SIGIR, pages 282--289, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. H. Fang, T. Tao, and C. Zhai. A formal study of information retrieval heuristics. In Proceedings of SIGIR, pages 49--56, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. W. R. Greiff. A theory of term weighting based on exploratory data analysis. In Proceedings of SIGIR, pages 11--19, New York, NY, USA, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. D. Harman. The history of IDF and its in uences on IR and other fields. In Charting a New Course: Natural Language Processing and Information Retrieval: Essays in Honour of Karen Spärck Jones, pages 69--79. Springer, 2005.Google ScholarGoogle Scholar
  7. C. D. Manning, P. Raghavan, and H. Schütze. Introduction to Information Retrieval, chapter 11 (Probabilistic information retrieval). Cambridge University Press, 2007. Draft of April 28.Google ScholarGoogle Scholar
  8. K. Papineni. Why inverse document frequency? In Proceedings of the NAACL, pages 1--8, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. S. E. Robertson. Understanding inverse document frequency: On theoretical arguments for IDF. Journal of Documentation, 60(5):503--520, 2004.Google ScholarGoogle ScholarCross RefCross Ref
  10. S. E. Robertson and K. Spärck Jones. Relevance weighting of search terms. Journal of the American Society for Information Science, 27(3):129--146, 1976.Google ScholarGoogle ScholarCross RefCross Ref
  11. S. E. Robertson and S. Walker. On relevance weights with little relevance information. In Proceedings of SIGIR, pages 16--24, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. K. Spärck Jones. A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation, 28:11--21, 1972.Google ScholarGoogle ScholarCross RefCross Ref
  13. I. H. Witten, A. Moffat, and T. C. Bell. Managing Gigabytes: Compressing and Indexing Documents and Images. Morgan Kaufmann, second edition, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. S. K. M. Wong and Y. Y. Yao. A note on inverse document frequency weighting scheme {sic}. Technical Report TR-89-990, Cornell University, Ithaca, NY, USA, 1989. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. IDF revisited: a simple new derivation within the Robertson-Spärck Jones probabilistic model

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      SIGIR '07: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
      July 2007
      946 pages
      ISBN:9781595935977
      DOI:10.1145/1277741

      Copyright © 2007 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 23 July 2007

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • Article

      Acceptance Rates

      Overall Acceptance Rate792of3,983submissions,20%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader