skip to main content
10.1145/1102351.1102420acmotherconferencesArticle/Chapter ViewAbstractPublication PagesicmlConference Proceedingsconference-collections
Article

Modeling word burstiness using the Dirichlet distribution

Published:07 August 2005Publication History

ABSTRACT

Multinomial distributions are often used to model text documents. However, they do not capture well the phenomenon that words in a document tend to appear in bursts: if a word appears once, it is more likely to appear again. In this paper, we propose the Dirichlet compound multinomial model (DCM) as an alternative to the multinomial. The DCM model has one additional degree of freedom, which allows it to capture burstiness. We show experimentally that the DCM is substantially better than the multinomial at modeling text data, measured by perplexity. We also show using three standard document collections that the DCM leads to better classification than the multinomial model. DCM performance is comparable to that obtained with multiple heuristic changes to the multinomial model.

References

  1. Aizawa, A. (2003). An information-theoretic perspective of tf-idf measures. Information Processing and Management, 39, 45--65. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Apte, C., Damerau, F. J., & Weiss, S. M. (1994). Automated learning of decision rules for text categorization. ACM Transactions on Information Systems, 12, 233--251. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Blei, D., Ng, A., & Jordan, M. (2003). Latent Dirichlet allocation. Journal of Machine Learning Research, 3, 993--1022. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Church, K. W., & Gale, W. A. (1995). Poisson mixtures. Natural Language Engineering, 1, 163 190.Google ScholarGoogle ScholarCross RefCross Ref
  5. Deerwester, S., Dumais, S., Landauer, T., Furnas, G., Harshman, R. (1990). Indexing by latent semantic analysis. Journal of the American Society of Information Science, 41, 391--407.Google ScholarGoogle ScholarCross RefCross Ref
  6. Hofmann, T. (1999). Probabilistic latent semantic indexing (PLSI). Proceedings of the 22nd Annual ACM Conference on Research and Development in Information Retrieval (pp. 50--57). Berkeley, California: ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Jansche, M. (2003). Parametric models of linguistic count data. 41st Annual Meeting of the Association for Computational Linguistics (pp. 288--295). Sapporo, Japan. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Jones, K. (1972). A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation, 28, 11--21.Google ScholarGoogle ScholarCross RefCross Ref
  9. Katz, S. M. (1996). Distribution of content words and phrases in text and language modelling. Natural Language Engineering, 2, 15--59. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Lewis, D. D. (1998). Naive (Bayes) at forty: The independence assumption in information retrieval. Proceedings of ECML-98, 10th European Conference on Machine Learning (pp. 4--15). Chemnitz, Germany: Springer Verlag, Heidelberg, Germany. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. McCallum, A., & Nigam, K. (1998). A comparison of event models for naive Bayes text classification. AAAI/ICML-98 Workshop on Learning for Text Categorization (pp. 41--48). AAAI Press.Google ScholarGoogle Scholar
  12. McCallum, A. K. (1996). Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering. www.cs.cmu.edu/mccalum/bow.Google ScholarGoogle Scholar
  13. Minka, T. (2003). Estimating a Dirichlet distribution. www.stat.cmu.edu/~minka/papers/dirichlet.Google ScholarGoogle Scholar
  14. Rennie, J. D. M., Shih, L., Teevan, J., & Karger, D. R. (2003). Tackling the poor assumptions of naive Bayes text classifiers. Proceedings of the Twentieth International Conference on Machine Learning (pp. 616--623). Washington, D.C., US: Morgan Kaufmann Publishers, San Francisco, US.Google ScholarGoogle Scholar
  15. Salton, G., Wong, A., & Yang, C. S. (1975). A vector space model for automatic indexing. Communications of the ACM, 18, 613--620. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Sebastiani, F. (2002). Machine learning in automated text categorization. ACM Computing Surveys, 34, 1--47. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Teevan, J., & Karger, D. R. (2003). Empirical development of an exponential probabilistic model for text retrieval: Using textual analysis to build a better model. Proceedings of the 26th Annual ACM Conference on Research and Development in Information Retrieval (SIGIR '03) (pp. 18--25). Toronto, CA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Zipf, G. (1949). Human behaviour and the principle of least effort: An introduction to human ecology. Addison-Wesley.Google ScholarGoogle Scholar

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in
  • Published in

    cover image ACM Other conferences
    ICML '05: Proceedings of the 22nd international conference on Machine learning
    August 2005
    1113 pages
    ISBN:1595931805
    DOI:10.1145/1102351

    Copyright © 2005 ACM

    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    • Published: 7 August 2005

    Permissions

    Request permissions about this article.

    Request Permissions

    Check for updates

    Qualifiers

    • Article

    Acceptance Rates

    Overall Acceptance Rate140of548submissions,26%

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader