Article

Modeling word burstiness using the Dirichlet distribution

Authors:
Rasmus E. Madsen

Technical University of Denmark

Technical University of Denmark
View Profile

,
David Kauchak

University of California, San Diego, La Jolla, CA

University of California, San Diego, La Jolla, CA
View Profile

,
Charles Elkan

University of California, San Diego, La Jolla, CA

University of California, San Diego, La Jolla, CA
View Profile

ICML '05: Proceedings of the 22nd international conference on Machine learningAugust 2005Pages 545–552https://doi.org/10.1145/1102351.1102420

Published:07 August 2005Publication History

ICML '05: Proceedings of the 22nd international conference on Machine learning

Pages 545–552

ABSTRACT

Multinomial distributions are often used to model text documents. However, they do not capture well the phenomenon that words in a document tend to appear in bursts: if a word appears once, it is more likely to appear again. In this paper, we propose the Dirichlet compound multinomial model (DCM) as an alternative to the multinomial. The DCM model has one additional degree of freedom, which allows it to capture burstiness. We show experimentally that the DCM is substantially better than the multinomial at modeling text data, measured by perplexity. We also show using three standard document collections that the DCM leads to better classification than the multinomial model. DCM performance is comparable to that obtained with multiple heuristic changes to the multinomial model.

References

Aizawa, A. (2003). An information-theoretic perspective of tf-idf measures. Information Processing and Management, 39, 45--65. Google ScholarDigital Library
Apte, C., Damerau, F. J., & Weiss, S. M. (1994). Automated learning of decision rules for text categorization. ACM Transactions on Information Systems, 12, 233--251. Google ScholarDigital Library
Blei, D., Ng, A., & Jordan, M. (2003). Latent Dirichlet allocation. Journal of Machine Learning Research, 3, 993--1022. Google ScholarDigital Library
Church, K. W., & Gale, W. A. (1995). Poisson mixtures. Natural Language Engineering, 1, 163 190.Google ScholarCross Ref
Deerwester, S., Dumais, S., Landauer, T., Furnas, G., Harshman, R. (1990). Indexing by latent semantic analysis. Journal of the American Society of Information Science, 41, 391--407.Google ScholarCross Ref
Hofmann, T. (1999). Probabilistic latent semantic indexing (PLSI). Proceedings of the 22nd Annual ACM Conference on Research and Development in Information Retrieval (pp. 50--57). Berkeley, California: ACM. Google ScholarDigital Library
Jansche, M. (2003). Parametric models of linguistic count data. 41st Annual Meeting of the Association for Computational Linguistics (pp. 288--295). Sapporo, Japan. Google ScholarDigital Library
Jones, K. (1972). A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation, 28, 11--21.Google ScholarCross Ref
Katz, S. M. (1996). Distribution of content words and phrases in text and language modelling. Natural Language Engineering, 2, 15--59. Google ScholarDigital Library
Lewis, D. D. (1998). Naive (Bayes) at forty: The independence assumption in information retrieval. Proceedings of ECML-98, 10th European Conference on Machine Learning (pp. 4--15). Chemnitz, Germany: Springer Verlag, Heidelberg, Germany. Google ScholarDigital Library
McCallum, A., & Nigam, K. (1998). A comparison of event models for naive Bayes text classification. AAAI/ICML-98 Workshop on Learning for Text Categorization (pp. 41--48). AAAI Press.Google Scholar
McCallum, A. K. (1996). Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering. www.cs.cmu.edu/mccalum/bow.Google Scholar
Minka, T. (2003). Estimating a Dirichlet distribution. www.stat.cmu.edu/~minka/papers/dirichlet.Google Scholar
Rennie, J. D. M., Shih, L., Teevan, J., & Karger, D. R. (2003). Tackling the poor assumptions of naive Bayes text classifiers. Proceedings of the Twentieth International Conference on Machine Learning (pp. 616--623). Washington, D.C., US: Morgan Kaufmann Publishers, San Francisco, US.Google Scholar
Salton, G., Wong, A., & Yang, C. S. (1975). A vector space model for automatic indexing. Communications of the ACM, 18, 613--620. Google ScholarDigital Library
Sebastiani, F. (2002). Machine learning in automated text categorization. ACM Computing Surveys, 34, 1--47. Google ScholarDigital Library
Teevan, J., & Karger, D. R. (2003). Empirical development of an exponential probabilistic model for text retrieval: Using textual analysis to build a better model. Proceedings of the 26th Annual ACM Conference on Research and Development in Information Retrieval (SIGIR '03) (pp. 18--25). Toronto, CA. Google ScholarDigital Library
Zipf, G. (1949). Human behaviour and the principle of least effort: An introduction to human ecology. Addison-Wesley.Google Scholar

Recommendations

Generalized Dirichlet distribution in Bayesian analysis

Generalized Dirichlet distribution has a more general covariance structure than Dirichlet distribution. This makes the generalized Dirichlet distribution to be more practical and useful. The concept of complete neutrality will be used to derive the ...
Read More
Probabilistic abductive logic programming using Dirichlet priors

Probabilistic programming is an area of research that aims to develop general inference algorithms for probabilistic models expressed as probabilistic programs whose execution corresponds to inferring the parameters of those models. In this paper, we ...
Read More
Generative Supervised Classification Using Dirichlet Process Priors

Choosing the appropriate parameter prior distributions associated to a given Bayesian model is a challenging problem. Conjugate priors can be selected for simplicity motivations. However, conjugate priors can be too restrictive to accurately model the ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
ICML '05: Proceedings of the 22nd international conference on Machine learning
August 2005
1113 pages
ISBN:1595931805
DOI:10.1145/1102351
General Chair:
Saso Dzeroski
Jozef Stefan Institute, Slovenia
,
Program Chairs:
Luc De Raedt,
Stefan Wrobel
Copyright © 2005 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 7 August 2005
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- Article
Conference

Acceptance Rates
Overall Acceptance Rate140of548submissions,26%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 167
  Total Citations
  View Citations
- 1,027
  Total Downloads
- Downloads (Last 12 months)34
- Downloads (Last 6 weeks)4
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Modeling word burstiness using the Dirichlet distribution

ICML '05: Proceedings of the 22nd international conference on Machine learning

ABSTRACT

References

Cited By

Recommendations

Generalized Dirichlet distribution in Bayesian analysis

Probabilistic abductive logic programming using Dirichlet priors

Generative Supervised Classification Using Dirichlet Process Priors

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Modeling word burstiness using the Dirichlet distribution

ICML '05: Proceedings of the 22nd international conference on Machine learning

ABSTRACT

References

Cited By

Recommendations

Generalized Dirichlet distribution in Bayesian analysis

Probabilistic abductive logic programming using Dirichlet priors

Generative Supervised Classification Using Dirichlet Process Priors

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media