ABSTRACT
Multinomial distributions are often used to model text documents. However, they do not capture well the phenomenon that words in a document tend to appear in bursts: if a word appears once, it is more likely to appear again. In this paper, we propose the Dirichlet compound multinomial model (DCM) as an alternative to the multinomial. The DCM model has one additional degree of freedom, which allows it to capture burstiness. We show experimentally that the DCM is substantially better than the multinomial at modeling text data, measured by perplexity. We also show using three standard document collections that the DCM leads to better classification than the multinomial model. DCM performance is comparable to that obtained with multiple heuristic changes to the multinomial model.
- Aizawa, A. (2003). An information-theoretic perspective of tf-idf measures. Information Processing and Management, 39, 45--65. Google ScholarDigital Library
- Apte, C., Damerau, F. J., & Weiss, S. M. (1994). Automated learning of decision rules for text categorization. ACM Transactions on Information Systems, 12, 233--251. Google ScholarDigital Library
- Blei, D., Ng, A., & Jordan, M. (2003). Latent Dirichlet allocation. Journal of Machine Learning Research, 3, 993--1022. Google ScholarDigital Library
- Church, K. W., & Gale, W. A. (1995). Poisson mixtures. Natural Language Engineering, 1, 163 190.Google ScholarCross Ref
- Deerwester, S., Dumais, S., Landauer, T., Furnas, G., Harshman, R. (1990). Indexing by latent semantic analysis. Journal of the American Society of Information Science, 41, 391--407.Google ScholarCross Ref
- Hofmann, T. (1999). Probabilistic latent semantic indexing (PLSI). Proceedings of the 22nd Annual ACM Conference on Research and Development in Information Retrieval (pp. 50--57). Berkeley, California: ACM. Google ScholarDigital Library
- Jansche, M. (2003). Parametric models of linguistic count data. 41st Annual Meeting of the Association for Computational Linguistics (pp. 288--295). Sapporo, Japan. Google ScholarDigital Library
- Jones, K. (1972). A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation, 28, 11--21.Google ScholarCross Ref
- Katz, S. M. (1996). Distribution of content words and phrases in text and language modelling. Natural Language Engineering, 2, 15--59. Google ScholarDigital Library
- Lewis, D. D. (1998). Naive (Bayes) at forty: The independence assumption in information retrieval. Proceedings of ECML-98, 10th European Conference on Machine Learning (pp. 4--15). Chemnitz, Germany: Springer Verlag, Heidelberg, Germany. Google ScholarDigital Library
- McCallum, A., & Nigam, K. (1998). A comparison of event models for naive Bayes text classification. AAAI/ICML-98 Workshop on Learning for Text Categorization (pp. 41--48). AAAI Press.Google Scholar
- McCallum, A. K. (1996). Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering. www.cs.cmu.edu/mccalum/bow.Google Scholar
- Minka, T. (2003). Estimating a Dirichlet distribution. www.stat.cmu.edu/~minka/papers/dirichlet.Google Scholar
- Rennie, J. D. M., Shih, L., Teevan, J., & Karger, D. R. (2003). Tackling the poor assumptions of naive Bayes text classifiers. Proceedings of the Twentieth International Conference on Machine Learning (pp. 616--623). Washington, D.C., US: Morgan Kaufmann Publishers, San Francisco, US.Google Scholar
- Salton, G., Wong, A., & Yang, C. S. (1975). A vector space model for automatic indexing. Communications of the ACM, 18, 613--620. Google ScholarDigital Library
- Sebastiani, F. (2002). Machine learning in automated text categorization. ACM Computing Surveys, 34, 1--47. Google ScholarDigital Library
- Teevan, J., & Karger, D. R. (2003). Empirical development of an exponential probabilistic model for text retrieval: Using textual analysis to build a better model. Proceedings of the 26th Annual ACM Conference on Research and Development in Information Retrieval (SIGIR '03) (pp. 18--25). Toronto, CA. Google ScholarDigital Library
- Zipf, G. (1949). Human behaviour and the principle of least effort: An introduction to human ecology. Addison-Wesley.Google Scholar
Recommendations
Generalized Dirichlet distribution in Bayesian analysis
Generalized Dirichlet distribution has a more general covariance structure than Dirichlet distribution. This makes the generalized Dirichlet distribution to be more practical and useful. The concept of complete neutrality will be used to derive the ...
Probabilistic abductive logic programming using Dirichlet priors
Probabilistic programming is an area of research that aims to develop general inference algorithms for probabilistic models expressed as probabilistic programs whose execution corresponds to inferring the parameters of those models. In this paper, we ...
Generative Supervised Classification Using Dirichlet Process Priors
Choosing the appropriate parameter prior distributions associated to a given Bayesian model is a challenging problem. Conjugate priors can be selected for simplicity motivations. However, conjugate priors can be too restrictive to accurately model the ...
Comments