Techniques for Improving the Performance of Naive Bayes for Text Classification

Schneider, Karl-Michael

doi:10.1007/978-3-540-30586-6_76

Karl-Michael Schneider¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 3406))

Included in the following conference series:

International Conference on Intelligent Text Processing and Computational Linguistics

2517 Accesses
51 Citations

Abstract

Naive Bayes is often used in text classification applications and experiments because of its simplicity and effectiveness. However, its performance is often degraded because it does not model text well, and by inappropriate feature selection and the lack of reliable confidence scores. We address these problems and show that they can be solved by some simple corrections. We demonstrate that our simple modifications are able to improve the performance of Naive Bayes for text classification significantly.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Sahami, M., Dumais, S., Heckerman, D., Horvitz, E.: A Bayesian approach to filtering junk e-mail. In: Learning for Text Categorization: Papers from the AAAI Workshop, Madison Wisconsin, pp. 55–62. AAAI Press, Menlo Park (1998); Technical Report WS-98-05
Google Scholar
Androutsopoulos, I., Paliouras, G., Karkaletsis, V., Sakkis, G., Spyropoulos, C.D., Stamatopoulos, P.: Learning to filter spam e-mail: A comparison of a Naive Bayesian and a memory-based approach. In: Zaragoza, H., Gallinari, P., Rajman, M. (eds.) Proc. Workshop on Machine Learning and Textual Information Access, 4th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD), Lyon, France, pp. 1–13 (2000)
Google Scholar
Lang, K.: NewsWeeder: Learning to filter netnews. In: Proc. 12th International Conference on Machine Learning (ICML 1995), pp. 331–339. Morgan Kaufmann, San Francisco (1995)
Google Scholar
Pazzani, M., Billsus, D.: Learning and revising user profiles: The identification of interesting web sites. Machine Learning 27, 313–331 (1997)
Article Google Scholar
Koller, D., Sahami, M.: Hierarchically classifying documents using very few words. In: 14th International Conference on Machine Learning (ICML 1997), pp. 170–178 (1997)
Google Scholar
Cohen, W.W., Singer, Y.: Context-sensitive learning methods for text categorization. ACM Transactions on Information Systems 17, 141–173 (1999)
Article Google Scholar
McCallum, A., Nigam, K.: A comparison of event models for Naive Bayes text classification. In: Learning for Text Categorization: Papers from the AAAI Workshop, pp. 41–48. AAAI Press, Menlo Park (1998); Technical Report WS-98-05
Google Scholar
Yang, Y., Liu, X.: A re-examination of text categorization methods. In: Proc. 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 1999), pp. 42–49 (1999)
Google Scholar
Mitchell, T.M.: Machine Learning. McGraw-Hill, New York (1997)
MATH Google Scholar
Joachims, T.: Text categorization with support vector machines: Learning with many relevant features. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 137–142. Springer, Heidelberg (1998)
Chapter Google Scholar
Craven, M., DiPasquo, D., Freitag, D., McCallum, A., Mitchell, T., Nigam, K., Slattery, S.: Learning to construct knowledge bases from the World Wide Web. Artificial Intelligence 118, 69–113 (2000)
Article MATH Google Scholar
Katz, S.M.: Distribution of content words and phrases in text and language modelling. Natural Language Engineering 2, 15–59 (1996)
Article Google Scholar
Domingos, P., Pazzani, M.: On the optimality of the simple Bayesian classifier under zero-one loss. Machine Learning 29, 103–130 (1997)
Article MATH Google Scholar
Friedman, J.H.: On bias, variance, 0/1-loss, and the curse-of-dimensionality. Data Mining and Knowledge Discovery 1, 55–77 (1997)
Article Google Scholar
Mladenić, D., Grobelnik, M.: Word sequences as features in text-learning. In: Proc. 17th Electrotechnical and Computer Science Conference (ERK 1998), Ljubljana, Slovenia (1998)
Google Scholar
Gómez-Hidalgo, J.M., de Buenaga Rodríguez, M.: Integrating a lexical database and a training collection for text categorization. In: ACL/EACL 1997 Workshop on Automatic Information Extraction and Building of Lexical Semantic Resources for NLP Applications, pp. 39–44 (1997)
Google Scholar
Dhillon, I.S., Mallela, S., Kumar, R.: A divisive information-theoretic feature clustering algorithm for text classification. Journal of Machine Learning Research 3, 1265–1287 (2003)
Article MATH Google Scholar
Torkkola, K.: Linear discriminant analysis in document classification. In: IEEE ICDM 2001 Workshop on Text Mining (TextDM 2001), San Jose, CA (2001)
Google Scholar
Rennie, J.D.M., Shih, L., Teevan, J., Karger, D.: Tackling the poor assumptions of Naive Bayes text classifiers. In: Fawcett, T., Mishra, N. (eds.) Proceedings of the Twentieth International Conference on Machine Learning (ICML 2003), Washington, D.C, pp. 616–623. AAAI Press, Menlo Park (2003)
Google Scholar
Kim, S.B., Rim, H.C., Yook, D., Lim, H.S.: Effective methods for improving Naive Bayes text classifiers. In: Ishizuka, M., Sattar, A. (eds.) PRICAI 2002. LNCS (LNAI), vol. 2417, pp. 414–423. Springer, Heidelberg (2002)
Chapter Google Scholar
Eyheramendy, S., Lewis, D.D., Madigan, D.: On the Naive Bayes model for text categorization. In: Bishop, C.M., Frey, B.J. (eds.) AI & Statistics 2003: Proceedings of the Ninth International Workshop on Artificial Intelligence and Statistics, pp. 332–339 (2003)
Google Scholar
Yang, Y., Pedersen, J.O.: A comparative study on feature selection in text categorization. In: Proc. 14th International Conference on Machine Learning (ICML 1997), pp. 412–420 (1997)
Google Scholar
Forman, G.: An extensive empirical study of feature selection metrics for text classification. Journal of Machine Learning Research 3, 1289–1305 (2003)
Article MATH Google Scholar
Cover, T.M., Thomas, J.A.: Elements of Information Theory. John Wiley, New York (1991)
Book MATH Google Scholar
Bennett, P.N.: Assessing the calibration of Naive Bayes’ posterior estimates. Technical Report CMU-CS-00-155, School of Computer Science, Carnegie Mellon University (2000)
Google Scholar
Apté, C., Damerau, F., Weiss, S.M.: Towards language independent automated learning of text categorization models. In: Proc. 17th ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 1994), pp. 23–30 (1994)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of General Linguistics, University of Passau, Innstr. 40, 94032, Passau, Germany
Karl-Michael Schneider

Authors

Karl-Michael Schneider
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

National Polytechnic Institute, Center for Computing Research, 07738, Mexico City, México
Alexander Gelbukh

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Schneider, KM. (2005). Techniques for Improving the Performance of Naive Bayes for Text Classification. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2005. Lecture Notes in Computer Science, vol 3406. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30586-6_76

Download citation

DOI: https://doi.org/10.1007/978-3-540-30586-6_76
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-24523-0
Online ISBN: 978-3-540-30586-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics