Abstract
An investigation is conducted on two well-known similarity-based learning approaches to text categorization: the k-nearest neighbors (kNN) classifier and the Rocchio classifier. After identifying the weakness and strength of each technique, a new classifier called the kNN model-based classifier (kNN Model) is proposed. It combines the strength of both kNN and Rocchio. A text categorization prototype, which implements kNN Model along with kNN and Rocchio, is described. An experimental evaluation of different methods is carried out on two common document corpora: the 20-newsgroup collection and the ModApte version of the Reuters-21578 collection of news stories. The experimental results show that the proposed kNN model-based method outperforms the kNN and Rocchio classifiers, and is therefore a good alternative for kNN and Rocchio in some application areas.
Similar content being viewed by others
References
Lam W, Ho C (1998) Using a generalized instance set for automatic text categorization. SIGIR'98, pp 81–89
Lewis D (1998) Naïve (Bayes) at forty: the independent assumption in information retrieval. In: Proceedings of ECML-98, 10th European conference on machine learning, pp 4–15
Cohen W, Singer Y (1999) Context-sensitive learning methods for text categorization. ACM Trans Inform Syst 17(2):141–173
Li H, Yamanishi K (1999) Text classification using esc-based stochastic decision lists. In: Proceedings of CIKM-99, 8th ACM international conference on information and knowledge management, pp 122–130
Yang Y, Liu X (1999) A re-examination of text categorization methods. In: Proceedings of SIGIR-99, 22nd ACM international conference on research and development in information retrieval, pp 42–49
Ruiz M, Srinivasan P (1999) Hierarchical neural networks for text categorization. In: Proceedings of SIGIR-99, 22nd ACM International Information Retrieval, pp 281–282
Mitchell T (1996) Machine learning. McGraw Hill, New York
Joachims T (1998) Text categorization with support vector machines: learning with many relevant features. In: Proceedings of 10th European conference on machine learning, Chemnitz, Germany, pp 137–142
Joachims T (2001) A statistical learning model of text classification for support vector machines. In: Proceedings of SIGIR-01, 24th ACM international conference on research and development in information retrieval, pp 128–136
Rocchio J (1971) Relevance feedback in information retrieval. In: The SMART retrieval system: experiments in automatic document processing. Salton G (ed) Prentice-Hall, Englewood Cliffs
Joachims T (1997) A probabilistic analysis of the rocchio algorithm with TFIDF for test categorization. In: Proceedings of ICML-97, 14th international conference on machine learning, pp 143–151
Sebastiani F (2002) Machine learning in automated text categorization. ACM Comput Surv 34(1):1–47
Dietterich T (1998) Approximate statistical tests for comparing supervised classification learning algorithms. Neural Comput 10(7): 1895–1924
Salton G (1989) Automatic text processing: the transformation, analysis, and retrieval of information by computer. Addison-Vesley, Reading
Han E, Karypis G (2000) Centroid-based document classification: analysis and experimental results, Technical Report:#00-017, University of Minnesota, Department of Computer Science / Army HPC Research Center, Minneapolis, MN 55455
ICONS (2001) ICONS Consortium, intelligent content management system contract number IST-2001-32429. Annex I-Description of work
Author information
Authors and Affiliations
Additional information
This work was partly supported by the European Commission project ICONS, project no. IST-2001-32429.
Rights and permissions
About this article
Cite this article
Guo, G., Wang, H., Bell, D. et al. Using kNN model for automatic text categorization. Soft Comput 10, 423–430 (2006). https://doi.org/10.1007/s00500-005-0503-y
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00500-005-0503-y