ABSTRACT
This paper explores the incorporation of topical support documents into a training set as a means of compensating for a shortage of positive training data in text categorization. To support topical representation, our method applies a simple transformation to documents, i.e., making new documents from existing positive documents by squaring a conventional term weight. The topical support documents thus created not only are expected to preserve the topic, but even improve the topical representation by emphasizing terms with higher weights. Experiments with support vector machines showed the effectiveness on RCV1 collection with a small number of positive training data. Our topical support representation achieved 52.01% and 8.83% improvements for 33 and 56 categories of RCV1 Topic in micro-averaged F1 with less than 100 and 300 positive documents in learning, respectively. Result analyses based on robustness indicate that topical support documents contribute to a steady and stable improvement.
- DeCoste, D. and Schölkopf, B. 2002. Training invariant support vector machines. Machine Learning 46(1). Google ScholarDigital Library
- Joachims, T. 1998. Text Categorization with Support Vector Machines: Learning with Many Relevant Features. In Proc. of the European Conference on Machine Learning (ECML). Google ScholarDigital Library
- Lee, K.-S., Kageura, K. 2007. Virtual relevant documents in text categorization with support vector machines, Information Processing & Management, 43(4), Elsvier. Google ScholarDigital Library
- Lewis, D. D., Yang, Y., Rose, T., and Li, F. 2004. RCV1: A New Benchmark Collection for Text Categorization Research. Journal of Machine Learning Research, 5:361--397. Google ScholarDigital Library
- Sassano, M. 2003. Virtual Examples for Text Classification with Support Vector Machines, In Proc. of the 2003 conference on Empirical methods in natural language processing, pp. 208--215. Google ScholarDigital Library
- Shen, D., Pan, R., Sun, J.-T., Pan, J., Wu, K., Yin, J., and Yang, Q. 2006. Query enrichment for web-query classification. ACM Transaction on Information Systems (TOIS), 24(3), pp. 320--352. Google ScholarDigital Library
- Yang, Y. 2001. A study on thresholding strategies for text categorization. In Proc. of 24th ACM SIGIR Conference, pp 137--145. Google ScholarDigital Library
Index Terms
- Incorporating topical support documents into a small training set in text categorization
Recommendations
Virtual relevant documents in text categorization with support vector machines
This paper explores the incorporation of prior knowledge into support vector machines as a means of compensating for a shortage of training data in text categorization. The prior knowledge about transformation invariance is generated by a virtual ...
An Evaluation of Passage-Based Text Categorization
Researches in text categorization have been confined to whole-document-level classification, probably due to lack of full-text test collections. However, full-length documents available today in large quantities pose renewed interests in text ...
Semi-supervised text categorization: Exploiting unlabeled data using ensemble learning algorithms
Text categorization is one of the fundamental tasks in text mining. Classical supervised methods need lot of labeled data to train a classifier. Since assigning labels to the large amount of data is very costly and time consuming, it is useful to use ...
Comments