Copyright © 2002 Elsevier Ltd. All rights reserved.
Improving text categorization using the importance of sentences
Received 18 July 2002;
References and further reading may be available for this article. To view references and further reading you must purchase this article.
Abstract
Automatic text categorization is a problem of assigning text documents to pre-defined categories. In order to classify text documents, we must extract useful features. In previous researches, a text document is commonly represented by the term frequency and the inverted document frequency of each feature. Since there is a difference between important sentences and unimportant sentences in a document, the features from more important sentences should be considered more than other features. In this paper, we measure the importance of sentences using text summarization techniques. Then we represent a document as a vector of features with different weights according to the importance of each sentence. To verify our new method, we conduct experiments using two language newsgroup data sets: one written by English and the other written by Korean. Four kinds of classifiers are used in our experiments: Naive Bayes, Rocchio, k-NN, and SVM. We observe that our new method makes a significant improvement in all these classifiers and both data sets.
Author Keywords: Text categorization; Importance of sentence; Text summarization techniques; Indexing technique; Text classifier
Article Outline
- 1. Introduction
- 2. The proposed text categorization system
- 3. Other classifiers used in our experiments
- 4. Empirical evaluation
- 4.1. Data sets and experimental settings
- 4.2. Experimental results
- 4.2.1. Setting the number of features
- 4.2.2. Setting the constant weights k1 and k2
- 4.2.3. Results in two newsgroup data sets
- 4.3. Verifying our indexing method
- 5. Conclusions
- Acknowledgements
- References







E-mail Article
Add to my Quick Links

Cited By in Scopus (15)







