ScienceDirect® Home Skip Main Navigation Links
You have guest access to ScienceDirect. Find out more.
 
Home
Browse
My Settings
Alerts
Help
 Quick Search
 Search tips (Opens new window)
    Clear all fields    
advertisementadvertisement
Information Processing & Management
Volume 40, Issue 1, January 2004, Pages 65-79
 
Font Size: Decrease Font Size  Increase Font Size
 Abstract - selected
Article
Purchase PDF (515 K)

 
 
 
Related Articles in ScienceDirect
View More Related Articles
 
Special issue
View Record in Scopus
 
doi:10.1016/S0306-4573(02)00056-0    How to Cite or Link Using DOI (Opens New Window)
Copyright © 2002 Elsevier Ltd. All rights reserved.

Improving text categorization using the importance of sentences

Youngjoong KoCorresponding Author Contact Information, E-mail The Corresponding Author, Jinwoo ParkE-mail The Corresponding Author and Jungyun SeoE-mail The Corresponding Author

Department of Computer Science, NLP lab., Sogang University, Sinsu-dong 1, Mapo-gu, Seoul 121-742, South Korea

Received 18 July 2002; 
accepted 18 July 2002. ;
Available online 19 December 2002.

Purchase the full-text article



References and further reading may be available for this article. To view references and further reading you must purchase this article.

Abstract

Automatic text categorization is a problem of assigning text documents to pre-defined categories. In order to classify text documents, we must extract useful features. In previous researches, a text document is commonly represented by the term frequency and the inverted document frequency of each feature. Since there is a difference between important sentences and unimportant sentences in a document, the features from more important sentences should be considered more than other features. In this paper, we measure the importance of sentences using text summarization techniques. Then we represent a document as a vector of features with different weights according to the importance of each sentence. To verify our new method, we conduct experiments using two language newsgroup data sets: one written by English and the other written by Korean. Four kinds of classifiers are used in our experiments: Naive Bayes, Rocchio, k-NN, and SVM. We observe that our new method makes a significant improvement in all these classifiers and both data sets.

Author Keywords: Text categorization; Importance of sentence; Text summarization techniques; Indexing technique; Text classifier

Article Outline

1. Introduction
2. The proposed text categorization system
2.1. Pre-processing
2.2. Feature selection
2.3. Measuring the importance of sentences
2.3.1. The importance of sentences by the title
2.3.2. The importance of sentences by the importance of terms
2.3.3. The combination of two sentence importance values
2.3.4. The indexing process
3. Other classifiers used in our experiments
3.1. Naive Bayes
3.2. Rocchio
3.3. k-Nearest Neighbor (k-NN) algorithm
3.4. Support vector machines
4. Empirical evaluation
4.1. Data sets and experimental settings
4.2. Experimental results
4.2.1. Setting the number of features
4.2.2. Setting the constant weights k1 and k2
4.2.3. Results in two newsgroup data sets
4.3. Verifying our indexing method
5. Conclusions
Acknowledgements
References






 
Home
Browse
My Settings
Alerts
Help
Elsevier.com (Opens new window)
About ScienceDirect  |  Contact Us  |  Information for Advertisers  |  Terms & Conditions  |  Privacy Policy
Copyright © 2008 Elsevier B.V. All rights reserved. ScienceDirect® is a registered trademark of Elsevier B.V.