ScienceDirect® Home Skip Main Navigation Links
You have guest access to ScienceDirect. Find out more.
 
Home
Browse
My Settings
Alerts
Help
 Quick Search
 Search tips (Opens new window)
    Clear all fields    
advertisementadvertisement
Information Processing & Management
Volume 40, Issue 3, May 2004, Pages 421-439
 
Font Size: Decrease Font Size  Increase Font Size
 Abstract - selected
Article
Purchase PDF (503 K)

 
 
 
Related Articles in ScienceDirect
View More Related Articles
 
Special issue
View Record in Scopus
 
doi:10.1016/j.ipm.2003.09.003    How to Cite or Link Using DOI (Opens New Window)
Copyright © 2003 Elsevier Ltd. All rights reserved.

Co-trained support vector machines for large scale unstructured document classification using unlabeled data and syntactic information

Seong-Bae Park E-mail The Corresponding Author and Byoung-Tak Zhang Corresponding Author Contact Information, E-mail The Corresponding Author

School of Computer Science and Engineering, Seoul National University, 151-744, Seoul, South Korea

Received 3 October 2002; 
accepted 19 September 2003. 
Available online 30 October 2003.

Purchase the full-text article



References and further reading may be available for this article. To view references and further reading you must purchase this article.

Abstract

Most document classification systems consider only the distribution of content words of the documents, ignoring the syntactic information underlying the documents though it is also an important factor. In this paper, we present an approach for classifying large scale unstructured documents by incorporating both the lexical and the syntactic information of documents. For this purpose, we use the co-training algorithm, a partially supervised learning algorithm, in which two separated views for the training data are employed and the small number of labeled data are augmented by the large number of unlabeled data. Since both the lexical and the syntactic information can play roles of separated views for the unstructured documents, the co-training algorithm enhances the performance of document classification using both of them and a large number of unlabeled documents. The experimental results on Reuters-21578 corpus and TREC-7 filtering documents show the effectiveness of unlabeled documents and the use of both the lexical and the syntactic information.

Author Keywords: Author Keywords: Text categorization; Co-training; Support vector machines; Syntactic information; Text chunking

Article Outline

1. Introduction
2. Related work
3. Co-training algorithm for classifying unstructured documents
3.1. Co-training algorithm
3.2. Two views
3.3. Syntactic features
3.4. Support vector machines as base classifiers
4. Experiments
4.1. Datasets
4.1.1. Reuters-21578 ModApte Split
4.1.2. TREC-7
4.2. Performance measure
4.3. Experimental results
4.3.1. Effect of syntactic information
4.3.1.1. Reuters-21578
4.3.1.2. TREC-7
4.3.2. Effect of unlabeled documents
4.3.2.1. TREC-7
4.3.2.2. Reuters-21578
4.4. Analysis of the co-training algorithm
5. Conclusion
Acknowledgements
References









 
Home
Browse
My Settings
Alerts
Help
Elsevier.com (Opens new window)
About ScienceDirect  |  Contact Us  |  Information for Advertisers  |  Terms & Conditions  |  Privacy Policy
Copyright © 2008 Elsevier B.V. All rights reserved. ScienceDirect® is a registered trademark of Elsevier B.V.