poster

Incorporating topical support documents into a small training set in text categorization

Author:
Kyung Soon Lee

Chonbuk National University, Jeonju, South Korea

Chonbuk National University, Jeonju, South Korea
View Profile

CIKM '08: Proceedings of the 17th ACM conference on Information and knowledge managementOctober 2008Pages 1511–1512https://doi.org/10.1145/1458082.1458361

Published:26 October 2008Publication History

CIKM '08: Proceedings of the 17th ACM conference on Information and knowledge management

Pages 1511–1512

ABSTRACT

This paper explores the incorporation of topical support documents into a training set as a means of compensating for a shortage of positive training data in text categorization. To support topical representation, our method applies a simple transformation to documents, i.e., making new documents from existing positive documents by squaring a conventional term weight. The topical support documents thus created not only are expected to preserve the topic, but even improve the topical representation by emphasizing terms with higher weights. Experiments with support vector machines showed the effectiveness on RCV1 collection with a small number of positive training data. Our topical support representation achieved 52.01% and 8.83% improvements for 33 and 56 categories of RCV1 Topic in micro-averaged F1 with less than 100 and 300 positive documents in learning, respectively. Result analyses based on robustness indicate that topical support documents contribute to a steady and stable improvement.

References

DeCoste, D. and Schölkopf, B. 2002. Training invariant support vector machines. Machine Learning 46(1). Google ScholarDigital Library
Joachims, T. 1998. Text Categorization with Support Vector Machines: Learning with Many Relevant Features. In Proc. of the European Conference on Machine Learning (ECML). Google ScholarDigital Library
Lee, K.-S., Kageura, K. 2007. Virtual relevant documents in text categorization with support vector machines, Information Processing & Management, 43(4), Elsvier. Google ScholarDigital Library
Lewis, D. D., Yang, Y., Rose, T., and Li, F. 2004. RCV1: A New Benchmark Collection for Text Categorization Research. Journal of Machine Learning Research, 5:361--397. Google ScholarDigital Library
Sassano, M. 2003. Virtual Examples for Text Classification with Support Vector Machines, In Proc. of the 2003 conference on Empirical methods in natural language processing, pp. 208--215. Google ScholarDigital Library
Shen, D., Pan, R., Sun, J.-T., Pan, J., Wu, K., Yin, J., and Yang, Q. 2006. Query enrichment for web-query classification. ACM Transaction on Information Systems (TOIS), 24(3), pp. 320--352. Google ScholarDigital Library
Yang, Y. 2001. A study on thresholding strategies for text categorization. In Proc. of 24th ACM SIGIR Conference, pp 137--145. Google ScholarDigital Library

Index Terms

Incorporating topical support documents into a small training set in text categorization
1. Information systems
  1. Information retrieval
    1. Retrieval tasks and goals
      1. Document filtering
      2. Information extraction

Recommendations

Virtual relevant documents in text categorization with support vector machines

This paper explores the incorporation of prior knowledge into support vector machines as a means of compensating for a shortage of training data in text categorization. The prior knowledge about transformation invariance is generated by a virtual ...
Read More
An Evaluation of Passage-Based Text Categorization

Researches in text categorization have been confined to whole-document-level classification, probably due to lack of full-text test collections. However, full-length documents available today in large quantities pose renewed interests in text ...
Read More
Semi-supervised text categorization: Exploiting unlabeled data using ensemble learning algorithms

Text categorization is one of the fundamental tasks in text mining. Classical supervised methods need lot of labeled data to train a classifier. Since assigning labels to the large amount of data is very costly and time consuming, it is useful to use ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
CIKM '08: Proceedings of the 17th ACM conference on Information and knowledge management
October 2008
1562 pages
ISBN:9781595939913
DOI:10.1145/1458082
General Chair:
James G. Shanahan
Church and Duncan Group Inc, USA
,
Program Chairs:
Sihem Amer-Yahia
Yahoo! Research, USA
,
Ioana Manolescu
INRIA, France
,
Yi Zhang
University of California, Santa Cruz, USA
,
David A. Evans
JustSystems Evans Research, USA
,
Alek Kolcz
Microsoft Live Labs, USA
,
Key-Sun Choi
KAIST, Korea
,
Abdur Chowdury
Twitter, USA
Copyright © 2008 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 26 October 2008
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
support vector machine
text categorization
topical support representation
transformation
Qualifiers
- poster
Conference

Acceptance Rates
Overall Acceptance Rate1,861of8,427submissions,22%
Upcoming Conference
CIKM '24

Sponsor:

sigir

sigir

The 33rd ACM International Conference on Information and Knowledge Management

October 21 - 25, 2024

Boise , ID , USA
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 168
  Total Downloads
- Downloads (Last 12 months)0
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Incorporating topical support documents into a small training set in text categorization

CIKM '08: Proceedings of the 17th ACM conference on Information and knowledge management

ABSTRACT

References

Cited By

Index Terms

Recommendations

Virtual relevant documents in text categorization with support vector machines

An Evaluation of Passage-Based Text Categorization

Semi-supervised text categorization: Exploiting unlabeled data using ensemble learning algorithms