research-article

Seed-Guided Topic Model for Document Filtering and Classification

Authors:
Chenliang Li

Wuhan University, Wuhan, Hubei, China

Wuhan University, Wuhan, Hubei, China

0000-0003-3144-6374
View Profile

,
Shiqian Chen

Wuhan University, Wuhan, Hubei, China

Wuhan University, Wuhan, Hubei, China
View Profile

,
Jian Xing

Hithink RoyalFlush Information Network Co., Ltd, China

Hithink RoyalFlush Information Network Co., Ltd, China
View Profile

,
Aixin Sun

Nanyang technological University, Singapore

Nanyang technological University, Singapore
View Profile

,
Zongyang Ma

Microsoft (China) Co., Ltd, Soochow, China

Microsoft (China) Co., Ltd, Soochow, China
View Profile

Authors Info & Claims

ACM Transactions on Information Systems Volume 37 Issue 1Article No.: 9pp 1–37https://doi.org/10.1145/3238250

Published:06 December 2018Publication History

ACM Transactions on Information Systems

Abstract

One important necessity is to filter out the irrelevant information and organize the relevant information into meaningful categories. However, developing text classifiers often requires a large number of labeled documents as training examples. Manually labeling documents is costly and time-consuming. More importantly, it becomes unrealistic to know all the categories covered by the documents beforehand. Recently, a few methods have been proposed to label documents by using a small set of relevant keywords for each category, known as dataless text classification. In this article, we propose a seed-guided topic model for the dataless text filtering and classification (named DFC). Given a collection of unlabeled documents, and for each specified category a small set of seed words that are relevant to the semantic meaning of the category, DFC filters out the irrelevant documents and classifies the relevant documents into the corresponding categories through topic influence. DFC models two kinds of topics: category-topics and general-topics. Also, there are two kinds of category-topics: relevant-topics and irrelevant-topics. Each relevant-topic is associated with one specific category, representing its semantic meaning. The irrelevant-topics represent the semantics of the unknown categories covered by the document collection. And the general-topics capture the global semantic information. DFC assumes that each document is associated with a single category-topic and a mixture of general-topics. A novelty of the model is that DFC learns the topics by exploiting the explicit word co-occurrence patterns between the seed words and regular words (i.e., non-seed words) in the document collection. A document is then filtered, or classified, based on its posterior category-topic assignment. Experiments on two widely used datasets show that DFC consistently outperforms the state-of-the-art dataless text classifiers for both classification with filtering and classification without filtering. In many tasks, DFC can also achieve comparable or even better classification accuracy than the state-of-the-art supervised learning solutions. Our experimental results further show that DFC is insensitive to the tuning parameters. Moreover, we conduct a thorough study about the impact of seed words for existing dataless text classification techniques. The results reveal that it is not using more seed words but the document coverage of the seed words for the corresponding category that affects the dataless classification performance.

References

David Andrzejewski, Xiaojin Zhu, and Mark Craven. 2009. Incorporating domain knowledge into topic modeling via Dirichlet forest priors. In Proceedings of the ICML. 25--32. Google ScholarDigital Library
David M. Blei and Jon D. McAuliffe. 2007. Supervised topic models. In Proceedings of the NIPS. 121--128. Google ScholarDigital Library
David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent dirichlet allocation. J. Mach. Learn. Res. 3 (2003), 993--1022. Google ScholarCross Ref
Chris Buckley and Gerard Salton. 1995. Optimization of relevance feedback weights. In Proceedings of the SIGIR. 351--357. Google ScholarDigital Library
Jaime G. Carbonell and Jade Goldstein. 1998. The use of MMR, diversity-based reranking for reordering documents and producing summaries. In Proceedings of the SIGIR. 335--336. Google ScholarDigital Library
Sutanu Chakraborti, Ulises Cerviño Beresi, Nirmalie Wiratunga, Stewart Massie, Robert Lothian, and Deepak Khemani. 2008. Visualizing and evaluating complexity of textual case bases. In Proceedings of the ECCBR. 104--119. Google ScholarDigital Library
Ming-Wei Chang, Lev-Arie Ratinov, Dan Roth, and Vivek Srikumar. 2008. Importance of semantic representation: Dataless classification. In Proceedings of the AAAI. 830--835. Google ScholarDigital Library
Chaitanya Chemudugunta, Padhraic Smyth, and Mark Steyvers. 2006. Modeling general and specific aspects of documents with a probabilistic topic model. In Proceedings of the NIPS. 241--248. Google ScholarDigital Library
Xingyuan Chen, Yunqing Xia, Peng Jin, and John A. Carroll. 2015. Dataless text classification with descriptive LDA. In Proceedings of the AAAI. 2224--2231. Google ScholarDigital Library
Zhiyuan Chen and Bing Liu. 2014. Mining topics in documents: Standing on the shoulders of big data. In Proceedings of the SIGKDD. 1116--1125. Google ScholarDigital Library
Zhiyuan Chen, Arjun Mukherjee, Bing Liu, Meichun Hsu, Malú Castellanos, and Riddhiman Ghosh. 2013. Leveraging multi-domain prior knowledge in topic models. In Proceedings of the IJCAI. 2071--2077. Google ScholarDigital Library
Scott C. Deerwester, Susan T. Dumais, Thomas K. Landauer, George W. Furnas, and Richard A. Harshman. 1990. Indexing by latent semantic analysis. JASIS 41, 6 (1990), 391--407.Google ScholarCross Ref
Doug Downey and Oren Etzioni. 2008. Look Ma, no hands: analyzing the monotonic feature abstraction for text classification. In Proceedings of the NIPS. 393--400. Google ScholarDigital Library
Gregory Druck, Gideon S. Mann, and Andrew McCallum. 2008. Learning from labeled features using generalized expectation criteria. In Proceedings of the SIGIR. 595--602. Google ScholarDigital Library
Mark D. Dunlop. 1997. The effect of accessing nonmatching documents on relevance feedback. ACM Trans. Inf. Syst. 15, 2 (1997), 137--153. Google ScholarDigital Library
Karla L. Caballero Espinosa and Ram Akella. 2012. Incorporating statistical topic information in relevance feedback. In Proceedings of the SIGIR. 1093--1094. Google ScholarDigital Library
Evgeniy Gabrilovich and Shaul Markovitch. 2007. Computing semantic relatedness using Wikipedia-based explicit semantic analysis. In Proceedings of the IJCAI. 1606--1611. Google ScholarDigital Library
Alfio Gliozzo, Carlo Strapparava, and Ido Dagan. 2009. Improving text categorization bootstrapping via unsupervised learning. ACM Trans. Speech Lang. Process. 6, 1 (Oct. 2009), 1:1--1:24. Google ScholarDigital Library
Yves Grandvalet and Yoshua Bengio. 2004. Semi-supervised learning by entropy minimization. In NIPS. 529--536. Google ScholarDigital Library
Hu Guan, Jingyu Zhou, and Minyi Guo. 2009. A class-feature-centroid classifier for text categorization. In Proceedings of the WWW. 201--210. Google ScholarDigital Library
Swapnil Hingmire and Sutanu Chakraborti. 2014. Topic labeled text classification: A weakly supervised approach. In Proceedings of the SIGIR. 385--394. Google ScholarDigital Library
Swapnil Hingmire, Sandeep Chougule, Girish K. Palshikar, and Sutanu Chakraborti. 2013. Document classification by topic labeling. In Proceedings of the SIGIR. 877--880. Google ScholarDigital Library
Thomas Hofmann. 1999. Probabilistic latent semantic indexing. In Proceedings of the SIGIR. 50--57. Google ScholarDigital Library
Jagadeesh Jagarlamudi, Hal Daumé III, and Raghavendra Udupa. 2012. Incorporating lexical priors into topic models. In Proceedings of the EACL. 204--213. Google ScholarDigital Library
Matt J. Kusner, Yu Sun, Nicholas I. Kolkin, and Kilian Q. Weinberger. 2015. From word embeddings to document distances. In Proceedings of the ICML. 957--966. Google ScholarDigital Library
Chenliang Li, Yu Duan, Haoran Wang, Zhiqian Zhang, Aixin Sun, and Zongyang Ma. 2017. Enhancing topic modeling for short texts with auxiliary word embeddings. ACM Trans. Inf. Syst. 36, 2 (2017), 11:1--11:30. Google ScholarDigital Library
Chenliang Li, Haoran Wang, Zhiqian Zhang, Aixin Sun, and Zongyang Ma. 2016. Topic modeling for short texts with auxiliary word embeddings. In Proceedings of the SIGIR. 165--174. Google ScholarDigital Library
Chenliang Li, Jian Xing, Aixin Sun, and Zongyang Ma. 2016. Effective document labeling with very few seed words: A topic model approach. In Proceedings of the CIKM. 85--94. Google ScholarDigital Library
Bing Liu, Xiaoli Li, Wee Sun Lee, and Philip S. Yu. 2004. Text classification by labeling words. In Proceedings of the AAAI. 425--430. Google ScholarDigital Library
Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schtze. 2008. Introduction to Information Retrieval. Cambridge University Press. Google Scholar
Jun Miao, Jimmy Xiangji Huang, and Jiashu Zhao. 2016. TopPRF: A probabilistic framework for integrating topic space into pseudo relevance feedback. ACM Trans. Inf. Syst. 34, 4 (2016), 22:1--22:36. Google ScholarDigital Library
David Mimno, Hanna M. Wallach, Edmund Talley, Miriam Leenders, and Andrew McCallum. 2011. Optimizing semantic coherence in topic models. In Proceedings of the EMNLP. 262--272. Google ScholarDigital Library
Arjun Mukherjee and Bing Liu. 2012. Aspect extraction through semi-supervised modeling. In ACL. 339--348. Google ScholarDigital Library
Kamal Nigam, Andrew Kachites MacCallum, Sebastian Thrun, and Tom Mitchell. 2000. Text classification from labeled and unlabeled documents using EM. Mach. Learn. 39, 2--3 (2000), 103--134. Google ScholarDigital Library
Hema Raghavan, Omid Madani, and Rosie Jones. 2006. Active learning with feedback on features and instances. J. Mach. Learn. Res. 7 (2006), 1655--1686. Google ScholarDigital Library
Alan Ritter, Evan Wright, William Casey, and Tom M. Mitchell. 2015. Weakly supervised extraction of computer security events from twitter. In Proceedings of the WWW. 896--905. Google ScholarDigital Library
Rodrygo L. T. Santos, Craig Macdonald, and Iadh Ounis. 2011. Intent-aware search result diversification. In Proceedings of the SIGIR. 595--604. Google ScholarDigital Library
Yangqiu Song and Dan Roth. 2014. On dataless hierarchical text classification. In Proceedings of the AAAI. 1579--1585. Google ScholarDigital Library
T. P. Straatsma, H. J. C. Berendsen, and A. J. Stam. 1986. Estimation of statistical errors in molecular simulation calculations. Mol. Phys. 57, 1 (1986), 89--95.Google ScholarCross Ref
Griffiths Thomas and Steyvers Mark. 2004. Finding scientific topics. In Proceedings of the PNAS.Google Scholar
Hanna M. Wallach, David M. Mimno, and Andrew McCallum. 2009. Rethinking LDA: Why priors matter. In Proceedings of the NIPS. 1973--1981. Google ScholarDigital Library
Pengtao Xie and Eric P. Xing. 2013. Integrating document clustering and topic modeling. In Proceedings of the UAI. Google ScholarDigital Library
Limin Yao, David M. Mimno, and Andrew McCallum. 2009. Efficient methods for topic model inference on streaming document collections. In Proceedings of the KDD. 937--946. Google ScholarDigital Library
Zheng Ye and Jimmy Xiangji Huang. 2014. A simple term frequency transformation model for effective pseudo relevance feedback. In Proceedings of the SIGIR. 323--332. Google ScholarDigital Library
Jinhui Yuan, Fei Gao, Qirong Ho, Wei Dai, Jinliang Wei, Xun Zheng, Eric Po Xing, Tie-Yan Liu, and Wei-Ying Ma. 2015. LightLDA: Big topic models on modest computer clusters. In Proceedings of the WWW. 1351--1361. Google ScholarDigital Library
Jun Zhu, Amr Ahmed, and Eric P. Xing. 2009. MedLDA: Maximum margin supervised topic models for regression and classification. In Proceedings of the ICML. 1257--1264. Google ScholarDigital Library
Jun Zhu, Amr Ahmed, and Eric P. Xing. 2012. MedLDA: Maximum margin supervised topic models. J. Mach. Learn. Res. 13 (2012), 2237--2278. Google ScholarDigital Library

Index Terms

Seed-Guided Topic Model for Document Filtering and Classification
1. Information systems
  1. Information retrieval
    1. Document representation
      1. Document topic models
    2. Retrieval tasks and goals
      1. Clustering and classification

Recommendations

Labelset topic model for multi-label document classification

It has recently been suggested that assuming independence between labels is not suitable for real-world multi-label classification. To account for label dependencies, this paper proposes a supervised topic modeling algorithm, namely labelset topic model ...
Read More
Research on Multi-document Summarization Based on LDA Topic Model
IHMSC '14: Proceedings of the 2014 Sixth International Conference on Intelligent Human-Machine Systems and Cybernetics - Volume 02

Compared with VSM (Vector Space Model) and graph-ranking models, LDA (Latent Dirichlet Allocation) Model can discover latent topics in the corpus and latent topics are beneficial to use sentence-ranking mechanisms to form a good summary. In the paper, ...
Read More
Multi-label dataless text classification with topic modeling
Abstract
Manually labeling documents is tedious and expensive, but it is essential for training a traditional text classifier. In recent years, a few dataless text classification techniques have been proposed to address this problem. However, existing ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Information Systems Volume 37, Issue 1
January 2019
435 pages
ISSN:1046-8188
EISSN:1558-2868
DOI:10.1145/3289475
Editor:
Maarten de Rijke
University of Amsterdam, The Netherlands
Issue’s Table of Contents
Copyright © 2018 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 6 December 2018
- Revised: 1 June 2018
- Accepted: 1 June 2018
- Received: 1 January 2018
Published in tois Volume 37, Issue 1

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Topic model
dataless classification
document filtering
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 22
  Total Citations
  View Citations
- 907
  Total Downloads
- Downloads (Last 12 months)92
- Downloads (Last 6 weeks)5
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Seed-Guided Topic Model for Document Filtering and Classification

ACM Transactions on Information Systems

Abstract

References

Cited By

Index Terms

Recommendations

Labelset topic model for multi-label document classification

Research on Multi-document Summarization Based on LDA Topic Model

Multi-label dataless text classification with topic modeling