Abstract
One important necessity is to filter out the irrelevant information and organize the relevant information into meaningful categories. However, developing text classifiers often requires a large number of labeled documents as training examples. Manually labeling documents is costly and time-consuming. More importantly, it becomes unrealistic to know all the categories covered by the documents beforehand. Recently, a few methods have been proposed to label documents by using a small set of relevant keywords for each category, known as dataless text classification. In this article, we propose a seed-guided topic model for the dataless text filtering and classification (named DFC). Given a collection of unlabeled documents, and for each specified category a small set of seed words that are relevant to the semantic meaning of the category, DFC filters out the irrelevant documents and classifies the relevant documents into the corresponding categories through topic influence. DFC models two kinds of topics: category-topics and general-topics. Also, there are two kinds of category-topics: relevant-topics and irrelevant-topics. Each relevant-topic is associated with one specific category, representing its semantic meaning. The irrelevant-topics represent the semantics of the unknown categories covered by the document collection. And the general-topics capture the global semantic information. DFC assumes that each document is associated with a single category-topic and a mixture of general-topics. A novelty of the model is that DFC learns the topics by exploiting the explicit word co-occurrence patterns between the seed words and regular words (i.e., non-seed words) in the document collection. A document is then filtered, or classified, based on its posterior category-topic assignment. Experiments on two widely used datasets show that DFC consistently outperforms the state-of-the-art dataless text classifiers for both classification with filtering and classification without filtering. In many tasks, DFC can also achieve comparable or even better classification accuracy than the state-of-the-art supervised learning solutions. Our experimental results further show that DFC is insensitive to the tuning parameters. Moreover, we conduct a thorough study about the impact of seed words for existing dataless text classification techniques. The results reveal that it is not using more seed words but the document coverage of the seed words for the corresponding category that affects the dataless classification performance.
- David Andrzejewski, Xiaojin Zhu, and Mark Craven. 2009. Incorporating domain knowledge into topic modeling via Dirichlet forest priors. In Proceedings of the ICML. 25--32. Google ScholarDigital Library
- David M. Blei and Jon D. McAuliffe. 2007. Supervised topic models. In Proceedings of the NIPS. 121--128. Google ScholarDigital Library
- David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent dirichlet allocation. J. Mach. Learn. Res. 3 (2003), 993--1022. Google ScholarCross Ref
- Chris Buckley and Gerard Salton. 1995. Optimization of relevance feedback weights. In Proceedings of the SIGIR. 351--357. Google ScholarDigital Library
- Jaime G. Carbonell and Jade Goldstein. 1998. The use of MMR, diversity-based reranking for reordering documents and producing summaries. In Proceedings of the SIGIR. 335--336. Google ScholarDigital Library
- Sutanu Chakraborti, Ulises Cerviño Beresi, Nirmalie Wiratunga, Stewart Massie, Robert Lothian, and Deepak Khemani. 2008. Visualizing and evaluating complexity of textual case bases. In Proceedings of the ECCBR. 104--119. Google ScholarDigital Library
- Ming-Wei Chang, Lev-Arie Ratinov, Dan Roth, and Vivek Srikumar. 2008. Importance of semantic representation: Dataless classification. In Proceedings of the AAAI. 830--835. Google ScholarDigital Library
- Chaitanya Chemudugunta, Padhraic Smyth, and Mark Steyvers. 2006. Modeling general and specific aspects of documents with a probabilistic topic model. In Proceedings of the NIPS. 241--248. Google ScholarDigital Library
- Xingyuan Chen, Yunqing Xia, Peng Jin, and John A. Carroll. 2015. Dataless text classification with descriptive LDA. In Proceedings of the AAAI. 2224--2231. Google ScholarDigital Library
- Zhiyuan Chen and Bing Liu. 2014. Mining topics in documents: Standing on the shoulders of big data. In Proceedings of the SIGKDD. 1116--1125. Google ScholarDigital Library
- Zhiyuan Chen, Arjun Mukherjee, Bing Liu, Meichun Hsu, Malú Castellanos, and Riddhiman Ghosh. 2013. Leveraging multi-domain prior knowledge in topic models. In Proceedings of the IJCAI. 2071--2077. Google ScholarDigital Library
- Scott C. Deerwester, Susan T. Dumais, Thomas K. Landauer, George W. Furnas, and Richard A. Harshman. 1990. Indexing by latent semantic analysis. JASIS 41, 6 (1990), 391--407.Google ScholarCross Ref
- Doug Downey and Oren Etzioni. 2008. Look Ma, no hands: analyzing the monotonic feature abstraction for text classification. In Proceedings of the NIPS. 393--400. Google ScholarDigital Library
- Gregory Druck, Gideon S. Mann, and Andrew McCallum. 2008. Learning from labeled features using generalized expectation criteria. In Proceedings of the SIGIR. 595--602. Google ScholarDigital Library
- Mark D. Dunlop. 1997. The effect of accessing nonmatching documents on relevance feedback. ACM Trans. Inf. Syst. 15, 2 (1997), 137--153. Google ScholarDigital Library
- Karla L. Caballero Espinosa and Ram Akella. 2012. Incorporating statistical topic information in relevance feedback. In Proceedings of the SIGIR. 1093--1094. Google ScholarDigital Library
- Evgeniy Gabrilovich and Shaul Markovitch. 2007. Computing semantic relatedness using Wikipedia-based explicit semantic analysis. In Proceedings of the IJCAI. 1606--1611. Google ScholarDigital Library
- Alfio Gliozzo, Carlo Strapparava, and Ido Dagan. 2009. Improving text categorization bootstrapping via unsupervised learning. ACM Trans. Speech Lang. Process. 6, 1 (Oct. 2009), 1:1--1:24. Google ScholarDigital Library
- Yves Grandvalet and Yoshua Bengio. 2004. Semi-supervised learning by entropy minimization. In NIPS. 529--536. Google ScholarDigital Library
- Hu Guan, Jingyu Zhou, and Minyi Guo. 2009. A class-feature-centroid classifier for text categorization. In Proceedings of the WWW. 201--210. Google ScholarDigital Library
- Swapnil Hingmire and Sutanu Chakraborti. 2014. Topic labeled text classification: A weakly supervised approach. In Proceedings of the SIGIR. 385--394. Google ScholarDigital Library
- Swapnil Hingmire, Sandeep Chougule, Girish K. Palshikar, and Sutanu Chakraborti. 2013. Document classification by topic labeling. In Proceedings of the SIGIR. 877--880. Google ScholarDigital Library
- Thomas Hofmann. 1999. Probabilistic latent semantic indexing. In Proceedings of the SIGIR. 50--57. Google ScholarDigital Library
- Jagadeesh Jagarlamudi, Hal Daumé III, and Raghavendra Udupa. 2012. Incorporating lexical priors into topic models. In Proceedings of the EACL. 204--213. Google ScholarDigital Library
- Matt J. Kusner, Yu Sun, Nicholas I. Kolkin, and Kilian Q. Weinberger. 2015. From word embeddings to document distances. In Proceedings of the ICML. 957--966. Google ScholarDigital Library
- Chenliang Li, Yu Duan, Haoran Wang, Zhiqian Zhang, Aixin Sun, and Zongyang Ma. 2017. Enhancing topic modeling for short texts with auxiliary word embeddings. ACM Trans. Inf. Syst. 36, 2 (2017), 11:1--11:30. Google ScholarDigital Library
- Chenliang Li, Haoran Wang, Zhiqian Zhang, Aixin Sun, and Zongyang Ma. 2016. Topic modeling for short texts with auxiliary word embeddings. In Proceedings of the SIGIR. 165--174. Google ScholarDigital Library
- Chenliang Li, Jian Xing, Aixin Sun, and Zongyang Ma. 2016. Effective document labeling with very few seed words: A topic model approach. In Proceedings of the CIKM. 85--94. Google ScholarDigital Library
- Bing Liu, Xiaoli Li, Wee Sun Lee, and Philip S. Yu. 2004. Text classification by labeling words. In Proceedings of the AAAI. 425--430. Google ScholarDigital Library
- Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schtze. 2008. Introduction to Information Retrieval. Cambridge University Press. Google Scholar
- Jun Miao, Jimmy Xiangji Huang, and Jiashu Zhao. 2016. TopPRF: A probabilistic framework for integrating topic space into pseudo relevance feedback. ACM Trans. Inf. Syst. 34, 4 (2016), 22:1--22:36. Google ScholarDigital Library
- David Mimno, Hanna M. Wallach, Edmund Talley, Miriam Leenders, and Andrew McCallum. 2011. Optimizing semantic coherence in topic models. In Proceedings of the EMNLP. 262--272. Google ScholarDigital Library
- Arjun Mukherjee and Bing Liu. 2012. Aspect extraction through semi-supervised modeling. In ACL. 339--348. Google ScholarDigital Library
- Kamal Nigam, Andrew Kachites MacCallum, Sebastian Thrun, and Tom Mitchell. 2000. Text classification from labeled and unlabeled documents using EM. Mach. Learn. 39, 2--3 (2000), 103--134. Google ScholarDigital Library
- Hema Raghavan, Omid Madani, and Rosie Jones. 2006. Active learning with feedback on features and instances. J. Mach. Learn. Res. 7 (2006), 1655--1686. Google ScholarDigital Library
- Alan Ritter, Evan Wright, William Casey, and Tom M. Mitchell. 2015. Weakly supervised extraction of computer security events from twitter. In Proceedings of the WWW. 896--905. Google ScholarDigital Library
- Rodrygo L. T. Santos, Craig Macdonald, and Iadh Ounis. 2011. Intent-aware search result diversification. In Proceedings of the SIGIR. 595--604. Google ScholarDigital Library
- Yangqiu Song and Dan Roth. 2014. On dataless hierarchical text classification. In Proceedings of the AAAI. 1579--1585. Google ScholarDigital Library
- T. P. Straatsma, H. J. C. Berendsen, and A. J. Stam. 1986. Estimation of statistical errors in molecular simulation calculations. Mol. Phys. 57, 1 (1986), 89--95.Google ScholarCross Ref
- Griffiths Thomas and Steyvers Mark. 2004. Finding scientific topics. In Proceedings of the PNAS.Google Scholar
- Hanna M. Wallach, David M. Mimno, and Andrew McCallum. 2009. Rethinking LDA: Why priors matter. In Proceedings of the NIPS. 1973--1981. Google ScholarDigital Library
- Pengtao Xie and Eric P. Xing. 2013. Integrating document clustering and topic modeling. In Proceedings of the UAI. Google ScholarDigital Library
- Limin Yao, David M. Mimno, and Andrew McCallum. 2009. Efficient methods for topic model inference on streaming document collections. In Proceedings of the KDD. 937--946. Google ScholarDigital Library
- Zheng Ye and Jimmy Xiangji Huang. 2014. A simple term frequency transformation model for effective pseudo relevance feedback. In Proceedings of the SIGIR. 323--332. Google ScholarDigital Library
- Jinhui Yuan, Fei Gao, Qirong Ho, Wei Dai, Jinliang Wei, Xun Zheng, Eric Po Xing, Tie-Yan Liu, and Wei-Ying Ma. 2015. LightLDA: Big topic models on modest computer clusters. In Proceedings of the WWW. 1351--1361. Google ScholarDigital Library
- Jun Zhu, Amr Ahmed, and Eric P. Xing. 2009. MedLDA: Maximum margin supervised topic models for regression and classification. In Proceedings of the ICML. 1257--1264. Google ScholarDigital Library
- Jun Zhu, Amr Ahmed, and Eric P. Xing. 2012. MedLDA: Maximum margin supervised topic models. J. Mach. Learn. Res. 13 (2012), 2237--2278. Google ScholarDigital Library
Index Terms
- Seed-Guided Topic Model for Document Filtering and Classification
Recommendations
Labelset topic model for multi-label document classification
It has recently been suggested that assuming independence between labels is not suitable for real-world multi-label classification. To account for label dependencies, this paper proposes a supervised topic modeling algorithm, namely labelset topic model ...
Research on Multi-document Summarization Based on LDA Topic Model
IHMSC '14: Proceedings of the 2014 Sixth International Conference on Intelligent Human-Machine Systems and Cybernetics - Volume 02Compared with VSM (Vector Space Model) and graph-ranking models, LDA (Latent Dirichlet Allocation) Model can discover latent topics in the corpus and latent topics are beneficial to use sentence-ranking mechanisms to form a good summary. In the paper, ...
Multi-label dataless text classification with topic modeling
AbstractManually labeling documents is tedious and expensive, but it is essential for training a traditional text classifier. In recent years, a few dataless text classification techniques have been proposed to address this problem. However, existing ...
Comments