skip to main content
research-article

Seed-Guided Topic Model for Document Filtering and Classification

Published:06 December 2018Publication History
Skip Abstract Section

Abstract

One important necessity is to filter out the irrelevant information and organize the relevant information into meaningful categories. However, developing text classifiers often requires a large number of labeled documents as training examples. Manually labeling documents is costly and time-consuming. More importantly, it becomes unrealistic to know all the categories covered by the documents beforehand. Recently, a few methods have been proposed to label documents by using a small set of relevant keywords for each category, known as dataless text classification. In this article, we propose a seed-guided topic model for the dataless text filtering and classification (named DFC). Given a collection of unlabeled documents, and for each specified category a small set of seed words that are relevant to the semantic meaning of the category, DFC filters out the irrelevant documents and classifies the relevant documents into the corresponding categories through topic influence. DFC models two kinds of topics: category-topics and general-topics. Also, there are two kinds of category-topics: relevant-topics and irrelevant-topics. Each relevant-topic is associated with one specific category, representing its semantic meaning. The irrelevant-topics represent the semantics of the unknown categories covered by the document collection. And the general-topics capture the global semantic information. DFC assumes that each document is associated with a single category-topic and a mixture of general-topics. A novelty of the model is that DFC learns the topics by exploiting the explicit word co-occurrence patterns between the seed words and regular words (i.e., non-seed words) in the document collection. A document is then filtered, or classified, based on its posterior category-topic assignment. Experiments on two widely used datasets show that DFC consistently outperforms the state-of-the-art dataless text classifiers for both classification with filtering and classification without filtering. In many tasks, DFC can also achieve comparable or even better classification accuracy than the state-of-the-art supervised learning solutions. Our experimental results further show that DFC is insensitive to the tuning parameters. Moreover, we conduct a thorough study about the impact of seed words for existing dataless text classification techniques. The results reveal that it is not using more seed words but the document coverage of the seed words for the corresponding category that affects the dataless classification performance.

References

  1. David Andrzejewski, Xiaojin Zhu, and Mark Craven. 2009. Incorporating domain knowledge into topic modeling via Dirichlet forest priors. In Proceedings of the ICML. 25--32. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. David M. Blei and Jon D. McAuliffe. 2007. Supervised topic models. In Proceedings of the NIPS. 121--128. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent dirichlet allocation. J. Mach. Learn. Res. 3 (2003), 993--1022. Google ScholarGoogle ScholarCross RefCross Ref
  4. Chris Buckley and Gerard Salton. 1995. Optimization of relevance feedback weights. In Proceedings of the SIGIR. 351--357. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Jaime G. Carbonell and Jade Goldstein. 1998. The use of MMR, diversity-based reranking for reordering documents and producing summaries. In Proceedings of the SIGIR. 335--336. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Sutanu Chakraborti, Ulises Cerviño Beresi, Nirmalie Wiratunga, Stewart Massie, Robert Lothian, and Deepak Khemani. 2008. Visualizing and evaluating complexity of textual case bases. In Proceedings of the ECCBR. 104--119. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Ming-Wei Chang, Lev-Arie Ratinov, Dan Roth, and Vivek Srikumar. 2008. Importance of semantic representation: Dataless classification. In Proceedings of the AAAI. 830--835. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Chaitanya Chemudugunta, Padhraic Smyth, and Mark Steyvers. 2006. Modeling general and specific aspects of documents with a probabilistic topic model. In Proceedings of the NIPS. 241--248. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Xingyuan Chen, Yunqing Xia, Peng Jin, and John A. Carroll. 2015. Dataless text classification with descriptive LDA. In Proceedings of the AAAI. 2224--2231. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Zhiyuan Chen and Bing Liu. 2014. Mining topics in documents: Standing on the shoulders of big data. In Proceedings of the SIGKDD. 1116--1125. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Zhiyuan Chen, Arjun Mukherjee, Bing Liu, Meichun Hsu, Malú Castellanos, and Riddhiman Ghosh. 2013. Leveraging multi-domain prior knowledge in topic models. In Proceedings of the IJCAI. 2071--2077. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Scott C. Deerwester, Susan T. Dumais, Thomas K. Landauer, George W. Furnas, and Richard A. Harshman. 1990. Indexing by latent semantic analysis. JASIS 41, 6 (1990), 391--407.Google ScholarGoogle ScholarCross RefCross Ref
  13. Doug Downey and Oren Etzioni. 2008. Look Ma, no hands: analyzing the monotonic feature abstraction for text classification. In Proceedings of the NIPS. 393--400. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Gregory Druck, Gideon S. Mann, and Andrew McCallum. 2008. Learning from labeled features using generalized expectation criteria. In Proceedings of the SIGIR. 595--602. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Mark D. Dunlop. 1997. The effect of accessing nonmatching documents on relevance feedback. ACM Trans. Inf. Syst. 15, 2 (1997), 137--153. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Karla L. Caballero Espinosa and Ram Akella. 2012. Incorporating statistical topic information in relevance feedback. In Proceedings of the SIGIR. 1093--1094. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Evgeniy Gabrilovich and Shaul Markovitch. 2007. Computing semantic relatedness using Wikipedia-based explicit semantic analysis. In Proceedings of the IJCAI. 1606--1611. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Alfio Gliozzo, Carlo Strapparava, and Ido Dagan. 2009. Improving text categorization bootstrapping via unsupervised learning. ACM Trans. Speech Lang. Process. 6, 1 (Oct. 2009), 1:1--1:24. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Yves Grandvalet and Yoshua Bengio. 2004. Semi-supervised learning by entropy minimization. In NIPS. 529--536. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Hu Guan, Jingyu Zhou, and Minyi Guo. 2009. A class-feature-centroid classifier for text categorization. In Proceedings of the WWW. 201--210. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Swapnil Hingmire and Sutanu Chakraborti. 2014. Topic labeled text classification: A weakly supervised approach. In Proceedings of the SIGIR. 385--394. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Swapnil Hingmire, Sandeep Chougule, Girish K. Palshikar, and Sutanu Chakraborti. 2013. Document classification by topic labeling. In Proceedings of the SIGIR. 877--880. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Thomas Hofmann. 1999. Probabilistic latent semantic indexing. In Proceedings of the SIGIR. 50--57. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Jagadeesh Jagarlamudi, Hal Daumé III, and Raghavendra Udupa. 2012. Incorporating lexical priors into topic models. In Proceedings of the EACL. 204--213. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Matt J. Kusner, Yu Sun, Nicholas I. Kolkin, and Kilian Q. Weinberger. 2015. From word embeddings to document distances. In Proceedings of the ICML. 957--966. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Chenliang Li, Yu Duan, Haoran Wang, Zhiqian Zhang, Aixin Sun, and Zongyang Ma. 2017. Enhancing topic modeling for short texts with auxiliary word embeddings. ACM Trans. Inf. Syst. 36, 2 (2017), 11:1--11:30. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Chenliang Li, Haoran Wang, Zhiqian Zhang, Aixin Sun, and Zongyang Ma. 2016. Topic modeling for short texts with auxiliary word embeddings. In Proceedings of the SIGIR. 165--174. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Chenliang Li, Jian Xing, Aixin Sun, and Zongyang Ma. 2016. Effective document labeling with very few seed words: A topic model approach. In Proceedings of the CIKM. 85--94. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Bing Liu, Xiaoli Li, Wee Sun Lee, and Philip S. Yu. 2004. Text classification by labeling words. In Proceedings of the AAAI. 425--430. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schtze. 2008. Introduction to Information Retrieval. Cambridge University Press. Google ScholarGoogle Scholar
  31. Jun Miao, Jimmy Xiangji Huang, and Jiashu Zhao. 2016. TopPRF: A probabilistic framework for integrating topic space into pseudo relevance feedback. ACM Trans. Inf. Syst. 34, 4 (2016), 22:1--22:36. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. David Mimno, Hanna M. Wallach, Edmund Talley, Miriam Leenders, and Andrew McCallum. 2011. Optimizing semantic coherence in topic models. In Proceedings of the EMNLP. 262--272. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Arjun Mukherjee and Bing Liu. 2012. Aspect extraction through semi-supervised modeling. In ACL. 339--348. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Kamal Nigam, Andrew Kachites MacCallum, Sebastian Thrun, and Tom Mitchell. 2000. Text classification from labeled and unlabeled documents using EM. Mach. Learn. 39, 2--3 (2000), 103--134. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Hema Raghavan, Omid Madani, and Rosie Jones. 2006. Active learning with feedback on features and instances. J. Mach. Learn. Res. 7 (2006), 1655--1686. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Alan Ritter, Evan Wright, William Casey, and Tom M. Mitchell. 2015. Weakly supervised extraction of computer security events from twitter. In Proceedings of the WWW. 896--905. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Rodrygo L. T. Santos, Craig Macdonald, and Iadh Ounis. 2011. Intent-aware search result diversification. In Proceedings of the SIGIR. 595--604. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Yangqiu Song and Dan Roth. 2014. On dataless hierarchical text classification. In Proceedings of the AAAI. 1579--1585. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. T. P. Straatsma, H. J. C. Berendsen, and A. J. Stam. 1986. Estimation of statistical errors in molecular simulation calculations. Mol. Phys. 57, 1 (1986), 89--95.Google ScholarGoogle ScholarCross RefCross Ref
  40. Griffiths Thomas and Steyvers Mark. 2004. Finding scientific topics. In Proceedings of the PNAS.Google ScholarGoogle Scholar
  41. Hanna M. Wallach, David M. Mimno, and Andrew McCallum. 2009. Rethinking LDA: Why priors matter. In Proceedings of the NIPS. 1973--1981. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Pengtao Xie and Eric P. Xing. 2013. Integrating document clustering and topic modeling. In Proceedings of the UAI. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Limin Yao, David M. Mimno, and Andrew McCallum. 2009. Efficient methods for topic model inference on streaming document collections. In Proceedings of the KDD. 937--946. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Zheng Ye and Jimmy Xiangji Huang. 2014. A simple term frequency transformation model for effective pseudo relevance feedback. In Proceedings of the SIGIR. 323--332. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Jinhui Yuan, Fei Gao, Qirong Ho, Wei Dai, Jinliang Wei, Xun Zheng, Eric Po Xing, Tie-Yan Liu, and Wei-Ying Ma. 2015. LightLDA: Big topic models on modest computer clusters. In Proceedings of the WWW. 1351--1361. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Jun Zhu, Amr Ahmed, and Eric P. Xing. 2009. MedLDA: Maximum margin supervised topic models for regression and classification. In Proceedings of the ICML. 1257--1264. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Jun Zhu, Amr Ahmed, and Eric P. Xing. 2012. MedLDA: Maximum margin supervised topic models. J. Mach. Learn. Res. 13 (2012), 2237--2278. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Seed-Guided Topic Model for Document Filtering and Classification

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Transactions on Information Systems
        ACM Transactions on Information Systems  Volume 37, Issue 1
        January 2019
        435 pages
        ISSN:1046-8188
        EISSN:1558-2868
        DOI:10.1145/3289475
        Issue’s Table of Contents

        Copyright © 2018 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 6 December 2018
        • Revised: 1 June 2018
        • Accepted: 1 June 2018
        • Received: 1 January 2018
        Published in tois Volume 37, Issue 1

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Research
        • Refereed

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader