Topic Models Incorporating Statistical Word Senses

Tang, Guoyu; Xia, Yunqing; Sun, Jun; Zhang, Min; Zheng, Thomas Fang

doi:10.1007/978-3-642-54906-9_13

Guoyu Tang^17,18,19,
Yunqing Xia^17,18,19,
Jun Sun²⁰,
Min Zhang²¹ &
…
Thomas Fang Zheng^17,18,19

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 8403))

Included in the following conference series:

International Conference on Intelligent Text Processing and Computational Linguistics

2064 Accesses
1 Citations

Abstract

LDA considers a surface word to be identical across all documents and measures the contribution of a surface word to each topic. However, a surface word may present different signatures in different contexts, i.e. polysemous words can be used with different senses in different contexts. Intuitively, disambiguating word senses for topic models can enhance their discriminative capabilities. In this work, we propose a joint model to automatically induce document topics and word senses simultaneously. Instead of using some pre-defined word sense resources, we capture the word sense information via a latent variable and directly induce them in a fully unsupervised manner from the corpora. Experimental results show that the proposed joint model outperforms the classic LDA and a standalone sense-based LDA model significantly in document clustering.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)
MATH Google Scholar
Navigli, R., Crisafulli, G.: Inducing word senses to improve web search result clustering. In: Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, EMNLP 2010, Stroudsburg, PA, USA, pp. 116–126. Association for Computational Linguistics (2010)
Google Scholar
Boyd-Graber, J.L., Blei, D.M., Zhu, X.: A topic model for word sense disambiguation. In: EMNLP-CoNLL, pp. 1024–1033 (2007)
Google Scholar
Guo, W., Diab, M.: Semantic topic models: Combining word distributional statistics and dictionary definitions. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP 2011, pp. 552–561. Association for Computational Linguistics, Stroudsburg (2011)
Google Scholar
Teh, Y.W., Jordan, M.I., Beal, M.J., Blei, D.M.: Hierarchical dirichlet processes. Journal of the American Statistical Association 101 (2004)
Google Scholar
Salton, G., Wong, A., Yang, C.S.: A vector space model for automatic indexing. Commun. ACM 18(11), 613–620 (1975)
Article MATH Google Scholar
Hotho, A., Staab, S., Stumme, G.: Wordnet improves text document clustering. In: Proc. of the SIGIR 2003 Semantic Web Workshop, pp. 541–544 (2003)
Google Scholar
Gabrilovich, E., Markovitch, S.: Computing semantic relatedness using wikipedia-based explicit semantic analysis. In: Proceedings of the 20th International Joint Conference on Artifical Intelligence, IJCAI 2007, San Francisco, CA, USA, pp. 1606–1611. Morgan Kaufmann Publishers Inc. (2007)
Google Scholar
Tufiş, D., Koeva, S.: Ontology-supported text classification based on cross-lingual word sense disambiguation. In: Masulli, F., Mitra, S., Pasi, G. (eds.) WILF 2007. LNCS (LNAI), vol. 4578, pp. 447–455. Springer, Heidelberg (2007)
Chapter Google Scholar
Brody, S., Lapata, M.: Bayesian word sense induction. In: Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2009, pp. 103–111. Association for Computational Linguistics, Stroudsburg (2009)
Google Scholar
Agirre, E., Soroa, A.: Semeval-2007 task 02: Evaluating word sense induction and discrimination systems. In: Proceedings of the 4th International Workshop on Semantic Evaluations, SemEval 2007, pp. 7–12. Association for Computational Linguistics, Stroudsburg (2007)
Google Scholar
Yao, X., Van Durme, B.: Nonparametric bayesian word sense induction. In: Proceedings of TextGraphs-6: Graph-based Methods for Natural Language Processing, pp. 10–14. Association for Computational Linguistics (2011)
Google Scholar
Griffiths, T.L., Steyvers, M.: Finding scientific topics. PNAS 101(suppl. 1), 5228–5235 (2004)
Article Google Scholar
Kong, J., Graff, D.: Tdt4 multilingual broadcast news speech corpus. Linguistic Data Consortium (2005), http://www. ldc. upenn. edu/Catalog/CatalogEntry. jsp
Google Scholar
Lewis, D.D.: Reuters-21578 text categorization test collection, distribution 1.0 (1997), http://www.research.att.com/~lewis/reuters21578.html
Schmid, H.: Probabilistic part-of-speech tagging using decision trees. In: Proceedings of International Conference on New Methods in Language Processing, Manchester, UK, vol. 12, pp. 44–49 (1994)
Google Scholar
Steinbach, M., Karypis, G., Kumar, V.: A comparison of document clustering techniques. In: KDD Workshop on Text Mining (2000)
Google Scholar

Download references

Author information

Authors and Affiliations

Center for Speech and Language Technologies, Division of Technical Innovation and Development, Tsinghua National Laboratory for Information Science and Technology, China
Guoyu Tang, Yunqing Xia & Thomas Fang Zheng
Center for Speech and Language Technologies, Research Institute of Information Technology, China
Guoyu Tang, Yunqing Xia & Thomas Fang Zheng
Department of Computer Science and Technology, Tsinghua University, Beijing, China
Guoyu Tang, Yunqing Xia & Thomas Fang Zheng
Institute for Infocomm Research, A-STAR, Singapore
Jun Sun
Soochow University, China
Min Zhang

Authors

Guoyu Tang
View author publications
You can also search for this author in PubMed Google Scholar
Yunqing Xia
View author publications
You can also search for this author in PubMed Google Scholar
Jun Sun
View author publications
You can also search for this author in PubMed Google Scholar
Min Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Thomas Fang Zheng
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Center for Computing Research, National Polytechnic Institute, Av. Juan Dios Bátiz, Col. Nueva Industrial Vallejo, 07738, Mexico D.F., Mexico
Alexander Gelbukh

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Tang, G., Xia, Y., Sun, J., Zhang, M., Zheng, T.F. (2014). Topic Models Incorporating Statistical Word Senses. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2014. Lecture Notes in Computer Science, vol 8403. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-54906-9_13

Download citation

DOI: https://doi.org/10.1007/978-3-642-54906-9_13
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-54905-2
Online ISBN: 978-3-642-54906-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics