Introducing Semantics in Short Text Classification

Bouaziz, Ameni; da Costa Pereira, Célia; Dartigues-Pallez, Christel; Precioso, Frédéric

doi:10.1007/978-3-319-75487-1_34

Ameni Bouaziz¹⁴,
Célia da Costa Pereira¹⁴,
Christel Dartigues-Pallez¹⁴ &
…
Frédéric Precioso¹⁴

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 9624))

Included in the following conference series:

International Conference on Intelligent Text Processing and Computational Linguistics

1199 Accesses

Abstract

To overcome short text classification issues due to shortness and sparseness, the enrichment process is classically proposed: topics (word clusters) are extracted from external knowledge sources using Latent Dirichlet Allocation. All the words, associated to topics which encompass short text words, are added to the initial short text content. We propose (i) an explicit representation of a two-level enrichment method in which the enrichment is considered either with respect to each word in the text or to the global semantic meaning of the short text and (ii) a new semantic Random Forest kind in which semantic relations between features are taken into account at node level rather than at tree level as it was recently proposed in the literature to avoid potential tree correlation. We demonstrate that our enrichment method is valid not only for Random Forest based methods but also for other methods like MaxEnt, SVM and Naive Bayes.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Phan, X.H., Nguyen, L.M., Horiguchi, S.: Learning to classify short and sparse text and web with hidden topics from large-scale data collections. In: International Conference on World Wide Web, pp. 91–100. ACM (2008)
Google Scholar
Yang, L., Li, C., Ding, Q., Li, L.: Combining lexical and semantic features for short text classification. Procedia Comput. Sci. 22, 78–86 (2013)
Article Google Scholar
Amaratunga, D., Cabrera, J., Lee, Y.S.: Enriched random forests. Bioinformatics 24, 2010–2014 (2008)
Article Google Scholar
Chen, M., Jin, X., Shen, D.: Short text classification improved by learning multi-granularity topics. In: IJCAI, pp. 1776–1781 (2011)
Google Scholar
Song, Y., Wang, H., Wang, Z., Li, H., Chen, W.: Short text conceptualization using a probabilistic knowledgebase. In: IJCAI, pp. 2330–2336. AAAI Press (2011)
Google Scholar
Bouaziz, A., Dartigues-Pallez, C., da Costa Pereira, C., Precioso, F., Lloret, P.: Short text classification using semantic random forest. In: Bellatreche, L., Mohania, M.K. (eds.) DaWaK 2014. LNCS, vol. 8646, pp. 288–299. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10160-6_26
Google Scholar
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. JMLR 3, 993–1022 (2003)
MATH Google Scholar
Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn. 20, 273–297 (1995)
MATH Google Scholar
Schneider, K.-M.: Techniques for improving the performance of Naive Bayes for text classification. In: Gelbukh, A. (ed.) CICLing 2005. LNCS, vol. 3406, pp. 682–693. Springer, Heidelberg (2005). https://doi.org/10.1007/978-3-540-30586-6_76
Chapter Google Scholar
Berger, A.L., Pietra, V.J.D., Pietra, S.A.D.: A maximum entropy approach to natural language processing. Comput. Linguist. 22, 39–71 (1996)
Google Scholar
Breiman, L.: Random forests. Mach. Learn. 45, 5–32 (2001)
Article MATH Google Scholar
Hu, X., Zhang, X., Lu, C., Park, E.K., Zhou, X.: Exploiting Wikipedia as external knowledge for document clustering. In: ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 389–396. ACM (2009)
Google Scholar
Hu, X., Sun, N., Zhang, C., Chua, T.S.: Exploiting internal and external semantics for the clustering of short texts using world knowledge. In: ACM Conference on Information and Knowledge Management, pp. 919–928 (2009)
Google Scholar
Dumais, S., Furnas, G., Landauer, T., Deerwester, S., Deerwester, S., et al.: Latent semantic indexing. In: Proceedings of the Text Retrieval Conference (1995)
Google Scholar
Song, G., Ye, Y., Du, X., Huang, X., Bie, S.: Short text classification: a survey. J. Multimed. 9, 635–643 (2014)
Article Google Scholar
Rafeeque, P., Sendhilkumar, S.: A survey on short text analysis in web. In: IEEE International Conference on Advanced Computing (ICoAC), pp. 365–371 (2011)
Google Scholar
Sun, A.: Short text classification using very few words. In: ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 1145–1146 (2012)
Google Scholar
Vo, D.T., Ock, C.Y.: Learning to classify short text from scientific documents using topic models with various types of knowledge. Expert Syst. Appl. 42, 1684–1698 (2015)
Article Google Scholar
Caragea, D., Bahirwani, V., Aljandal, W., Hsu, W.H.: Ontology-based link prediction in the livejournal social network. In: SARA, vol. 9 (2009)
Google Scholar
Chen, Z., Zhang, W.: Integrative analysis using module-guided random forests reveals correlated genetic factors related to mouse weight. PLoS Comput. Biol. 9 (2013)
Google Scholar

Download references

Acknowledgments

This work has been co-funded by Région Provence Alpes Côte d’Azur (PACA) and Semantic Grouping Company (SGC).

Author information

Authors and Affiliations

Laboratoire I3S (CNRS UMR-7271), Université Nice Sophia Antipolis, Nice, France
Ameni Bouaziz, Célia da Costa Pereira, Christel Dartigues-Pallez & Frédéric Precioso

Authors

Ameni Bouaziz
View author publications
You can also search for this author in PubMed Google Scholar
Célia da Costa Pereira
View author publications
You can also search for this author in PubMed Google Scholar
Christel Dartigues-Pallez
View author publications
You can also search for this author in PubMed Google Scholar
Frédéric Precioso
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ameni Bouaziz .

Editor information

Editors and Affiliations

CIC, Instituto Politécnico Nacional, Mexico City, Mexico
Alexander Gelbukh

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Bouaziz, A., da Costa Pereira, C., Dartigues-Pallez, C., Precioso, F. (2018). Introducing Semantics in Short Text Classification. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2016. Lecture Notes in Computer Science(), vol 9624. Springer, Cham. https://doi.org/10.1007/978-3-319-75487-1_34

Download citation

DOI: https://doi.org/10.1007/978-3-319-75487-1_34
Published: 21 March 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-75486-4
Online ISBN: 978-3-319-75487-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics