Abstract
Text clustering is used in various applications of text analysis. In the clustering process, the employed document representation method has a significant impact on the results. Some popular document representation methods cannot effectively maintain the proximity information of the documents or suffer from low interpretability. Although the concept-based representation methods overcome these challenges to some extent, the existing semi-supervised document clustering methods rarely use this type of document representation. In this paper, we propose a concept-based semi-supervised framework for document clustering that uses both labeled and unlabeled data to yield a higher clustering quality. Concepts are composed of a set of semantically similar words. We propose the notion of semi-supervised concepts to benefit from document labels in extracting more relevant concepts. We also propose a new method of clustering documents based on the weights of such concepts. In the first and second steps of the proposed framework, the documents are represented based on the concepts extracted from the set of embedded words in the corpus. The proposed representation is interpretable and preserves the proximity information of documents. In the third step, the semi-supervised hierarchical clustering process utilizes unlabeled data to capture the overall structure of the clusters, and the supervision of a small number of labeled documents to adjust the cluster centroids. The use of concept vectors improves the process of merging clusters in the hierarchical clustering approach. The proposed framework is evaluated using the Reuters, 20-NewsGroups and WebKB text collections, and the results reveal the superiority of the proposed framework compared to several existing semi-supervised and unsupervised clustering approaches.
Similar content being viewed by others
Data availability
The datasets used in this study are openly available at the following addresses:
Reuters-21578 at [https://archive.ics.uci.edu/dataset/137/reuters+21578+text+categorization+collection], 20-Newsgroups at [http://qwone.com/~jason/20Newsgroups/] and WebKB at [http://www.cs.cmu.edu/~webkb/].
References
Forsati, R., Mahdavi, M., Kangavari, M., Safarkhani, B.: Web page clustering using harmony search optimization. In: 2008 Canadian Conference on Electrical and Computer Engineering, pp. 001601–001604, Niagara Falls (2008). https://doi.org/10.1109/CCECE.2008.4564812
Janani, R., Vijayarani, S.: Text document clustering using Spectral Clustering algorithm with Particle Swarm Optimization. Expert Syst. Appl. 134, 192–200 (2019). https://doi.org/10.1016/J.ESWA.2019.05.030
Zhang, W., Yoshida, T., Tang, X., Wang, Q.: Text clustering using frequent itemsets. Knowledge-Based Syst. 23, 379–388 (2010). https://doi.org/10.1016/j.knosys.2010.01.011
Xiao, Y., Liu, B., Yin, J., Hao, Z.: A multiple-instance stream learning framework for adaptive document categorization. Knowledge-Based Syst. 120, 198–210 (2017). https://doi.org/10.1016/j.knosys.2017.01.001
Misztal-Radecka, J., Indurkhya, B.: Bias-Aware Hierarchical Clustering for detecting the discriminated groups of users in recommendation systems. Inf. Process. Manag. 58, 102519 (2021). https://doi.org/10.1016/J.IPM.2021.102519
Wang, J., Shi, Y., Li, D., Zhang, K., Chen, Z., Li, H.: McHa: a multistage clustering-based hierarchical attention model for knowledge graph-aware recommendation. World Wide Web. 253(25), 1103–1127 (2022). https://doi.org/10.1007/S11280-022-01022-5
Edara, D.C., Vanukuri, L.P., Sistla, V., Kolli, V.K.K.: Sentiment analysis and text categorization of cancer medical records with LSTM. J. Ambient Intell. Humaniz. Comput. 1–17 (2019). https://doi.org/10.1007/s12652-019-01399-8
Almeida, T.A., Silva, T.P., Santos, I., Gómez Hidalgo, J.M.: Text normalization and semantic indexing to enhance Instant Messaging and SMS spam filtering. Knowledge-Based Syst. 108, 25–32 (2016). https://doi.org/10.1016/j.knosys.2016.05.001
Ligthart, A., Catal, C., Tekinerdogan, B.: Analyzing the effectiveness of semi-supervised learning approaches for opinion spam classification. Appl. Soft Comput. 101, 107023 (2021). https://doi.org/10.1016/J.ASOC.2020.107023
Shakiba, T., Zarifzadeh, S., Derhami, V.: Spam query detection using stream clustering. World Wide Web. 212(21), 557–572 (2017). https://doi.org/10.1007/S11280-017-0471-Z
Djenouri, Y., Belhadi, A., Fournier-Viger, P., Lin, J.C.W.: Fast and effective cluster-based information retrieval using frequent closed itemsets. Inf. Sci. (Ny) 453, 154–167 (2018). https://doi.org/10.1016/j.ins.2018.04.008
Joty, S., Carenini, G., Ng, R.T.: Topic segmentation and labeling in asynchronous conversations. J. Artif. Intell. Res. 47, 521–573 (2013). https://doi.org/10.1613/jair.3940
Paparrizos, J., Gravano, L.: Fast and accurate time-series clustering. ACM Trans. Database Syst. 42, 1–49 (2017). https://doi.org/10.1145/3044711
Li, Y., Guo, H., Zhang, Q., Gu, M., Yang, J.: Imbalanced text sentiment classification using universal and domain-specific knowledge. Knowledge-Based Syst. 160, 1–15 (2018). https://doi.org/10.1016/j.knosys.2018.06.019
Mohd, M., Jan, R., Shah, M.: Text document summarization using word embedding. Expert Syst. Appl. 143, 112958 (2020). https://doi.org/10.1016/j.eswa.2019.112958
Zhang, W., Yoshida, T., Tang, X.: A comparative study of TF*IDF, LSI and multi-words for text classification. Expert Syst. Appl. 38, 2758–2765 (2011). https://doi.org/10.1016/j.eswa.2010.08.066
Sayeedunnissa, S.F., Hussain, A.R., Hameed, M.A. Supervised opinion mining of social network data using a bag-of-words approach on the cloud BT. In: Bansal, J.C., Singh, P., Deep, K., Pant, M., Nagar, A. (eds.) Proceedings of seventh international conference on bio-inspired computing: theories and applications (BIC-TA 2012), pp. 299–309. Springer India, India (2013)
Jacovi, A., Shalom, O.S., Goldberg, Y. Understanding convolutional neural networks for text classification. In: Proc. 2018 EMNLP Work. BlackboxNLP Anal. Interpret. Neural Networks NLP, pp. 56–65. Association for Computational Linguistics (ACL) (2018). https://doi.org/10.18653/V1/W18-5408
N. Kalchbrenner, E. Grefenstette, P. Blunsom, A convolutional neural network for modelling sentences, in: 52nd Annu. Meet. Assoc. Comput. Linguist. ACL 2014 - Proc. Conf., Association for Computational Linguistics (ACL), 2014: pp. 655–665. https://doi.org/10.3115/v1/p14-1062.
Le, Q., Mikolov, T. Distributed representations of sentences and documents. In: Proceedings of the 31st International Conference on International Conference on Machine Learning. PMLR, 32, 1188–1196. (2014). http://proceedings.mlr.press/v32/le14.html. Accessed 19 March 2021
Kim, H.K., Kim, H., Cho, S.: Bag-of-concepts: Comprehending document representation through clustering words in distributed representation. Neurocomputing 266, 336–352 (2017). https://doi.org/10.1016/j.neucom.2017.05.046
van Engelen, J.E., Hoos, H.H.: A survey on semi-supervised learning. Mach. Learn. 1092(109), 373–440 (2019). https://doi.org/10.1007/S10994-019-05855-6
Luo, X., Liu, F., Yang, S., Wang, X., Zhou, Z.: Joint sparse regularization based Sparse Semi-Supervised Extreme Learning Machine (S3ELM) for classification. Knowledge-Based Syst. 73, 149–160 (2015). https://doi.org/10.1016/j.knosys.2014.09.014
Zhang, W., Yang, Y., Wang, Q.: Using Bayesian regression and EM algorithm with missing handling for software effort prediction. Inf. Softw. Technol. 58, 58–70 (2015). https://doi.org/10.1016/j.infsof.2014.10.005
Zhang, W., Tang, X., Yoshida, T.: TESC: An approach to TExt classification using Semi-supervised Clustering. Knowledge-Based Syst. 75, 152–160 (2015). https://doi.org/10.1016/j.knosys.2014.11.028
Lai, S., Liu, K., He, S., Zhao, J.: How to generate a good word embedding. IEEE Intell. Syst. 31, 5–14 (2016). https://doi.org/10.1109/MIS.2016.45
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. (2013). http://arxiv.org/abs/1301.3781. Accessed 16 Sept 2023
Dai, A.M., Olah, C., Le, Q.V.: Document embedding with paragraph vectors. 1–8 (2015). http://arxiv.org/abs/1507.07998. Accessed 16 Sept 2023
Zhang, Z., Zhang, Y., Xu, M., Zhang, L., Yang, Y., Yan, S.: A survey on concept factorization: from shallow to deep representation learning. Inf. Process. Manag. 58, 102534 (2021). https://doi.org/10.1016/J.IPM.2021.102534
Li, P., Mao, K., Xu, Y., Li, Q., Zhang, J.: Bag-of-Concepts representation for document classification based on automatic knowledge acquisition from probabilistic knowledge base. Knowledge-Based Syst. 193, 105436 (2020). https://doi.org/10.1016/j.knosys.2019.105436
Luo, X., Shah, S.: Concept embedding-based weighting scheme for biomedical text clustering and visualization. Appl. Informatics. 5, 1–19 (2018). https://doi.org/10.1186/s40535-018-0055-8
Jia, C., Carson, M.B., Wang, X., Yu, J.: Concept decompositions for short text clustering by identifying word communities. Pattern Recognit. 76, 691–703 (2018). https://doi.org/10.1016/j.patcog.2017.09.045
Wu, C., Kanoulas, E., de Rijke, M.: Learning entity-centric document representations using an entity facet topic model. Inf. Process. Manag. 57, 102216 (2020). https://doi.org/10.1016/J.IPM.2020.102216
Li, W., Suzuki, E.: Adaptive and hybrid context-aware fine-grained word sense disambiguation in topic modeling based document representation. Inf. Process. Manag. 58, 102592 (2021). https://doi.org/10.1016/J.IPM.2021.102592
Lee, Y.H., Hu, P.J.H., Tsao, W.J., Li, L.: Use of a domain-specific ontology to support automated document categorization at the concept level: Method development and evaluation. Expert Syst. Appl. 174, 114681 (2021). https://doi.org/10.1016/J.ESWA.2021.114681
Mehanna, Y.S., Bin Mahmuddin, M.: A semantic conceptualization using tagged bag-of-concepts for sentiment analysis. IEEE Access. 9, 118736–118756 (2021). https://doi.org/10.1109/ACCESS.2021.3107237
Basu, S., Davidson, I., Wagstaff, K.: Constrained clustering: advances in algorithms, theory, and applications, 1st. edn. Chapman & Hall/CRC (2008)
Zhang, Z., Zhang, Y., Liu, G., Tang, J., Yan, S., Wang, M.: Joint label prediction based semi-supervised adaptive concept factorization for robust data representation. IEEE Trans. Knowl. Data Eng. 32, 952–970 (2020). https://doi.org/10.1109/TKDE.2019.2893956
Lu, M., Zhao, X.J., Zhang, L., Li, F.Z.: Semi-supervised concept factorization for document clustering. Inf. Sci. (Ny) 331, 86–98 (2016). https://doi.org/10.1016/j.ins.2015.10.038
Diaz-Valenzuela, I., Loia, V., Martin-Bautista, M.J., Senatore, S., Vila, M.A.: Automatic constraints generation for semisupervised clustering: experiences with documents classification. Soft Comput. 20, 2329–2339 (2016). https://doi.org/10.1007/s00500-015-1643-3
Li, P., Deng, Z.: Use of distributed semi-supervised clustering for text classification. J. Circuits Syst. Comput. 28, 1–13 (2019). https://doi.org/10.1142/S0218126619501275
Masud, M.A., Huang, J.Z., Zhong, M., Fu, X.: Generate pairwise constraints from unlabeled data for semi-supervised clustering. Data Knowl. Eng. 123, 101715 (2019). https://doi.org/10.1016/j.datak.2019.101715
Gan, H., Fan, Y., Luo, Z., Zhang, Q.: Local homogeneous consistent safe semi-supervised clustering. Expert Syst. Appl. 97, 384–393 (2018). https://doi.org/10.1016/j.eswa.2017.12.046
Agarwal, R.: Phrases based document classification from semi supervised hierarchical LDA. In: 2021 2nd International Conference on Computation, Automation and Knowledge Management (ICCAKM), pp. 332–337. Dubai (2021). https://doi.org/10.1109/ICCAKM50778.2021.9357720
Zhang, Y., Chen, X., Meng, Y., Han, J.: Hierarchical metadata-aware document categorization under weak supervision. In: Proceedings of the 14th ACM International Conference on Web Search and Data Mining (WSDM '21), pp. 770–778. Association for Computing Machinery, New York (2021). https://doi.org/10.1145/3437963.3441730
Vilhagra, L.A., Fernandes, E.R., Nogueira, B.M. TextCSN: A semi-supervised approach for text clustering using pairwise constraints and convolutional siamese network. Proc. ACM Symp. Appl. Comput. 1135–1142 (2020). https://doi.org/10.1145/3341105.3374018
Li, L., Zhao, K., Gan, J., Cai, S., Liu, T., Mu, H., Sun, R.: Robust adaptive semi-supervised classification method based on dynamic graph and self-paced learning. Inf. Process. Manag. 58, 102433 (2021). https://doi.org/10.1016/J.IPM.2020.102433
Emadi, M., Tanha, J., Shiri, M.E., Aghdam, M.H.: A Selection Metric for semi-supervised learning based on neighborhood construction. Inf. Process. Manag. 58, 102444 (2021). https://doi.org/10.1016/J.IPM.2020.102444
Ren, Y., Hu, K., Dai, X., Pan, L., Hoi, S.C.H., Xu, Z.: Semi-supervised deep embedded clustering. Neurocomputing 325, 121–130 (2019). https://doi.org/10.1016/J.NEUCOM.2018.10.016
Xing, Z., Wen, M., Peng, J., Feng, J.: Discriminative semi-supervised non-negative matrix factorization for data clustering. Eng. Appl. Artif. Intell. 103, 104289 (2021). https://doi.org/10.1016/J.ENGAPPAI.2021.104289
Li, X., Yin, H., Zhou, K., Zhou, X.: Semi-supervised clustering with deep metric learning and graph embedding. World Wide Web. 232(23), 781–798 (2019). https://doi.org/10.1007/S11280-019-00723-8
Mikolov, T., Sutskever, I., Chen, K., Corrado, G., Dean, J.: Distributed representations of words and phrases and their compositionality. Adv. Neural Inf. Process. Syst. 3111–3119 (2013). http://arxiv.org/abs/1310.4546. Accessed 11 May 2021
Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of NAACL-HLT, pp. 1–2 (2019)
Hornik, K., Feinerer, I., Wu, M.K., Wien, W., Buchta, C.: Spherical k-means clustering. J. Stat. Softw. 50, 1–22 (2012)
Robertson, S.: Understanding inverse document frequency: On theoretical arguments for IDF. J. Doc. 60, 503–520 (2004). https://doi.org/10.1108/00220410410560582
Li, C., Bai, J., Wenjun, Z., Xihao, Y.: Community detection using hierarchical clustering based on edge-weighted similarity in cloud environment. Inf. Process. Manag. 56, 91–109 (2019). https://doi.org/10.1016/j.ipm.2018.10.004
Semertzidis, T., Rafailidis, D., Strintzis, M.G., Daras, P.: Large-scale spectral clustering based on pairwise constraints. Inf. Process. Manag. 51, 616–624 (2015). https://doi.org/10.1016/J.IPM.2015.05.007
Zhang, T., Ramakrishnan, R., Livny, M. BIRCH: an efficient data clustering method for very large databases. In: SIGMOD Rec. (ACM Spec. Interes. Gr. Manag. Data), pp. 103–114. ACM PUB27, New York (1996). https://doi.org/10.1145/235968.233324
Xie, J., Girshick, R., Farhadi, A.: Unsupervised deep embedding for clustering analysis. In: Proceedings of the 33rd International Conference on Machine Learning, pp. 478–487. PMLR (2016)
Strehl, A., Ghosh, J., Mooney, R. Impact of similarity measures on web-page clustering. Work. Artif. Intell. Web Search (AAAI 2000). 58, 64 (2000)
Author information
Authors and Affiliations
Contributions
Conceptualization: [Seyed Mojtaba Sadjadi], [Hoda Mashayekhi]; Data curation: [Seyed Mojtaba Sadjadi]; Formal analysis: [Seyed Mojtaba Sadjadi], [Hoda Mashayekhi], [Hamid Hassanpour]; Investigation: [Seyed Mojtaba Sadjadi], [Hoda Mashayekhi]; Methodology: [Seyed Mojtaba Sadjadi], [Hoda Mashayekhi]; Project administration: [Hoda Mashayekhi]; Software: [Seyed Mojtaba Sadjadi]; Supervision: [Hoda Mashayekhi], [Hamid Hassanpour]; Validation: [Seyed Mojtaba Sadjadi], [Hoda Mashayekhi]; Visualization: [Seyed Mojtaba Sadjadi], [Hoda Mashayekhi], [Hamid Hassanpour]; Writing – original draft: [Seyed Mojtaba Sadjadi]; Writing – review & editing: [Hoda Mashayekhi], [Hamid Hassanpour].
Corresponding author
Ethics declarations
Ethical approval and consent to participate
Not applicable.
Human and animal ethics
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Authors' information
Seyed Mojtaba Sadjadi received his M.S. from the Department of Computer Engineering in Shahrood University of Technology in 2021. His main research interests include machine learning, natural language processing, text mining through embedding, and the semantic web.
Hoda Mashayekhi is an Assistant Professor at the faculty of Computer Engineering, Shahrood University of Technology. She received her PhD from Sharif University of Technology in 2013. Her research interests include massive data mining, machine learning, parallel and distributed computing, decision-making, and recommendation systems.
Hamid Hassanpour received his Ph.D. from the Queensland University of Technology, Australia in 2004. He is currently a full professor at the faculty of Computer Engineering, Sharood University of Technololgy, Iran. His research interests include Image Processing, Signal Processing, and Data Mining. He has published over 200 journal and conference papers. He is the Editor-in-Chief for Journal of Artificial Intelligence and Data Mining.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Sadjadi, S.M., Mashayekhi, H. & Hassanpour, H. A semi-supervised framework for concept-based hierarchical document clustering. World Wide Web 26, 3861–3890 (2023). https://doi.org/10.1007/s11280-023-01209-4
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11280-023-01209-4