A semi-supervised framework for concept-based hierarchical document clustering

Sadjadi, Seyed Mojtaba; Mashayekhi, Hoda; Hassanpour, Hamid

doi:10.1007/s11280-023-01209-4

A semi-supervised framework for concept-based hierarchical document clustering

Published: 02 October 2023

Volume 26, pages 3861–3890, (2023)
Cite this article

World Wide Web Aims and scope Submit manuscript

Seyed Mojtaba Sadjadi¹,
Hoda Mashayekhi¹ &
Hamid Hassanpour¹

149 Accesses
Explore all metrics

Abstract

Text clustering is used in various applications of text analysis. In the clustering process, the employed document representation method has a significant impact on the results. Some popular document representation methods cannot effectively maintain the proximity information of the documents or suffer from low interpretability. Although the concept-based representation methods overcome these challenges to some extent, the existing semi-supervised document clustering methods rarely use this type of document representation. In this paper, we propose a concept-based semi-supervised framework for document clustering that uses both labeled and unlabeled data to yield a higher clustering quality. Concepts are composed of a set of semantically similar words. We propose the notion of semi-supervised concepts to benefit from document labels in extracting more relevant concepts. We also propose a new method of clustering documents based on the weights of such concepts. In the first and second steps of the proposed framework, the documents are represented based on the concepts extracted from the set of embedded words in the corpus. The proposed representation is interpretable and preserves the proximity information of documents. In the third step, the semi-supervised hierarchical clustering process utilizes unlabeled data to capture the overall structure of the clusters, and the supervision of a small number of labeled documents to adjust the cluster centroids. The use of concept vectors improves the process of merging clusters in the hierarchical clustering approach. The proposed framework is evaluated using the Reuters, 20-NewsGroups and WebKB text collections, and the results reveal the superiority of the proposed framework compared to several existing semi-supervised and unsupervised clustering approaches.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Combining semantic and term frequency similarities for text clustering

Article 02 January 2019

A Statistics-Based Semantic Relation Analysis Approach for Document Clustering

A Hybrid Approach for Improving Web Document Clustering Based on Concept Mining

Data availability

The datasets used in this study are openly available at the following addresses:

Reuters-21578 at [https://archive.ics.uci.edu/dataset/137/reuters+21578+text+categorization+collection], 20-Newsgroups at [http://qwone.com/~jason/20Newsgroups/] and WebKB at [http://www.cs.cmu.edu/~webkb/].

Notes

References

Forsati, R., Mahdavi, M., Kangavari, M., Safarkhani, B.: Web page clustering using harmony search optimization. In: 2008 Canadian Conference on Electrical and Computer Engineering, pp. 001601–001604, Niagara Falls (2008). https://doi.org/10.1109/CCECE.2008.4564812
Janani, R., Vijayarani, S.: Text document clustering using Spectral Clustering algorithm with Particle Swarm Optimization. Expert Syst. Appl. 134, 192–200 (2019). https://doi.org/10.1016/J.ESWA.2019.05.030
Article Google Scholar
Zhang, W., Yoshida, T., Tang, X., Wang, Q.: Text clustering using frequent itemsets. Knowledge-Based Syst. 23, 379–388 (2010). https://doi.org/10.1016/j.knosys.2010.01.011
Article Google Scholar
Xiao, Y., Liu, B., Yin, J., Hao, Z.: A multiple-instance stream learning framework for adaptive document categorization. Knowledge-Based Syst. 120, 198–210 (2017). https://doi.org/10.1016/j.knosys.2017.01.001
Article Google Scholar
Misztal-Radecka, J., Indurkhya, B.: Bias-Aware Hierarchical Clustering for detecting the discriminated groups of users in recommendation systems. Inf. Process. Manag. 58, 102519 (2021). https://doi.org/10.1016/J.IPM.2021.102519
Article Google Scholar
Wang, J., Shi, Y., Li, D., Zhang, K., Chen, Z., Li, H.: McHa: a multistage clustering-based hierarchical attention model for knowledge graph-aware recommendation. World Wide Web. 253(25), 1103–1127 (2022). https://doi.org/10.1007/S11280-022-01022-5
Article Google Scholar
Edara, D.C., Vanukuri, L.P., Sistla, V., Kolli, V.K.K.: Sentiment analysis and text categorization of cancer medical records with LSTM. J. Ambient Intell. Humaniz. Comput. 1–17 (2019). https://doi.org/10.1007/s12652-019-01399-8
Almeida, T.A., Silva, T.P., Santos, I., Gómez Hidalgo, J.M.: Text normalization and semantic indexing to enhance Instant Messaging and SMS spam filtering. Knowledge-Based Syst. 108, 25–32 (2016). https://doi.org/10.1016/j.knosys.2016.05.001
Article Google Scholar
Ligthart, A., Catal, C., Tekinerdogan, B.: Analyzing the effectiveness of semi-supervised learning approaches for opinion spam classification. Appl. Soft Comput. 101, 107023 (2021). https://doi.org/10.1016/J.ASOC.2020.107023
Article Google Scholar
Shakiba, T., Zarifzadeh, S., Derhami, V.: Spam query detection using stream clustering. World Wide Web. 212(21), 557–572 (2017). https://doi.org/10.1007/S11280-017-0471-Z
Article Google Scholar
Djenouri, Y., Belhadi, A., Fournier-Viger, P., Lin, J.C.W.: Fast and effective cluster-based information retrieval using frequent closed itemsets. Inf. Sci. (Ny) 453, 154–167 (2018). https://doi.org/10.1016/j.ins.2018.04.008
Article MathSciNet Google Scholar
Joty, S., Carenini, G., Ng, R.T.: Topic segmentation and labeling in asynchronous conversations. J. Artif. Intell. Res. 47, 521–573 (2013). https://doi.org/10.1613/jair.3940
Article MathSciNet Google Scholar
Paparrizos, J., Gravano, L.: Fast and accurate time-series clustering. ACM Trans. Database Syst. 42, 1–49 (2017). https://doi.org/10.1145/3044711
Article MathSciNet Google Scholar
Li, Y., Guo, H., Zhang, Q., Gu, M., Yang, J.: Imbalanced text sentiment classification using universal and domain-specific knowledge. Knowledge-Based Syst. 160, 1–15 (2018). https://doi.org/10.1016/j.knosys.2018.06.019
Article Google Scholar
Mohd, M., Jan, R., Shah, M.: Text document summarization using word embedding. Expert Syst. Appl. 143, 112958 (2020). https://doi.org/10.1016/j.eswa.2019.112958
Article Google Scholar
Zhang, W., Yoshida, T., Tang, X.: A comparative study of TF*IDF, LSI and multi-words for text classification. Expert Syst. Appl. 38, 2758–2765 (2011). https://doi.org/10.1016/j.eswa.2010.08.066
Article Google Scholar
Sayeedunnissa, S.F., Hussain, A.R., Hameed, M.A. Supervised opinion mining of social network data using a bag-of-words approach on the cloud BT. In: Bansal, J.C., Singh, P., Deep, K., Pant, M., Nagar, A. (eds.) Proceedings of seventh international conference on bio-inspired computing: theories and applications (BIC-TA 2012), pp. 299–309. Springer India, India (2013)
Jacovi, A., Shalom, O.S., Goldberg, Y. Understanding convolutional neural networks for text classification. In: Proc. 2018 EMNLP Work. BlackboxNLP Anal. Interpret. Neural Networks NLP, pp. 56–65. Association for Computational Linguistics (ACL) (2018). https://doi.org/10.18653/V1/W18-5408
N. Kalchbrenner, E. Grefenstette, P. Blunsom, A convolutional neural network for modelling sentences, in: 52nd Annu. Meet. Assoc. Comput. Linguist. ACL 2014 - Proc. Conf., Association for Computational Linguistics (ACL), 2014: pp. 655–665. https://doi.org/10.3115/v1/p14-1062.
Le, Q., Mikolov, T. Distributed representations of sentences and documents. In: Proceedings of the 31st International Conference on International Conference on Machine Learning. PMLR, 32, 1188–1196. (2014). http://proceedings.mlr.press/v32/le14.html. Accessed 19 March 2021
Kim, H.K., Kim, H., Cho, S.: Bag-of-concepts: Comprehending document representation through clustering words in distributed representation. Neurocomputing 266, 336–352 (2017). https://doi.org/10.1016/j.neucom.2017.05.046
Article Google Scholar
van Engelen, J.E., Hoos, H.H.: A survey on semi-supervised learning. Mach. Learn. 1092(109), 373–440 (2019). https://doi.org/10.1007/S10994-019-05855-6
Article MathSciNet Google Scholar
Luo, X., Liu, F., Yang, S., Wang, X., Zhou, Z.: Joint sparse regularization based Sparse Semi-Supervised Extreme Learning Machine (S3ELM) for classification. Knowledge-Based Syst. 73, 149–160 (2015). https://doi.org/10.1016/j.knosys.2014.09.014
Article Google Scholar
Zhang, W., Yang, Y., Wang, Q.: Using Bayesian regression and EM algorithm with missing handling for software effort prediction. Inf. Softw. Technol. 58, 58–70 (2015). https://doi.org/10.1016/j.infsof.2014.10.005
Article Google Scholar
Zhang, W., Tang, X., Yoshida, T.: TESC: An approach to TExt classification using Semi-supervised Clustering. Knowledge-Based Syst. 75, 152–160 (2015). https://doi.org/10.1016/j.knosys.2014.11.028
Article Google Scholar
Lai, S., Liu, K., He, S., Zhao, J.: How to generate a good word embedding. IEEE Intell. Syst. 31, 5–14 (2016). https://doi.org/10.1109/MIS.2016.45
Article Google Scholar
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. (2013). http://arxiv.org/abs/1301.3781. Accessed 16 Sept 2023
Dai, A.M., Olah, C., Le, Q.V.: Document embedding with paragraph vectors. 1–8 (2015). http://arxiv.org/abs/1507.07998. Accessed 16 Sept 2023
Zhang, Z., Zhang, Y., Xu, M., Zhang, L., Yang, Y., Yan, S.: A survey on concept factorization: from shallow to deep representation learning. Inf. Process. Manag. 58, 102534 (2021). https://doi.org/10.1016/J.IPM.2021.102534
Article Google Scholar
Li, P., Mao, K., Xu, Y., Li, Q., Zhang, J.: Bag-of-Concepts representation for document classification based on automatic knowledge acquisition from probabilistic knowledge base. Knowledge-Based Syst. 193, 105436 (2020). https://doi.org/10.1016/j.knosys.2019.105436
Article Google Scholar
Luo, X., Shah, S.: Concept embedding-based weighting scheme for biomedical text clustering and visualization. Appl. Informatics. 5, 1–19 (2018). https://doi.org/10.1186/s40535-018-0055-8
Article Google Scholar
Jia, C., Carson, M.B., Wang, X., Yu, J.: Concept decompositions for short text clustering by identifying word communities. Pattern Recognit. 76, 691–703 (2018). https://doi.org/10.1016/j.patcog.2017.09.045
Article Google Scholar
Wu, C., Kanoulas, E., de Rijke, M.: Learning entity-centric document representations using an entity facet topic model. Inf. Process. Manag. 57, 102216 (2020). https://doi.org/10.1016/J.IPM.2020.102216
Article Google Scholar
Li, W., Suzuki, E.: Adaptive and hybrid context-aware fine-grained word sense disambiguation in topic modeling based document representation. Inf. Process. Manag. 58, 102592 (2021). https://doi.org/10.1016/J.IPM.2021.102592
Article Google Scholar
Lee, Y.H., Hu, P.J.H., Tsao, W.J., Li, L.: Use of a domain-specific ontology to support automated document categorization at the concept level: Method development and evaluation. Expert Syst. Appl. 174, 114681 (2021). https://doi.org/10.1016/J.ESWA.2021.114681
Article Google Scholar
Mehanna, Y.S., Bin Mahmuddin, M.: A semantic conceptualization using tagged bag-of-concepts for sentiment analysis. IEEE Access. 9, 118736–118756 (2021). https://doi.org/10.1109/ACCESS.2021.3107237
Article Google Scholar
Basu, S., Davidson, I., Wagstaff, K.: Constrained clustering: advances in algorithms, theory, and applications, 1st. edn. Chapman & Hall/CRC (2008)
Zhang, Z., Zhang, Y., Liu, G., Tang, J., Yan, S., Wang, M.: Joint label prediction based semi-supervised adaptive concept factorization for robust data representation. IEEE Trans. Knowl. Data Eng. 32, 952–970 (2020). https://doi.org/10.1109/TKDE.2019.2893956
Article Google Scholar
Lu, M., Zhao, X.J., Zhang, L., Li, F.Z.: Semi-supervised concept factorization for document clustering. Inf. Sci. (Ny) 331, 86–98 (2016). https://doi.org/10.1016/j.ins.2015.10.038
Article MathSciNet Google Scholar
Diaz-Valenzuela, I., Loia, V., Martin-Bautista, M.J., Senatore, S., Vila, M.A.: Automatic constraints generation for semisupervised clustering: experiences with documents classification. Soft Comput. 20, 2329–2339 (2016). https://doi.org/10.1007/s00500-015-1643-3
Article Google Scholar
Li, P., Deng, Z.: Use of distributed semi-supervised clustering for text classification. J. Circuits Syst. Comput. 28, 1–13 (2019). https://doi.org/10.1142/S0218126619501275
Article Google Scholar
Masud, M.A., Huang, J.Z., Zhong, M., Fu, X.: Generate pairwise constraints from unlabeled data for semi-supervised clustering. Data Knowl. Eng. 123, 101715 (2019). https://doi.org/10.1016/j.datak.2019.101715
Article Google Scholar
Gan, H., Fan, Y., Luo, Z., Zhang, Q.: Local homogeneous consistent safe semi-supervised clustering. Expert Syst. Appl. 97, 384–393 (2018). https://doi.org/10.1016/j.eswa.2017.12.046
Article Google Scholar
Agarwal, R.: Phrases based document classification from semi supervised hierarchical LDA. In: 2021 2nd International Conference on Computation, Automation and Knowledge Management (ICCAKM), pp. 332–337. Dubai (2021). https://doi.org/10.1109/ICCAKM50778.2021.9357720
Zhang, Y., Chen, X., Meng, Y., Han, J.: Hierarchical metadata-aware document categorization under weak supervision. In: Proceedings of the 14th ACM International Conference on Web Search and Data Mining (WSDM '21), pp. 770–778. Association for Computing Machinery, New York (2021). https://doi.org/10.1145/3437963.3441730
Vilhagra, L.A., Fernandes, E.R., Nogueira, B.M. TextCSN: A semi-supervised approach for text clustering using pairwise constraints and convolutional siamese network. Proc. ACM Symp. Appl. Comput. 1135–1142 (2020). https://doi.org/10.1145/3341105.3374018
Li, L., Zhao, K., Gan, J., Cai, S., Liu, T., Mu, H., Sun, R.: Robust adaptive semi-supervised classification method based on dynamic graph and self-paced learning. Inf. Process. Manag. 58, 102433 (2021). https://doi.org/10.1016/J.IPM.2020.102433
Article Google Scholar
Emadi, M., Tanha, J., Shiri, M.E., Aghdam, M.H.: A Selection Metric for semi-supervised learning based on neighborhood construction. Inf. Process. Manag. 58, 102444 (2021). https://doi.org/10.1016/J.IPM.2020.102444
Article Google Scholar
Ren, Y., Hu, K., Dai, X., Pan, L., Hoi, S.C.H., Xu, Z.: Semi-supervised deep embedded clustering. Neurocomputing 325, 121–130 (2019). https://doi.org/10.1016/J.NEUCOM.2018.10.016
Article Google Scholar
Xing, Z., Wen, M., Peng, J., Feng, J.: Discriminative semi-supervised non-negative matrix factorization for data clustering. Eng. Appl. Artif. Intell. 103, 104289 (2021). https://doi.org/10.1016/J.ENGAPPAI.2021.104289
Article Google Scholar
Li, X., Yin, H., Zhou, K., Zhou, X.: Semi-supervised clustering with deep metric learning and graph embedding. World Wide Web. 232(23), 781–798 (2019). https://doi.org/10.1007/S11280-019-00723-8
Article Google Scholar
Mikolov, T., Sutskever, I., Chen, K., Corrado, G., Dean, J.: Distributed representations of words and phrases and their compositionality. Adv. Neural Inf. Process. Syst. 3111–3119 (2013). http://arxiv.org/abs/1310.4546. Accessed 11 May 2021
Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of NAACL-HLT, pp. 1–2 (2019)
Hornik, K., Feinerer, I., Wu, M.K., Wien, W., Buchta, C.: Spherical k-means clustering. J. Stat. Softw. 50, 1–22 (2012)
Article Google Scholar
Robertson, S.: Understanding inverse document frequency: On theoretical arguments for IDF. J. Doc. 60, 503–520 (2004). https://doi.org/10.1108/00220410410560582
Article Google Scholar
Li, C., Bai, J., Wenjun, Z., Xihao, Y.: Community detection using hierarchical clustering based on edge-weighted similarity in cloud environment. Inf. Process. Manag. 56, 91–109 (2019). https://doi.org/10.1016/j.ipm.2018.10.004
Article Google Scholar
Semertzidis, T., Rafailidis, D., Strintzis, M.G., Daras, P.: Large-scale spectral clustering based on pairwise constraints. Inf. Process. Manag. 51, 616–624 (2015). https://doi.org/10.1016/J.IPM.2015.05.007
Article Google Scholar
Zhang, T., Ramakrishnan, R., Livny, M. BIRCH: an efficient data clustering method for very large databases. In: SIGMOD Rec. (ACM Spec. Interes. Gr. Manag. Data), pp. 103–114. ACM PUB27, New York (1996). https://doi.org/10.1145/235968.233324
Xie, J., Girshick, R., Farhadi, A.: Unsupervised deep embedding for clustering analysis. In: Proceedings of the 33rd International Conference on Machine Learning, pp. 478–487. PMLR (2016)
Strehl, A., Ghosh, J., Mooney, R. Impact of similarity measures on web-page clustering. Work. Artif. Intell. Web Search (AAAI 2000). 58, 64 (2000)

Download references

Author information

Authors and Affiliations

Faculty of Computer Engineering, Shahrood University of Technology, Shahrood, Iran
Seyed Mojtaba Sadjadi, Hoda Mashayekhi & Hamid Hassanpour

Authors

Seyed Mojtaba Sadjadi
View author publications
You can also search for this author in PubMed Google Scholar
Hoda Mashayekhi
View author publications
You can also search for this author in PubMed Google Scholar
Hamid Hassanpour
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Conceptualization: [Seyed Mojtaba Sadjadi], [Hoda Mashayekhi]; Data curation: [Seyed Mojtaba Sadjadi]; Formal analysis: [Seyed Mojtaba Sadjadi], [Hoda Mashayekhi], [Hamid Hassanpour]; Investigation: [Seyed Mojtaba Sadjadi], [Hoda Mashayekhi]; Methodology: [Seyed Mojtaba Sadjadi], [Hoda Mashayekhi]; Project administration: [Hoda Mashayekhi]; Software: [Seyed Mojtaba Sadjadi]; Supervision: [Hoda Mashayekhi], [Hamid Hassanpour]; Validation: [Seyed Mojtaba Sadjadi], [Hoda Mashayekhi]; Visualization: [Seyed Mojtaba Sadjadi], [Hoda Mashayekhi], [Hamid Hassanpour]; Writing – original draft: [Seyed Mojtaba Sadjadi]; Writing – review & editing: [Hoda Mashayekhi], [Hamid Hassanpour].

Corresponding author

Correspondence to Hoda Mashayekhi.

Ethics declarations

Ethical approval and consent to participate

Not applicable.

Human and animal ethics

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Authors' information

Seyed Mojtaba Sadjadi received his M.S. from the Department of Computer Engineering in Shahrood University of Technology in 2021. His main research interests include machine learning, natural language processing, text mining through embedding, and the semantic web.

Hoda Mashayekhi is an Assistant Professor at the faculty of Computer Engineering, Shahrood University of Technology. She received her PhD from Sharif University of Technology in 2013. Her research interests include massive data mining, machine learning, parallel and distributed computing, decision-making, and recommendation systems.

Hamid Hassanpour received his Ph.D. from the Queensland University of Technology, Australia in 2004. He is currently a full professor at the faculty of Computer Engineering, Sharood University of Technololgy, Iran. His research interests include Image Processing, Signal Processing, and Data Mining. He has published over 200 journal and conference papers. He is the Editor-in-Chief for Journal of Artificial Intelligence and Data Mining.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Sadjadi, S.M., Mashayekhi, H. & Hassanpour, H. A semi-supervised framework for concept-based hierarchical document clustering. World Wide Web 26, 3861–3890 (2023). https://doi.org/10.1007/s11280-023-01209-4

Download citation

Received: 30 July 2022
Revised: 21 August 2023
Accepted: 30 August 2023
Published: 02 October 2023
Issue Date: November 2023
DOI: https://doi.org/10.1007/s11280-023-01209-4

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A semi-supervised framework for concept-based hierarchical document clustering

Abstract

Access this article

Similar content being viewed by others

Combining semantic and term frequency similarities for text clustering

A Statistics-Based Semantic Relation Analysis Approach for Document Clustering

A Hybrid Approach for Improving Web Document Clustering Based on Concept Mining

Data availability

Notes

References

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethical approval and consent to participate

Human and animal ethics

Consent for publication

Competing interests

Authors' information

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A semi-supervised framework for concept-based hierarchical document clustering

Abstract

Access this article

Similar content being viewed by others

Combining semantic and term frequency similarities for text clustering

A Statistics-Based Semantic Relation Analysis Approach for Document Clustering

A Hybrid Approach for Improving Web Document Clustering Based on Concept Mining

Data availability

Notes

References

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethical approval and consent to participate

Human and animal ethics

Consent for publication

Competing interests

Authors' information

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation