Skip to main content
Log in

A semi-supervised framework for concept-based hierarchical document clustering

  • Published:
World Wide Web Aims and scope Submit manuscript

Abstract

Text clustering is used in various applications of text analysis. In the clustering process, the employed document representation method has a significant impact on the results. Some popular document representation methods cannot effectively maintain the proximity information of the documents or suffer from low interpretability. Although the concept-based representation methods overcome these challenges to some extent, the existing semi-supervised document clustering methods rarely use this type of document representation. In this paper, we propose a concept-based semi-supervised framework for document clustering that uses both labeled and unlabeled data to yield a higher clustering quality. Concepts are composed of a set of semantically similar words. We propose the notion of semi-supervised concepts to benefit from document labels in extracting more relevant concepts. We also propose a new method of clustering documents based on the weights of such concepts. In the first and second steps of the proposed framework, the documents are represented based on the concepts extracted from the set of embedded words in the corpus. The proposed representation is interpretable and preserves the proximity information of documents. In the third step, the semi-supervised hierarchical clustering process utilizes unlabeled data to capture the overall structure of the clusters, and the supervision of a small number of labeled documents to adjust the cluster centroids. The use of concept vectors improves the process of merging clusters in the hierarchical clustering approach. The proposed framework is evaluated using the Reuters, 20-NewsGroups and WebKB text collections, and the results reveal the superiority of the proposed framework compared to several existing semi-supervised and unsupervised clustering approaches.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Data availability

The datasets used in this study are openly available at the following addresses:

Reuters-21578 at [https://archive.ics.uci.edu/dataset/137/reuters+21578+text+categorization+collection], 20-Newsgroups at [http://qwone.com/~jason/20Newsgroups/] and WebKB at [http://www.cs.cmu.edu/~webkb/].

Notes

  1. https://archive.ics.uci.edu/ml/datasets/reuters-21578+text+categorization+collection

  2. http://qwone.com/~jason/20Newsgroups/

  3. http://www.cs.cmu.edu/~webkb/

References

  1. Forsati, R., Mahdavi, M., Kangavari, M., Safarkhani, B.: Web page clustering using harmony search optimization. In: 2008 Canadian Conference on Electrical and Computer Engineering, pp. 001601–001604, Niagara Falls (2008). https://doi.org/10.1109/CCECE.2008.4564812

  2. Janani, R., Vijayarani, S.: Text document clustering using Spectral Clustering algorithm with Particle Swarm Optimization. Expert Syst. Appl. 134, 192–200 (2019). https://doi.org/10.1016/J.ESWA.2019.05.030

    Article  Google Scholar 

  3. Zhang, W., Yoshida, T., Tang, X., Wang, Q.: Text clustering using frequent itemsets. Knowledge-Based Syst. 23, 379–388 (2010). https://doi.org/10.1016/j.knosys.2010.01.011

    Article  Google Scholar 

  4. Xiao, Y., Liu, B., Yin, J., Hao, Z.: A multiple-instance stream learning framework for adaptive document categorization. Knowledge-Based Syst. 120, 198–210 (2017). https://doi.org/10.1016/j.knosys.2017.01.001

    Article  Google Scholar 

  5. Misztal-Radecka, J., Indurkhya, B.: Bias-Aware Hierarchical Clustering for detecting the discriminated groups of users in recommendation systems. Inf. Process. Manag. 58, 102519 (2021). https://doi.org/10.1016/J.IPM.2021.102519

    Article  Google Scholar 

  6. Wang, J., Shi, Y., Li, D., Zhang, K., Chen, Z., Li, H.: McHa: a multistage clustering-based hierarchical attention model for knowledge graph-aware recommendation. World Wide Web. 253(25), 1103–1127 (2022). https://doi.org/10.1007/S11280-022-01022-5

    Article  Google Scholar 

  7. Edara, D.C., Vanukuri, L.P., Sistla, V., Kolli, V.K.K.: Sentiment analysis and text categorization of cancer medical records with LSTM. J. Ambient Intell. Humaniz. Comput. 1–17 (2019). https://doi.org/10.1007/s12652-019-01399-8

  8. Almeida, T.A., Silva, T.P., Santos, I., Gómez Hidalgo, J.M.: Text normalization and semantic indexing to enhance Instant Messaging and SMS spam filtering. Knowledge-Based Syst. 108, 25–32 (2016). https://doi.org/10.1016/j.knosys.2016.05.001

    Article  Google Scholar 

  9. Ligthart, A., Catal, C., Tekinerdogan, B.: Analyzing the effectiveness of semi-supervised learning approaches for opinion spam classification. Appl. Soft Comput. 101, 107023 (2021). https://doi.org/10.1016/J.ASOC.2020.107023

    Article  Google Scholar 

  10. Shakiba, T., Zarifzadeh, S., Derhami, V.: Spam query detection using stream clustering. World Wide Web. 212(21), 557–572 (2017). https://doi.org/10.1007/S11280-017-0471-Z

    Article  Google Scholar 

  11. Djenouri, Y., Belhadi, A., Fournier-Viger, P., Lin, J.C.W.: Fast and effective cluster-based information retrieval using frequent closed itemsets. Inf. Sci. (Ny) 453, 154–167 (2018). https://doi.org/10.1016/j.ins.2018.04.008

    Article  MathSciNet  Google Scholar 

  12. Joty, S., Carenini, G., Ng, R.T.: Topic segmentation and labeling in asynchronous conversations. J. Artif. Intell. Res. 47, 521–573 (2013). https://doi.org/10.1613/jair.3940

    Article  MathSciNet  Google Scholar 

  13. Paparrizos, J., Gravano, L.: Fast and accurate time-series clustering. ACM Trans. Database Syst. 42, 1–49 (2017). https://doi.org/10.1145/3044711

    Article  MathSciNet  Google Scholar 

  14. Li, Y., Guo, H., Zhang, Q., Gu, M., Yang, J.: Imbalanced text sentiment classification using universal and domain-specific knowledge. Knowledge-Based Syst. 160, 1–15 (2018). https://doi.org/10.1016/j.knosys.2018.06.019

    Article  Google Scholar 

  15. Mohd, M., Jan, R., Shah, M.: Text document summarization using word embedding. Expert Syst. Appl. 143, 112958 (2020). https://doi.org/10.1016/j.eswa.2019.112958

    Article  Google Scholar 

  16. Zhang, W., Yoshida, T., Tang, X.: A comparative study of TF*IDF, LSI and multi-words for text classification. Expert Syst. Appl. 38, 2758–2765 (2011). https://doi.org/10.1016/j.eswa.2010.08.066

    Article  Google Scholar 

  17. Sayeedunnissa, S.F., Hussain, A.R., Hameed, M.A. Supervised opinion mining of social network data using a bag-of-words approach on the cloud BT. In: Bansal, J.C., Singh, P., Deep, K., Pant, M., Nagar, A. (eds.) Proceedings of seventh international conference on bio-inspired computing: theories and applications (BIC-TA 2012), pp. 299–309. Springer India, India (2013)

  18. Jacovi, A., Shalom, O.S., Goldberg, Y. Understanding convolutional neural networks for text classification. In: Proc. 2018 EMNLP Work. BlackboxNLP Anal. Interpret. Neural Networks NLP, pp. 56–65. Association for Computational Linguistics (ACL) (2018). https://doi.org/10.18653/V1/W18-5408

  19. N. Kalchbrenner, E. Grefenstette, P. Blunsom, A convolutional neural network for modelling sentences, in: 52nd Annu. Meet. Assoc. Comput. Linguist. ACL 2014 - Proc. Conf., Association for Computational Linguistics (ACL), 2014: pp. 655–665. https://doi.org/10.3115/v1/p14-1062.

  20. Le, Q., Mikolov, T. Distributed representations of sentences and documents. In: Proceedings of the 31st International Conference on International Conference on Machine Learning. PMLR, 32, 1188–1196. (2014). http://proceedings.mlr.press/v32/le14.html. Accessed 19 March 2021

  21. Kim, H.K., Kim, H., Cho, S.: Bag-of-concepts: Comprehending document representation through clustering words in distributed representation. Neurocomputing 266, 336–352 (2017). https://doi.org/10.1016/j.neucom.2017.05.046

    Article  Google Scholar 

  22. van Engelen, J.E., Hoos, H.H.: A survey on semi-supervised learning. Mach. Learn. 1092(109), 373–440 (2019). https://doi.org/10.1007/S10994-019-05855-6

    Article  MathSciNet  Google Scholar 

  23. Luo, X., Liu, F., Yang, S., Wang, X., Zhou, Z.: Joint sparse regularization based Sparse Semi-Supervised Extreme Learning Machine (S3ELM) for classification. Knowledge-Based Syst. 73, 149–160 (2015). https://doi.org/10.1016/j.knosys.2014.09.014

    Article  Google Scholar 

  24. Zhang, W., Yang, Y., Wang, Q.: Using Bayesian regression and EM algorithm with missing handling for software effort prediction. Inf. Softw. Technol. 58, 58–70 (2015). https://doi.org/10.1016/j.infsof.2014.10.005

    Article  Google Scholar 

  25. Zhang, W., Tang, X., Yoshida, T.: TESC: An approach to TExt classification using Semi-supervised Clustering. Knowledge-Based Syst. 75, 152–160 (2015). https://doi.org/10.1016/j.knosys.2014.11.028

    Article  Google Scholar 

  26. Lai, S., Liu, K., He, S., Zhao, J.: How to generate a good word embedding. IEEE Intell. Syst. 31, 5–14 (2016). https://doi.org/10.1109/MIS.2016.45

    Article  Google Scholar 

  27. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. (2013). http://arxiv.org/abs/1301.3781. Accessed 16 Sept 2023

  28. Dai, A.M., Olah, C., Le, Q.V.: Document embedding with paragraph vectors. 1–8 (2015). http://arxiv.org/abs/1507.07998. Accessed 16 Sept 2023

  29. Zhang, Z., Zhang, Y., Xu, M., Zhang, L., Yang, Y., Yan, S.: A survey on concept factorization: from shallow to deep representation learning. Inf. Process. Manag. 58, 102534 (2021). https://doi.org/10.1016/J.IPM.2021.102534

    Article  Google Scholar 

  30. Li, P., Mao, K., Xu, Y., Li, Q., Zhang, J.: Bag-of-Concepts representation for document classification based on automatic knowledge acquisition from probabilistic knowledge base. Knowledge-Based Syst. 193, 105436 (2020). https://doi.org/10.1016/j.knosys.2019.105436

    Article  Google Scholar 

  31. Luo, X., Shah, S.: Concept embedding-based weighting scheme for biomedical text clustering and visualization. Appl. Informatics. 5, 1–19 (2018). https://doi.org/10.1186/s40535-018-0055-8

    Article  Google Scholar 

  32. Jia, C., Carson, M.B., Wang, X., Yu, J.: Concept decompositions for short text clustering by identifying word communities. Pattern Recognit. 76, 691–703 (2018). https://doi.org/10.1016/j.patcog.2017.09.045

    Article  Google Scholar 

  33. Wu, C., Kanoulas, E., de Rijke, M.: Learning entity-centric document representations using an entity facet topic model. Inf. Process. Manag. 57, 102216 (2020). https://doi.org/10.1016/J.IPM.2020.102216

    Article  Google Scholar 

  34. Li, W., Suzuki, E.: Adaptive and hybrid context-aware fine-grained word sense disambiguation in topic modeling based document representation. Inf. Process. Manag. 58, 102592 (2021). https://doi.org/10.1016/J.IPM.2021.102592

    Article  Google Scholar 

  35. Lee, Y.H., Hu, P.J.H., Tsao, W.J., Li, L.: Use of a domain-specific ontology to support automated document categorization at the concept level: Method development and evaluation. Expert Syst. Appl. 174, 114681 (2021). https://doi.org/10.1016/J.ESWA.2021.114681

    Article  Google Scholar 

  36. Mehanna, Y.S., Bin Mahmuddin, M.: A semantic conceptualization using tagged bag-of-concepts for sentiment analysis. IEEE Access. 9, 118736–118756 (2021). https://doi.org/10.1109/ACCESS.2021.3107237

    Article  Google Scholar 

  37. Basu, S., Davidson, I., Wagstaff, K.: Constrained clustering: advances in algorithms, theory, and applications, 1st. edn. Chapman & Hall/CRC (2008)

  38. Zhang, Z., Zhang, Y., Liu, G., Tang, J., Yan, S., Wang, M.: Joint label prediction based semi-supervised adaptive concept factorization for robust data representation. IEEE Trans. Knowl. Data Eng. 32, 952–970 (2020). https://doi.org/10.1109/TKDE.2019.2893956

    Article  Google Scholar 

  39. Lu, M., Zhao, X.J., Zhang, L., Li, F.Z.: Semi-supervised concept factorization for document clustering. Inf. Sci. (Ny) 331, 86–98 (2016). https://doi.org/10.1016/j.ins.2015.10.038

    Article  MathSciNet  Google Scholar 

  40. Diaz-Valenzuela, I., Loia, V., Martin-Bautista, M.J., Senatore, S., Vila, M.A.: Automatic constraints generation for semisupervised clustering: experiences with documents classification. Soft Comput. 20, 2329–2339 (2016). https://doi.org/10.1007/s00500-015-1643-3

    Article  Google Scholar 

  41. Li, P., Deng, Z.: Use of distributed semi-supervised clustering for text classification. J. Circuits Syst. Comput. 28, 1–13 (2019). https://doi.org/10.1142/S0218126619501275

    Article  Google Scholar 

  42. Masud, M.A., Huang, J.Z., Zhong, M., Fu, X.: Generate pairwise constraints from unlabeled data for semi-supervised clustering. Data Knowl. Eng. 123, 101715 (2019). https://doi.org/10.1016/j.datak.2019.101715

    Article  Google Scholar 

  43. Gan, H., Fan, Y., Luo, Z., Zhang, Q.: Local homogeneous consistent safe semi-supervised clustering. Expert Syst. Appl. 97, 384–393 (2018). https://doi.org/10.1016/j.eswa.2017.12.046

    Article  Google Scholar 

  44. Agarwal, R.: Phrases based document classification from semi supervised hierarchical LDA. In: 2021 2nd International Conference on Computation, Automation and Knowledge Management (ICCAKM), pp. 332–337. Dubai (2021). https://doi.org/10.1109/ICCAKM50778.2021.9357720

  45. Zhang, Y., Chen, X., Meng, Y., Han, J.: Hierarchical metadata-aware document categorization under weak supervision. In: Proceedings of the 14th ACM International Conference on Web Search and Data Mining (WSDM '21), pp. 770–778. Association for Computing Machinery, New York (2021). https://doi.org/10.1145/3437963.3441730

  46. Vilhagra, L.A., Fernandes, E.R., Nogueira, B.M. TextCSN: A semi-supervised approach for text clustering using pairwise constraints and convolutional siamese network. Proc. ACM Symp. Appl. Comput. 1135–1142 (2020). https://doi.org/10.1145/3341105.3374018

  47. Li, L., Zhao, K., Gan, J., Cai, S., Liu, T., Mu, H., Sun, R.: Robust adaptive semi-supervised classification method based on dynamic graph and self-paced learning. Inf. Process. Manag. 58, 102433 (2021). https://doi.org/10.1016/J.IPM.2020.102433

    Article  Google Scholar 

  48. Emadi, M., Tanha, J., Shiri, M.E., Aghdam, M.H.: A Selection Metric for semi-supervised learning based on neighborhood construction. Inf. Process. Manag. 58, 102444 (2021). https://doi.org/10.1016/J.IPM.2020.102444

    Article  Google Scholar 

  49. Ren, Y., Hu, K., Dai, X., Pan, L., Hoi, S.C.H., Xu, Z.: Semi-supervised deep embedded clustering. Neurocomputing 325, 121–130 (2019). https://doi.org/10.1016/J.NEUCOM.2018.10.016

    Article  Google Scholar 

  50. Xing, Z., Wen, M., Peng, J., Feng, J.: Discriminative semi-supervised non-negative matrix factorization for data clustering. Eng. Appl. Artif. Intell. 103, 104289 (2021). https://doi.org/10.1016/J.ENGAPPAI.2021.104289

    Article  Google Scholar 

  51. Li, X., Yin, H., Zhou, K., Zhou, X.: Semi-supervised clustering with deep metric learning and graph embedding. World Wide Web. 232(23), 781–798 (2019). https://doi.org/10.1007/S11280-019-00723-8

    Article  Google Scholar 

  52. Mikolov, T., Sutskever, I., Chen, K., Corrado, G., Dean, J.: Distributed representations of words and phrases and their compositionality. Adv. Neural Inf. Process. Syst. 3111–3119 (2013). http://arxiv.org/abs/1310.4546. Accessed 11 May 2021

  53. Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of NAACL-HLT, pp. 1–2 (2019)

  54. Hornik, K., Feinerer, I., Wu, M.K., Wien, W., Buchta, C.: Spherical k-means clustering. J. Stat. Softw. 50, 1–22 (2012)

    Article  Google Scholar 

  55. Robertson, S.: Understanding inverse document frequency: On theoretical arguments for IDF. J. Doc. 60, 503–520 (2004). https://doi.org/10.1108/00220410410560582

    Article  Google Scholar 

  56. Li, C., Bai, J., Wenjun, Z., Xihao, Y.: Community detection using hierarchical clustering based on edge-weighted similarity in cloud environment. Inf. Process. Manag. 56, 91–109 (2019). https://doi.org/10.1016/j.ipm.2018.10.004

    Article  Google Scholar 

  57. Semertzidis, T., Rafailidis, D., Strintzis, M.G., Daras, P.: Large-scale spectral clustering based on pairwise constraints. Inf. Process. Manag. 51, 616–624 (2015). https://doi.org/10.1016/J.IPM.2015.05.007

    Article  Google Scholar 

  58. Zhang, T., Ramakrishnan, R., Livny, M. BIRCH: an efficient data clustering method for very large databases. In: SIGMOD Rec. (ACM Spec. Interes. Gr. Manag. Data), pp. 103–114. ACM PUB27, New York (1996). https://doi.org/10.1145/235968.233324

  59. Xie, J., Girshick, R., Farhadi, A.: Unsupervised deep embedding for clustering analysis. In: Proceedings of the 33rd International Conference on Machine Learning, pp. 478–487. PMLR (2016)

  60. Strehl, A., Ghosh, J., Mooney, R. Impact of similarity measures on web-page clustering. Work. Artif. Intell. Web Search (AAAI 2000). 58, 64 (2000)

Download references

Author information

Authors and Affiliations

Authors

Contributions

Conceptualization: [Seyed Mojtaba Sadjadi], [Hoda Mashayekhi]; Data curation: [Seyed Mojtaba Sadjadi]; Formal analysis: [Seyed Mojtaba Sadjadi], [Hoda Mashayekhi], [Hamid Hassanpour]; Investigation: [Seyed Mojtaba Sadjadi], [Hoda Mashayekhi]; Methodology: [Seyed Mojtaba Sadjadi], [Hoda Mashayekhi]; Project administration: [Hoda Mashayekhi]; Software: [Seyed Mojtaba Sadjadi]; Supervision: [Hoda Mashayekhi], [Hamid Hassanpour]; Validation: [Seyed Mojtaba Sadjadi], [Hoda Mashayekhi]; Visualization: [Seyed Mojtaba Sadjadi], [Hoda Mashayekhi], [Hamid Hassanpour]; Writing – original draft: [Seyed Mojtaba Sadjadi]; Writing – review & editing: [Hoda Mashayekhi], [Hamid Hassanpour].

Corresponding author

Correspondence to Hoda Mashayekhi.

Ethics declarations

Ethical approval and consent to participate

Not applicable.

Human and animal ethics

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Authors' information

Seyed Mojtaba Sadjadi received his M.S. from the Department of Computer Engineering in Shahrood University of Technology in 2021. His main research interests include machine learning, natural language processing, text mining through embedding, and the semantic web.

Hoda Mashayekhi is an Assistant Professor at the faculty of Computer Engineering, Shahrood University of Technology. She received her PhD from Sharif University of Technology in 2013. Her research interests include massive data mining, machine learning, parallel and distributed computing, decision-making, and recommendation systems.

Hamid Hassanpour received his Ph.D. from the Queensland University of Technology, Australia in 2004. He is currently a full professor at the faculty of Computer Engineering, Sharood University of Technololgy, Iran. His research interests include Image Processing, Signal Processing, and Data Mining. He has published over 200 journal and conference papers. He is the Editor-in-Chief for Journal of Artificial Intelligence and Data Mining.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Sadjadi, S.M., Mashayekhi, H. & Hassanpour, H. A semi-supervised framework for concept-based hierarchical document clustering. World Wide Web 26, 3861–3890 (2023). https://doi.org/10.1007/s11280-023-01209-4

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11280-023-01209-4

Keywords

Navigation