Abstract
Electronic health records (EHRs) involve heterogeneous data types such as binary, numeric and categorical attributes. As traditional clustering approaches require the definition of a single proximity measure, different data types are typically transformed into a common format or amalgamated through a single distance function. Unfortunately, this early transformation step largely pre-determines the cluster analysis results and can cause information loss, as the relative importance of different attributes is not considered. This exploratory work aims to avoid this premature integration of attribute types prior to cluster analysis through a multi-objective evolutionary algorithm called MVMC. This approach allows multiple data types to be integrated into the clustering process, explore trade-offs between them, and determine consensus clusters that are supported across these data views. We evaluate our approach in a case study focusing on systemic sclerosis (SSc), a highly heterogeneous auto-immune disease that can be considered a representative example of an EHRs data problem. Our results highlight the potential benefits of multi-view learning in an EHR context. Furthermore, this comprehensive classification integrating multiple and various data sources will help to understand better disease complications and treatment goals.
Keywords
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
SSc patients in the Internal Medicine Department of University Hospital of Lille, France, between October 2014 and December 2021 as part of the FHU PRECISE project (PREcision health in Complex Immune-mediated inflammatory diseaSEs); sample collection and usage authorization, CPP 2019-A01083-54.
- 2.
Note that the Silhouette score is intended to compare different partitions produced by a single method. Usually, the Rand index is preferred to the Silhouette score to compare two solutions when a ground-truth partition is available [35].
References
Abdullin, A., Nasraoui, O.: Clustering heterogeneous data sets. In: American Web Congress, pp. 1–8. IEEE (2012)
Ahmad, A., Dey, L.: A k-mean clustering algorithm for mixed numeric and categorical data. Data Knowl. Eng. 63(2), 503–527 (2007)
Ahmad, A., Khan, S.S.: Survey of state-of-the-art mixed data clustering algorithms. IEEE Access 7, 31883–31902 (2019)
Ahmad, A., Khan, S.S.: initKmix-a novel initial partition generation algorithm for clustering mixed data using k-means-based clustering. Expert Syst. Appl. 167, 114149 (2021)
Aljalbout, E., Golkov, V., Siddiqui, Y., Strobel, M., Cremers, D.: Clustering with deep learning: taxonomy and new methods (2018). arXiv:1801.07648
Banfield, J.D., Raftery, A.E.: Model-based gaussian and non-gaussian clustering. Biometrics 49(3), 803–821 (1993)
Basel, A.J., Rui, F., Nandi, K.A.: Integrative cluster analysis in bioinformatics. John Wiley & Sons, USA (2015)
Bécue-Bertaut, M., Pagés, J.: Multiple factor analysis and clustering of a mixture of quantitative, categorical and frequency data. Comput. Stat. Data Anal. 52(6), 3255–3268 (2008)
Ben Ali, B., Massmoudi, Y.: K-means clustering based on gower similarity coefficient: a comparative study. In: International Conference on Modeling, Simulation and Applied Optimization (ICMSAO), pp. 1–5. IEEE (2013)
Budiaji, W., Leisch, F.: Simple k-medoids partitioning algorithm for mixed variable data. Algorithms 12(9), 177 (2019)
de Carvalho, F., Lechevallier, Y., de Melo, F.M.: Partitioning hard clustering algorithms based on multiple dissimilarity matrices. Pattern Recogn. 45(1), 447–464 (2012)
de Carvalho, F.D.A., Lechevallier, Y., de Melo, F.M.: Partitioning hard clustering algorithms based on multiple dissimilarity matrices. Pattern Recogn. 45(1), 447–464 (2012)
Chiu, T., Fang, D., Chen, J., Wang, Y., Jeris, C.: A robust and scalable clustering algorithm for mixed type attributes in large database environment. In: Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 2001), pp. 263–268. Association for Computing Machinery, New York, NY, USA (2001)
de Carvalho, F., Lechevallier, Y., Despeyroux, T., de Melo, F.M.: Advances in knowledge discovery and management. In: Zighed, F., Abdelkader, G., Gilles, P., Venturini, B.D. (eds.) Multi-view Clustering on Relational Data, pp. 37–51. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-02999-3_3
Foss, A.H., Markatou, M., Ray, B.: Distance metrics and clustering methods for mixed-type data. Int. Stat. Rev. 87(1), 80–109 (2019)
Fraley, C., Raftery, A.E.: How many clusters? which clustering method? answers via model-based cluster analysis. Comput. J. 41(8), 578–588 (1998)
Green, P.E., Rao, V.R.: A note on proximity measures and cluster analysis. J. Mark. Res. 3(6), 359–364 (1969)
Harikumar, S., Surya, P.V.: K-medoid clustering for heterogeneous datasets. Procedia Comput. Sci. 70, 226–237 (2015)
Hsu, C.C., Chen, C.L., Su, Y.W.: Hierarchical clustering of mixed data based on distance hierarchy. Inf. Sci. 177(20), 4474–4492 (2007)
Huang, J., Ng, M., Rong, H., Li, Z.: Automated variable weighting in k-means type clustering. IEEE Trans. Pattern Anal. Mach. Intell. 27(5), 657–668 (2005)
Huang, Z.: Clustering large data sets with mixed numeric and categorical values. In: The Pacific-Asia Conference on Knowledge Discovery and Data Mining, pp. 21–34 (1997)
Hunt, L., Jorgensen, M.: Clustering mixed data. WIREs Data Min. Knowl. Disc. 1(4), 352–361 (2011)
José-García, A., Gómez-Flores, W.: Automatic clustering using nature-inspired metaheuristics: a survey. Appl. Soft Comput. 41, 192–213 (2016)
José-García, A., Gómez-Flores, W.: A survey of cluster validity indices for automatic data clustering using differential evolution. In: Proceedings of the Genetic and Evolutionary Computation Conference, pp. 314–322. ACM Press (2021). https://doi.org/10.1145/3449639.3459341
José-García, A., Handl, J.: On the interaction between distance functions and clustering criteria in multi-objective clustering. In: Ishibuchi, H., Zhang, Q., Cheng, R., Li, K., Li, H., Wang, H., Zhou, A. (eds.) EMO 2021. LNCS, vol. 12654, pp. 504–515. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-72062-9_40
José-García, A., Handl, J., Gómez-Flores, W., Garza-Fabre, M.: Many-view clustering: an illustration using multiple dissimilarity measures. In: Genetic and Evolutionary Computation Conference - GECCO 2019, pp. 213–214. ACM Press, Prague, Czech Republic (2019)
José-García, A., Handl, J., Gómez-Flores, W., Garza-Fabre, M.: An evolutionary many-objective approach to multiview clustering using feature and relational data. Appl. Soft Comput. 108, 107425 (2021)
Landi, I., et al.: Deep representation learning of electronic health records to unlock patient stratification at scale. NPJ Digital Med. 3(1), 96 (2020)
Li, C., Biswas, G.: Unsupervised learning with mixed numeric and nominal data. IEEE Trans. Knowl. Data Eng. 14(4), 673–690 (2002)
Liu, C., Chen, Q., Chen, Y., Liu, J.: A fast multiobjective fuzzy clustering with multimeasures combination. Math. Prob. Eng. 2019, 1–21 (2019)
Liu, C., Liu, J., Peng, D., Wu, C.: A general multiobjective clustering approach based on multiple distance measures. IEEE Access 6, 41706–41719 (2018)
Van der Maaten, L., Hinton, G.: Visualizing data using t-SNE. J. Mach. Learn. Res. 9(11), 2579–2605 (2008)
MacQueen, J.: Some methods for classification and analysis of multivariate observations. In: Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, pp. 281–297. University of California Press (1967)
Zhang, Q., Li, H.: MOEA/D: a multiobjective evolutionary algorithm based on decomposition. IEEE Trans. Evol. Comput. 11(6), 712–731 (2007)
Rousseeuw, P.J.: Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20, 53–65 (1987)
Shirkhorshidi, A.S., Aghabozorgi, S., Wah, T.Y.: A comparison study on similarity and dissimilarity measures in clustering continuous data. PLOS ONE 10(12), e0144059 (2015)
Sobanski, V., Giovannelli, J., Allanore, Y., et al.: Phenotypes determined by cluster analysis and their survival in the prospective european scleroderma trials and research cohort of patients with systemic sclerosis. Arthritis Rheumatol. 71(9), 1553–1570 (2019)
Theodoridis, S., Koutrumbas, K.: Pattern Recognition. Elsevier Inc., Amsterdam (2009)
Vandromme, M., Jacques, J., Taillard, J., Jourdan, L., Dhaenens, C.: A biclustering method for heterogeneous and temporal medical data. IEEE Trans. Knowl. Data Eng. 34(2), 506–518 (2022)
van de Velden, M., Iodice D’Enza, A., Markos, A.: Distance-based clustering of mixed data. WIREs Comput. Stat. 11(3), e1456 (2019)
Wei, M., Chow, T., Chan, R.: Clustering heterogeneous data with k-means by mutual information-based unsupervised feature transformation. Entropy 17(3), 1535–1548 (2015)
Acknowledgments
The authors are grateful to the University of Lille, CHU Lille, and INSERM, founded by the MEL through the I-Site cluster humAIn@Lille.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Appendix
Appendix
This Appendix includes figures complementing the results of the experiments presented in Sect. 5. From Fig. 5 (Appendix), it is clear that the determined number of clusters is three as the Silhouette index obtained its highest point value at this point, \(k=3\). Also, from the Pareto front approximations obtained by these configurations, a substantial inference of the {Num} view is observed over the {Bin} and {Gower} views, respectively. Accordingly, the clustering solutions and the weighted embedding space are remarkably similar between these two data-view configurations.
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
José-García, A. et al. (2022). Multi-view Clustering of Heterogeneous Health Data: Application to Systemic Sclerosis. In: Rudolph, G., Kononova, A.V., Aguirre, H., Kerschke, P., Ochoa, G., Tušar, T. (eds) Parallel Problem Solving from Nature – PPSN XVII. PPSN 2022. Lecture Notes in Computer Science, vol 13399. Springer, Cham. https://doi.org/10.1007/978-3-031-14721-0_25
Download citation
DOI: https://doi.org/10.1007/978-3-031-14721-0_25
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-14720-3
Online ISBN: 978-3-031-14721-0
eBook Packages: Computer ScienceComputer Science (R0)