Abstract
Identifying groups of similar objects is a popular first step in biomedical data analysis, but it is error-prone and impossible to perform manually. Many computational methods have been developed to tackle this problem. Here we assessed 13 well-known methods using 24 data sets ranging from gene expression to protein domains. Performance was judged on the basis of 13 common cluster validity indices. We developed a clustering analysis platform, ClustEval (http://clusteval.mpi-inf.mpg.de), to promote streamlined evaluation, comparison and reproducibility of clustering results in the future. This allowed us to objectively evaluate the performance of all tools on all data sets with up to 1,000 different parameter sets each, resulting in a total of more than 4 million calculated cluster validity indices. We observed that there was no universal best performer, but on the basis of this wide-ranging comparison we were able to develop a short guideline for biomedical clustering tasks. ClustEval allows biomedical researchers to pick the appropriate tool for their data type and allows method developers to compare their tool to the state of the art.
This is a preview of subscription content, access via your institution
Access options
Subscribe to this journal
Receive 12 print issues and online access
$259.00 per year
only $21.58 per issue
Buy this article
- Purchase on Springer Link
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout
Similar content being viewed by others
References
Brohée, S. & van Helden, J. Evaluation of clustering algorithms for protein-protein interaction networks. BMC Bioinformatics 7, 488 (2006).
Wittkop, T., Baumbach, J., Lobo, F.P. & Rahmann, S. Large scale clustering of protein sequences with FORCE—a layout based heuristic for weighted cluster editing. BMC Bioinformatics 8, 396 (2007).
Salton, G. Developments in automatic text retrieval. Science 253, 974–980 (1991).
Navigli, R. Word sense disambiguation: a survey. ACM Comput. Surv. 41, 10:11–10:69 (2009).
Verhaak, R.G.W. et al. Integrated genomic analysis identifies clinically relevant subtypes of glioblastoma characterized by abnormalities in PDGFRA, IDH1, EGFR, and NF1. Cancer Cell 17, 98–110 (2010).
Wirapati, P. et al. Meta-analysis of gene expression profiles in breast cancer: toward a unified understanding of breast cancer subtyping and prognosis signatures. Breast Cancer Res. 10, R65 (2008).
Wittkop, T. et al. Comprehensive cluster analysis with Transitivity Clustering. Nat. Protoc. 6, 285–295 (2011).
Röttger, R. et al. Density parameter estimation for finding clusters of homologous proteins–tracing actinobacterial pathogenicity lifestyles. Bioinformatics 29, 215–222 (2013).
King, A.D., Przulj, N. & Jurisica, I. Protein complex prediction via cost-based clustering. Bioinformatics 20, 3013–3020 (2004).
Nepusz, T., Yu, H. & Paccanaro, A. Detecting overlapping protein complexes in protein-protein interaction networks. Nat. Methods 9, 471–472 (2012).
Milligan, G. & Cheng, R. Measuring the influence of individual data points in a cluster analysis. Journal of Classification 13, 315–335 (1996).
Xu, R. & Wunsch, D.C. Clustering algorithms in biomedical research: a review. IEEE Rev. Biomed. Eng. 3, 120–154 (2010).
Andreopoulos, B., An, A., Wang, X. & Schroeder, M. A roadmap of clustering algorithms: finding a match for a biomedical application. Brief. Bioinform. 10, 297–314 (2009).
Dubes, R.C. How many clusters are best? - An experiment. Pattern Recognit. 20, 645–663 (1987).
Jain, A.K., Murty, M.N. & Flynn, P.J. Data clustering: a review. ACM Comput. Surv. 31, 264–323 (1999).
Röttger, R., Kreutzer, C., Duong Vu, T., Wittkop, T. & Baumbach, J. Online transitivity clustering of biological data with missing values. Proc. German Conference on Bioinformatics (eds. Böcker, S., Hufsky, F., Scheubert, K., Schleicher, J. & Schuster, S.) 57–68 (Schloss Dagstuhl–Leibniz-Zentrum für Informatik, 2012).
Belacel, N., Wang, Q. & Cuperlovic-Culf, M. Clustering methods for microarray gene expression data. OMICS 10, 507–531 (2006).
Boutros, P.C. & Okey, A.B. Unsupervised pattern recognition: an introduction to the whys and wherefores of clustering microarray data. Brief. Bioinform. 6, 331–343 (2005).
D'Haeseleer, P. How does gene expression clustering work? Nat. Biotechnol. 23, 1499–1501 (2005).
Kerr, G., Ruskin, H.J., Crane, M. & Doolan, P. Techniques for clustering gene expression data. Comput. Biol. Med. 38, 283–293 (2008).
Thalamuthu, A., Mukhopadhyay, I., Zheng, X. & Tseng, G.C. Evaluation and comparison of gene clustering methods in microarray analysis. Bioinformatics 22, 2405–2412 (2006).
Frey, B.J. & Dueck, D. Clustering by passing messages between data points. Science 315, 972–976 (2007).
Rodriguez, A. & Laio, A. Clustering by fast search and find of density peaks. Science 344, 1492–1496 (2014).
Ester, M., Kriegel, H.-P., Sander, J. & Xu, X. A density-based algorithm for discovering clusters in large spatial databases with noise. Kdd 96, 226–231 (1996).
Maechler, M., Rousseeuw, P., Struyf, A., Hubert, M. & Hornik, K. cluster: cluster analysis basics and extensions. R package version 2.0.1 (2015).
R Core Team. A Language and Environment for Statistical Computing (R Foundation for Statistical Computing, 2012).
Van Dongen, S. A Cluster Algorithm for Graphs Technical Report INS-R0010 (National Research Institute for Mathematics and Computer Science in the Netherlands, 2000).
Bader, G.D. & Hogue, C.W.V. An automated method for finding molecular complexes in large protein interaction networks. BMC Bioinformatics 4, 2 (2003).
Wehrens, R. & Buydens, L.M.C. Self- and super-organizing maps in R: the kohonen package. J. Stat. Softw. 21, 1–19 (2007).
Karatzoglou, A., Smola, A., Hornik, K. & Zeileis, A. kernlab–an S4 package for kernel methods in R. J. Stat. Softw. 11, 1–20 (2004).
Wittkop, T. et al. Partitioning biological data with transitivity clustering. Nat. Methods 7, 419–420 (2010).
Monti, S., Tamayo, P., Mesirov, J. & Golub, T. Consensus clustering—a resampling-based method for class discovery and visualization of gene expression microarray data. Machine Learning 52, 91–118 (2003).
Speicher, N. Towards the Identification of Cancer Subtypes by Integrative Clustering of Molecular Data M.S. thesis, Universität des Saarlandes (2012).
Pagel, P. et al. The MIPS mammalian protein-protein interaction database. Bioinformatics 21, 832–834 (2005).
Brenner, S.E., Koehl, P. & Levitt, M. The ASTRAL compendium for protein structure and sequence analysis. Nucleic Acids Res. 28, 254–256 (2000).
Brown, S.D., Gerlt, J.A., Seffernick, J.L. & Babbitt, P.C. A gold standard set of mechanistically diverse enzyme superfamilies. Genome Biol. 7, R8 (2006).
Ortiz, A.R., Strauss, C.E. & Olmea, O. MAMMOTH (matching molecular models obtained from theory): an automated method for model comparison. Protein Sci. 11, 2606–2621 (2002).
Zachary, W.W. An information flow model for conflict and fission in small groups. J. Anthropol. Res. 33, 452–473 (1977).
Chang, H. & Yeung, D.-Y. Robust path-based spectral clustering. Pattern Recognit. 41, 191–203 (2008).
Fränti, P. & Virmajoki, O. Iterative shrinking method for clustering problems. Pattern Recognit. 39, 761–775 (2006).
Fu, L. & Medico, E. FLAME, a novel fuzzy clustering method for the analysis of DNA microarray data. BMC Bioinformatics 8, 3 (2007).
Gionis, A., Mannila, H. & Tsaparas, P. Clustering aggregation. ACM Trans. Knowl. Discov. Data 1, 4–es (2007).
Veenman, C.J., Reinders, M.J.T. & Backer, E. A maximum variance cluster algorithm. IEEE Trans. Pattern Anal. Mach. Intell. 24, 1273–1280 (2002).
Zahn, C.T. Graph-theoretical methods for detecting and describing gestalt clusters. IEEE Trans. Comput. C-20, 68–86 (1971).
Leisch, F. & Dimitriadou, E. mlbench: Machine Learning Benchmark Problems R package version 2.1-1. (CRAN R Project, 2010).
Miller, G.A. WordNet: a lexical database for English. Commun. ACM 38, 39–41 (1995).
Davies, D.L. & Bouldin, D.W. A cluster separation measure. IEEE Trans. Pattern Anal. Mach. Intell. 1, 224–227 (1979).
Dunn, J.C. Well-separated clusters and optimal fuzzy partitions. Cybern. Syst. 4, 95–104 (1974).
Rousseeuw, P.J. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20, 53–65 (1987).
Powers, D.M.W. Evaluation: from precision, recall and F-factor to ROC, informedness, markedness and correlation. Journal of Machine Learning Technologies 2, 1–24 (2007).
Benjamini, Y. & Hochberg, Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. R. Stat. Soc. Series B Stat. Methodol. 57, 289–300 (1995).
Hastie, T., Tibshirani, R. & Friedman, J. The Elements of Statistical Learning (Springer, 2009).
Fowlkes, E.B. & Mallows, C.L. A method for comparing two hierarchical clusterings. J. Am. Stat. Assoc. 78, 553–569 (1983).
Jaccard, P. Etude comparative de la distribution florale dans une portion des Alpes et du Jura (Corbaz, 1901).
Rand, W.M. Objective criteria for the evaluation of clustering methods. J. Am. Stat. Assoc. 66, 846–850 (1971).
Rosenberg, A. & Hirschberg, J. V-Measure: a conditional entropy-based external cluster evaluation measure. In Proc. 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL) (ed. Eisner, J.) 410–420 (Association for Computational Linguistics, 2007).
Hartigan, J.A. & Wong, M.A. A K-means clustering algorithm. J. R. Stat. Soc. Ser. C Appl. Stat. 28, 100–108 (1979).
Sander, J., Ester, M., Kriegel, H.-P. & Xu, X. Density-based clustering in spatial databases: the algorithm GDBSCAN and its applications. Data Min. Knowl. Discov. 2, 169–194 (1998).
Lawson, R.G. & Jurs, P.C. New index for clustering tendency and its application to chemical problems. J. Chem. Inf. Comput. Sci. 30, 36–41 (1990).
Handl, J., Knowles, J. & Kell, D.B. Computational cluster validation in post-genomic data analysis. Bioinformatics 21, 3201–3212 (2005).
Acknowledgements
C.W. is supported by the SDU2020 funding initiative at the University of Southern Denmark. R.R. was partially supported by the International Max Planck Research School on Computer Science and the Saarland University Graduate School for Computer Science. J.B. is grateful for financial support from the Cluster of Excellence for Multimodal Computing and Interaction (MMCI).
Author information
Authors and Affiliations
Contributions
C.W. implemented ClustEval and performed the study. J.B. and R.R. jointly directed this work and designed the study. All authors contributed equally to the manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing financial interests.
Integrated supplementary information
Supplementary Figure 1 Performance of all clustering tools on all data sets.
See Table 1 in the main text for definitions of methods’ abbreviations. Empty fields correspond to an inability of the corresponding tool to cluster the data set or to an inability to compute a cluster validity index. This happens when a tool needs feature vectors for the objects but the data set is given as similarity matrix, or when the silhouette value is undefined (indicated with an asterisk) because the clustering consists of only singletons or only one cluster, respectively.
Supplementary Figure 2 Robustness analysis of all clustering methods.
Robustness of all clustering methods on five selected data sets reported as mean F1 scores over ten repetitions. For the two biomedical data sets (astral1_161 and bone_marrow) the noise levels are 5% (low) and 10% (high). For the three synthetic data sets, we report the performance on higher noise levels: 20% (low) and 40% (high). See Table 1 for definitions of methods’ abbreviations. Empty fields correspond to an inability of the corresponding tool to cluster the data set. This happens when a tool needs feature vectors for the objects but the data set is given as similarity matrix.
Supplementary information
Supplementary Text and Figures
Supplementary Figures 1–2 and Supplementary Note (PDF 2855 kb)
Rights and permissions
About this article
Cite this article
Wiwie, C., Baumbach, J. & Röttger, R. Comparing the performance of biomedical clustering methods. Nat Methods 12, 1033–1038 (2015). https://doi.org/10.1038/nmeth.3583
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/nmeth.3583
This article is cited by
-
An improved black hole algorithm designed for K-means clustering method
Complex & Intelligent Systems (2024)
-
MOBILE pipeline enables identification of context-specific networks and regulatory mechanisms
Nature Communications (2023)
-
Mapping terrestrial ecosystem health in drylands: comparison of field-based information with remotely sensed data at watershed level
Landscape Ecology (2023)
-
Clustering as a dual problem to colouring
Computational and Applied Mathematics (2022)
-
Combining heterogeneous subgroups with graph-structured variable selection priors for Cox regression
BMC Bioinformatics (2021)