Comparing the performance of biomedical clustering methods

Wiwie, Christian; Baumbach, Jan; Röttger, Richard

doi:10.1038/nmeth.3583

Analysis
Published: 21 September 2015

Comparing the performance of biomedical clustering methods

Christian Wiwie¹,
Jan Baumbach^1,2,3^na1 &
Richard Röttger¹^na1

Nature Methods volume 12, pages 1033–1038 (2015)Cite this article

10k Accesses
136 Citations
60 Altmetric
Metrics details

Subjects

Abstract

Identifying groups of similar objects is a popular first step in biomedical data analysis, but it is error-prone and impossible to perform manually. Many computational methods have been developed to tackle this problem. Here we assessed 13 well-known methods using 24 data sets ranging from gene expression to protein domains. Performance was judged on the basis of 13 common cluster validity indices. We developed a clustering analysis platform, ClustEval (http://clusteval.mpi-inf.mpg.de), to promote streamlined evaluation, comparison and reproducibility of clustering results in the future. This allowed us to objectively evaluate the performance of all tools on all data sets with up to 1,000 different parameter sets each, resulting in a total of more than 4 million calculated cluster validity indices. We observed that there was no universal best performer, but on the basis of this wide-ranging comparison we were able to develop a short guideline for biomedical clustering tasks. ClustEval allows biomedical researchers to pick the appropriate tool for their data type and allows method developers to compare their tool to the state of the art.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Figure 1: Performance of all clustering tools on all nonartificial data sets on the basis of F1 scores.**

**Figure 2: Correlations between internal and external cluster validity indices for all biomedical data sets.**

Distance-based clustering challenges for unbiased benchmarking studies

Article Open access 23 September 2021

Michael C. Thrun

KMD clustering: robust general-purpose clustering of biological data

Article Open access 02 November 2023

Aviv Zelig, Hagai Kariti & Noam Kaplan

Classifying diseases by using biological features to identify potential nosological models

Article Open access 26 October 2021

Lucía Prieto Santamaría, Eduardo P. García del Valle, … Alejandro Rodríguez-González

References

Brohée, S. & van Helden, J. Evaluation of clustering algorithms for protein-protein interaction networks. BMC Bioinformatics 7, 488 (2006).
Article Google Scholar
Wittkop, T., Baumbach, J., Lobo, F.P. & Rahmann, S. Large scale clustering of protein sequences with FORCE—a layout based heuristic for weighted cluster editing. BMC Bioinformatics 8, 396 (2007).
Article Google Scholar
Salton, G. Developments in automatic text retrieval. Science 253, 974–980 (1991).
Article CAS Google Scholar
Navigli, R. Word sense disambiguation: a survey. ACM Comput. Surv. 41, 10:11–10:69 (2009).
Article Google Scholar
Verhaak, R.G.W. et al. Integrated genomic analysis identifies clinically relevant subtypes of glioblastoma characterized by abnormalities in PDGFRA, IDH1, EGFR, and NF1. Cancer Cell 17, 98–110 (2010).
Article CAS Google Scholar
Wirapati, P. et al. Meta-analysis of gene expression profiles in breast cancer: toward a unified understanding of breast cancer subtyping and prognosis signatures. Breast Cancer Res. 10, R65 (2008).
Article Google Scholar
Wittkop, T. et al. Comprehensive cluster analysis with Transitivity Clustering. Nat. Protoc. 6, 285–295 (2011).
Article CAS Google Scholar
Röttger, R. et al. Density parameter estimation for finding clusters of homologous proteins–tracing actinobacterial pathogenicity lifestyles. Bioinformatics 29, 215–222 (2013).
Article Google Scholar
King, A.D., Przulj, N. & Jurisica, I. Protein complex prediction via cost-based clustering. Bioinformatics 20, 3013–3020 (2004).
Article CAS Google Scholar
Nepusz, T., Yu, H. & Paccanaro, A. Detecting overlapping protein complexes in protein-protein interaction networks. Nat. Methods 9, 471–472 (2012).
Article CAS Google Scholar
Milligan, G. & Cheng, R. Measuring the influence of individual data points in a cluster analysis. Journal of Classification 13, 315–335 (1996).
Article Google Scholar
Xu, R. & Wunsch, D.C. Clustering algorithms in biomedical research: a review. IEEE Rev. Biomed. Eng. 3, 120–154 (2010).
Article Google Scholar
Andreopoulos, B., An, A., Wang, X. & Schroeder, M. A roadmap of clustering algorithms: finding a match for a biomedical application. Brief. Bioinform. 10, 297–314 (2009).
Article CAS Google Scholar
Dubes, R.C. How many clusters are best? - An experiment. Pattern Recognit. 20, 645–663 (1987).
Article Google Scholar
Jain, A.K., Murty, M.N. & Flynn, P.J. Data clustering: a review. ACM Comput. Surv. 31, 264–323 (1999).
Article Google Scholar
Röttger, R., Kreutzer, C., Duong Vu, T., Wittkop, T. & Baumbach, J. Online transitivity clustering of biological data with missing values. Proc. German Conference on Bioinformatics (eds. Böcker, S., Hufsky, F., Scheubert, K., Schleicher, J. & Schuster, S.) 57–68 (Schloss Dagstuhl–Leibniz-Zentrum für Informatik, 2012).
Belacel, N., Wang, Q. & Cuperlovic-Culf, M. Clustering methods for microarray gene expression data. OMICS 10, 507–531 (2006).
Article CAS Google Scholar
Boutros, P.C. & Okey, A.B. Unsupervised pattern recognition: an introduction to the whys and wherefores of clustering microarray data. Brief. Bioinform. 6, 331–343 (2005).
Article CAS Google Scholar
D'Haeseleer, P. How does gene expression clustering work? Nat. Biotechnol. 23, 1499–1501 (2005).
Article CAS Google Scholar
Kerr, G., Ruskin, H.J., Crane, M. & Doolan, P. Techniques for clustering gene expression data. Comput. Biol. Med. 38, 283–293 (2008).
Article CAS Google Scholar
Thalamuthu, A., Mukhopadhyay, I., Zheng, X. & Tseng, G.C. Evaluation and comparison of gene clustering methods in microarray analysis. Bioinformatics 22, 2405–2412 (2006).
Article CAS Google Scholar
Frey, B.J. & Dueck, D. Clustering by passing messages between data points. Science 315, 972–976 (2007).
Article CAS Google Scholar
Rodriguez, A. & Laio, A. Clustering by fast search and find of density peaks. Science 344, 1492–1496 (2014).
Article CAS Google Scholar
Ester, M., Kriegel, H.-P., Sander, J. & Xu, X. A density-based algorithm for discovering clusters in large spatial databases with noise. Kdd 96, 226–231 (1996).
Google Scholar
Maechler, M., Rousseeuw, P., Struyf, A., Hubert, M. & Hornik, K. cluster: cluster analysis basics and extensions. R package version 2.0.1 (2015).
R Core Team. A Language and Environment for Statistical Computing (R Foundation for Statistical Computing, 2012).
Van Dongen, S. A Cluster Algorithm for Graphs Technical Report INS-R0010 (National Research Institute for Mathematics and Computer Science in the Netherlands, 2000).
Bader, G.D. & Hogue, C.W.V. An automated method for finding molecular complexes in large protein interaction networks. BMC Bioinformatics 4, 2 (2003).
Article Google Scholar
Wehrens, R. & Buydens, L.M.C. Self- and super-organizing maps in R: the kohonen package. J. Stat. Softw. 21, 1–19 (2007).
Article Google Scholar
Karatzoglou, A., Smola, A., Hornik, K. & Zeileis, A. kernlab–an S4 package for kernel methods in R. J. Stat. Softw. 11, 1–20 (2004).
Article Google Scholar
Wittkop, T. et al. Partitioning biological data with transitivity clustering. Nat. Methods 7, 419–420 (2010).
Article CAS Google Scholar
Monti, S., Tamayo, P., Mesirov, J. & Golub, T. Consensus clustering—a resampling-based method for class discovery and visualization of gene expression microarray data. Machine Learning 52, 91–118 (2003).
Article Google Scholar
Speicher, N. Towards the Identification of Cancer Subtypes by Integrative Clustering of Molecular Data M.S. thesis, Universität des Saarlandes (2012).
Pagel, P. et al. The MIPS mammalian protein-protein interaction database. Bioinformatics 21, 832–834 (2005).
Article CAS Google Scholar
Brenner, S.E., Koehl, P. & Levitt, M. The ASTRAL compendium for protein structure and sequence analysis. Nucleic Acids Res. 28, 254–256 (2000).
Article CAS Google Scholar
Brown, S.D., Gerlt, J.A., Seffernick, J.L. & Babbitt, P.C. A gold standard set of mechanistically diverse enzyme superfamilies. Genome Biol. 7, R8 (2006).
Article Google Scholar
Ortiz, A.R., Strauss, C.E. & Olmea, O. MAMMOTH (matching molecular models obtained from theory): an automated method for model comparison. Protein Sci. 11, 2606–2621 (2002).
Article CAS Google Scholar
Zachary, W.W. An information flow model for conflict and fission in small groups. J. Anthropol. Res. 33, 452–473 (1977).
Article Google Scholar
Chang, H. & Yeung, D.-Y. Robust path-based spectral clustering. Pattern Recognit. 41, 191–203 (2008).
Article Google Scholar
Fränti, P. & Virmajoki, O. Iterative shrinking method for clustering problems. Pattern Recognit. 39, 761–775 (2006).
Article Google Scholar
Fu, L. & Medico, E. FLAME, a novel fuzzy clustering method for the analysis of DNA microarray data. BMC Bioinformatics 8, 3 (2007).
Article Google Scholar
Gionis, A., Mannila, H. & Tsaparas, P. Clustering aggregation. ACM Trans. Knowl. Discov. Data 1, 4–es (2007).
Article Google Scholar
Veenman, C.J., Reinders, M.J.T. & Backer, E. A maximum variance cluster algorithm. IEEE Trans. Pattern Anal. Mach. Intell. 24, 1273–1280 (2002).
Article Google Scholar
Zahn, C.T. Graph-theoretical methods for detecting and describing gestalt clusters. IEEE Trans. Comput. C-20, 68–86 (1971).
Article Google Scholar
Leisch, F. & Dimitriadou, E. mlbench: Machine Learning Benchmark Problems R package version 2.1-1. (CRAN R Project, 2010).
Miller, G.A. WordNet: a lexical database for English. Commun. ACM 38, 39–41 (1995).
Article Google Scholar
Davies, D.L. & Bouldin, D.W. A cluster separation measure. IEEE Trans. Pattern Anal. Mach. Intell. 1, 224–227 (1979).
Article CAS Google Scholar
Dunn, J.C. Well-separated clusters and optimal fuzzy partitions. Cybern. Syst. 4, 95–104 (1974).
Google Scholar
Rousseeuw, P.J. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20, 53–65 (1987).
Article Google Scholar
Powers, D.M.W. Evaluation: from precision, recall and F-factor to ROC, informedness, markedness and correlation. Journal of Machine Learning Technologies 2, 1–24 (2007).
Google Scholar
Benjamini, Y. & Hochberg, Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. R. Stat. Soc. Series B Stat. Methodol. 57, 289–300 (1995).
Google Scholar
Hastie, T., Tibshirani, R. & Friedman, J. The Elements of Statistical Learning (Springer, 2009).
Fowlkes, E.B. & Mallows, C.L. A method for comparing two hierarchical clusterings. J. Am. Stat. Assoc. 78, 553–569 (1983).
Article Google Scholar
Jaccard, P. Etude comparative de la distribution florale dans une portion des Alpes et du Jura (Corbaz, 1901).
Rand, W.M. Objective criteria for the evaluation of clustering methods. J. Am. Stat. Assoc. 66, 846–850 (1971).
Article Google Scholar
Rosenberg, A. & Hirschberg, J. V-Measure: a conditional entropy-based external cluster evaluation measure. In Proc. 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL) (ed. Eisner, J.) 410–420 (Association for Computational Linguistics, 2007).
Hartigan, J.A. & Wong, M.A. A K-means clustering algorithm. J. R. Stat. Soc. Ser. C Appl. Stat. 28, 100–108 (1979).
Google Scholar
Sander, J., Ester, M., Kriegel, H.-P. & Xu, X. Density-based clustering in spatial databases: the algorithm GDBSCAN and its applications. Data Min. Knowl. Discov. 2, 169–194 (1998).
Article Google Scholar
Lawson, R.G. & Jurs, P.C. New index for clustering tendency and its application to chemical problems. J. Chem. Inf. Comput. Sci. 30, 36–41 (1990).
Article CAS Google Scholar
Handl, J., Knowles, J. & Kell, D.B. Computational cluster validation in post-genomic data analysis. Bioinformatics 21, 3201–3212 (2005).
Article CAS Google Scholar

Download references

Acknowledgements

C.W. is supported by the SDU2020 funding initiative at the University of Southern Denmark. R.R. was partially supported by the International Max Planck Research School on Computer Science and the Saarland University Graduate School for Computer Science. J.B. is grateful for financial support from the Cluster of Excellence for Multimodal Computing and Interaction (MMCI).

Author information

Jan Baumbach and Richard Röttger: These authors jointly supervised this work.

Authors and Affiliations

Department of Mathematics and Computer Science, University of Southern Denmark, Odense, Denmark
Christian Wiwie, Jan Baumbach & Richard Röttger
Computational Systems Biology, Max Planck Institute for Informatics, Saarbrücken, Germany
Jan Baumbach
Cluster of Excellence for Multimodal Computing and Interaction, Saarland University, Saarbrücken, Germany
Jan Baumbach

Authors

Christian Wiwie
View author publications
You can also search for this author in PubMed Google Scholar
Jan Baumbach
View author publications
You can also search for this author in PubMed Google Scholar
Richard Röttger
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

C.W. implemented ClustEval and performed the study. J.B. and R.R. jointly directed this work and designed the study. All authors contributed equally to the manuscript.

Corresponding author

Correspondence to Jan Baumbach.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Integrated supplementary information

Supplementary Figure 1 Performance of all clustering tools on all data sets.

See Table 1 in the main text for definitions of methods’ abbreviations. Empty fields correspond to an inability of the corresponding tool to cluster the data set or to an inability to compute a cluster validity index. This happens when a tool needs feature vectors for the objects but the data set is given as similarity matrix, or when the silhouette value is undefined (indicated with an asterisk) because the clustering consists of only singletons or only one cluster, respectively.

Supplementary Figure 2 Robustness analysis of all clustering methods.

Robustness of all clustering methods on five selected data sets reported as mean F1 scores over ten repetitions. For the two biomedical data sets (astral1_161 and bone_marrow) the noise levels are 5% (low) and 10% (high). For the three synthetic data sets, we report the performance on higher noise levels: 20% (low) and 40% (high). See Table 1 for definitions of methods’ abbreviations. Empty fields correspond to an inability of the corresponding tool to cluster the data set. This happens when a tool needs feature vectors for the objects but the data set is given as similarity matrix.

Supplementary information

Supplementary Text and Figures

Supplementary Figures 1–2 and Supplementary Note (PDF 2855 kb)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wiwie, C., Baumbach, J. & Röttger, R. Comparing the performance of biomedical clustering methods. Nat Methods 12, 1033–1038 (2015). https://doi.org/10.1038/nmeth.3583

Download citation

Received: 12 March 2015
Accepted: 24 July 2015
Published: 21 September 2015
Issue Date: November 2015
DOI: https://doi.org/10.1038/nmeth.3583

This article is cited by

An improved black hole algorithm designed for K-means clustering method
- Chenyang Gao
- Xin Yong
- Teng Li
Complex & Intelligent Systems (2024)
MOBILE pipeline enables identification of context-specific networks and regulatory mechanisms
- Cemal Erdem
- Sean M. Gross
- Marc R. Birtwistle
Nature Communications (2023)
Mapping terrestrial ecosystem health in drylands: comparison of field-based information with remotely sensed data at watershed level
- Mojdeh Safaei
- Hossein Bashari
- André Große-Stoltenberg
Landscape Ecology (2023)
Clustering as a dual problem to colouring
- Barbara Ikica
- Boštjan Gabrovšek
- Janez Žerovnik
Computational and Applied Mathematics (2022)
Combining heterogeneous subgroups with graph-structured variable selection priors for Cox regression
- Katrin Madjar
- Manuela Zucknick
- Jörg Rahnenführer
BMC Bioinformatics (2021)