A novel clustering approach and prediction of optimal number of clusters: global optimum search with enhanced positioning

Tan, Meng Piao; Broach, James R.; Floudas, Christodoulos A.

doi:10.1007/s10898-007-9140-6

A novel clustering approach and prediction of optimal number of clusters: global optimum search with enhanced positioning

Original Paper
Published: 12 April 2007

Volume 39, pages 323–346, (2007)
Cite this article

Journal of Global Optimization Aims and scope Submit manuscript

Meng Piao Tan¹,
James R. Broach² &
Christodoulos A. Floudas¹

323 Accesses
31 Citations
Explore all metrics

Abstract

Cluster analysis of genome-wide expression data from DNA microarray hybridization studies is a useful tool for identifying biologically relevant gene groupings (DeRisi et al. 1997; Weiler et al. 1997). It is hence important to apply a rigorous yet intuitive clustering algorithm to uncover these genomic relationships. In this study, we describe a novel clustering algorithm framework based on a variant of the Generalized Benders Decomposition, denoted as the Global Optimum Search (Floudas et al. 1989; Floudas 1995), which includes a procedure to determine the optimal number of clusters to be used. The approach involves a pre-clustering of data points to define an initial number of clusters and the iterative solution of a Linear Programming problem (the primal problem) and a Mixed-Integer Linear Programming problem (the master problem), that are derived from a Mixed Integer Nonlinear Programming problem formulation. Badly placed data points are removed to form new clusters, thus ensuring tight groupings amongst the data points and incrementing the number of clusters until the optimum number is reached. We apply the proposed clustering algorithm to experimental DNA microarray data centered on the Ras signaling pathway in the yeast Saccharomyces cerevisiae and compare the results to that obtained with some commonly used clustering algorithms. Our algorithm compares favorably against these algorithms in the aspects of intra-cluster similarity and inter-cluster dissimilarity, often considered two key tenets of clustering. Furthermore, our algorithm can predict the optimal number of clusters, and the biological coherence of the predicted clusters is analyzed through gene ontology.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Combined Mapping of Multiple clUsteriNg ALgorithms (COMMUNAL): A Robust Method for Selection of Cluster Number, K

Article Open access 19 November 2015

Cluster Analysis of Microarray Data

A robustness metric for biological data clustering algorithms

Article Open access 24 December 2019

References

Adams W.P. and Sherali H.D. (1990). Linearization strategies for a class of zero-one mixed integer programming problems. Operat. Res. 38(2): 217–226
Google Scholar
Aggarwal A. and Floudas C.A. (1990). Synthesis of general separation sequences - nonsharp separations. Comput. Chem. Eng 14: 631–653
Article Google Scholar
Beer M. and Tavazoie S. (2004). Predicting gene expression from sequence. Cell 117: 185–198
Article Google Scholar
Bezdek J.C. (1981). Pattern Recognition with Fuzzy Objective Function Algorithms. Plenum Press, New York
Google Scholar
Brooke A., Kendrick D. and Meeraus A. (1988). GAMS: A User’s Guide. The Scientific Press, San Francisco, CA
Google Scholar
Carpenter G. and Grossberg S. (1990). ART3: hierarchical search using chemical transmitters in self-organizing patterns recognition architectures. Neural Networks 3: 129–152
Article Google Scholar
Ciric A.R. and Floudas C.A. (1989). A retrofit approach of heat exchanger networks. Comput. Chem. Eng 13: 703–715
Article Google Scholar
Claverie J. (1999). Computational methods for the identification of differential and coordinated gene expression. Human Mol. Genet. 8: 1821–1832
Article Google Scholar
Davis D.L. and Bouldin D.W. (1979). A cluster separation measure. IEEE Trans. Pattern Anal. Machine Intell. 1(4): 224–227
Google Scholar
Dempster A.P., Laird N.M. and Rudin D.B. (1977). Maximum likelihood from incomplete data via the EM algorithm. J. Roy. Stat. Soc. B. 39(1): 1–38
Google Scholar
DeRisi J.L., Iyer V.R. and Brown P.O. (1997). Exploring the metabolic and genetic control of gene expression on a genomic scale. Science 278: 680–686
Article Google Scholar
Dhillon, I.S., Guan, Y.: Information theoretic clustering of sparse co-occurrence data. Proceedings of the Third IEEE International Conference on Data Mining (ICDM) (2003)
Dunn J.C. (1973). A fuzzy relative of the ISODATA process and its use in detecting compact well-separated clusters. J. Cybernet. 3: 32–57
Google Scholar
Dunn J.C. (1974). Well separated clusters and optimal fuzzy partitions. J. Cybernet. 4: 95–104
Google Scholar
Duran M.A. and Odell P.L. (1974). Cluster Analysis: A Survey. Springer Verlag, New York
Google Scholar
Eisen M.B., Spellman P.T., Brown P.O. and Botstein D. (1998). Cluster analysis and display of genome-wide expression patterns. Proc. Nat. Acad. Sci. U.S.A. 95(25): 14863–14868
Article Google Scholar
Floudas C.A., Akrotirianakis I.G., Caratzoulas S., Meyer C.A. and Kallrath J. (2005). Global optimization in the 21st Century: advances and challenges. Comput. Chem. Eng. 29: 1185–2002
Article Google Scholar
Floudas, C.A. Deterministic Global Optimization: Theory, Algorithms, and Applications. Kluwer Academic Publishers (2000)
Floudas, C.A.: Nonlinear and Mixed-Integer Optimization: Fundamentals and Applications. Oxford University Press (1995)
Floudas C.A., Aggarwal A. and Ciric A.R. (1989). Global optimum search for non convex NLP and MINLP problems. Comp. Chem. Eng. 13(10): 1117–1132
Article Google Scholar
Floudas C.A. and Anastasiadis S.H. (1988). Synthesis of general distillation sequences with several multicomponent feeds and products. Chem. Eng. Sci. 43: 2407–2419
Article Google Scholar
Floudas C.A. and Grossmann I.E. (1987). Synthesis of flexible heat exchanger networks with uncertain flow rates and temperatures. Comput. Chem. Eng 11: 319–336
Article Google Scholar
Geoffrion A.M. (1973). Generalized benders decomposition. J. Optim. Theory Appl. 10(4): 237
Article Google Scholar
Goodman L. and Kruskal W. (1954). Measures of associations for cross-validations. J. Am. Stat. Assoc. 49: 732–764
Article Google Scholar
Gower J.C. and Ross G.J.S. (1969). Minimum spanning trees and single-linkage cluster analysis. Appl. Stat. 18: 54–64
Article Google Scholar
Halkidi M., Batistakis Y. and Vazirgiannis M. (2002). Cluster validity methods: Part 1. SIGMOD record 31(2): 40–45
Article Google Scholar
Hansen P. and Jaumard B. (1997). Cluster analysis and mathematical programming. Math. Program. 79: 191–215
Google Scholar
Hartigan J.A. (1975). Clustering Algorithms. John Wiley & Sons, New York
Google Scholar
Hartigan J.A. and Wong M.A. (1979). Algorithm AS 136: a K-means clustering algorithm. Appl. Stat. J. Roy. St. C. 28: 100–108
Google Scholar
Herrero J., Valencia A. and Dopazo J. (2001). A hierarchical unsupervised growing neural network for clustering gene expression patterns. Bioinformatics 17(2): 126–136
Article Google Scholar
Heyer L.J., Kruglyak S. and Yooseph S. (1999). Exploring expression data: identification and analysis of co-expressed genes. Genome Res. 9: 1106–1115
Article Google Scholar
Hubert L. and Schultz J. (1976). Quadratic assignment as a general data-analysis strategy. Br. J. Math. Stat. Psychol. 29: 190–241
Google Scholar
Jaccard P. (1912). The distribution of flora in the alpine zone. New Phytol. 11: 37–50
Article Google Scholar
Jain A.K., Murty M.N. and Flynn P.J. (1999). Data clustering: a review. ACM Comput. Surv. 31(3): 264–323
Article Google Scholar
Jain A.K. and Dubes R.C. (1988). Algorithms for Clustering Data. Prentice-Hall Advanced Reference Series, Prentice-Hall, Inc., Englewood Cliffs, New Jersey
Google Scholar
Johnson, R.E.: The role of cluster analysis in assessing comparability under the US transfer pricing regulations. Business Economics (April 2001)
Jung Y., Park H., Du D. and Drake B.L. (2003). A decision criterion for the optimal number of clusters in hierarchical clustering. J. Global Optimiz. 25: 91–111
Article Google Scholar
Kirkpatrick S., Gelatt C.D. and Vecchi M.P. (1983). Optimization by simulated annealing. Science 220(4598): 671–680
Article Google Scholar
Kohonen T. (1984). Self Organization and Associative Memory. Springer Information Science Series, Springer Verlag, Berlin, Heidelberg, New York
Google Scholar
Kohonen T. (1997). Self-Organizing Maps. Springer Verlag, Berlin
Google Scholar
Kokossis A.C. and Floudas C.A. (1994). Optimization of complex reactor networks - II. Nonisothermal operation.. Chem. Eng. Sci 49: 1037–1051
Article Google Scholar
Leisch, F., Weingessel, A., Dimitriadou, E.: Competitive learning for binary valued data. In: Niklasson L., Bod’en M., Ziemke T. (eds.) Proceedings of the 8th International Conference on Artificial Neural Networks (ICANN 98), vol. 2, pp. 779–784. Sk"ovde, Sweden, Springer (1998)
Likas A., Vlassis N. and Vebeek J.L. (2003). The global K-means clustering algorithm. Pattern Recogn. 36: 451–461
Article Google Scholar
Lin X., Floudas C., Wang Y. and Broach J.R. (2003). Theoretical and computational studies of the glucose signaling pathways in yeast using global gene expression data. Biotechnol. Bioeng. 84(7): 864–886
Article Google Scholar
Lukashin A.V. and Fuchs R. (2001). Analysis of temporal gene expression profiles: clustering by simulated annealing and determining the optimal number of clusters. Bioinformatics 17(5): 405–414
Article Google Scholar
McQueen, J.: Some methods for classification and analysis of multivariate observations. In: Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, pp. 281–297 (1967)
Metropolis N., Rosenbluth A., Rosenbluth M., Teller A. and Teller E.J. (1953). Equations of State calculations by fast computing machines. J. Chem. Phys. 21: 1087–1091
Article Google Scholar
Paules G.E. IV. and Floudas C.A. (1989). APROS: Algorithmic development methodology for discrete-continuous optimization problems. Oper. Res. J. 37: 902–915
Article Google Scholar
Pauwels E.J. and Frederix G. (1999). Finding salient regions in images: non-parametric clustering for image segmentation and grouping. Comput. Vision Image Understand. 75: 73–85
Article Google Scholar
Pipenbacher P., Schliep A., Schneckener S., Schonhuth A., Schomburg D. and Schrader R. (2002). ProClust: improved clustering of protein sequences with an extended graph-based approach. Bioinformatics 18(Suppl 2): S182–S191
Google Scholar
Rand W.M. (1971). Objective criteria for the evaluation of clustering methods. J. Am. Stat. Assoc. 66(336): 846–850
Article Google Scholar
Rousseeuw P.J. (1987). Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. J. Comp. App. Math 20: 53–65
Article Google Scholar
Ruspini E.H. (1969). A new approach to clustering. Inf. Control 15: 22–32
Article Google Scholar
Schneper L., Düvel K. and Broach J.R. (2004). Sense and sensibility: nutritional response and signal integration in yeast. Curr. Opin. Microbiol. 7(6): 624–630
Article Google Scholar
Sherali H.D. and Desai J. (2005a). A global optimization RLT-based approach for solving the hard clustering problem. J. Global Optimiz. 32(2): 281–306
Article Google Scholar
Sherali H.D. and Desai J. (2005b). A global optimization RLT-based approach for solving the fuzzy clustering approach. J. Global Optimiz. 33(4): 597–615
Article Google Scholar
Slonim N., Atwal G.S., Tkačik G. and Bialek W. (2005). Information based clustering. Proc. Nat. Acad. Sci. U.S.A. 102(51): 18297–18302
Article Google Scholar
Sokal R.R. and Michener C.D. (1958). A statistical method for evaluating systematic relationships. Univ. Kans. Sci. Bull. 38: 1409–1438
Google Scholar
Sorlie T., Tibshirani R., Parker J., Hastie T., Marron J.S., Nobel A., Deng S., Johnsen H., Pesich R., Geisler S., Demeter J., Perou C.M., Lonning P.E., Brown P.O., Borresen-Dala A.L. and Botstein D. (2003). Repeated observations of breast tumor subtypes in independent gene expression data sets. Proc. Nat. Acad. Sci. U.S.A. 100: 8418–8423
Article Google Scholar
Tishby, N., Pereira, F., Bialek, W.: The information bottleneck method; proceedings of the 37th annual allerton conference on communication. Control Comput. 368–377 (1999)
Troyanskaya O.G., Dolinski K., Owen A.B., Altman R.B. and Botstein D. (2003). A Bayesian framework for combining heterogeneous data sources for gene function prediction (in Saccharomyces cerevisiae). Proc. Nat. Acad. Sci. U.S.A. 100: 8348–8353
Article Google Scholar
Wang Y., Pierce M., Schneper L., Guldal C.G., Zhang X., Tavazoie S. and Broach J.R. (2004). Ras and Gpa2 mediate one branch of a redundant glucose signaling pathway in yeast. Plos Biol. 2(5): 610–622
Article Google Scholar
Weiler J., Gausepohl H., Hauser N., Jensen O.N. and Hoheisel J.D. (1997). Hybridization-based DNA screening on peptide nucleic acid (PNA) oligomer arrays. Nuclei Acids Res. 25: 2792–2799
Article Google Scholar
Wu Z. and Leahy R. (1993). An optimal graph theoretic approach to data clustering: theory and its application to image segmentation. IEEE Trans. Pattern Recogn. Mach. Intell. 15(11): 1101–1113
Article Google Scholar
Xu R. and Wunsch D. (2005). Survey of clustering algorithms. IEEE Trans. Neural Networks 16(3): 645–678
Article Google Scholar
Zahn C.T. (1971). Graph theoretical methods for detecting and describing gestalt systems. IEEE Trans. Comput. C- 20: 68–86
Article Google Scholar
Zhang, B., Hsu, M., Dayal, U.: K-Harmonic Means – A Data Clustering Algorithm. Hewlett-Packard Research Laboratory Technical Report (June 1999)
Zhang, B.: Generalized K-Harmonic Means: Boosting in Unsupervised Learning. Hewlett-Packard Research Laboratory Technical Report (October 2000)

Download references

Author information

Authors and Affiliations

Department of Chemical Engineering, Princeton University, Princeton, NJ, 08544, USA
Meng Piao Tan & Christodoulos A. Floudas
Department of Molecular Biology, Princeton University, Princeton, NJ, 08544, USA
James R. Broach

Authors

Meng Piao Tan
View author publications
You can also search for this author in PubMed Google Scholar
James R. Broach
View author publications
You can also search for this author in PubMed Google Scholar
Christodoulos A. Floudas
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Christodoulos A. Floudas.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Tan, M.P., Broach, J.R. & Floudas, C.A. A novel clustering approach and prediction of optimal number of clusters: global optimum search with enhanced positioning. J Glob Optim 39, 323–346 (2007). https://doi.org/10.1007/s10898-007-9140-6

Download citation

Received: 10 July 2006
Accepted: 15 January 2007
Published: 12 April 2007
Issue Date: November 2007
DOI: https://doi.org/10.1007/s10898-007-9140-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A novel clustering approach and prediction of optimal number of clusters: global optimum search with enhanced positioning

Abstract

Access this article

Similar content being viewed by others

Combined Mapping of Multiple clUsteriNg ALgorithms (COMMUNAL): A Robust Method for Selection of Cluster Number, K

Cluster Analysis of Microarray Data

A robustness metric for biological data clustering algorithms

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A novel clustering approach and prediction of optimal number of clusters: global optimum search with enhanced positioning

Abstract

Access this article

Similar content being viewed by others

Combined Mapping of Multiple clUsteriNg ALgorithms (COMMUNAL): A Robust Method for Selection of Cluster Number, K

Cluster Analysis of Microarray Data

A robustness metric for biological data clustering algorithms

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation