Skip to main content
Log in

Accelerated EM-based clustering of large data sets

  • Published:
Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Abstract

Motivated by the poor performance (linear complexity) of the EM algorithm in clustering large data sets, and inspired by the successful accelerated versions of related algorithms like k-means, we derive an accelerated variant of the EM algorithm for Gaussian mixtures that: (1) offers speedups that are at least linear in the number of data points, (2) ensures convergence by strictly increasing a lower bound on the data log-likelihood in each learning step, and (3) allows ample freedom in the design of other accelerated variants. We also derive a similar accelerated algorithm for greedy mixture learning, where very satisfactory results are obtained. The core idea is to define a lower bound on the data log-likelihood based on a grouping of data points. The bound is maximized by computing in turn (i) optimal assignments of groups of data points to the mixture components, and (ii) optimal re-estimation of the model parameters based on average sufficient statistics computed over groups of data points. The proposed method naturally generalizes to mixtures of other members of the exponential family. Experimental results show the potential of the proposed method over other state-of-the-art acceleration techniques.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

Notes

  1. This is why we use the more general term ‘responsibility’ for the distributions q rather than e.g. ‘cluster posterior probability’.

  2. As we discuss in Section 6, our approach straightforwardly generalizes to the case of overlapping cells.

  3. The inverse covariance matrix is found in linear time since it is diagonal.

  4. We only require that each data point is contained in at least one subset.

  5. In principle it is also possible, and straightforward, to optimize over the association variables \(\beta_{iA}\).

  6. Following Dasgupta (1999), a Gaussian mixture is c separated if for each pair (i, j) of component densities \(\|m_i-m_j\| \geq c\sqrt{d\max \{\lambda_{\max}({C}_i),\lambda_{\max}({C}_j)\}}\), where \(\lambda_{\max}({C})\) denotes the maximum eigenvalue of C.

  7. Recall that both algorithms have a running time that is linear in the number of nodes that is processed in each step.

References

  • Bentley JL (1975) Multidimensional binary search trees used for associative searching. Comm ACM 18(9):509–517

    Google Scholar 

  • Bishop CM, Svensén M, Williams CKI (1998) GTM: The generative topographic mapping. Neur Comput 10:215–234

    Google Scholar 

  • Bradley PS, Fayyad UM, Reina CA (1998) Scaling EM (expectation maximization) clustering to large databases. Technical Report MSR-TR-98-35, Microsoft Research

  • Dasgupta S (1999) Learning mixtures of Gaussians. In: Proceedings of the IEEE Symposium on Foundations of Computer Science, vol. 40. IEEE Computer Society Press, Los Alamitos, CA, USA, pp 634–644

  • Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm. J Royal Stat Soc Ser B (Methodological) 39(1):1–38

    Google Scholar 

  • Gersho A, Gray RM (1992) Vector quantization and signal compression. Kluwer Academic Publishers, Boston

  • Kanungo T, Mount DM, Netanyahu N, Piatko C, Silverman R, Wu AY (2002) An efficient k-means clustering algorithm: Analysis and implementation. Trans Patt Anal Mach Intell 24:881–892

    Google Scholar 

  • Li JQ, Barron AR (2000) Mixture density estimation. In: Solla SA, Leen TK, Müller K-R (eds) Advances in neural information processing systems, vol. 12. MIT Press, Cambridge, MA, USA, pp 279–285

  • Lindsay BG (1983) The geometry of mixture likelihoods: A general theory. Ann Stat 11(1):86–94

    Google Scholar 

  • McCallum A, Nigam K, Ungar L (2000) Efficient clustering of high-dimensional data sets with application to reference matching. In: Ramakrishnan R, Stolfo S (eds) Proceedings of the ACM SIGKDD international conference on knowledge discovery and data mining, vol. 6. ACM Press, New-York, NY, USA

  • McLachlan GJ, Peel D (2000) Finite mixture models. John Wiley & Sons

  • Moore A (1999) Very fast EM-based mixture model clustering using multiresolution kd-trees. In: Kearns MJ, Solla SA, Cohn DA (eds) Advances in Neural information processing systems, vol. 11. MIT Press, Cambridge, MA, USA, pp 543–549

  • Moore A, Pelleg D (1999) Accelerating exact k-means algorithms with geometric reasoning. In: Proc 5th Int Conf Knowledge Discovery and Data Mining, pp 277–281

  • Moore AW (2000) The anchors hierarchy: Using the triangle inequality to survive high-dimensional data. In: Boutilier C, Goldszmidt M (eds) Proceedings of the Annual conference on uncertainty in artificial intelligence, vol. 16. Morgan Kaufmann, San Mateo, CA, USA, pp 397–405

  • Moore AW, Lee MS (1998) Cached sufficient statistics for efficient machine learning with large data sets. J Arti Intell Res 8:67–91

    Google Scholar 

  • Neal RM, Hinton GE (1998) A view of the EM algorithm that justifies incremental, sparse, and other variants. In: Jordan MI (eds) Learning in graphical models. Kluwer, Boston, MA, USA, pp 355–368

  • Nunnink JRJ (2003) Large scale Gaussian mixture modelling using a greedy expectation-maximisation algorithm. Master's thesis, Informatics Institute, University of Amsterdam. www.science.uva.nl/research/ias/alumni/m.sc.theses

  • Omohundro SM (1989) Five balltree construction algorithms. Technical Report TR-89-063, International Computer Science Institute, Berkeley

  • Rose K (1998) Deterministic annealing for clustering, compression, classification, regression and related optimization proble ms. IEEE Trans Inform The 86(11):2210–2239

    Google Scholar 

  • Sand P, Moore AW (2001) Repairing faulty mixture models using density estimation. In: Brodley CE, Danyluk AP (eds) Proceedings of the international conference on machine learning, vol. 18. Morgan Kaufmann, San Mateo, CA, USA, pp 457–464

  • Sproull RF (1991) Refinements to nearest-neighbor searching in k-dimensional trees. Algorithmica 6:579–589

    Google Scholar 

  • Thiesson B, Meek C, Heckerman D (2001). Accelerating EM for large databases. Mach Learn 45(3):279–299

    Google Scholar 

  • Titsias M, Likas A (2001) Shared kernel models for class conditional density estimation. IEEE Trans Neur Netw 12(5):987–997

    Google Scholar 

  • Verbeek JJ, Vlassis N, Kröse BJA (2003) Efficient greedy learning of Gaussian mixture models. Neur Comput 15(2):469–485

    Google Scholar 

  • Vlassis N, Likas A (2002) A greedy EM algorithm for Gaussian mixture learning. Neur Proc Lett 15(1):77–87

    Google Scholar 

  • Zhang T (2002) A general greedy approximation algorithm with applications. In: Dietterich TG, Becker S, Ghahramani Z (eds) Advances in neural information processing systems, vol. 14. MIT Press, Cambridge, MA, USA

Download references

Acknowledgments

We would like to thank the reviewers for their useful comments which helped to improve this manuscript. We are indebted to Tijn Schmits for part of the experimental work. JJV is supported by the Technology Foundation STW (project AIF 4997) applied science division of NWO and the technology program of the Dutch Ministry of Economic Affairs.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jakob J. Verbeek.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Verbeek, J.J., Nunnink, J.R.J. & Vlassis, N. Accelerated EM-based clustering of large data sets. Data Min Knowl Disc 13, 291–307 (2006). https://doi.org/10.1007/s10618-005-0033-3

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10618-005-0033-3

Keywords

Navigation