Accelerated EM-based clustering of large data sets

Verbeek, Jakob J.; Nunnink, Jan R. J.; Vlassis, Nikos

doi:10.1007/s10618-005-0033-3

Accelerated EM-based clustering of large data sets

Published: 26 May 2006

Volume 13, pages 291–307, (2006)
Cite this article

Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Jakob J. Verbeek¹,
Jan R. J. Nunnink² &
Nikos Vlassis²

494 Accesses
34 Citations
9 Altmetric
1 Mention
Explore all metrics

Abstract

Motivated by the poor performance (linear complexity) of the EM algorithm in clustering large data sets, and inspired by the successful accelerated versions of related algorithms like k-means, we derive an accelerated variant of the EM algorithm for Gaussian mixtures that: (1) offers speedups that are at least linear in the number of data points, (2) ensures convergence by strictly increasing a lower bound on the data log-likelihood in each learning step, and (3) allows ample freedom in the design of other accelerated variants. We also derive a similar accelerated algorithm for greedy mixture learning, where very satisfactory results are obtained. The core idea is to define a lower bound on the data log-likelihood based on a grouping of data points. The bound is maximized by computing in turn (i) optimal assignments of groups of data points to the mixture components, and (ii) optimal re-estimation of the model parameters based on average sufficient statistics computed over groups of data points. The proposed method naturally generalizes to mixtures of other members of the exponential family. Experimental results show the potential of the proposed method over other state-of-the-art acceleration techniques.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

EM for mixtures

Article 11 June 2015

Jean-Patrick Baudry & Gilles Celeux

Maximum likelihood estimation of Gaussian mixture models without matrix operations

Article 05 June 2015

Hien D. Nguyen & Geoffrey J. McLachlan

Examining the effect of initialization strategies on the performance of Gaussian mixture modeling

Article 31 December 2015

Emilie Shireman, Douglas Steinley & Michael J. Brusco

Notes

This is why we use the more general term ‘responsibility’ for the distributions q rather than e.g. ‘cluster posterior probability’.
As we discuss in Section 6, our approach straightforwardly generalizes to the case of overlapping cells.
The inverse covariance matrix is found in linear time since it is diagonal.
We only require that each data point is contained in at least one subset.
In principle it is also possible, and straightforward, to optimize over the association variables \(\beta_{iA}\).
Following Dasgupta (1999), a Gaussian mixture is c separated if for each pair (i, j) of component densities \(\|m_i-m_j\| \geq c\sqrt{d\max \{\lambda_{\max}({C}_i),\lambda_{\max}({C}_j)\}}\), where \(\lambda_{\max}({C})\) denotes the maximum eigenvalue of C.
Recall that both algorithms have a running time that is linear in the number of nodes that is processed in each step.

References

Bentley JL (1975) Multidimensional binary search trees used for associative searching. Comm ACM 18(9):509–517
Google Scholar
Bishop CM, Svensén M, Williams CKI (1998) GTM: The generative topographic mapping. Neur Comput 10:215–234
Google Scholar
Bradley PS, Fayyad UM, Reina CA (1998) Scaling EM (expectation maximization) clustering to large databases. Technical Report MSR-TR-98-35, Microsoft Research
Dasgupta S (1999) Learning mixtures of Gaussians. In: Proceedings of the IEEE Symposium on Foundations of Computer Science, vol. 40. IEEE Computer Society Press, Los Alamitos, CA, USA, pp 634–644
Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm. J Royal Stat Soc Ser B (Methodological) 39(1):1–38
Google Scholar
Gersho A, Gray RM (1992) Vector quantization and signal compression. Kluwer Academic Publishers, Boston
Kanungo T, Mount DM, Netanyahu N, Piatko C, Silverman R, Wu AY (2002) An efficient k-means clustering algorithm: Analysis and implementation. Trans Patt Anal Mach Intell 24:881–892
Google Scholar
Li JQ, Barron AR (2000) Mixture density estimation. In: Solla SA, Leen TK, Müller K-R (eds) Advances in neural information processing systems, vol. 12. MIT Press, Cambridge, MA, USA, pp 279–285
Lindsay BG (1983) The geometry of mixture likelihoods: A general theory. Ann Stat 11(1):86–94
Google Scholar
McCallum A, Nigam K, Ungar L (2000) Efficient clustering of high-dimensional data sets with application to reference matching. In: Ramakrishnan R, Stolfo S (eds) Proceedings of the ACM SIGKDD international conference on knowledge discovery and data mining, vol. 6. ACM Press, New-York, NY, USA
McLachlan GJ, Peel D (2000) Finite mixture models. John Wiley & Sons
Moore A (1999) Very fast EM-based mixture model clustering using multiresolution kd-trees. In: Kearns MJ, Solla SA, Cohn DA (eds) Advances in Neural information processing systems, vol. 11. MIT Press, Cambridge, MA, USA, pp 543–549
Moore A, Pelleg D (1999) Accelerating exact k-means algorithms with geometric reasoning. In: Proc 5th Int Conf Knowledge Discovery and Data Mining, pp 277–281
Moore AW (2000) The anchors hierarchy: Using the triangle inequality to survive high-dimensional data. In: Boutilier C, Goldszmidt M (eds) Proceedings of the Annual conference on uncertainty in artificial intelligence, vol. 16. Morgan Kaufmann, San Mateo, CA, USA, pp 397–405
Moore AW, Lee MS (1998) Cached sufficient statistics for efficient machine learning with large data sets. J Arti Intell Res 8:67–91
Google Scholar
Neal RM, Hinton GE (1998) A view of the EM algorithm that justifies incremental, sparse, and other variants. In: Jordan MI (eds) Learning in graphical models. Kluwer, Boston, MA, USA, pp 355–368
Nunnink JRJ (2003) Large scale Gaussian mixture modelling using a greedy expectation-maximisation algorithm. Master's thesis, Informatics Institute, University of Amsterdam. www.science.uva.nl/research/ias/alumni/m.sc.theses
Omohundro SM (1989) Five balltree construction algorithms. Technical Report TR-89-063, International Computer Science Institute, Berkeley
Rose K (1998) Deterministic annealing for clustering, compression, classification, regression and related optimization proble ms. IEEE Trans Inform The 86(11):2210–2239
Google Scholar
Sand P, Moore AW (2001) Repairing faulty mixture models using density estimation. In: Brodley CE, Danyluk AP (eds) Proceedings of the international conference on machine learning, vol. 18. Morgan Kaufmann, San Mateo, CA, USA, pp 457–464
Sproull RF (1991) Refinements to nearest-neighbor searching in k-dimensional trees. Algorithmica 6:579–589
Google Scholar
Thiesson B, Meek C, Heckerman D (2001). Accelerating EM for large databases. Mach Learn 45(3):279–299
Google Scholar
Titsias M, Likas A (2001) Shared kernel models for class conditional density estimation. IEEE Trans Neur Netw 12(5):987–997
Google Scholar
Verbeek JJ, Vlassis N, Kröse BJA (2003) Efficient greedy learning of Gaussian mixture models. Neur Comput 15(2):469–485
Google Scholar
Vlassis N, Likas A (2002) A greedy EM algorithm for Gaussian mixture learning. Neur Proc Lett 15(1):77–87
Google Scholar
Zhang T (2002) A general greedy approximation algorithm with applications. In: Dietterich TG, Becker S, Ghahramani Z (eds) Advances in neural information processing systems, vol. 14. MIT Press, Cambridge, MA, USA

Download references

Acknowledgments

We would like to thank the reviewers for their useful comments which helped to improve this manuscript. We are indebted to Tijn Schmits for part of the experimental work. JJV is supported by the Technology Foundation STW (project AIF 4997) applied science division of NWO and the technology program of the Dutch Ministry of Economic Affairs.

Author information

Authors and Affiliations

INRIA Rhone-Alpes, 655 Avenue de l'Europe, 38330, Montbonnot Saint-Martin, France
Jakob J. Verbeek
Informatics Institute, University of Amsterdam, Kruislaan 403, 1098 SJ, Amsterdam, The Netherlands
Jan R. J. Nunnink & Nikos Vlassis

Authors

Jakob J. Verbeek
View author publications
You can also search for this author in PubMed Google Scholar
Jan R. J. Nunnink
View author publications
You can also search for this author in PubMed Google Scholar
Nikos Vlassis
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jakob J. Verbeek.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Verbeek, J.J., Nunnink, J.R.J. & Vlassis, N. Accelerated EM-based clustering of large data sets. Data Min Knowl Disc 13, 291–307 (2006). https://doi.org/10.1007/s10618-005-0033-3

Download citation

Received: 29 July 2004
Accepted: 14 November 2005
Published: 26 May 2006
Issue Date: November 2006
DOI: https://doi.org/10.1007/s10618-005-0033-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Accelerated EM-based clustering of large data sets

Abstract

Access this article

Similar content being viewed by others

EM for mixtures

Maximum likelihood estimation of Gaussian mixture models without matrix operations

Examining the effect of initialization strategies on the performance of Gaussian mixture modeling

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Accelerated EM-based clustering of large data sets

Abstract

Access this article

Similar content being viewed by others

EM for mixtures

Maximum likelihood estimation of Gaussian mixture models without matrix operations

Examining the effect of initialization strategies on the performance of Gaussian mixture modeling

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation