Abstract
We describe a novel algorithm called \(k\)-Maximum Likelihood Estimator (\(k\)-MLE) for learning finite statistical mixtures of exponential families relying on Hartigan’s \(k\)-means swap clustering method. To illustrate this versatile Hartigan \(k\)-MLE technique, we consider the exponential family of Wishart distributions and show how to learn their mixtures. First, given a set of symmetric positive definite observation matrices, we provide an iterative algorithm to estimate the parameters of the underlying Wishart distribution which is guaranteed to converge to the MLE. Second, two initialization methods for \(k\)-MLE are proposed and compared. Finally, we propose to use the Cauchy-Schwartz statistical divergence as a dissimilarity measure between two Wishart mixture models and sketch a general methodology for building a motion retrieval system.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsNotes
- 1.
Otherwise, convergence to a pointwise estimate of the parameters would be replaced by convergence in distribution of a Markov chain.
- 2.
Product \(\hat{\theta }_n^{(t)}\hat{\theta }_S^{(t)}\) is constant through iterations.
- 3.
For translation invariance, \(\mathbb {X}_{i}\) are column centered before.
- 4.
Since \(|2S|=2^d|S|\), we have \(2^{\frac{nd}{2}} |S|^{\frac{n}{2}}\) that is equivalent to \(|2S|^{\frac{n}{2}}\).
References
McLachlan, G.J., Krishnan, T.: The EM Algorithm and Extensions, 2nd edn. Wiley Series in Probability and Statistics. Wiley-Interscience, New York (2008)
Banerjee, A., Merugu, S., Dhillon, I.S., Ghosh, J.: Clustering with Bregman divergences. J. Mach. Learn. Res. 6, 1705–1749 (2005)
Nielsen, F.: \(k\)-MLE: a fast algorithm for learning statistical mixture models. In: International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 869–872 (2012). Long version as arXiv:1203.5181
Jain, A.K.: Data clustering: 50 years beyond \(K\)-means. Pattern Recogn. Lett. 31, 651–666 (2010)
Wishart, J.: The generalised product moment distribution in samples from a Normal multivariate population. Biometrika 20(1/2), 32–52 (1928)
Tsai, M.-T.: Maximum likelihood estimation of Wishart mean matrices under Lwner order restrictions. J. Multivar. Anal. 98(5), 932–944 (2007)
Formont, P., Pascal, T., Vasile, G., Ovarlez, J.-P., Ferro-Famil, L.: Statistical classification for heterogeneous polarimetric SAR images. IEEE J. Sel. Top. Sign. Proces. 5(3), 567–576 (2011)
Jian, B., Vemuri, B.: Multi-fiber reconstruction from diffusion MRI using mixture of wisharts and sparse deconvolution. In: Information Processing in Medical Imaging, pp. 384–395, Springer, Berlin (2007)
Cherian, A., Morellas, V., Papanikolopoulos, N., Bedros, S.: Dirichlet process mixture models on symmetric positive definite matrices for appearance clustering in video surveillance applications. In: Computer Vision and Pattern Recognition (CVPR), pp. 3417–3424 (2011)
Nielsen, F., Garcia, V.: Statistical exponential families: a digest with flash cards. http://arxiv.org/abs/0911.4863.. Accessed Nov 2009
Rockafellar, R.T.: Convex Analysis, vol. 28. Princeton University Press, Princeton (1997)
Wainwright, M.J., Jordan, M.J.: Graphical models, exponential families, and variational inference. Found. Trends Mach. Learn. 1(1–2), 1–305 (2008)
Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the EM algorithm. J. Roy. Stat. Soc. (Methodological). 39 1–38 (1977)
Celeux, G., Govaert, G.: A classification EM algorithm for clustering and two stochastic versions. Comput. Stat. Data Anal. 14(3), 315–332 (1992)
Hartigan, J.A., Wong, M.A.: Algorithm AS 136: A \(k\)-means clustering algorithm. J. Roy. Stat. Soc. C (Applied Statistics). 28(1), 100–108 (1979)
Telgarsky, M., Vattani, A.: Hartigan’s method: \(k\)-means clustering without Voronoi. In: Proceedings of International Conference on Artificial Intelligence and Statistics (AISTATS), pp. 820–827 (2010)
Nielsen, F., Boissonnat, J.D., Nock, R.: On Bregman Voronoi diagrams. In: ACM-SIAM Symposium on Discrete Algorithms, pp. 746–755 (2007)
Xu, R., Wunsch, D.: Survey of clustering algorithms. IEEE Trans. Neural Networks 16(3), 645–678 (2005)
Kulis, B., Jordan, M.I.: Revisiting \(k\)-means: new algorithms via Bayesian nonparametrics. In: International Conference on Machine Learning (ICML) (2012)
Ackermann, M.R.: Algorithms for the Bregman \(K\)-median problem. PhD thesis. Paderborn University (2009)
Arthur, D., Vassilvitskii, S.: \(k\)-means++: the advantages of careful seeding. In: Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 1027–1035 (2007)
Ji, S., Krishnapuram, B., Carin, L.: Variational Bayes for continuous hidden Markov models and its application to active learning. IEEE Trans. Pattern Anal. Mach. Intell. 28(4), 522–532 (2006)
Hidot, S., Saint-Jean, C.: An Expectation-Maximization algorithm for the Wishart mixture model: application to movement clustering. Pattern Recogn. Lett. 31(14), 2318–2324 (2010)
Brent. R.P.: Algorithms for Minimization Without Derivatives. Courier Dover Publications, Mineola (1973)
Bezdek, J.C., Hathaway, R.J., Howard, R.E., Wilson, C.A., Windham, M.P.: Local convergence analysis of a grouped variable version of coordinate descent. J. Optim. Theory Appl. 54(3), 471–477 (1987)
Bogdan, K., Bogdan, M.: On existence of maximum likelihood estimators in exponential families. Statistics 34(2), 137–149 (2000)
Ciuperca, G., Ridolfi, A., Idier, J.: Penalized maximum likelihood estimator for normal mixtures. Scand. J. Stat. 30(1), 45–59 (2003)
Vinh, N.X., Epps, J., Bailey, J.: Information theoretic measures for clusterings comparison: variants, properties, normalization and correction for chance. J. Mach. Learn. Res. 11, 2837–2854 (2010)
Nielsen, F.: Closed-form information-theoretic divergences for statistical mixtures. In: International Conference on Pattern Recognition (ICPR), pp. 1723–1726 (2012)
Haff, L.R., Kim, P.T., Koo, J.-Y., Richards, D.: Minimax estimation for mixtures of Wishart distributions. Ann. Stat. 39(6), 3417–3440 (2011)
Jebara, T., Kondor, R., Howard, A.: Probability product kernels. J. Mach. Learn. Res. 5, 819–844 (2004)
Moreno, P.J., Ho, P., Vasconcelos, N.: A Kullback-Leibler divergence based kernel for SVM classification in multimedia applications. In: Advances in Neural Information Processing Systems (2003)
Petersen, K.B., Pedersen, M.S.: The matrix cookbook. http://www2.imm.dtu.dk/pubdb/p.php?3274. Accessed Nov 2012
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Appendix A
Appendix A
This Appendix details some calculations for distributions \(\mathcal {W}_{d}\), \(\mathcal {W}_{d,\underline{n}}\), \(\mathcal {W}_{d,\underline{S}}\).
11.1.1 Wishart Distribution \(\mathcal {W}_{d}\)
Letting \((\theta _{n},\theta _{S})=(\frac{n-d-1}{2},S^{-1}) \longleftrightarrow (n,S) = (2\theta _{n}+d+1,\theta _{S}^{-1})\)
where \(\varPsi _{d}\) is the multivariate Digamma function (or multivariate polygamma of order 0).
Dissimilarity \(\varDelta (\theta ,\theta ')\) between natural parameters \(\theta =(\theta _{n},\theta _{S})\) and \(\theta '=(\theta '_{n},\theta '_{S})\) is
Remark \(\varDelta (\theta ,\theta ) \ne 0\). Same quantity with source parameters \(\lambda =(n,S)\) and \(\lambda '=(n',S')\) is
11.1.2 Distribution \(\mathcal {W}_{d,\underline{n}}\)
Letting \(\theta _{S}=S^{-1}\),
Using the rule \(\frac{\partial log |X|}{\partial X} = ^{t}(X^{-1})\) [33] and the symmetry of \(\theta _{S}\), we get
The correspondence between natural parameter \(\theta _{S}\) and expectation parameter \(\eta _{S}\) is
Finally, we obtain the MLE for \(\theta _{S}\) in this sub family:
Same formulation with source parameter \(S\):
Dual log-normalizer \(F_{\underline{n}}^{*}\) for \(\mathcal {W}_{d,\underline{n}}\) is
also with source parameter
Let’s remark that KL divergence depends now on \(\underline{n}\).
11.1.3 Distribution \(\mathcal {W}_{d,\underline{S}}\)
For fixed \(\underline{S}\), the p.d.f of \(\mathcal {W}_{d,\underline{S}}\) can be rewrittenFootnote 4 as
Letting \(\theta _{n}=\frac{n-d-1}{2}\) (\(n=2\theta _{n}+d+1\))
The correspondence between natural parameter \(\theta _{n}\) and expectation parameter \(\eta _{n}\) is
Finally, we obtain the MLE for \(\theta _{n}\) in this sub family:
Same formulation with source parameter \(n\):
Dual log-normalizer \(F_{\underline{S}}^{*}\) for \(\mathcal {W}_{d,\underline{S}}\) is
also with source parameter
Let us remark that this quantity does not depend on \(\underline{S}\).
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this chapter
Cite this chapter
Saint-Jean, C., Nielsen, F. (2014). Hartigan’s Method for \(k\)-MLE: Mixture Modeling with Wishart Distributions and Its Application to Motion Retrieval. In: Nielsen, F. (eds) Geometric Theory of Information. Signals and Communication Technology. Springer, Cham. https://doi.org/10.1007/978-3-319-05317-2_11
Download citation
DOI: https://doi.org/10.1007/978-3-319-05317-2_11
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-05316-5
Online ISBN: 978-3-319-05317-2
eBook Packages: EngineeringEngineering (R0)