1 Introduction

Mixture models [12, 25, 27] are very useful tools, widely applied in pattern recognition for modeling complex probability distributions. A finite mixture model \(p({\mathbf{x}}|\Theta )\) can be expressed by a weighted sum of \(K\) components:

$$\begin{aligned} p({\mathbf{x}} |\Theta )=\sum _{m=1}^{K} \alpha _m p_m({\mathbf{x}} |\theta _m), \end{aligned}$$
(1)

where \(\alpha _m\) is \(m\)-th mixing proportion and \(p_m\) is the probability density function of the \(m\)-th component. In (1), \(\theta _m\) is the set of parameters defining the \(m\)-th component and \(\Theta =\{\theta _1,\theta _2,\ldots ,\theta _K,\alpha _1,\alpha _2,\ldots ,\alpha _K\}\) is the complete set of the parameters needed to define the mixture. \(\Theta\) is unknown and has to be estimated in the mixture learning process. The mixing proportions must satisfy the following conditions:

$$\begin{aligned} \alpha _m >0,\;\; m=1,\ldots ,K \quad \mathrm and \quad \sum _{m=1}^K \alpha _m=1. \end{aligned}$$
(2)

The number of components \(K\) is either known a priori or has to be determined during the mixture learning process. In the paper it is assumed that \(K\) is known.

The present paper considers the most widely used class of mixture models known as the Gaussian mixture models (GMMs), in which the probability density function of the \(m\)-th component is given by:

$$\begin{aligned} p_m({\mathbf{x}} |\theta _m)=\frac{1}{{(2\pi )}^{d/2}|\Sigma _m|^{1/2}}\exp \left(-\frac{1}{2}({\mathbf{x}}-{\mathbf{\mu }}_m)^T\Sigma _m^{-1}({\mathbf{x}}-{\mathbf{\mu }}_m)\right), \end{aligned}$$
(3)

where \({\mathbf{\mu }}_m\) and \(\Sigma _m\) denote, respectively, the mean vector and covariance matrix, \(|\cdot |\) denotes a determinant of a matrix and \(d\) is the dimension of the feature space. The set of parameters of the \(m\)-th component is \(\theta _m=\{{\mathbf{\mu }}_m,\Sigma _m\}\). Thus, for a GMM \(\Theta\) is defined by: \(\Theta =\{{\mathbf{\mu }}_1,\Sigma _1,\ldots ,{\mathbf{\mu }}_K,\Sigma _K,\alpha _1,\ldots ,\alpha _K\}.\) GMMs are widely used in such applications as data clustering [12, 27], data classification [15], speaker recognition [33], image segmentation and classification [29].

Estimation of the parameters of a GMM can be performed using the maximum likelihood approach. Given a set of independent and identically distributed feature vectors \(X=\{{\mathbf{x}}^1,{\mathbf{x}}^2,\ldots , {\mathbf{x}}^N\}\), called the learning set, the log likelihood corresponding to a \(K\)-component GMM is given by:

$$\begin{aligned} \log p(X|\Theta )=\log \prod _{i=1}^N p({\mathbf{x}}^i|\Theta )=\sum _{i=1}^N \log \sum _{m=1}^K \alpha _m p_m({\mathbf{x}}^i | \theta _m). \end{aligned}$$
(4)

The maximum likelihood estimate of the parameters is given by:

$$\begin{aligned} \Theta _{ML}=\mathop {\arg \max }\limits _\Theta \{\log p(X|\Theta )\}. \end{aligned}$$
(5)

Since the solution of this maximization problem cannot be obtained in a closed form (cf. [5]), a numerical optimization method has to be employed.

Because of its simplicity, the EM (expectation-maximization) algorithm [9, 24] is the most popular tool for maximum likelihood estimation of the parameters of GMMs. This procedure, however, is highly sensitive to initialization and easily gets trapped in a local optimum of (4). Therefore, the quality of the final solution is strongly dependent on the initial guess of the mixture parameters. The problem can be to some degree alleviated by performing multiple runs of the algorithm, each run starting from different initial conditions and returning the result with the highest \(\log p(X|\Theta )\). This approach is called multiple restart EM (MREM).

Model-based clustering [12, 27] is one of the most important applications of GMMs. The aim of clustering can be defined as dividing a set of objects into \(K\) disjoint groups, called clusters, in such a way that the objects within the same group are very similar, whereas the objects in different groups are very distinct. In applications of GMMs to clustering, it is assumed that each feature vector was generated from one of \(K\) mixture components. The goal of clustering is to identify, for each feature vector, the mixture component from which it was generated. If it is possible to estimate the mixture parameters, clustering can be performed using maximum a posteriori (MAP) rule, by allocating a feature vector to a cluster (mixture component) with the highest posterior probability. From the Bayes theorem, this probability for the mixture component \(m\) can be expressed as:

$$\begin{aligned} h_m({\mathbf{x}}^i)=\frac{\alpha _m p_m({\mathbf{x}}^i|\theta _m)}{p({\mathbf{x}}^i|\Theta )}. \end{aligned}$$
(6)

Maximization of (6) is equivalent to finding the mixture index \(m\) with the highest value \(\alpha _m p_m({\mathbf{x}}^i|\theta _m)\).

The aim of the paper is the description and thorough evaluation of a new method for the initialization of the EM algorithm for Gaussian model-based clustering. The method, while retaining the probabilistic nature of random initialization, tries to initialize the component means by feature vectors located far away from the already initialized components. The advantage of the method compared to purely random approach is the increased probability of proper initialization of all cluster (components) centers in case of well-separated clusters, which can lead to better clustering results.

The remainder of the paper is organized as follows. In the next section, the research background related to the presented work is discussed. In Sect. 3, the EM algorithm for GMMs is described. Section 4 presents the author’s approach to the initialization problem. Section 5 presents the results of computational experiments, in which the approach was compared to three other initialization methods for solving the clustering problem, using both synthetic and real data. The final section offers the concluding remarks.

2 Related research

The EM algorithm is a local optimization method. Because the log likelihood function (4) has numerous local maxima [25], performance of the EM algorithm is strongly dependent on the initial conditions. Many initialization methods have been proposed, but no single strategy outperforms the rivaling procedures in all cases. In this section, only the most popular approaches are mentioned.

The standard procedure for tackling the problem of the EM algorithm initialization is the multiple restart approach (MREM). In this approach the EM algorithm is run many times, each run being started with different random initial conditions. The result of the best run (in the sense of the highest final \(\log p(X|\Theta )\)) is returned as the final outcome. The most popular random method [25], which in this paper is called rnd-nearest, initializes the component means with randomly chosen feature vectors. After initialization of the means, the feature vectors are clustered by assigning a particular feature vector to the cluster represented by the closest mean. Next, the covariance matrix and the mixing proportion of each cluster are used as initial estimates of the parameters. A variant of this method which is known as rnd-kmeans uses the \(K\)-means algorithm [26], initialized by random feature vectors, to find component means. Next, the covariance matrices and mixing proportions are initialized using the same method as in the rnd-nearest technique. However, the \(K\)-means algorithm is in itself susceptible to the local maxima.

Another random initialization method [10, 30] uses mixing proportions equal to \(1/K\), selects random feature vectors as component means, and employs a spherical covariance matrix equal to \(0.1\sigma ^2{\mathbf{I}}\), where \({\mathbf{I}}\) is the identity matrix and

$$\begin{aligned} \sigma ^2=\frac{1}{d}\mathrm {trace}(\Sigma ). \end{aligned}$$
(7)

In the above equation, \(\Sigma\) is the covariance matrix of the learning set \(X\). This approach is referred to as rnd-spherical.

In [4], an extension of MREM called emEM was proposed. The idea of emEM involves performing several short EM runs using different random starting points and a lax convergence criterion. The mixture parameters obtained by the best (in the sense of the highest \(\log p(X|\Theta )\)) short run are used as a starting point for a long EM run. This strategy can be improved by repeating it many times until the available CPU time is exhausted. A variant of emEM called rndEM [21] reduces the short EM phase to the evaluation of \(\log p(X|\Theta )\) of the random starting position.

In yet another approach to the GMM parameter estimation problem, the EM algorithm is combined with some global optimization method, e.g.. an evolutionary algorithm [1, 30] or a particle swarm optimizer [7]. However, global optimizers have high computational demands and this approach is limited to moderately sized datasets. Other algorithms proposed to deal with the problem of local maxima include versions of EM with split and merge operations [34, 36], a greedy learning method [35], and a component-wise method [10]. Some of these approaches try to estimate simultaneously the number of components \(K\) while searching for the optimal set of mixture parameters \(\Theta\).

The idea of using distant feature vectors to initialize center-based clustering algorithms is not new. In [17] it was described in the context of Lloyd’s method for vector quantization, equivalent to the \(K\)-means algorithm. The original approach initialized the centroid of the first cluster with the feature vector with maximum norm and considered all feature vectors (instead of a random subset, as in the method presented in this article) as new cluster centers. For this reason, it had a deterministic nature and was unsuitable for multiple restart scenario. In [18] a probabilistic version of this method for the \(K\)-means algorithm was proposed. Since the \(K\)-means algorithm can be interpreted in the framework of model-based clustering (see e.g., [21] or [23]), the method proposed in this paper can be considered as a generalization of [18] to the cases of non-spherical and non-homogeneous covariances. The well-known \(K\)-means++ algorithm [2] initializes cluster centroids by random feature vectors chosen with the probability proportional to the squared shortest distance to the already initialized centroids.

The present article is an extension of the conference paper [19], in which initial experimental results achieved by the method were reported.

3 EM algorithm for Gaussian mixture models

The EM algorithm [9, 24] is an iterative method which, starting from an initial guess of mixture parameters \(\Theta ^{(0)}\), generates a sequence of estimates \(\Theta ^{(1)},\Theta ^{(2)},\ldots ,\Theta ^{(j)},\ldots\) with increasing log likelihood (i.e., \(\log p(X|\Theta ^{(j)})>\log p(X|\Theta ^{(j-1)})\). Each iteration \(j\) of the algorithm consists of two steps called the expectation step (E-step) and the maximization step (M-step). For GMMs, these steps are defined as follows [27, 32]:

  • E-Step: Given the set of mixture parameters \(\Theta ^{(j-1)}\) from the previous iteration, for each \(m=1,\ldots ,K\) and \(i=1,\ldots ,N\), the posterior probability that a feature vector \({\mathbf{x}}^i\) was generated from the \(m\)-th component is calculated as:

    $$\begin{aligned} h_m^{(j)}({\mathbf{x}}^i)=\frac{\alpha _m^{(j-1)} p_m({\mathbf{x}}^i|\theta _m^{(j-1)})}{\sum _{k=1}^K \alpha _k^{(j-1)} p_k({\mathbf{x}}^i|\theta _k^{(j-1)})}, \end{aligned}$$
    (8)

    where \(\theta _m^{(j-1)}\) and \(\theta _k^{(j-1)}\) denote parameters of components \(m\) and \(k\), in the iteration \(j-1\), respectively.

  • M-Step: Given the posterior probabilities \(h_m^{(j)}({\mathbf{x}}^i)\), the set of parameters \(\Theta ^{(j)}\) is calculated as:

    $$\begin{aligned} \alpha _m^{(j)}=\frac{1}{N}\sum _{i=1}^{N}h_m^{(j)}({\mathbf{x}}^i) \end{aligned}$$
    (9)
    $$\begin{aligned} {\mathbf{\mu }}_m^{(j)}=\frac{\sum _{i=1}^{N}h_m^{(j)}({\mathbf{x}}^i)*{\mathbf{x}}^i}{\sum _{i=1}^{N}h_m^{(j)}({\mathbf{x}}^i)} \end{aligned}$$
    (10)
    $$\begin{aligned} \Sigma _m^{(j)}=\frac{\sum _{i=1}^{N}h_m^{(j)}({\mathbf{x}}^i)({\mathbf{x}}^i-{\mathbf{\mu }}_m^{(j)})({\mathbf{x}}^i-{\mathbf{\mu }}_m^{(j)})^T}{\sum _{i=1}^{N}h_m^{(j)}({\mathbf{x}}^i)}. \end{aligned}$$
    (11)

The E-steps and M-steps are applied alternately until a convergence criterion is met. In the experiments presented in this paper, a convergence criterion based on a relative improvement of log likelihood was employed. The EM algorithm was terminated if

$$\begin{aligned} \frac{\log p(X|\Theta ^{(j)}) - \log p(X|\Theta ^{(j-1)})}{\log p(X|\Theta ^{(j-1)})}<\epsilon , \end{aligned}$$
(12)

where \(\epsilon \ll 1\) was a user-defined termination threshold (in the present case \(\epsilon =1e-5\)).

4 Description of the proposed rnd-maxmin method

In the discussed method, equal initial mixing proportions (\(\alpha _m^{(0)}=1/K\)) are used. In the next two sections, we describe how we generate initial component means and covariances.

4.1 Component means

The rationale behind the presented approach to initialization of the EM algorithm for GMMs was to correct the shortcomings of the random initialization methods shown in Fig. 1. The figure presents the stage of initialization algorithm for a three-component mixture at which two component means (labeled as 1st component and 2nd component) were correctly initialized by one of feature vectors marked by \(\square\). To increase the chances of discovering the optimal solution, it would be beneficial to initialize the mean of the third component by one of the feature vectors labeled \(\vartriangle\). However, in the standard random initialization method, all the feature vectors, including the vectors labeled \(\square\), have the same probability of being selected as the means of the third component. In that situation, the selection of one of the vectors marked by \(\square\) may lead to a suboptimal solution.

Fig. 1
figure 1

Example of a three-component mixture illustrating the motivation for the rnd-maxmin method

To amend this shortcoming, the method devised by the authors, denoted as rnd-maxmin, initializes the component means and covariances incrementally. When selecting new feature vectors as component means, it tries to disregard the vectors located closely to the already initialized components. The method has a probabilistic nature, which allows it to be used in the MREM algorithm. The initialization procedure can be described by the following steps:

  1. 1.

    Choose the mean of the first component \({\mathbf{\mu }}_1^{(0)}\) as a random feature vector. Generate randomly the covariance matrix \(\Sigma _1^{(0)}\). Set \(m=2\).

  2. 2.

    Choose \(t\) (\(t\) is a parameter of the method) random unique feature vectors from the remaining (not yet selected as component means) elements of \(X\). Denote by \(X_r=\{{\mathbf{x}}_{r_1},{\mathbf{x}}_{r_2},\ldots ,{\mathbf{x}}_{r_t}\}\) the set of the chosen vectors. For each \({\mathbf{x}}_{r_i} \in X_r,\) compute the minimal squared Mahalanobis distance to the already chosen component means \({\mathbf{\mu }}_1^{(0)},\ldots ,{\mathbf{\mu }}_{m-1}^{(0)}\):

    $$\begin{aligned} \mathrm {dmin}^2({\mathbf{x}}_{r_i})=\min _{j=1,\ldots ,m-1} {\mathrm {d}}_M^2\left({\mathbf{\mu }}_{j}^{(0)},\Sigma _j^{(0)},{\mathbf{x}}_{r_i}\right), \end{aligned}$$
    (13)

    where \(\mathrm {d}_M^2({\mathbf{\mu }}_{j}^{(0)},\Sigma _j^{(0)},{\mathbf{x}}_{r_i})\) is given by:

    $$\begin{aligned} \mathrm {d}_M^2({\mathbf{\mu }}_{j}^{(0)},\Sigma _j^{(0)},{\mathbf{x}}_{r_i})=({\mathbf{\mu }}_{j}^{(0)}-{\mathbf{x}}_{r_i})\left( \Sigma _j^{(0)} \right) ^{-1}({\mathbf{\mu }}_{j}^{(0)}-{\mathbf{x}}_{r_i})^T. \end{aligned}$$
    (14)
  3. 3.

    Select as the \(m\)-th component mean the element of \(X_r\) with the largest minimal squared distance:

    $$\begin{aligned} {\mathbf{\mu }}_m^{(0)}=\mathop {\arg \max }\limits _{{{\mathrm{X}}_{{r_i}}} \in {{\mathrm{X}}_r}}\mathrm {dmin}^2({\mathbf{x}}_{r_i}). \end{aligned}$$
    (15)
  4. 4.

    Generate randomly the covariance matrix \(\Sigma _m^{(0)}\).

  5. 5.

    \(m=m+1\) if \(m<K\) then goto Step 2. Otherwise, terminate the algorithm.

The preliminary experiments indicated that the choice of parameter \(t\) has an influence on the performance of the above initialization method. In the computational experiments reported in section 5, we have chosen value \(t=5\) for \(K>5\) and \(t=K\) for \(K \le 5\), which yielded good results.

4.2 Component covariances

Random generation of the covariance matrix \(\Sigma _m^{(0)}\) was based on the eigenvalue decomposition. Since a covariance matrix of a non-degenerate multivariate normal distribution is positive definite, it can be expressed as:

$$\begin{aligned} \Sigma _m^{(0)}=Q_m\Lambda _m Q_m^T, \end{aligned}$$
(16)

where \(\Lambda _m\) is a diagonal matrix of eigenvalues with all diagonal elements positive, and \(Q_m\) is an orthogonal matrix of eigenvectors. In the presented method, the eigenvalues are generated first. They are generated randomly with the following two restrictions:

  • The relation of the largest eigenvalue to the smallest eigenvalue cannot be greater than 10.

  • The sum of all eigenvalues should be equal \(\frac{1}{10 d K}\mathrm {tr}(\Sigma )\), where \(\mathrm {tr}\) is the trace of a matrix and \(\Sigma\) is the covariance matrix of the learning set \(X\).

Two fulfill these two conditions we first randomly generate \(d\) positive numbers \(\lambda _1,\lambda _2,\ldots ,\lambda _d\) from the uniform distribution. Let \(\lambda_{{\text {max}}}\) be the maximal generated number. For each of the generated numbers \(\lambda _i,\) we check if \(\lambda _i<0.1\lambda _{\text {max}}\). If this condition holds we assign \(\lambda _i=0.1\lambda _{\text {max}}\). Finally, we scale all the numbers by the same factor, so their sum fulfills the second condition.

In the next step, the orthogonal matrix \(Q_m\) is obtained by randomly generating a square \(d \times d\) matrix (each element is generated independently from the standard normal distribution) and performing its QR decomposition [13]. The covariance matrix \(\Sigma _m^{(0)}\) is then obtained using (16).

4.3 Influence of the proposed method on computational complexity of the MREM algorithm

The computational complexity of a single EM iteration is \(O(NKd^2+Kd^3)\). Because usually \(K \ll N\) and \(d \ll N\), this complexity can be approximated by \(O(NKd^2)\). The computation of \(\mathrm {tr}(\Sigma )\) requires \(O(Nd)\) time and is performed only once per MREM run. Initialization of a single component by our method requires random generation of covariance matrix and computation of its inverse, which can be accomplished in \(O(d^3)\) time. Initialization of a component also involves random generation of the mean which can be accomplished in \(O(td^2)\) time. In all experiments, we used \(t \le 5\). For that reason, we can simplify this expression to \(O(d^2)\). Since there are \(K\) components in a mixture, the total time of initialization is \(O(K(d^2+d^3))=O(Kd^3)\). Because there are hundreds of EM iterations in a single MREM run, and for typical datasets \(K \ll N\) and \(d \ll N\), the overhead of initialization is usually very low.

We have performed some profiling experiments with our software, which confirmed the above analysis: the ratio of the runtime of our initialization method to the total runtime of the MREM was less than 1.5 %.

5 Experimental results

In this section, the results of the computational experiments performed on many synthetic datasets and four real-life datasets are presented. The experiments used GMMs for model-based clustering, in which, after the learning of model parameters by the MREM algorithm, the feature vectors were clustered using the MAP rule, as explained in Sect. 1. External cluster validity measure was employed to compare the partitions of data discovered in the course of clustering with the original partitions. In the case of synthetic data, the original partitions were known, because all the datasets were generated by a random number generator, making it possible to track the source (i.e., mixture component) of each feature vector. Also, the real-life datasets come with class labels, which allows to compare their original partitions with the obtained clustering results.

The external cluster validity measure used to compare the results of clustering methods was the adjusted Rand index (ARI) [16]. The expected value of ARI in the case of randomly generated partitions is 0. A higher value of ARI indicates a higher similarity between partitions; the maximum value of 1 means that two partitions are identical.

The main aim of computational experiments with synthetic data was to identify conditions in which MREM using our initialization method produces better clustering results than MREM using other common methods. We also wanted to demonstrate its performance on several real-life datasets.

5.1 The algorithms

The rnd-maxmin method described in Sect. 4 was compared with rnd-nearest, rnd-spherical, and rnd-kmeans methods outlined in Sect. 2. It is possible that, after the initialization of component means, rnd-nearest or rnd-kmeans methods obtain a singular component covariance matrix. This happens, for instance, when the number of objects in the initial cluster is smaller than the number of dimensions \(d\). In this case, the used implementations of rnd-nearest and rnd-kmeans resort to rnd-spherical initialization of the covariance matrix of the component.

The four random methods were always compared on an equal runtime basis. Each of the methods was allocated the same amount of CPU time. Then, the MREM algorithm was run until the allocated time was exhausted. The result of the best (in the sense of the highest log likelihood) solution found by the MREM was considered as the outcome of the method.

It should be noted that the log likelihood for GMMs is unbounded [5]. As a consequence, the EM algorithm may converge to the boundary of the parameter space, with a covariance matrix of a component becoming arbitrarily close to singular and \(\log p(X|\Theta )\) approaching infinity. In the present experiments with the multiple restart version of the EM algorithm, this issue was addressed by computing, after each EM iteration, condition numbers of component covariances [10]. If the largest condition number was above a fixed large threshold, the EM run was aborted and a new one was commenced.

5.2 Hardware and software

The EM algorithm and all four random initialization methods were implemented in C++ language and compiled by the Intel C++ compiler version 14.0.1 using optimizing options (-O3 -ipo -march=core2 -fno-alias). The algorithms were run on a Dell Poweredge 1950 server with two quad-core Intel Xeon 5355 (2.66 GHz) processors and 16 GB of RAM, running Ubuntu Linux 12.04. The implementation of EM was parallelized [20] using OpenMP standard for shared memory computers, taking advantage of all eight cores of the system. Moreover, a cluster of 16 identical Dell servers was put to use, making it possible to perform up to 16 independent experiments in parallel. While the approach with multiple servers did not speed up a single EM run, it was able to speed up batches of EM runs which had to be performed.

To prove that the difference between EM initialization methods was not a random effect, nonparametric statistical tests using R software package [31] were performed.

5.3 Synthetic datasets

In the experiments with synthetic data, a generator recently proposed in [22] was employed which randomly generates Gaussian mixtures according to the user-defined overlap characteristics. The overlap \(\omega _{ij}\) between two clusters \(i\) and \(j\) is defined as the sum of two misclassification probabilities \(\omega _{j|i}\) and \(\omega _{i|j}\) where:

$$\begin{aligned} \omega _{j|i}=Pr[\alpha _i p({\mathbf{x}}|{\mathbf{\mu }}_i,\Sigma _i) < \alpha _j p({\mathbf{x}}|{\mathbf{\mu }}_j,\Sigma _j) | {\mathbf{x}}\sim \mathcal{N}({\mathbf{\mu }}_i,\Sigma _i)], \end{aligned}$$
(17)

and similarly:

$$\begin{aligned} \omega _{i|j}=Pr[\alpha _j p({\mathbf{x}}|{\mathbf{\mu }}_j,\Sigma _j) < \alpha _i p({\mathbf{x}}|{\mathbf{\mu }}_i,\Sigma _i) | {\mathbf{x}}\sim \mathcal{N}({\mathbf{\mu }}_j,\Sigma _j)]. \end{aligned}$$
(18)

The overlap characteristics of mixtures obtained from the generator [22] were controlled by the two parameters: \(\bar{\omega }\) specifying average pairwise overlap between components and \(\check{\omega }\) specifying maximum pairwise overlap. In the experiments, the number of components \(K\) was fixed at 20 and mixtures with dimension \(d\in \{2,5,10\}\) were generated. For each dimension, following [22], we used \(\bar{\omega } \in \{0.001,0.01,0.05\}\). For each value of \(\bar{\omega }\) we used three values of \(\check{\omega }\): a value specifying very high maximum overlap, a value specifying very low maximum overlap and an intermediate value specifying moderate maximum overlap. Thus, for each dimension \(d,\) we used nine \((\bar{\omega },\check{\omega })\) pairs. The parameter \(\pi _\wedge\) determining minimal mixing proportion was set at \(\pi _\wedge =0.25*1/K=0.0125\). Figure 2 shows six example datasets simulated from the mixtures obtained from the generator [22] for three values of \(\bar{\omega }\) and extreme values of \(\check{\omega }\).

Fig. 2
figure 2

Two-dimensional datasets simulated from 20-component mixtures with (a) \(\bar{\omega }=0.001\) and \(\check{\omega }=0.05\), (b) \(\bar{\omega }=0.001\) and \(\check{\omega }=0.15\), (c) \(\bar{\omega }=0.01\) and \(\check{\omega }=0.25\), (d) \(\bar{\omega }=0.01\) and \(\check{\omega }=0.60\), (e) \(\bar{\omega }=0.05\) and \(\check{\omega }=0.45\), (f) \(\bar{\omega }=0.05\) and \(\check{\omega }=0.75\). The ellipses are centered around component means and represent 95 % confidence regions

As noticed in [22], both parameters influence overlap characteristics of generated mixtures. For instance, mixtures illustrated in Fig. 2a, b have very low average overlap (\(\bar{\omega }=0.001\)) between component pairs. However, in Fig. 2b with high \(\check{\omega }=0.15,\) the components are much better separated, except for two component pairs which mostly contribute to average overlap. Similar trends are repeated in Fig. 2c, d. Mixture components in Fig. 2e, f are both poorly separated.

The experimental setup was as follows. For each triplet \((d,\bar{\omega },\check{\omega })\), 50 different random mixtures were generated using MixSim software [28]. The generated mixture parameters were stored to be used as ground truth for comparison. Then, for each mixture a single dataset consisting of 6000 feature vectors was realized, and for each of the datasets a single run of MREM using a given initialization method was performed. The runtime allocated for MREM was 100 s for \(d=10\), 50 s for \(d=5\), and 20 s for \(d=2\).

The feature vectors were clustered by the MAP rule using the best (in the sense of highest \(\log p(X|\Theta )\)) set of parameters found by MREM. Next, the partition found by cluster analysis was compared (using ARI) with the original partition of the data obtained by tracking a source (mixture) component which generated each feature vector.

The average ARI values obtained for the clustering based on ground truth mixture parameters were used as the baseline for comparison of the four EM initialization methods. For each method, the result is presented as % error relative to the ground truth mixture parameters. The % error \(\epsilon _A\) of the method \(A\) is computed as \(\epsilon _A=(\mathrm {ARI}_T-\mathrm {ARI}_A)/\mathrm {ARI_T}*100\), where \(\mathrm {ARI}_T\) is the average (over 50 different mixtures) ARI obtained using the ground truth parameters and \(\mathrm {ARI}_A\) is average ARI obtained using mixture parameters estimated by MREM using the initialization method \(A\). A lower value of \(\epsilon _A\) indicates a better performance; values close to 0 indicate that MREM running the initialization method \(A\) achieves a similar performance as clustering using the ground truth mixture parameters.

To assess the statistical significance of the differences in ARI and log likelihood, Wilcoxon signed rank test for paired data [8] was conducted in which, separately for each triplet \((d,\bar{\omega },\check{\omega })\), the results obtained by the rnd-maxmin method were compared with the results obtained by the best of the three remaining methods.

The results of the experiments with synthetic data for dimension \(d=2\) and different values of \(\bar{\omega }\) and \(\check{\omega }\) are shown in Table 1. \(\epsilon _A\) represents % error in ARI relative ground truth mixture parameters. The last row of the Table, labeled ’optimal ARI’ shows the ARI obtained when clustering was performed using ground truth mixture parameters. \(L\) represents average (over 50 different mixtures) log likelihood obtained using a given initialization method. In each column, the best results for \(\epsilon _A\) and \(L\) are shown in italics. In the row presenting the results of the rnd-maxmin method, \(p_\epsilon\) and \(p_L\) represent Wilcoxon signed rank test p values in statistical comparison of the rnd-maxmin method with the best of the three remaining with respect to \(\epsilon _A\) and \(L\), respectively. If the comparison is statistically significant at \(0.05\) level, the corresponding result of rnd-maxmin is shown in bold.

The results obtained for \(d=5\) and \(d=10\) are shown in Tables 2 and 3, respectively.

Table 1 Performance of model-based clustering using four EM initialization methods for \(d=2\). The best results for \(\epsilon _A\) and \(L\) are shown in italics. Additionally, the results for the rnd-maxmin method are shown in bold if the comparison with the best of the remaining methods is significant at the 0.05 level
Table 2 Performance of model-based clustering using four EM initialization methods for \(d=5\)
Table 3 Performance of model-based clustering using four EM initialization methods for \(d=10\)

The results from Tables 13 indicate that average overlap \(\bar{\omega }\) alone is mostly sufficient for determining the attainable ARI, when clustering is performed using the ground truth parameters. However, both \(\bar{\omega }\) and \(\check{\omega }\) influence the performance of the four initialization methods with respect to \(\epsilon _A\).

The comparison of the performance of the rnd-maxmin method with the best of the remaining methods indicates that:

  • rnd-maxmin dominates the others with respect to ARI and log likelihood when the average overlap between components is very low (\(\bar{\omega }=0.001\)) regardless of the value of \(d\) and \(\check{\omega }\).

  • For situations with low average overlap (\(\bar{\omega }=0.01\)), our method also dominates the others if the maximum overlap between component is not very high. However, the experiments identified the clear weakness of our approach: it usually yields results worse than the other methods (and sometimes by a large margin) if the maximum overlap between components is very high. We conjecture, that in these cases our strategy of locating component means far from already initialized components fails to place two means in close vicinity.

  • A similar trend is repeated for \(\bar{\omega }=0.05\).

  • The performance trends with respect to ARI and log likelihood are not identical (although they are very similar for \(\bar{\omega }=0.001\)).

We summarize the results with synthetic data in Table 4, which shows the number of significant (at \(0.05\) level) wins scored by each method against the best of the other methods, separately for each value of \(\bar{\omega }\). Table 4 demonstrates that if we remove rnd-maxmin from the competition, there is no clear winner among the three remaining methods.

5.4 Real datasets

In this section, we demonstrate the performance of the rnd-maxmin method on four real-life datasets with known class labels. Two of them: iris and thyroid are small-scale datasets available from UCI machine learning repository [3]. They are commonly used in comparisons of clustering algorithms. Two other datasets: Brodatz and pendigits are examples of more challenging problems, because of greater sample size, dimension of feature space and number of classes.

In the experiments the class information was discarded during GMM learning; it was used only to compute ARI. (Figs. 3a–6a and Tables 58). MREM algorithm with each of the initialization methods was run for 200 s larger Brodatz and pendigits datasets and for 0.1 s for smaller iris and thyroid datasets. This experiment was repeated 50 times, each time starting with a different seed of the random number generator. To assess statistical significances of differences in ARI and \(\log p(X|\Theta ),\) the Wilcoxon rank sum test [8] was performed.

Table 4 Summary of the experiments with synthetic datasets. For each value of \(\bar{\omega }\) and each method, we report the number of experiments (out of 9 conducted experiments) in which the method performed significantly better (at \(0.05\) level) than the best of the remaining methods. The numbers before “/” signs concern ARI; the numbers after these signs concern log likelihood

5.4.1 Pendigits dataset

The pendigits dataset is available from the UCI machine learning repository [3]. It describes handwritten images of ten digits. Each digit is represented by a 16-dimensional feature vector. In [30] a subset of this dataset describing the first five digits was used for GMM learning. To evaluate the learning algorithm in a more challenging setting, in the presented experiments the complete dataset (\(K=10\)) consisting of 10992 feature vectors was used. Similarly to [30], first the dimensionality was reduced using principal component analysis (PCA), retaining the first nine principal components which together captured more than 95 % of the total variance.

Figure 3 illustrates the progress toward the final solution of the four random initialization methods. The curves in Fig. 3a were obtained by measuring the ARI of the best (in the sense of highest log likelihood) solution found so far by the MREM algorithm in time steps of 10 s. The curves in Fig. 3b were updated after each EM run (average time of a single EM run was about 0.2 s).

Fig. 3
figure 3

Progress toward final solution of the four random initialization methods for pendigits data: (a) adjusted Rand index, (b) log likelihood of the best solution. The curves represent the averaged results of 50 independent experiments

Table 5 shows the comparison of the final solutions of the four random initialization methods based on 50 independent runs. The first column of the table indicates the name of the method. The second column shows the average (over 50 experiments) log likelihood obtained by the method and the standard deviation after \(\pm\) sign. The third column shows the p value for a statistical comparison of log likelihood (obtained using the Wilcoxon rank sum test) of the method with rnd-maxmin. The fourth column presents average ARI and the standard deviation. The last column shows the p value for a statistical comparison of ARI (obtained using the Wilcoxon rank sum test) of the method with rnd-maxmin.

Table 5 Comparison of four random initialization methods for pendigits dataset. The results are based on 50 independent 200 s runs of MREM

The results in Fig. 3 and Table 5 indicate that the rnd-maxmin method outperforms the three other random initialization methods with respect to ARI. Although the relative difference in ARI in comparison with the second best rnd-spherical is very small (about 0.5 %), it is statistically significant at the \(0.05\) level. The difference in log likelihood is significant as well.

5.4.2 Brodatz dataset

The Brodatz dataset originated from the Laboratory of Image Processing and Pattern Recognition (INPG-LTIRF). It was used in the framework of the Esprit project ELENAFootnote 1. The original source of data was the Brodatz texture album [6]. Each object in the dataset belongs to 1 of 11 classes representing textures from the album. The objects are described by 40 features extracted using estimation of fourth-order modified moments in four orientations: 0, 45, 90 and 135 degrees [14]. Each of the 11 groups consists of 500 objects.

In this dataset there are linear dependencies between 40 features: the determinant of the covariance matrix is very near zero and the first 37 principal components capture 100 % of the total variance. Thus, this dataset is badly conditioned and a GMM cannot be estimated by the EM algorithm using the original features. To remove linear dependencies from this data, we used PCA to reduce the dimensionality to 37.

Figure 4 illustrates the progress toward the final solution of the four random initialization methods.

Fig. 4
figure 4

Progress toward the final solution of the four random initialization methods for Brodatz data: (a) adjusted Rand index, (b) log likelihood of the best solution. The curves represent the averaged results of 50 independent experiments

It is evident that MREM using rnd-kmeans is easily trapped in local minima of the log likelihood function, as its performance is far worse than the performance of the other three methods. Table 6 shows the comparison of the final solutions of the four random initialization methods based on 50 independent runs.

Table 6 Comparison of four random initialization methods for Brodatz dataset. The results are based on 50 independent 200 s runs of MREM

Similarly, as in the case of pendigits dataset, the rnd-maxmin method outperforms the others with respect to ARI of the clustering solution (1.9 % relative difference) and the log likelihood of estimated GMMs. The differences are significant at the \(0.05\) level.

5.4.3 Thyroid dataset

The thyroid dataset is also available from the UCI repository. It consists of the thyroid disease measurements of 215 patients from the same hospital. Each of the patients belongs to one of three groups with a known diagnosis: a group of healthy patients (150 cases), patients with hyperthyroidism (35 cases), and patients with hypothyroidism (30 cases). Each patient is characterized by the results of five laboratory tests.

Figure 5 illustrates the progress toward the final solution of the four random initialization methods. The curves in Fig. 5a were obtained by measuring the ARI of the best (in the sense of highest log likelihood) solution found so far by the MREM algorithm in time steps of 0.005 s. The curves in Fig. 5b were updated after each EM run (the average time of a single EM run was about 0.002 s).

Fig. 5
figure 5

Progress toward final solution of the four random initialization methods for thyroid data: (a) adjusted Rand index, (b) log likelihood of the best solution. The curves represent the averaged results of 50 independent experiments

Table 7 shows the comparison of the final solutions of the four random initialization methods based on 50 independent runs. Since the same partition of data was found by every method in each of the 50 runs (i.e., there was no difference between the methods), we did not perform statistical significance tests.

Table 7 Comparison of four random initialization methods for thyroid dataset. The results are based on 50 independent 0.1 s runs of MREM

5.4.4 Iris dataset

The well-known iris dataset [11], available from the UCI repository, contains four measurements of 150 samples of three iris species. The feature vectors are divided evenly into three classes.

Fig. 6 illustrates the progress toward the final solution of the four random initialization methods.

Fig. 6
figure 6

Progress toward the final solution of the four random initialization methods for iris data: (a) adjusted Rand index, (b) log likelihood of the best solution. The curves represent the average results of 50 independent experiments

Table 8 shows the comparison of the final solutions of the four random initialization methods based on 50 independent runs. The results indicate that similarly, as in the case of the thyroid dataset, there was no discernible difference between the four methods.

Table 8 Comparison of four random initialization methods for the iris dataset. The results are based on 50 independent 0.1 s runs of MREM

6 Conclusions and future work

In this article, a new random method for initialization of the EM algorithm for Gaussian mixture model-based clustering was proposed. The method was compared with three other approaches. The tests were carried out using multiple restart EM algorithm (MREM) on many synthetic datasets with different overlap characteristics and four real-life datasets.

The results of experiments confirm the well-known fact that no single method outperforms the others in all cases. Our approach also has its strengths and weaknesses. Generally, it performs better (or comparable) than other methods if the maximum overlap between clusters is not very high. Otherwise, it produces clustering results worse then the competing algorithms.

Finally, it should be noted that for small-scale clustering problems, as exemplified by experiments with the iris and thyroid datasets, the performance of model-based clustering solution estimated by the MREM algorithm may be similar (if not virtually the same) irrespective of the initialization method.

There are several possible directions of future work. To correct the deficiencies of our method, we plan to devise a hybrid solution combining rnd-maxmin with other random methods. For the MREM approach, the hybrid solution could be easily obtained by alternately running the EM algorithm using two or more different initialization methods (and keeping the mixture with the highest log likelihood). We plan to apply this hybrid method to classification problems and use it to model class-conditional densities in discriminant analysis [15]. This application of GMMs is often characterized by high overlap between mixture components. In this situation, a rnd-maxmin strategy alone performs poorly.

An anonymous reviewer has pointed out that there are other possibilities than Mahalanobis distance, for measuring distance between a feature vector and a component mean (for instance, the contribution of the feature vector to the log likelihood). We plan to implement different distance metrics in the rnd-maxmin method and test their performance.

Also, there are ideas to combine the described method for initialization of the EM algorithm with a global optimization method, e.g., differential evolution, similarly to that done with the \(K\)-means algorithm in [18].