1 Introduction

In statistical analyses, a typical assumption is that the used data is measured without error. However, this assumption is often violated. One possible way to approach this problem is to assume measurement errors in the observed variables. However, these measurement errors often are not specifiable with distributional assumptions as nothing is known about them. Measurement errors with unknown structure appear in many different application areas and will be referred to as unstructured errors hereafter. For example, Rocke et al. (2009) state four sources of such errors in the case of micro-array analysis. When using register data, a typical source of errors are over- and undercounts, which are typically not detectable from the register itself; see Burgard and Münnich (2012). Another example is the analysis of the potential use of big data in public health research in Khoury and Ioannidis (2014), where the authors strikingly state “‘Big Error’ can plague Big Data”.

The field of robust optimization treats this problem of unstructured, uncertain data by robustifying optimization problems with uncertain parameters. In this setting, no additional information about these parameters is required besides that they belong to a prescribed uncertainty set. See Soyster (1973) for the first paper on robust optimization and the textbook by Ben-Tal et al. (2009) or the survey article by Bertsimas et al. (2011) for a general overview. For a recent application of these ideas to linear mixed models; see Burgard et al. (2020). However, the most basic variant of robust optimization—usually called strict robustness—was often criticized since it may lead to overly conservative solutions as they are explicitly hedged against the worst-case. As a remedy, less conservative concepts of robustness have been proposed in the last decades; see, e.g., recoverable robustness (Liebchen et al. 2009), light robustness (Fischetti and Monaci 2009), adjustable robustness (Ben-Tal et al. 2004), or \(\Gamma \)-robustness (Bertsimas and Sim 2004).

As motivated above, in statistics, there are several optimization problems, which should be re-engineered to account for the unstructured error setting as it arises in big data sources. By developing new robust optimization methodology for optimization problems from statistics and/or classification, it is widely accepted that a major advancement in the analysis and usability of large data sources can be achieved. For example, a very important and widely used method is k-means clustering (MacQueen 1967; Lloyd 1982). It is used to find clusters in an unsupervised learning setting and is widely applied in the context of big data analysis; see, e.g., Grira et al. (2004) and Celebi and Aydin (2016). However, it does not compensate for possible errors in the data set. Therefore, the clusters that are found are perturbed by the unstructured errors and do not necessarily correspond to the true underlying structure of interest.

The marriage of problems from classification or statistics with modern techniques from mathematical optimization has been a field of very active research during the last years and a comprehensive overview over this literature is out of the scope of this article. Thus, we mainly focus on reviewing the literature at the interface of robust optimization, k-means clustering, and classification problems. One of the most prominent methods for classification are support vector machines (SVMs), which also have been considered from the point of view of robust optimization; see, e.g., Bhattacharyya et al. (2005), Trafalis and Gilbert (2007), Pant et al. (2011) and Bertsimas et al. (2019). The results obtained in these papers often illustrate that robustified solutions can give higher accuracy than nominal ones, leading to improvements in the SVM method. The most recent paper by Bertsimas et al. (2019) from the above list also contains the application of robust optimization to two other major and widely used classification methods (besides SVMs): logistic regression and CART. The presented extensive computational results show that the robust formulations outperform nominal and regularized models for most of the instances—especially in the case of high-dimensional instances and instances that are hard to separate. Moreover, to the best of our knowledge, Bertsimas et al. (2019) is the only paper that also uses the concept of \(\Gamma \)-robustness for classification. Interestingly, the authors report that \(\Gamma \)-robustness is beneficial for robustifying SVMs. In contrast to that, our results for \(\Gamma \)-robustified k-means clustering are clearly outperformed by the strictly robustified models.

Our contribution is the following. We follow the way paved by Bertsimas et al. (2019) but consider strictly and \(\Gamma \)-robust optimization for the k-means clustering problem. The robustification of the underlying optimization problem allows us to consider erroneous input data. As a consequence, it becomes possible to compute clusterings that are closer to the clusterings of the original, i.e., error-free, data although only erroneous input data can be considered. To the best of our knowledge, this issue has not been studied before in the literature. Moreover, we develop tailored alternating direction methods (ADMs) to quickly obtain solutions of good quality. Here, one of the main insights is that an ADM for the nominal case can be generalized so that it becomes applicable to the robust counterparts while keeping its convergence properties.

Classic ADMs are extensions of augmented Lagrangian methods and have been originally proposed in Gabay and Mercier (1976) and Glowinski and Marroco (1975). A general analysis of the convergence theory of ADMs is presented in Bertsekas and Tsitsiklis (1989). The particular case of ADMs applied to convex objective functions f(uv) over disjoint constraint sets \(u \in U\), \(v \in V\), is discussed in Wendell and Hurter (1976). An extension to the case of a biconvex f is given in Gorski et al. (2007). More recently, in Boyd et al. (2011), the authors discuss applications of the alternating direction methods of multipliers (ADMMs), which are closely related to ADMs, in the broad area of machine learning and highlight its possibility of parallelization and distributed computing. ADMs are very broadly applied; for an application of ADMs to solve nonconvex MINLPs in the context of gas transport, we refer to Geißler et al. (2015) and Geißler et al. (2018). For an application in supply chain management, we refer to Schewe et al. (2020). Furthermore, in Geißler et al. (2017) it is shown that idealized feasibility pumps can be seen as an ADM and a penalty-based ADM for MIPs and MINLPs is proposed. A similar method based on block coordinate descent and coordinate descent has been applied recently in the context of optimal imputation in Bertsimas et al. (2017). However, the authors do not consider clustering problems.

In general, these methods rely on a problem-specific decomposition of the variable space of the robust counterparts. Using these decompositions, solutions of good quality can be obtained by alternatingly solving two sub-problems that are significantly easier to solve than the robust counterpart. For the nominal case, the applied ADM exactly mimics the classic k-means clustering algorithm. However, our more abstract view on the decomposition of the problem later allows to carry the idea of alternatingly solving easier subproblems over to the robustified settings. Strictly robust counterparts for the k-means clustering problem are also considered in Li et al. (2016) and Vo et al. (2016). However, there is no investigation of structural properties of data sets that indicate whether a robustification of k-means might be beneficial or not. Additionally, by explicitly framing our method as an ADM, we are also able to provide a convergence result (both for the nominal and the two robustified clustering methods), which is not done in Li et al. (2016) and Vo et al. (2016). We also extend our ADM setting so that partial minima (delivered by the ADM) of bad quality can be detected and improved so that the alternating algorithm can be restarted with a significantly improved starting point. We show in our numerical results that this restart heuristic leads to significantly better clustering results. In Li et al. (2016), the authors focus on clustering of incomplete data sets. Thus, in their computational study, the robust clustering results are evaluated by comparing them to other existing k-means algorithms for incomplete data sets. In contrast to this, the goal of this paper is to evaluate the quality of robustified solutions obtained and to develop an understanding of structural properties of data sets for which robust clustering is beneficial. It turns out that a robustification of the clustering approach is not required if the underlying data points are already well separated. However, if this is not the case, the robustified method is able to recover the true clustering although only erroneous data is observed. Our analysis especially leads to an interesting result regarding the comparison of strictly and \(\Gamma \)-robust clustering. The \(\Gamma \)-robust approach was developed to overcome the criticism of strictly robust solutions that are often too conservative in practice. For k-means clustering we observe the contrary: The conservatism of strict robustness leads to significantly better clustering results when compared to the \(\Gamma \)-robust solutions.

The remainder of the paper is structured as follows. In Sect. 2 we formally introduce the k-means clustering problem and state it as a mixed-integer nonlinear optimization problem (MINLP). Afterward, Sect. 3 contains the strictly robust counterpart, followed by the \(\Gamma \)-robust counterpart in Sect. 4. We state a generic ADM in Sect. 5 and discuss its basic convergence properties that can also be applied if a tailored version is applied to nominal or robustified clustering problems. By doing so, we can explicitly characterize the obtained clustering solutions as so-called partial minima of the underlying MINLP. Moreover, we state the specific sub-problems that need to be solved in every iteration of the ADM (if applied to the nominal or one of the two robust clustering methods) and show that they can always be solved efficiently. In Sect. 6, we describe the already mentioned restart heuristic and present our numerical results in Sect. 7. The paper closes with some concluding remarks and notes on future research topics in Sect. 8.

2 The k-means clustering problem and an MINLP formulation

Let \(X \in \mathbb {R}^{p \times n}\) be the matrix containing the data set for clustering, which consists of n data points in \(\mathbb {R}^p\). Here, the data point \(x^i \in \mathbb {R}^p\), \(i=1, \ldots , n\), corresponds to the ith column of X. For given k, the goal of k-means clustering is to find mean vectors \(\mu ^j \in \mathbb {R}^p\), \(j=1, \ldots , k\), of k clusters that satisfy

$$\begin{aligned} \mu ^* = \mathop {{\text {arg min}}}\limits _{\mu } \ h(X, \mu ), \quad \mu = (\mu ^j)_{j=1,\ldots ,k}, \end{aligned}$$
(1)

where the loss function h is a sum of distances such as the squared Euclidean distance

$$\begin{aligned} h(X,\mu ) = \sum _{j=1}^k \sum _{x^i \in S_j} \Vert x^i - \mu ^j\Vert _2^2, \end{aligned}$$
(2)

with \(S_j \subset \mathbb {R}^p\) being the set of data points that are assigned to cluster j.

By introducing binary variables \(b_{i,j} \in \{0,1\}\) for \(i = 1, \ldots , n\) and \(j=1, \ldots , k\), we can reformulate the function h as

$$\begin{aligned} h(X,\mu ,b) = \sum _{j=1}^{k} \sum _{i=1}^{n} b_{i,j} \Vert x^i - \mu ^j\Vert _2^2, \quad b=(b_{i,j})_{i=1,\ldots ,n}^{j=1,\ldots ,k}, \end{aligned}$$

where the binary variables have the meaning

$$\begin{aligned} b_{i,j} = {\left\{ \begin{array}{ll} 1, &{} \text {if point } x^i \text { is assigned to cluster } S_j, \\ 0, &{} \text {otherwise}. \end{array}\right. } \end{aligned}$$

As \(x^i \in \mathbb {R}^p\) should belong to exactly only one cluster, we include the constraint

$$\begin{aligned} \sum _{j=1}^{k} b_{i,j} =1 \quad \text {for all} \quad i=1,\ldots ,n. \end{aligned}$$

To simplify the presentation in what follows, we introduce the notations

$$\begin{aligned} N \mathrel {{\mathop :}{=}}\{1, \ldots , n\}, \quad P \mathrel {{\mathop :}{=}}\{1, \ldots , p\}, \quad K \mathrel {{\mathop :}{=}}\{1, \ldots , k\}. \end{aligned}$$

Thus, a reformulation of the k-means clustering problem as a mathematical optimization problem is given by

$$\begin{aligned} \min _{\mu ,b} \quad&\sum _{j \in K} \sum _{i \in N} b_{i,j} \Vert x^{i} - \mu ^{j}\Vert _{2}^{2} \end{aligned}$$
(3a)
$$\begin{aligned} {{\,\mathrm{s.t.}\,}}\quad&\sum _{j \in K} b_{i,j} =1 \quad \text {for all} \quad i \in N, \end{aligned}$$
(3b)
$$\begin{aligned}&b_{i,j} \in \{0,1\} \quad \text {for all} \quad i \in N, \ j \in K, \end{aligned}$$
(3c)
$$\begin{aligned}&b \in \mathbb {R}^{n \times k}, \end{aligned}$$
(3d)
$$\begin{aligned}&\mu \in \mathbb {R}^{p \times k}. \end{aligned}$$
(3e)

This is a mixed-integer nonlinear programming problem (MINLP); all constraints are linear but the objective function is cubic.

3 The strict robust counterpart of the k-means clustering MINLP

Up to now, we assumed that the data points in X are known exactly. However, in practice, often only erroneous measurements \({\tilde{X}} = X + E\) can be observed instead, where \(E \in \mathbb {R}^{p \times n}\) describes the respective perturbation of the original data. We assume that there is neither information on the expected value nor on any higher order moment of the additive perturbation E. The error could even be a deterministic error as opposed to a stochastic one.

In statistics, measurement error correction methods typically assume that the errors are random variables that follow a known distribution; see Carroll et al. (2006). For a recent proposal on dealing with clustering in these classic measurement error settings; see Su et al. (2018). By accounting for the error distribution in the estimation process, the influence of the errors is reduced. However, this assumption is often very restrictive. Especially when using big data sources as in Davalos (2017) and Yamada et al. (2018), no meaningful error distribution can be assumed. As it is shown in White (2011) for biomarker measures, errors can be due to specimen collection and storage. Last but not least, in social science and econometric analysis, the indicators used are typically estimates resulting form a long statistical production process that is prone to different kind of errors; see, e.g., Alfons et al. (2013). All these different types of errors can even extend to be systematic and thus are far from being described as a zero-mean distributed error.

In this paper we tackle the following situation: Although we can only observe erroneous measures \({\tilde{X}}\) in practice, we want to compute a clustering that is as close as possible to the clustering of the original data \(X\in \mathbb {R}^{p \times n}\). Obviously, especially if E has a mean vector different from \(0\in \mathbb {R}^p\) over all observations in every cluster for every variable, the optimal solution

$$\begin{aligned} {\tilde{\mu }}^* = \mathop {{\text {arg min}}}\limits _{\mu } \ h({\tilde{X}}, \mu ), \quad \mu = (\mu ^j)_{j \in K}, \end{aligned}$$

will differ from the optimal solution of Problem (1). Therefore, the optimization problem has to be reformulated to account for the unobservable and unstructured error E. This can be done by using the min-max problem

$$\begin{aligned} \min _{\mu } \ \max _{D \in {\mathcal {U}}} \ h({\tilde{X}} - D, \mu ), \end{aligned}$$

with \({\mathcal {U}}\subseteq \mathbb {R}^{p \times n}\) being the uncertainty set, which will be specified right below. In particular, we assume that \({\mathcal {U}}\) can be chosen so that \(E\in {\mathcal {U}}\) holds. Thus, given an uncertainty set \({\mathcal {U}}\), we minimize the function h considering the worst-case scenario w.r.t. all possible \(D \in {\mathcal {U}}\).

As noted above, we assume that the error E is unstructured. Thus, we particularly refrain from using distributional information. Instead, we consider the box-uncertainty set \({\mathcal {U}}\subseteq \mathbb {R}^{p\times n}\) to be given as

$$\begin{aligned} {\mathcal {U}}^{\text {box}} \mathrel {{\mathop :}{=}}\{D \in \mathbb {R}^{p\times n}:-{\Delta d}_{l}^i \le d_l^i \le \Delta d_{l}^i, \ \Delta d^{i}_{l} \ge 0, \ (l,i) \in P\times N\}. \end{aligned}$$
(4)

With this at hand, we can now state the strictly robust counterpart of Problem (3):

$$\begin{aligned} \min _{\mu ,b} \ \max _{D \in {\mathcal {U}}^{\text {box}}} \quad&\sum _{j\in K} \sum _{i\in N} b_{i,j} \Vert {\tilde{x}}^i - d^i - \mu ^j\Vert _2^2 \end{aligned}$$
(5a)
$$\begin{aligned} {{\,\mathrm{s.t.}\,}}\quad&\sum _{j\in K} b_{i,j} =1, \quad i\in N, \end{aligned}$$
(5b)
$$\begin{aligned}&b_{i,j} \in \left\{ 0,1\right\} , \quad i\in N, \ j\in K, \end{aligned}$$
(5c)
$$\begin{aligned}&\mu \in \mathbb {R}^{p \times k}. \end{aligned}$$
(5d)

This robust counterpart is an optimization problem that is not directly tractable as it is stated in (5). The main reason for its hardness is the min-max structure of the objective function. Fortunately, we can reformulate this problem without the inner maximization problem, as shown in the following theorem.

Theorem 3.1

The robust counterpart (5) is equivalent to

$$\begin{aligned} \min _{\mu ,b,\alpha } \quad&\sum _{j\in K} \sum _{i\in N} b_{i,j} \left( \Vert {\tilde{x}}^i - \mu ^j\Vert _2^2 + \Vert \Delta d^i\Vert _2^2 + \sum _{l\in P} 2 \, {\Delta d}^i_l \, \alpha _{i,j,l}\right) \end{aligned}$$
(6a)
$$\begin{aligned} {{\,\mathrm{s.t.}\,}}\quad&-\alpha _{i,j,l} \le {\tilde{x}}^i_l - \mu _l^j \le \alpha _{i,j,l}, \quad i\in N,\ j\in K,\ l\in P, \end{aligned}$$
(6b)
$$\begin{aligned}&\sum _{j\in K} b_{i,j} =1, \quad i \in N, \end{aligned}$$
(6c)
$$\begin{aligned}&b_{i,j} \in \{0,1\}, \quad i \in N, \ j \in K, \end{aligned}$$
(6d)
$$\begin{aligned}&\mu \in \mathbb {R}^{p \times k}. \end{aligned}$$
(6e)

Proof

First, we move the uncertainty from the objective function to the constraints by introducing an extra variable \(t\in \mathbb {R}\) and by rewriting the robust counterpart (5) equivalently as

$$\begin{aligned} \min _{\mu ,b,t} \quad&t \end{aligned}$$
(7a)
$$\begin{aligned} {{\,\mathrm{s.t.}\,}}\quad&\max _{D \in {\mathcal {U}}^{\text {box}}} \left\{ \sum _{j\in K} \sum _{i\in N} b_{i,j} \Vert {\tilde{x}}^i - d^i - \mu ^j\Vert _2^2\right\} \le t, \end{aligned}$$
(7b)
$$\begin{aligned}&\sum _{j\in K} b_{i,j} =1, \quad i\in N, \end{aligned}$$
(7c)
$$\begin{aligned}&b_{i,j} \in \{0,1\}, \quad i\in N, \ j\in K, \end{aligned}$$
(7d)
$$\begin{aligned}&\mu \in \mathbb {R}^{p \times k}. \end{aligned}$$
(7e)

Now, the uncertainties only appear in Constraint (7b), which in turn is equivalent to

$$\begin{aligned}&\max _{D \in {\mathcal {U}}^{\text {box}}} \left\{ \sum _{j\in K} \sum _{i\in N} b_{i,j} \left( \Vert {\tilde{x}}^i - \mu ^j\Vert _2^2 + \Vert d^i\Vert _2^2 - 2({\tilde{x}}^i - \mu ^j)^{\top }d^{i}\right) \right\} \le t \\ \iff&\sum _{j\in K} \sum _{i\in N} b_{i,j} \Vert {\tilde{x}}^i - \mu ^j\Vert _2^2 + \max _{D \in {\mathcal {U}}^{\text {box}}} \left\{ \sum _{j\in K} \sum _{i\in N} b_{i,j} \left( \Vert d^i\Vert _2^2 - 2({\tilde{x}}^i - \mu ^j)^{\top }d^{i}\right) \right\} \le t. \end{aligned}$$

Using the specific box-structure of the uncertainty set (4), we obtain

$$\begin{aligned}&\max _{D \in {\mathcal {U}}^{\text {box}}} \sum _{j\in K} \sum _{i\in N} b_{i,j} \left( \Vert d^i\Vert _2^2 - 2({\tilde{x}}^i - \mu ^j)^{\top }d^{i}\right) \\&\quad = \max _{-{\Delta d}^{i}_{l} \le d^i_l \le {\Delta d}^{i}_{l}} \ \sum _{j\in K} \sum _{i\in N} \sum _{l\in P} b_{i,j} \left( (d^i_l)^2 - 2({\tilde{x}}^i_l - \mu ^j_l)d^{i}_l\right) \\&\quad = \sum _{j\in K} \sum _{i\in N} \sum _{l\in P} b_{i,j} \left( ({\Delta d}^i_l)^2 + 2 |{\tilde{x}}^i_l - \mu ^j_l| {\Delta d}^{i}_l \right) . \end{aligned}$$

Thus, we get the constraint

$$\begin{aligned} \sum _{j\in K} \sum _{i\in N} b_{i,j} \left( \Vert {\tilde{x}}^i - \mu ^j\Vert _2^2 + \sum _{l\in P} ({\Delta d}^i_l)^2 + 2{\Delta d}^{i}_l |{\tilde{x}}^i_l - \mu ^j_l|\right) \le t, \end{aligned}$$

instead of (7b), which, in turn, can be re-written as

$$\begin{aligned}&\sum _{j\in K} \sum _{i\in N} b_{i,j} \left( \Vert {\tilde{x}}^i - \mu ^j\Vert _2^2 + \Vert {\Delta d}^i\Vert ^2_2 + \sum _{l\in P} 2{\Delta d}^i_l\alpha _{i,j,l}\right) \le t, \\&\qquad \qquad - \alpha _{i,j,l} \le {\tilde{x}}^i_l - \mu ^j_l \le \alpha _{i,j,l}, \quad i \in N,\ j \in K,\ l \in P,\\&\qquad \qquad \qquad \qquad \qquad \qquad \quad \,\,\alpha _{i,j,l} \ge 0, \quad i \in N,\ j \in K,\ l \in P. \end{aligned}$$

Thus, Problem (5) is equivalent to

$$\begin{aligned} \min _{\mu ,b,t, \alpha } \quad&t \\ {{\,\mathrm{s.t.}\,}}\quad&\sum _{j\in K} \sum _{i\in N} b_{i,j} \left( \Vert {\tilde{x}}^i - \mu ^j\Vert _2^2 + \Vert {\Delta d}^i\Vert ^2_2 + \sum _{l\in P} 2{\Delta d}^{i}_l\alpha _{i,j,l}\right) \le t, \\&- \alpha _{i,j,l} \le {\tilde{x}}^i_l - \mu ^j_l \le \alpha _{i,j,l}, \quad i \in N,\ j \in K,\ l \in P, \\&\alpha _{i,j,l} \ge 0, \quad i \in N,\ j \in K,\ l \in P, \\&\sum _{j\in K} b_{i,j} =1, \quad i\in N, \\&b_{i,j} \in \{0,1\}, \quad i\in N, \ j\in K, \\&\mu \in \mathbb {R}^{p \times k}, \end{aligned}$$

and the result follows by eliminating t again. \(\square \)

A standard criticism regarding strictly robust solutions (like those of Problem (6)) is that they tend to be too conservative as they are hedged against the worst-case error that can appear for all entries of the matrix \({\tilde{X}}\). To obtain robust solutions that are practically meaningful but less conservative, in Bertsimas and Sim (2004) the authors propose the so-called \(\Gamma \)-robust approach, which allows to control the degree of conservatism. Thus, in the next section, we apply this approach to the k-means clustering problem as well.

4 The \(\Gamma \)-robust counterpart of the k-means clustering MINLP

The \(\Gamma \)-approach proposed in Bertsimas and Sim (2004) does not assume that all parameters will realize in a worst-case way but restricts the number of such parameters by \(\Gamma \in {\mathbb {N}}\), which gives the approach its name. We now apply the technique proposed in Bertsimas and Sim (2004) to the k-means clustering problem. This means that some observations \({\tilde{x}}^i_l\) are not perturbed and, in this case, \({\tilde{x}}^i_l = x^i_l\) and \(d^i_l = 0\) holds. Thus, the matrices \({\tilde{X}}\) and X only differ in a subset of \(\Gamma \) many elements.

To state the \(\Gamma \)-robust counterpart of Problem (3), we consider the uncertainty set

$$\begin{aligned} {\mathcal {U}}_{\Gamma } \mathrel {{\mathop :}{=}}\left\{ D \in {\mathcal {U}}^{\text {box}}:\left|\left\{ d^i_l:d^i_l \ne 0, \ (l,i) \in P\times N\right\} \right| \le \Gamma \right\} . \end{aligned}$$

Using this uncertainty set, we determine the maximum number of observations that may realize in worst-case way. However, we do not state in advance which will do so. Hence, the \(\Gamma \)-robust counterpart of Problem (3) is given by

$$\begin{aligned} \min _{\mu , b, \alpha } \quad&\sum _{j\in K} \sum _{i\in N} b_{i,j} \Vert {\tilde{x}}^i -\mu ^j\Vert ^2_2 \nonumber \\&\quad + \max _{\{I\subseteq P\times N:|I| \le \Gamma \}} \left\{ \sum _{j\in K} \sum _{(l,i)\in I} b_{i,j}\left( (\Delta d^i_l)^2 + 2\Delta d^i_l\alpha _{i,j,l}\right) \right\} \end{aligned}$$
(8a)
$$\begin{aligned} {{\,\mathrm{s.t.}\,}}\quad&- \alpha _{i,j,l} \le {\tilde{x}}^i_l - \mu ^j_l \le \alpha _{i,j,l}, \quad i \in N,\ j \in K,\ l \in P, \end{aligned}$$
(8b)
$$\begin{aligned}&\alpha _{i,j,l} \ge 0, \quad i \in N,\ j \in K,\ l \in P, \end{aligned}$$
(8c)
$$\begin{aligned}&\sum _{j\in K} b_{i,j} =1, \quad i\in N, \end{aligned}$$
(8d)
$$\begin{aligned}&b_{i,j} \in \{0,1\}, \quad i\in N,\ j\in K, \end{aligned}$$
(8e)
$$\begin{aligned}&\mu \in \mathbb {R}^{p \times k}, \end{aligned}$$
(8f)

where we already replaced the absolute values with additional variables and linear constraints as in Theorem 3.1. This model can be reformulated as an equivalent problem without the inner maximization problem as we show next.

Theorem 4.1

The \(\Gamma \)-robust counterpart (8) is equivalent to

$$\begin{aligned} \min _{\mu ,b, \alpha , \uplambda , \beta } \quad&\sum _{j\in K} \sum _{i\in N} b_{i,j} \Vert {\tilde{x}}^i - \mu ^j\Vert _2^2 + \Gamma \uplambda + \sum _{(l,i)\in P\times N}\beta ^i_l \end{aligned}$$
(9a)
$$\begin{aligned} {{\,\mathrm{s.t.}\,}}\quad&\uplambda + \beta ^i_l \ge \sum _{j\in K} b_{i,j} \left( (\Delta d^i_l)^2 + 2\Delta d^i_l\alpha _{i,j,l}\right) , \quad i \in N, \ l \in P, \end{aligned}$$
(9b)
$$\begin{aligned}&- \alpha _{i,j,l} \le {\tilde{x}}^i_l - \mu ^j_l \le \alpha _{i,j,l}, \quad i \in N,\ j \in K, \ l \in P, \end{aligned}$$
(9c)
$$\begin{aligned}&\alpha _{i,j,l} \ge 0, \quad i \in N,\ j \in K,\ l \in P, \end{aligned}$$
(9d)
$$\begin{aligned}&\sum _{j\in K} b_{i,j} =1, \quad i\in N, \end{aligned}$$
(9e)
$$\begin{aligned}&b_{i,j} \in \{0,1\}, \quad i\in N, \ j\in K, \end{aligned}$$
(9f)
$$\begin{aligned}&\uplambda \ge 0, \end{aligned}$$
(9g)
$$\begin{aligned}&\beta ^i_l \ge 0, \quad i \in N, \ l \in P, \end{aligned}$$
(9h)
$$\begin{aligned}&\mu \in \mathbb {R}^{p \times k}. \end{aligned}$$
(9i)

Proof

First, we rewrite Problem (8) as

$$\begin{aligned}&\min _{\mu , b, \alpha , t} \qquad t \end{aligned}$$
(10a)
$$\begin{aligned}&\qquad {{\,\mathrm{s.t.}\,}}\quad \sum _{j\in K} \sum _{i\in N} b_{i,j} \Vert {\tilde{x}}^i -\mu ^j\Vert ^2_2 \nonumber \\&\qquad \qquad \quad + \max _{\{I\subseteq P\times N:|I| \le \Gamma \}} \left\{ \sum _{j\in K} \sum _{(l,i)\in I} b_{i,j}\left( (\Delta d^i_l)^2 + 2\Delta d^i_l\alpha _{i,j,l}\right) \right\} \le t, \end{aligned}$$
(10b)
$$\begin{aligned}&\qquad \qquad \quad - \alpha _{i,j,l} \le {\tilde{x}}^i_l - \mu ^j_l \le \alpha _{i,j,l}, \quad i \in N, \ j \in K, \ l \in P, \end{aligned}$$
(10c)
$$\begin{aligned}&\qquad \qquad \qquad \alpha _{i,j,l} \ge 0, \quad i \in N,\ j \in K,\ l \in P, \end{aligned}$$
(10d)
$$\begin{aligned}&\qquad \qquad \quad \sum _{j\in K} b_{i,j} =1, \quad i\in N, \end{aligned}$$
(10e)
$$\begin{aligned}&\qquad \qquad \quad \, b_{i,j} \in \{0,1\}, \quad i\in N, \ j\in K, \end{aligned}$$
(10f)
$$\begin{aligned}&\qquad \qquad \quad \, \mu \in \mathbb {R}^{p \times k}. \end{aligned}$$
(10g)

Note that the inner maximization problem that appears in Constraint (10b), can be re-written by re-arranging the terms in the sums. By doing so, we obtain the following equivalent subset selection problem

$$\begin{aligned} \max _{\{I\subseteq P\times N:|I| \le \Gamma \}} \left\{ \sum _{(l,i)\in I} \sum _{j\in K} b_{i,j}\left( (\Delta d^i_l)^2 + 2 \, \Delta d^i_l \, \alpha _{i,j,l}\right) \right\} . \end{aligned}$$
(11)

We now reformulate the latter problem as a linear optimization problem. For details, we refer to Proposition 1 in Bertsimas and Sim (2004). From the cited result, it follows that an equivalent formulation of Problem (11) is given by

$$\begin{aligned} \max _{z\in \mathbb {R}^{p\times n}} \quad&\sum _{(l,i)\in P\times N} \left( \sum _{j\in K} b_{i,j} \left( (\Delta d^i_{l})^2 + 2 \, \Delta d^i_l \, \alpha _{i,j,l}\right) \right) z_l^i \end{aligned}$$
(12a)
$$\begin{aligned} {{\,\mathrm{s.t.}\,}}\quad&\sum _{(l,i)\in P\times N} z^i_l \le \Gamma , \end{aligned}$$
(12b)
$$\begin{aligned}&0\le z_l^i \le 1, \quad i \in N, \ l\in P . \end{aligned}$$
(12c)

This is a linear optimization problem in z and its dual problem reads

$$\begin{aligned} \min _{\uplambda , \beta } \quad&\Gamma \uplambda + \sum _{(l,i)\in P\times N} \beta ^i_l \end{aligned}$$
(13a)
$$\begin{aligned} {{\,\mathrm{s.t.}\,}}\quad&\uplambda + \beta ^i_l \ge \sum _{j\in K} b_{i,j} \left( (\Delta d^i_{l})^2 + 2 \, \Delta d^i_l \, \alpha _{i,j,l} \right) , \quad i\in N, \ l\in P, \end{aligned}$$
(13b)
$$\begin{aligned}&\uplambda \ge 0, \end{aligned}$$
(13c)
$$\begin{aligned}&\beta ^i_l \ge 0, \quad i\in N, \ l\in P. \end{aligned}$$
(13d)

Here, \(\uplambda \) is the dual variable corresponding to Constraint (12b) and \(\beta ^i_l\) are the dual variables corresponding to the constraints in (12c). Since Problem (12) is feasible and bounded, we can apply strong duality, which states that the dual problem (13) is also feasible and bounded and that the primal and dual optimal objective function values coincide. We can thus replace the inner maximization problem in (10b) by its dual minimization problem. In addition, notice that we do not need the minimum here because if

$$\begin{aligned} t&\ge \sum _{j\in K} \sum _{i\in N} b_{i,j} \Vert {\tilde{x}}^i - \mu ^j\Vert _2^2 + \Gamma \uplambda + \sum _{(l,i)\in P\times N}\beta ^i_l, \\ \uplambda + \beta ^i_l&\ge \sum _{j\in K}b_{i,j} \left( (\Delta d^i_{l})^2 + 2\Delta d^i_l\alpha _{i,j,l}\right) , \quad i\in N, \ l\in P, \\ \uplambda&\ge 0, \\ \beta ^i_l&\ge 0, \quad i\in N, \ l\in P, \end{aligned}$$

is true, it also holds for the minimum value of

$$\begin{aligned} \Gamma \uplambda + \sum _{(l,i)\in P\times N}\beta ^i_l \end{aligned}$$

over the dual feasible set. Hence, we obtain

$$\begin{aligned} \min _{\mu , b, t, \uplambda , \beta } \quad&t \\ {{\,\mathrm{s.t.}\,}}\quad&t \ge \sum _{j\in K} \sum _{i\in N} b_{i,j} \Vert {\tilde{x}}^i - \mu ^j\Vert _2^2 + \Gamma \uplambda + \sum _{(l,i)\in P\times N}\beta ^i_l, \\&\uplambda + \beta ^i_l \ge \sum _{j\in K} b_{i,j} \left( (\Delta d^i_{l})^2 + 2\Delta d^i_l\alpha _{i,j,l}\right) , \quad i\in N, \ l\in P, \\&- \alpha _{i,j,l} \le {\tilde{x}}^i_l - \mu ^j_l \le \alpha _{i,j,l}, \quad i \in N, \ j \in K, \ l \in P, \\&\alpha _{i,j,l} \ge 0, \quad i \in N,\ j \in K,\ l \in P, \\&\sum _{j\in K} b_{i,j}=1, \quad i\in N, \\&b_{i,j} \in \{0,1\}, \quad i\in N, \ j\in K, \\&\uplambda \ge 0, \\&\beta ^i_l \ge 0, \quad i\in N, \ l\in P, \end{aligned}$$

and the result follows by eliminating t. \(\square \)

5 Alternating direction methods for nominal and robust k-means clustering

In this section, we propose tailored alternating direction methods (ADMs) to compute approximate solutions to Problems (3), (6), and (4.1). Note that all these problems are MINLPs, which are in general NP-hard to solve to global optimality. Actually, the nominal k-means problem itself is an NP-hard problem even in the particular case of two dimensions; see, e.g., Dasgupta (2007), Aloise et al. (2009) and Mahajan et al. (2012). For such hard problems it is usually considered appropriate to study primal heuristics to quickly compute feasible points of good quality. The presented ADMs can be seen as such primal heuristics with the additional advantage that they come with a formal convergence analysis. The key idea of the ADMs is to decompose the set of variables into two blocks: the set of binary variables and the set of continuous variables. Afterward, we alternatingly solve the problem in one direction while keeping the other one fixed, which is a popular approach for clustering problems; see, e.g., Li et al. (2016) or Ames (2014), Pirinen and Ames (2019), where an alternating direction method of multipliers is used to solve semidefinite programming relaxations for computing a clustering of weighted graphs.

In the following, we first briefly review classic ADMs and afterward propose tailored versions of ADMs to solve Problems (3), (6), and (4.1).

5.1 General ADM framework and convergence properties

We now consider the general problem

$$\begin{aligned} \min _{u,v} \quad&f(u,v) \end{aligned}$$
(15a)
$$\begin{aligned} {{\,\mathrm{s.t.}\,}}\quad&g(u,v) = 0, \quad c(u,v) \ge 0, \end{aligned}$$
(15b)
$$\begin{aligned} \quad&u \in U \subseteq \mathbb {R}^{n_u}, \quad v\in V \subseteq \mathbb {R}^{n_v}, \end{aligned}$$
(15c)

for which we make the following assumption.

Assumption 1

The objective function \(f:\mathbb {R}^{n_u + n_v} \rightarrow \mathbb {R}\) and the constraint functions \(g:\mathbb {R}^{n_u + n_v} \rightarrow \mathbb {R}^m\), \(c:\mathbb {R}^{n_u + n_v} \rightarrow \mathbb {R}^q\) are continuous and the sets U and V are non-empty and compact.

The feasible set of Problem (15) is denoted by \(\Omega \), i.e.,

$$\begin{aligned} \Omega = \left\{ (u,v) \in U\times V:g(u, v) = 0, \ c(u,v)\ge 0\right\} \subseteq U\times V. \end{aligned}$$

Alternating direction methods are iterative procedures that solve Problem (15) by alternatingly solving two simpler problems. Given an iterate \((u^t,v^t)\), they solve Problem (15) with v fixed to \(v^t\) into the direction of u, yielding a new u-iterate \(u^{t+1}\). Afterward, u is fixed to \(u^{t+1}\) and Problem (15) is solved into the direction of v, yielding a new v-iterate \(v^{t+1}\). The algorithm is formally stated in Algorithm 1. The for-loop is repeated until a termination criterion is reached. In order to state convergence results for Algorithm 1, we need the following definition.

figure a

Definition 5.1

Let \((u^*,v^*) \in \Omega \) be a feasible point of Problem (15). Then, \((u^*,v^*)\) is called a partial minimum of Problem (15) if

$$\begin{aligned}&f(u^*,v^*) \le f(u,v^*) \quad \text {for all} \quad (u, v^*) \in \Omega , \\&f(u^*,v^*) \le f(u^*,v) \quad \text {for all} \quad (u^*, v)\in \Omega \end{aligned}$$

holds.

Consider \(\Theta ({\bar{u}},{\bar{v}})\) being the set of possible iterates starting from point \(({\bar{u}},{\bar{v}})\), i.e.,

$$\begin{aligned} \Theta ({\bar{u}},{\bar{v}}) = \left\{ (u^*,v^*):f(u^*,{\bar{v}}) \le f(u,{\bar{v}}), \, u\in U; \ f(u^*,v^*) \le f(u^*,v), \, v\in V \right\} . \end{aligned}$$

The following general convergence result is taken from Gorski et al. (2007).

Theorem 5.1

Let \(\{ (u^t,v^t)\}_{t=0}^{\infty }\) be a sequence generated by Algorithm 1 with \((u^{t+1},v^{t+1}) \in \Theta (u^t,v^t)\). Suppose that the solution of the first optimization problem (in Line 3) is always unique. Then, every convergent subsequence of \(\{ (u^t,v^t)\}_{t=0}^{\infty }\) converges to a partial minimum. In addition, if w and \(w'\) are two limit points of such subsequences, then \(f(w) = f(w')\) holds.

The sequence \(\{ (u^t,v^t)\}_{t=0}^{\infty }\) generated by Algorithm 1 may not converge, but instead may have several convergent subsequences. In particular, if \(\{ (u^t,v^t)\}_{t=0}^{\infty }\) is contained in a compact set, then there exists at least one convergent subsequence \(\{ (u^t,v^t)\}_{t\in {\mathcal {T}}}\). A partial minimum may not be global minimum. However, as the problems that we need to solve are NP-hard, we are willing to accept sub-optimal solutions. For more details on the convergence theory of classic ADMs; see, e.g., Gorski et al. (2007) and Wendell and Hurter (1976).

5.2 An ADM for the nominal k-means clustering problem

In this section, we propose a tailored version of Algorithm 1 to compute partial minimum of Problem (3). First, we define the sets U and V. For each attribute’s index \(l\in P\), let \(\pi _{l}^{-} \mathrel {{\mathop :}{=}}\min \{x_l^i:i\in N\}\) and \(\pi _{l}^{+} \mathrel {{\mathop :}{=}}\max \{x_l^i:i\in N\}\) be the minimum and maximum value for this attribute, respectively, and let \(\Pi (X) \mathrel {{\mathop :}{=}}[\pi _{1}^{-}, \pi _{1}^{+}] \times \cdots \times [\pi _{p}^{-}, \pi _{p}^{+}]\) be the bounding box of all data points. Thus, \(x^{i} \in \Pi (X)\) holds for all \(i\in N\).

Now, within the context of Algorithm 1, we consider the following non-empty and compact sets

$$\begin{aligned} U&\mathrel {{\mathop :}{=}}\left\{ b \in \mathbb {R}^{n \times k}:b_{i,j} \in \{0,1\}, \ \sum _{j\in K} b_{i,j} = 1, \ i\in N,\ j\in K\right\} , \end{aligned}$$
(16)
$$\begin{aligned} V&\mathrel {{\mathop :}{=}}\left\{ \mu \in \mathbb {R}^{p \times k}:\mu ^j \in \Pi (X), \ j\in K\right\} , \end{aligned}$$
(17)

which gives the variable splitting required for the application of the ADM. Note that, in Problem (3), we do not have any coupling constraint such as g or c in the general Problem (15). In what follows, we describe how we compute a partial minimum of Problem (3).

Suppose that initial mean vectors \(\mu ^0\in V\) are given. In each iteration t, the problem in the direction of U reads

$$\begin{aligned} b^{t+1} \in \mathop {{\text {arg min}}}\limits _{b\in U} \ \sum _{j\in K}\sum _{i\in N} b_{i,j} \Vert x^i - (\mu ^j)^t\Vert _2^2. \end{aligned}$$
(18)

The optimal solution to this binary problem can be obtained as follows. Consider a fixed \(i\in N\). Since \(b_{i,j} \in \{0,1\}\) for \(j\in K\) and only one \(b_{i,j}\) is equal to 1, we set \(b_{i,j} = 1\) for j that minimizes the norm. This is exactly the j with \(\mu ^j\) being the center that is the closest to the point \(x^i\). Thus, for each \(i\in N\), we compute

$$\begin{aligned} J_i = \left\{ j:j \in \mathop {{\text {arg min}}}\limits _{j\in K} \, \Vert x^i - (\mu ^j)^t\Vert _2^2\right\} . \end{aligned}$$
(19)

Note that there may exist more than one element in \(J_i\). In this case, we break the tie by choosing the smallest one. With \(b^{t+1}\) at hand, we update the mean vectors by solving the problem in the direction of V, which is given by

$$\begin{aligned} \mu ^{t+1} = \mathop {{\text {arg min}}}\limits _{\mu \in V} \ \sum _{j\in K}\sum _{i\in N} b_{i,j}^{t+1} \Vert x^i - \mu ^j\Vert _2^2. \end{aligned}$$

This problem can be solved very effectively by using the formula

$$\begin{aligned} (\mu ^j)^{t+1} = \frac{1}{|C_j^{t+1}|} \sum _{i\in C_j^{t+1}} x^i, \quad j\in K, \end{aligned}$$
(20)

with

$$\begin{aligned} C_j^{t+1} = \{i\in N:b_{i,j}^{t+1} =1\}. \end{aligned}$$

In other words, in each iteration, we obtain the new centers \(\mu ^{t+1}\in V\) by computing the centers for the given assignment \(b^{t+1}\). Now, we can summarize the main steps of our algorithm to compute a partial minimum of Problem (3) in Algorithm 2. For details on how we choose the initial centers we refer to Sect. 7.3.

figure b

Since the method in Algorithm 2 is a special case of the general ADM in Algorithm 1, the general convergence result as stated in Theorem 5.1 can be applied, yielding that Algorithm 2 leads to partial minima. Let us finally mention that the presented ADM exactly mimics the classic k-means clustering algorithm. However, this more abstract view on decomposing the problem can be nicely transferred to the robustified settings that we discuss in the following sections.

5.3 An ADM for the strictly robust k-means clustering problem

In this section, we propose a tailored ADM to compute partial minima of Problem (6). First, we describe how we choose the set V. Consider the bounding box of the erroneous measures \(\Pi ({\tilde{X}})\) as in Sect. 5.2. Thus, \({\tilde{x}}^i\), \(\mu ^j\in \Pi ({\tilde{X}})\) holds for all \(i\in N\), \(j\in K\). With this, for all \(i\in N, \ j\in K, \ l\in P\), we can set

$$\begin{aligned} |{\tilde{x}}^i_l - \mu ^j_l| \le ({\tilde{x}}^i_l - \mu ^j_l)^2 \le (\pi _{l}^{+} - \pi _{l}^{-})^2 \mathrel {{=}{\mathop :}}M_{l}, \quad M \mathrel {{\mathop :}{=}}\max \{M_l:l\in P\}. \end{aligned}$$
(21)

For being a solution of Problem (6), \(\alpha \) needs to be minimized while satisfying (6b). Thus, we can conclude that M as defined in (21) is an upper bound for \(\alpha \), i.e., we can set

$$\begin{aligned} |{\tilde{x}}^i_l - \mu ^j_l| \le \alpha _{i,j,l} \le M, \quad i\in N, \ j\in K, \ l\in P. \end{aligned}$$

This leads to the compact sets

$$\begin{aligned} V_{\alpha }&\mathrel {{\mathop :}{=}}\left\{ \alpha \in \mathbb {R}^{n\times k \times p}:0\le \alpha _{i,j,l}\le M, \ i\in N, \ j\in K,\ l\in P\right\} , \end{aligned}$$
(22)
$$\begin{aligned} V_{\mu }&\mathrel {{\mathop :}{=}}\left\{ \mu \in \mathbb {R}^{p\times k}:\mu ^j \in \Pi ({\tilde{X}}), \ j\in K\right\} . \end{aligned}$$
(23)

Finally, let \(V\mathrel {{\mathop :}{=}}V_{\mu }\times V_{\alpha }\) be the Cartesian product of the compacts sets \(V_\mu \) and \(V_\alpha \). Thus, the variable splitting is given by the non-empty and compact set (16) and V as just defined. Note that, with this variable splitting, we again do not have any coupling constraints. Now, we can apply Algorithm 1 to Problem (6).

Suppose that the continuous variables \(\mu ^0\) and \(\alpha ^0\) are given. In each iteration t, the first subproblem to be solved is the binary optimization problem

$$\begin{aligned} b^{t+1} \in \mathop {{\text {arg min}}}\limits _{b\in U} \ \sum _{j\in K}\sum _{i\in N} b_{i,j} \Vert {\tilde{x}}^i - (\mu ^j)^{t}\Vert _2^2, \end{aligned}$$

in which we already skipped all the constant terms in the objective function. This problem can be solved in analogy to Problem (18). Suppose that we have computed \(b^{t+1}\). The new iterates \(\mu ^{t+1}\) and \(\alpha ^{t+1}\) are computed by solving

$$\begin{aligned} \min _{(\mu ,\alpha ) \in V} \quad&\sum _{j\in K}\sum _{i\in N} b_{i,j}^{t+1} \left( \Vert x^i - \mu ^j\Vert _2^2 + \Vert \Delta d^i\Vert _2^2 + \sum _{l\in P} 2 \, \Delta d^i_l \, \alpha _{i,j,l}\right) \end{aligned}$$
(24a)
$$\begin{aligned} {{\,\mathrm{s.t.}\,}}\quad&-\alpha _{i,j,l} \le {\tilde{x}}^i_l - \mu _{l}^j \le \alpha _{i,j,l}, \quad i\in N,\ j\in K,\ l\in P. \end{aligned}$$
(24b)

This is a convex quadratic optimization problem and can thus be efficiently solved by standard state-of-the-art solvers; see, e.g., Ben-Tal and Nemirovski (2001) and Boyd and Vandenberghe (2004) for an overview of algorithms and complexity results for convex problems. The overall method to compute a partial minimum of Problem (6) is formally given in Algorithm 3. As in the last section, Theorem 5.1 can be applied again, yielding that Algorithm 3 leads to partial minima of the strictly robust counterpart.

figure c

5.4 An ADM for the \(\Gamma \)-robust k-means clustering problem

As in Sect. 5.3 we now first discuss how to obtain the set V for applying Algorithm 1 to Problem (4.1).

Note that in Problem (4.1), the continuous variables \(\uplambda \) and \(\beta \) are always bounded in the direction of optimization since they are non-negative and minimized. Thus, there exist constants \(M_{\uplambda }\) and \(M_{\beta }\) with

$$\begin{aligned} 0\le \uplambda \le M_\uplambda , \quad 0\le \beta _{l}^i\le M_\beta , \quad i\in N, \ l\in P. \end{aligned}$$

Now, we consider the following non-empty and compact sets

$$\begin{aligned} V_\uplambda \mathrel {{\mathop :}{=}}\left\{ \uplambda \in \mathbb {R}:0\le \uplambda \le M_\uplambda \right\} , \quad V_\beta \mathrel {{\mathop :}{=}}\left\{ \beta \in \mathbb {R}^{p\times n}:0\le \beta _{l}^i\le M_\beta , \ i\in N, \ l\in P\right\} . \end{aligned}$$

With these sets and the sets in (22) and (23) at hand, we can define the non-empty and compact set V as

$$\begin{aligned} V\mathrel {{\mathop :}{=}}V_\mu \times V_\alpha \times V_\uplambda \times V_\beta . \end{aligned}$$
(25)

Thus, the variable splitting is represented by the sets (16) and (25). In this case now, we have coupling constraints (9b). In the notation of (15), we define these constraints formally as \(c:\mathbb {R}^{n_u + n_v} \rightarrow \mathbb {R}^{p\times n}\) with

$$\begin{aligned} c_{(l,i)}(b, \alpha , \uplambda , \beta ) = \uplambda + \beta ^i_l - \sum _{j\in K} b_{i,j} \left( (\Delta d^i_l)^2 + 2 \, \Delta d^i_l \, \alpha _{i,j,l}\right) , \quad (l,i) \in P\times N. \end{aligned}$$

In the following, we describe how we solve the ADM subproblems. To solve Problem (4.1) into the direction of U, we first need to choose \(\mu ^0\), \(\alpha ^0\), \(\uplambda ^0\), and \(\beta ^0\). Suppose that \(\mu ^0\) is given, and we obtain \(b^0\) as in Steps 3–6 of Algorithm 2 w.r.t. \({\tilde{x}}^i\). Then, we can compute \(\alpha ^0\), \(\uplambda ^0\), and \(\beta ^0\) by solving the linear optimization problem

$$\begin{aligned} \min _{(\alpha , \uplambda , \beta )\in V} \quad&\Gamma \uplambda + \sum _{(l,i)\in P\times N}\beta ^i_l \end{aligned}$$
(26a)
$$\begin{aligned} {{\,\mathrm{s.t.}\,}}\quad&c_{(l,i)}(b^0, \alpha , \uplambda , \beta ) \ge 0, \quad (l,i) \in P\times N, \end{aligned}$$
(26b)
$$\begin{aligned}&- \alpha _{i,j,l} \le {\tilde{x}}^i_l - (\mu ^j_l)^0 \le \alpha _{i,j,l}, \quad i \in N,\ j \in K, \ l \in P. \end{aligned}$$
(26c)

With the initial values for \(\mu ^0\), \(\alpha ^0\), \(\uplambda ^0\), and \(\beta ^0\) at hand, the problem in the direction of U in iteration t reads

$$\begin{aligned} \min _{b \in U} \quad&\sum _{j\in K} \sum _{i\in N} b_{i,j} \Vert {\tilde{x}}^i - (\mu ^j)^t\Vert _2^2 \end{aligned}$$
(27a)
$$\begin{aligned} {{\,\mathrm{s.t.}\,}}\quad&c_{(l,i)}(b, \alpha ^t, \uplambda ^t, \beta ^t) \ge 0, \quad (l,i) \in P\times N, \end{aligned}$$
(27b)

where we already skipped all constant terms in the objective function.

Note that for \(t=0\), we obtain \(b^1=b^0\) by solving Problem (27). This is because \(b^0\) is feasible for (27) and \(b^0\) leads to the optimal objective value since it has been obtained as in Algorithm 2 w.r.t. \({\tilde{x}}^i\). Thus, there is no need to compute \(\alpha ^0\)\(\uplambda ^0\), as well as \(\beta ^0\) and we can directly proceed with the problem into the direction of V using \(b^0\), as described below.

Suppose that \(\mu ^0\) is given and \(b^0\) is obtained as in Algorithm 2. Thus, in each iteration t, we compute \(\mu ^{t+1}\), \(\alpha ^{t+1}\), \(\uplambda ^{t+1}\), and \(\beta ^{t+1}\) by solving

$$\begin{aligned} \min _{(\mu ,\alpha , \uplambda , \beta )\in V} \quad&\sum _{j\in K} \sum _{i\in N} b_{i,j}^t \Vert {\tilde{x}}^i - \mu ^j\Vert _2^2 + \Gamma \uplambda + \sum _{(l,i)\in P\times N}\beta ^i_l \end{aligned}$$
(28a)
$$\begin{aligned} {{\,\mathrm{s.t.}\,}}\quad&c_{(l,i)}(b^t, \alpha , \uplambda , \beta ) \ge 0, \quad (l,i) \in P\times N, \end{aligned}$$
(28b)
$$\begin{aligned}&- \alpha _{i,j,l} \le {\tilde{x}}^i_l - \mu ^j_l \le \alpha _{i,j,l}, \quad i \in N,\ j \in K, \ l \in P, \end{aligned}$$
(28c)

which is a convex quadratic optimization problem. Having this problem solved, we obtain the next b-iterate by solving Problem (27).

Algorithm 4 summarizes the main steps to compute a partial minimum of Problem (4.1). Again, Theorem 5.1 ensures that Algorithm 4 computes partial minima of the \(\Gamma \)-robust counterpart.

figure d

6 Restart heuristic

There are specific clustering instances for which the ADMs described so far obtain visibly bad results; see, e.g., the clustering obtained for the instance Unbalance in Fig. 1. For the details about the benchmark data sets we refer to Sect. 7.2. Observe that Algorithm 2 identified two clusters, \(S_1\) and \(S_2\), where there should be only one, and identified one cluster, \(S_3\), where there should be two. This leads to the situation that the loss within cluster \(S_1\) and within cluster \(S_2\) is much smaller than the one within cluster \(S_3\). Here and in what follows, we compute the weighted value of the loss function h in (2) restricted to cluster \(S_j\) by

$$\begin{aligned} h_{j} = \frac{1}{\left|S_j\right|} \sum _{x^i\in S_j}\Vert x^i -\mu ^j\Vert ^2_2. \end{aligned}$$
(29)
Fig. 1
figure 1

Comparison between the ground truth clustering (left) and the clustering result of the ADM (right) for the instance Unbalance. Obviously, the partial minimum is of low quality

Suppose now that we join clusters \(S_1\) and \(S_2\) to form a new cluster \(S_{12}\). If the total loss within cluster \(S_{12}\) is still smaller than the one within cluster \(S_3\), then our algorithm is actually not minimizing the sum of total losses, because there exists a better point when we split cluster \(S_3\) into two clusters and join clusters \(S_1\) and \(S_2\).

Based on these observations, we propose a heuristic to avoid that the ADMs in Algorithms 2–4 get stuck in a partial minimum of bad quality as shown in Fig. 1; see Fraiman et al. (2013) for a similar heuristic.

Suppose that we have a partial minimum and the \(h_j\) values at hand. For each pair of clusters \((S_{j_1}, S_{j_2})\), we also compute their joint center and the corresponding total loss via

$$\begin{aligned} \mu ^{j_1j_2} = \frac{1}{\left|S_{j_1}\right| + \left|S_{j_2}\right|} \sum _{x^i \in S_{j_1} \cup S_{j_2}} x^i \end{aligned}$$
(30)

and

$$\begin{aligned} h_{j_{1}j_2} = \frac{1}{\left|S_{j_1}\right| + \left|S_{j_2}\right|} \sum _{x^i \in S_{j_1} \cup S_{j_2}}\Vert x^i -\mu ^{j_{1}j_2}\Vert ^2_2. \end{aligned}$$
(31)

Now, consider the set

$$\begin{aligned} \Psi \mathrel {{\mathop :}{=}}\left\{ (S_{j_1}, S_{j_2}, S_{j_3}):h_{j_{1}j_2} < h_{j_3}\right\} , \end{aligned}$$
(32)

which is the set of all possible combinations of three clusters such that the total loss within two joined clusters is smaller than the total loss within a third cluster. Note that the set \(\Psi \) can be empty. If so, this means that we cannot obtain a better partial minimum by joining two clusters and splitting another one. On the other hand, i.e., if there exists \((S_{j_1}, S_{j_2}, S_{j_3}) \in \Psi \), then the total loss of the joined clusters \(S_{j_1}\) and \(S_{j_2}\) is smaller than the total loss within cluster \(S_{j_3}\). Thus, we obtain a better partial minimum by joining \(S_{j_1}\) and \(S_{j_2}\) and by splitting cluster \(S_{j_3}\) into two smaller clusters. To this end, we update the centers in such a way that the clusters \(S_{j_1}\) and \(S_{j_2}\) are now one cluster with center \(\mu ^{j_1j_2}\), cluster \(S_{j_3}\) receives two centers which are the two furthest points in \(S_{j_3}\), and the other centers remain the same, i.e.,

$$\begin{aligned} {\hat{\mu }}^{j_1} \leftarrow \mu ^{j_1j_2}, \quad {\hat{\mu }}^{j_2} \leftarrow {\bar{x}}^i, \quad {\hat{\mu }}^{j_3} \leftarrow {\bar{x}}^{i'}, \end{aligned}$$
(33)
$$\begin{aligned} {\hat{\mu }}^j \leftarrow \mu ^j \quad \text {for all} \quad j \notin \{j_1,j_2,j_3\}, \end{aligned}$$
(34)

with

(35)

Finally, if the set \(\Psi \) has more than one element, then we repeat the process starting with the element \((S_{j_1}, S_{j_2}, S_{j_3})\) that gives the minimum ratio \(h_{j_{1}j_2} / h_{j_3}\). Each time an element \((S_{j_1}, S_{j_2}, S_{j_3})\) is used, we exclude all the elements that contain \(S_{j_1}\), \(S_{j_2}\), or \(S_{j_3}\), because these clusters were already modified.

With \({\hat{\mu }}\) at hand, we compute a new partial minimum using \({\hat{\mu }}\) as the initial centers. Given the new partial minimum, we repeat the process until a termination criterion is satisfied. In Algorithm 5, we formally state the proposed heuristic as an additional part of Algorithms 2–4. The termination criterion in our implementation is to stop if the restart heuristic does not provide modified centers anymore and, as a consequence, if two consecutive partial minima are the same (up to numerical tolerances).

figure e

7 Numerical results

In this section, we present our numerical studies, which have been carried out using the setup described in Sect. 7.1. The goal here is to evaluate the performance of the proposed methods in Algorithms 2–5. To this end, we first describe in Sect. 7.2 the clustering problem instances and the evaluation metrics that we use. For analyzing the performance of the different methods, a ground truth labeling has to be defined. These labels are provided with the test data sets used in the numerical evaluation. For the optimization, these ground truth labels are not used as k-means clustering is an unsupervised learning method, i.e., the input data is not labeled. Since all algorithms assume that initial centers are given, we briefly describe how they are obtained in Sect. 7.3. In Sect. 7.4, we show that the proposed ADM performs very well for typical nominal clustering problems. Moreover, we also evaluate the effectiveness of the heuristic described in Sect. 6. In Sects. 7.5 and 7.6, we discuss the results that we obtain by robustifying these k-means clustering problems. Finally, in Sects. 7.7 and 7.8, we go further and evaluate the performance of the proposed algorithms on real-world clustering instances.

As we will see, the strictly robust method clearly outperforms the nominal and \(\Gamma \)-robust methods. In the appendix, we also present a comparison of the strictly robust method with the spectral clustering method by Von Luxburg (2007), which is another state-of-the-art method for clustering. By these results, the competitiveness of the strictly robust method is further highlighted.

7.1 Software and hardware setup

We implemented the algorithms in Python 3.8.5 and solved the binary linear as well as the convex quadratic optimization problems with Gurobi 9.1.0. We use the special Python modules sklearn.preprocessing.MinMaxScaler, sklearn.metrics.adjusted_rand_score, sklearn.metrics.silhouette_score, and scipy.stats.wilcoxon to scale the data to the range [0, 1], to compute the ARI (see below), to compute the Silhouette score (see below), and to calculate the p values for the Wilcoxon signed-rank test (see below as well), respectively. All the computations were performed on a computer with a 3.60 GHz Intel(R) Core(TM) i3-8100 processor and 16 GB RAM. For Algorithms 2–4, the termination criterion is \(\Vert \mu ^{t+1} - \mu ^{t}\Vert _\infty < 10^{-4}\).

7.2 Data sets and validation metrics

To comprehensively compare the proposed algorithms on a variety of clustering problems, we use (i) synthetically generated data for which we include some uncertainty by perturbing the data points as well as (ii) real-world data sets, which naturally contain measurement errors.

Since one of the goals of this work is to identify structural properties of clustering problems for which a robustification is beneficial, we use six synthetic clustering benchmark data sets (a, s, Unbalance, dim, g2, birch) proposed by Fränti and Sieranoja (2018). These data sets provide a perfect benchmark since they cover a wide range of typical k-means clustering problems with different degrees of overlap, density, and sparsity. Besides that, most of the data sets are two-dimensional, so we can also visualize and compare the recovered nominal and robust clustering results. For reasons of comparability, we use these data sets as a reference but exclude the data set birch with 100 clusters that leads to extremely large and challenging problems.

It is also standard in the literature to test clustering algorithms on data sets from the UCI Machine Learning Repository (Dua and Graff 2017); see, e.g., Li et al. (2016), Vo et al. (2016) and Aloise et al. (2012). Thus, to also evaluate the quality of robust solutions on high-dimensional and real-world data sets, we select 52 instances from this repository, forming a sample of clustering problems with diverse sizes and difficulties.

For the synthetic data, the ground truth partition of all data sets is publicly available in Fränti and Sieranoja (2018) and for the real-world instances, the true assignments (labels) of each data point are given as well. Thus, we choose as external validity metrics the adjusted Rand index (ARI) and the loss function value. The ARI (Hubert and Arabie 1985; Steinley 2004) measures the similarity between two assignments, ignoring permutations, within the range \([-1, 1]\). Here, 1 means perfect agreement between the two assignments.

However, in some cases, the ground truth partition corresponds to lower quality solutions, which may lead to a subjective validation if only external metrics are used as reference. Therefore, we also use the Silhouette score as an internal validation metric (Rousseeuw 1987). This score is shown to have a good performance for general clustering problems in an extensive study over a variety of validation metrics done in Arbelaitz et al. (2013). The Silhouette score measures the quality of the clustering in terms of cohesion and separation, taking values within the interval \([-1, 1]\), where a value close to 1 means that the data set is well clustered, whereas a result close to \(-1\) means that most of the data points have been misclassified or that there is no natural clustering structure in the data set.

Further, to assess the significance of the obtained results, we apply the Wilcoxon signed-rank test to calculate p values (Wilcoxon 1945). With this, we are able to quantify the observed improvements in cluster quality when comparing the Silhouette score as well as the ARI of nominal and robust models. If the resulting p value is smaller than the \(\alpha \)-error of 10 %, we reject the null-hypothesis that the robust model is less or equally good as the nominal one in terms of the Silhouette score or the ARI, respectively. Hence, a rejection supports the hypothesis that we obtain improved clustering results for erroneous data when using robustified models.

7.3 Starting point heuristic

It is well known that a poor initialization of the centers can cause a bad clustering result. In Fränti and Sieranoja (2019), the authors study some popular initialization heuristics and test them on a benchmark library, which is the same synthetic data that we use to test our algorithms. On average, they conclude that the “furthest point heuristic”, also known as “Maxmin”, reduces the clustering error of k-means. Based on their results, we decide to compute the initial mean points \(\mu ^0\) with the “Maxmin” heuristic. The idea is to select the first center randomly within the respective bounding box, and then obtain new centers one by one. In each iteration, the next center is the point that is the furthest (max) from its nearest (min) existing center.

7.4 Evaluation of the ADM applied to the synthetic nominal clustering problems

In this section, we discuss the performance of Algorithm 2 and its extension in Algorithm 5 applied to the synthetic data sets. To simplify the presentation here, we refer to Algorithm 2 as (“ADM2”) and to Algorithm 5 as (“ADM5”). As already mentioned at the end of Sect. 5.2, the ADM in Algorithm 2 exactly mimics the classic k-means clustering algorithm. Nevertheless, we present the numerical results for the nominal cases here as well to assess the effects of the restart heuristic and to have a proper baseline for a computational comparison of the robustified versions later on. First, we evaluate the quality of the partial minima computed by ADM2 and ADM5 for 12 instances from the synthetic data sets. The results are shown in Table 1. The objective function value is denoted by \(h(X,\mu )\). We both state the value of h corresponding to the ground truth (“GT”) and the value of h in a partial minimum computed with ADM2 and ADM5. The Silhouette score is computed for the ground truth assignment as well as for the ADM2 and ADM5 assignment results. The two last columns show the ARI, respectively. In Table 2, we state the size of the clustering problem, i.e., the number of clusters k, the number of points n, and the number of attributes p. There we also present the runtime (“Time”; in seconds) and the number of iterations (denoted by “It”) that the ADM2 and ADM5 require to compute a partial minimum. Since the initialization heuristic for the centers is a random procedure, we apply the algorithm five times with different initializations and report the average over all five runs.

Table 1 Results of Algorithm 2 and its extended Algorithm 5 on the synthetic clustering data sets
Table 2 Information about the synthetic data sets and the performance of Algorithm 2 and its extended Algorithm 5 on these instances

Looking at the results of ADM2 first, we note that for the instances a1, a2, a3, s1, s2, dim, and g2-2-30, on average, the value of h and the Silhouette score in the partial minimum are close to the ground truth and the ARI is also close to 1. This shows that the clusterings of the partial minima are similar to the ground truth clusterings for these instances. One can also see that the quality of the partial minima measured in terms of the objective function h and in terms of the Silhouette score can both be worse (e.g., instances s1 or s2) or better (e.g., instance g2-2-30) compared to the ground truth.

On the other hand, for the instances s3, s4, Unbalance, g2-2-50, and g2-2-70, on average, the ARI reveals that the corresponding clusterings have low similarity and also the value of h in the partial minima is rather different to the ground truth. However, the Silhouette score is better and the value of h is smaller for all these cases except for the instance Unbalance for which we obtain a partial minimum of very bad quality; see also Fig. 1. Due to this, in Sect. 6, we proposed a heuristic to improve the quality of the partial minima that are computed by the ADM2; see Algorithm 5.

Regarding the runtime and number of iterations, the ADM2 is very fast in computing partial minima. The median average runtime is 1.9 s (with a maximum of less than 15 s) and the ADM2 never requires more than 45 iterations on average.

Now, looking at the results of ADM5, we note that the extended algorithm is able to find the ground truth clustering of the instance Unbalance. Furthermore, all the results improved in terms of the objective function h and in terms of the Silhouette score. We also observe that the results for all g2 instances (all having \(k = 2\)) remain the same because the heuristic proposed in Sect. 6 is only applicable for \(k>2\).

Let us briefly discuss the instance s4 for which we get the lowest ARI (0.608). In this case, the objective function value is significantly smaller in the ADM5 result (compared to the ground truth; 20.125 vs. 34.544), and also the Silhouette score is better (compared to the ground truth; 0.468 vs. 0.321). Both results are shown in Fig. 2, where it is easy to see that the clusterings also qualitatively differ from each other. Nevertheless, the low values of the Silhouette score reveal that this is a difficult instance for clustering.

Fig. 2
figure 2

The ground truth clustering (left) and the ADM5 result (right) for the instance s4

Regarding the runtime and number of iterations that the ADM5, on average, requires to find a partial minimum, both increase due to the restart heuristic. This is expected, because in each “outer” iteration we may have new initial centers and, if this is the case, we apply the ADM2 again leading to additional inner ADM iterations. Thus, the computational costs of getting better partial minima is the increase of median average runtime from 1.9 to 5.7 s (with a maximum of 32 s for the instance a3).

We conclude that the proposed Algorithm 2 combined with the restart heuristic proposed in Algorithm 5, i.e., the combination that we denote here by ADM5, performs very well on the synthetic data sets. The algorithm is able to find partial minima of very good quality with objective function values close to or even better than the ground truth and Silhouette scores always better than or equal to the ground truth.

7.5 Strictly robust clustering results for the synthetic data

Let us start the discussion of the strictly robust clustering results with an informal note. To this end, consider the clustering instance Unbalance in Fig. 1 again. For this instance, it is obvious that the eight clusters are very well separated and not too close to each other. Only for a few points between the two clusters below and the middle cluster in the right part of the figure it may be not clear to which cluster they belong. However, the specific assignment decision for these data points will not qualitatively change the overall clustering result. In such a setting, it cannot be expected that a slight perturbation of a certain percentage of data points by a certain amount will lead to a qualitatively different result. Even if all data points are perturbed by, e.g., 1 %,Footnote 1 the clustering will stay the same although the clusters may be blurred a bit. This general observation is also supported by our preliminary numerical experiments: Clustering of well separated sets of data points does not benefit from robustification. Moreover, in such a situation there is also no need for robustified clustering if compared with the nominal results.

Due to this and to the fact that most of the instances from the twelve synthetic data sets provided in Fränti and Sieranoja (2018) are practically well separated, we induce some uncertainty in the data points by perturbing them. More precisely, we consider an amount of perturbation of 10 % and different quantities of perturbed data points: 5 %, 30 %, and 50 %. Assuming now that the perturbed data set is the “new nominal” or given one, we apply the nominal and the two robust models. Each data set is perturbed 10 times and every time the three approaches are applied to an instance, they start with the same given centers. The average results for all data sets over the 10 runs each are presented in Table 3, where we show the Silhouette score as well as the ARI regarding the clustering results on the perturbed data compared to the nominal ADM solution on the unperturbed set (notice that we do not compare to the ground truth assignment here). For each instance and each validation metric, the best average result is printed in bold. Moreover, since in some cases the differences in terms of Silhouette and ARI among methods seem to be modest, we underline the best robust average results for which the corresponding p values (calculated for each metric and each two methods) are smaller than the \(\alpha \)-error of 10 % w.r.t. the nominal results. Therefore, the underlined average results are statistically significantly better on a 10 % level. All the calculated p values are shown in the appendix in Table 10.

Table 3 Nominal and robust results on the perturbed synthetic data sets

It can be directly seen that the advantage of the strictly robust model over the nominal and \(\Gamma \)-robust models increases with the amount of perturbed data points. This is the case, e.g., for the instance Unbalance. As we induce a stronger perturbation, the clusters significantly mix with each other and the performance of the strictly robust model becomes the best in both evaluation metrics.

Let us now discuss the quality of the robustified clusterings for the instances s3, s4, a2, and a3 for which the data points are not well separated or for which the clusters are close to each other; see Fig. 3 (all other data sets are given in Fig. 8 in the appendix). To this end, we consider Fig. 4 in which we display box-plots to analyze the quality of the robustified clustering results. On the y-axes we plot the ARI regarding the strictly robustified clustering results on the perturbed data compared to the nominal ADM solution on the unperturbed set divided by the ARI regarding the nominal ADM clustering results on the perturbed data compared to the nominal ADM solution on the unperturbed set. All ADMs always have been applied using the restart heuristic described in Sect. 6.

Fig. 3
figure 3

The four test instances for the strictly robustified clustering method

Fig. 4
figure 4

Statistical performance of the strictly robust vs. the nominal k-means clustering methods measured as the ratio of the respective ARIs. Values over 1 indicate higher ARI for the strictly robust than for the nominal k-means clustering

For all four instances we see that the quality of the strictly robustified clustering is improved if the errors (in terms of the amount of perturbed points) get larger. The detailed behavior also depends on the specific instance. Thus, we discuss two exemplary ones. For a3 we see that the robustified clustering is advantageous only for the case of 50 % perturbed data points. Taking a look at Fig. 3 again we see that the data points in instance a3 are rather well separated but the clusters are close to each other. Since all points are perturbed by 10 % if they are perturbed at all, a robustification is getting beneficial in the case that enough clusters mix with each other. This does not seem to be the case for 30 % of perturbed points but is the case for 50 %. We observed a similar behavior in our computations if the number of perturbed points stays the same and if the points are perturbed by a varying amount like 10, 20, or 30 %.

As a second example let us consider instance s4; see also Fig. 3. Here, all clusters are not well separated. In such a situation, already a smaller amount of perturbed points leads to a beneficial robustification but for larger amounts of perturbed points no significant further improvements can be seen. If we take a look at Table 3 again, we see that this is also the case for the Silhouette score. The reason most likely is that for 30 % of perturbed points the clusters significantly mix with each other and that this mix is not qualitatively increased for more perturbed points.

Now we discuss the computational performance of the robust clustering methods for the exemplary instances previously discussed, see Table 4 (all the other results are presented in Table 9 in the appendix), where we report the average (over ten runs with different random perturbations) runtimes and iteration counts for the ADM with restart heuristic applied to the nominal and the two robustified clustering models. (The \(\Gamma \) columns will be discussed in the next section.) It is clearly visible that the solution of the strictly robust clustering problems takes more time: the average factor is 5.9 with minimum 2.0 and maximum 12.1. Moreover, the percentage of perturbed points does not have a clear impact on the runtime. In contrast to runtimes, the number of required iterations does not change significantly and there are almost as many instances with decreased average iteration counts as there are increased iteration counts. Consequently, the main additional computational work is due to harder subproblems that need to be solved in every iteration of the ADM.

Table 4 Average runtime and iteration counts for the corresponding ADMs with restart heuristic for the nominal, the strictly, and the \(\Gamma \) robust models on the perturbed (10 %) synthetic data sets

Thus, the bottom line is as follows. The strictly robustified clustering method is harder to compute but the additional computational effort is reasonable since we roughly stay in the same order of magnitude compared to the runtime of the nominal method. Regarding the outcome of the robustified clustering, the results always have to be analyzed with the structure of the measured data points in mind, since different structures in the observed data may lead to different robustified results as the two exemplary discussed instances show.

Fig. 5
figure 5

Nominal versus strictly robustified clustering results for the instance s3

Let us close this section on the strictly robustified clustering method with a final numerical result for the instance s3. In Fig. 5, the green triangles are the same in the left and the right plot of the figure and represent the clustering obtained by the ADM including the restart heuristic when applied to the unperturbed data set. Then, we perturb 50 % of all points by 10 % and again solve the ADM with restart heuristic considering these perturbed data points as the given ones. This results in the black (crosses) centers in the left plot of Fig. 5. It is obvious, that by considering the perturbed points as “new nominal” ones, one is not able to detect the original centers of the clusters. In contrast, the strictly robustified clustering result (right plot in Fig. 5) almost perfectly recovers the original centers.

7.6 \(\Gamma \)-robust clustering results for the synthetic data

We start with the analysis of the computational performance of the \(\Gamma \)-approach; see Table 4 again. First, one can observe that the number of iterations is again in the same order of magnitude as the number of iterations required to solve the nominal as well as the strictly robust clustering problems. Interestingly, the \(\Gamma \)-approach requires fewer iterations for 11 out of 15 instances. However, the computation times are, on average, a factor of 9.4 longer then for the nominal problems with a minimum factor of 2.7 and a maximum factor of 12.8. This means that the computation times for the \(\Gamma \)-robustified models are roughly two times longer than the ones for the strictly robust versions.

The motivation for \(\Gamma \)-robustness in Bertsimas and Sim (2004) was to introduce a concept that leads to a less conservative notion of robustness—especially compared to strict robustness. This less conservative concept had many applications and was also shown to be useful for classification in Bertsimas et al. (2019), which is, to the best of our knowledge, the first and only application of \(\Gamma \)-robustness in classification so far.

The question now is whether a \(\Gamma \)-robustification is also beneficial in the context of clustering problems. Unfortunately, an analysis comparable to the one given in Fig. 4 does not show a benefit in general; see Table 3 again. Our analysis of the results revealed two major explanations for this. First, the aim of the \(\Gamma \)-approach is to hedge against worst-case realizations in \(\Gamma \) many parameters with the specific parameters being unknown beforehand. In particular, these \(\Gamma \) many parameters do not need to correspond to those, e.g., 30 % of the parameters that are randomly chosen and perturbed in our simulations. Thus, it is likely to be the case that the randomly perturbed ones are not the ones that the \(\Gamma \)-approach hedges against. Let us discuss this on the example of instance g2-2-50, which is shown in Fig. 6. Here, the two clusters are not well separated. The interesting data points for the \(\Gamma \)-approach most likely are the ones near the diagonal of the plot. Thus, only a small percentage of points can be expected to be the points for which a perturbation might lead to a beneficial behavior of the \(\Gamma \)-approach. In Fig. 7, we analyze the performance of the \(\Gamma \)-approach. It can be seen that the \(\Gamma \)-approach can lead to an improvement compared to the nominal and to the strictly robust approach—but only if the parameter \(\Gamma \) is chosen appropriately; see, e.g., the case in which \(\Gamma \) corresponds to 0.1 % of the data points. Unfortunately, it is not clear on how to choose this parameter appropriately in advance.

Fig. 6
figure 6

The instance g2-2-50

Fig. 7
figure 7

Performance of the \(\Gamma \)-robustified vs. the nominal k-means clustering method measured as the ratio of their respective ARIs

Let us also briefly comment on the results reported on beneficial \(\Gamma \)-robustifications in case of SVMs in Bertsimas et al. (2019). There, the authors apply this approach in the case in which the labels are subject to uncertainty, which is not the situation here since no labels are considered. In the case that Bertsimas et al. (2019) call “feature-robust”, which corresponds to our setting here, they also use strictly robustified models.

Figure 7 also makes another issue visible, namely that the \(\Gamma \)-robustified results do not tend to the strictly robust solutions if \(\Gamma \) is getting larger. In theory, however, this should be the case. Our experiments revealed that this is due to non-optimal partial minima computed by the ADM in Algorithm 4—although we present here the results for which the restart heuristic is used as well. This problem is not visible for the strictly robustified problems that we also solve using a tailored ADM, which might be simply explained by the overall increased complexity of the \(\Gamma \)-robustified models compared to the nominal or strictly robust models.

To sum up, we found an interesting case in which less conservatism does not lead to better results—as it is often expected in robust optimization in general. Moreover, the advantageous clustering results based on the strictly robust models are even more pronounced because these models are also significantly easier to solve; see Table 4 again. Thus, if no specific value of \(\Gamma \) is available, we strongly suggest to use the strictly robustified clustering method for data sets with unstructured errors. Therefore, in Sect. 7.8, we only consider the strictly robustified model for the real-world data sets.

7.7 Evaluation of the ADM applied to nominal real-world clustering problems

In this section, we evaluate the performance of Algorithm 2 with the restart heuristic applied to a sample of 52 real-world data sets taken from Dua and Graff (2017). Considering the starting point procedure described in Sect. 7.3, the ADM is applied five times considering different initial centers. The average (over the five runs) runtimes and iterations counts are present in Table 5, where we also give information on the size of the instances. Again, as seen for the synthetic data, the ADM is very fast in computing partial minima. The median average runtime is 0.1 s and the maximum number of required iterations is, on average, less than 57. Interestingly, the ADM is significantly faster for these real-world instances compared to the synthetic data sets. Thus, the proposed method seems to scale very well.

Table 5 Computational performance of Algorithm 2 using the restart heuristic on the real-world clustering data sets

Let us now discuss the quality of the partial minima. To this end, we consider again the ARI as external validation metric since the true labels are available and we also report the Silhouette score to evaluate how good the clustering is in terms of separation and cohesion. The average results over the five runs are shown in Table 6, where we additionally compute the Silhouette score for the true labels. First, we can see that for some instances the clustering result is very bad. Negative values for the ARI mean that the ADM assignment result has little similarity with the true assignment. However, taking a closer look at the Silhouette score in the true labels, the negative (or very small) values of the ARI happen exactly when this score is also very small. The most likely explanation for this is that the data set simply does not have an underlying clustering structure and, consequently, does not fit well the k-means clustering idea.

Table 6 Results of Algorithm 2 using the restart heuristic on the real world clustering data sets

In contrast to this, when the Silhouette score in the true labels is larger than, e.g., 0.2, then the ARI is also closer to 1 in general. Thus, the ADM assignment result is more similar to the true labeling. In other words, if the real-world data set possesses an underlying clustering structure, then the proposed ADM combined with the restart heuristic performs very well and is able to find partial minima of good quality.

7.8 Strictly robust clustering results on real-world data sets

As seen in the last section, several of the real-world data sets among the 52 instances turned out to be instances that are not suitable for the k-means clustering method because they do not possess an underlying clustering structure. Moreover, our experiments reveal that applying the strictly robustified k-means model to these instances does not significantly improve an already bad nominal clustering result. Therefore, from now on, we only focus on those instances for which the Silhouette score for the true labels is larger than 0.2 and we also focus on the strictly robustified model here due to our previous results on synthetic data sets.

In order to compare the nominal method against its strictly robust counterpart, we apply each method five times with different seeds for the initialization procedure and both methods start with the same given initial centers. The average results over the five runs are presented in Table 7, where the best ARI result for the data set is highlighted in bold. As for the synthetic data sets, we calculated p values to assess the significance of the obtained results. Therefore, we underline the average results for which the difference in the ARI of the nominal vs. the strictly robust method is reasonable (p value smaller than an \(\alpha \)-error of 10 %). All calculated p values are shown in Table 11 in the appendix.

Table 7 Performance of the strictly robust versus the nominal k-means clustering method on real-world clustering data sets

Note that for the Silhouette score there is no big difference between the results. On the other hand, the ARI is more telling here since we compare how similar the two assignment results are to the true labeling. We can see that only for the instance mammographic-mass the nominal method outperforms the strictly robust method. For 10 out of the 15 instances, the strictly robust model performs the best, yielding clustering results that have more similarity to the true classes than the nominal clustering results do. Moreover, the underlined average results reveal that the improvement in clustering quality attained with the strictly robust method is significant. Thus, in the presence of potential measurement errors, the strictly robustified k-means clustering method is able to give more accurate results than its nominal counterpart.

The running times and iterations counts are listed in Table 8. Clearly, on average the strictly robust method needs more time to find a partial minimum. However, it is still rather fast, never requiring more than 39 s on average. In contrast to runtimes, the number of iterations does not change significantly.

Table 8 Computational performance of the nominal and strictly robust models on real-world clustering data sets

We conclude that the strictly robustified k-means clustering model is also beneficial for real-world clustering instances, which likely contain uncertainties due to measurement errors. Specifically, for almost all of the instances, the strictly robust model leads to improvements regarding the recovery of the true assignment over the nominal model.

Fig. 8
figure 8

Some of the synthetic clustering instances used for the experiments; originally proposed in Fränti and Sieranoja (2018)

Table 9 Average runtime and iteration counts for the corresponding ADMs with restart heuristic for the nominal, the strictly, and the \(\Gamma \)-robust models on the perturbed (10 %) synthetic data sets
Table 10 Calculated p values using the Wilcoxon signed-rank test for the synthetic data sets w.r.t. each validation metric and each pair of methods
Table 11 Calculated p values using the Wilcoxon signed-rank test for the real-world data sets w.r.t. each validation metric and each pair of methods
Table 12 Performance of the nominal, strictly robust, and spectral clustering methods on the perturbed synthetic data sets
Table 13 Performance of the nominal, strictly robust, and spectral clustering methods on the real-world clustering data sets
Table 14 Calculated p values using the Wilcoxon signed-rank test for the synthetic data sets
Table 15 Calculated p values using the Wilcoxon signed-rank test for the real-world data sets

8 Conclusion

In this paper, we derived the strictly and \(\Gamma \)-robust counterpart of the k-means clustering problem to hedge clustering results against errors without known structure in the observed data. The robustification makes it possible to obtain clusterings that are closer to clusterings of the original, i.e., error-free, data although only erroneous input data can be considered. Since already the nominal problem is NP-hard, we develop tailored alternating direction methods to quickly compute (robust) feasible solutions of good quality. Since these methods only converge to so-called partial minima that may correspond to clusterings of bad quality, we develop a restart heuristic for which we experimentally show that it significantly improves the quality of the computed clusterings. Our comparison of the strictly and \(\Gamma \)-robust clustering method reveals the interesting aspect that the less conservative \(\Gamma \)-approach is outperformed by the classic concept of strict robustness. Most importantly, the strictly robustified clustering method is able to recover clusterings of the original data even if only erroneous measurements are observed. Thus, we see that the studied robustifications can lead to more reliable models.

Let us finally sketch some open problems that we do not resolve in this paper. One issue with the \(\Gamma \)-approach is that partial minima are obtained that do not correspond to globally optimal clusterings. Consequently, the development of methods to compute global optima of robustified clustering problems may be a worthwhile topic of future work. Moreover, we do not answer the question on how to derive practically meaningful uncertainty sets for the robustification, which may be done for practical applications by means of cross-validation. Further, other uncertainty sets such as polyhedral and ellipsoidal sets should be considered in future works as well.

Finally, one could study whether other clustering approaches such as spectral or hierarchical clustering can also be robustified so that they can cope with uncertain data. However, these approaches are not based on a clear-cut optimization model so that new robust concepts need to be developed that are maybe not directly based on robust optimization.