1 Introduction

Cluster analysis or clustering is the task of partitioning similar objects into different groups according to some defined distance measure [1]. Clustering is a fundamental technique for data analysis that belongs to a subclass of unsupervised learning algorithms in machine learning and statistics. Its relevance is supported by a large number of algorithms and applications such as customer segmentation, medical imaging, recommendation systems, social network analysis, and image processing [2]. The complexity of finding a suitable clustering significantly depends on the fitness measure of a proposed partition. Among many criteria used in cluster analysis, the most studied and frequently adopted criterion is the minimum sum-of-squares clustering (MSSC). Given a set of n data points \(\{p_i\}_{i=1}^n\) in the d-dimensional Euclidean space, the MSSC problem aims to partition them into k clusters \(\{C_j\}_{j=1}^k\), so that the total sum of squared Euclidean distances between the data points and the cluster centers \(\{m_j\}_{j=1}^k\) is minimized. The MSSC problem is known to be NP-hard [3], since it leads to a non-convex mixed-integer nonlinear optimization problem (MINLP) that is challenging to solve in practice [4]. This is why the MSSC is frequently addressed by means of heuristic algorithms in comparison with exact methods.

In recent years, researchers have started to focus on clustering with user constraints, i.e., instance-level and cluster-level constraints, to make the clustering process more accurate [5]. Instance-level constraints, typically must-link and cannot-link constraints, indicate that two points must or cannot be assigned to the same cluster [6, 7]. Cluster-level constraints, instead, exploit prior knowledge on the structure of the clusters by imposing fixed cluster sizes or lower and upper bounds on the capacity of each cluster [8,9,10]. Cardinality constraints find relevant applications in demand planning and segmentation [11], electric power systems research [12], document clustering [13]. Both in unconstrained and constrained clustering, the presence of outliers can deteriorate the quality of the solution and should be carefully considered. Strict cardinality constraints offer a framework to keep into account the presence of outliers, improving the quality of the solution of the clustering problem [4]. This is achieved in [4] by allocating an extra cluster to accomodate outliers. For a detailed overview of constrained clustering applications see the survey in [14] and references therein. In this paper, we focus on the MSSC problem with strict cardinality constraints, in short ccMSSC. Adding constraints to an existing clustering formulation may make the resulting problem harder, both empirically and in terms of worst-case analysis [15]. However, if clustering problems are only solved by means of heuristic algorithms, unsatisfactory results may be obtained [16]. For this reason, global algorithms play a fundamental role in two different aspects. First, a certified globally optimal solution is fundamental for evaluating, improving, and developing heuristics and approximation algorithms, and second, as an unsupervised machine learning task, results usually require interpretation from domain experts. Since the majority of algorithms can locate only local minimizers, this interpretation may be completely erroneous in case the obtained clustering solution is far from the global optimum.

Our motivation for studying ccMSSC is twofold. First, constrained clustering with instance-level constraints is a well-covered topic in the literature with both heuristic and global optimization algorithms. Clustering with cardinality constraints, instead, is less studied and the existing methods are only capable of solving instances with a very limited number of data points, i.e., less than \(n=150\). Second, clustering with cardinality constraints is robust with respect to unfair solutions containing very few or even no data points.

In this paper, we propose a global optimization algorithm for constrained MSSC where the prior knowledge is incorporated in the form of strict cardinality constraints. This method is based on the branch-and-cut technique and exploits semidefinite programming (SDP) tools to obtain tight lower and upper bounds. The main original contributions of this paper are:

  1. 1.

    We derive a new SDP relaxation for ccMSSC and we compare it theoretically and numerically with the SDP relaxation recently proposed in [4]. The new bound can be weaker than the one in [4], but is solved significantly faster as n and k increase.

  2. 2.

    We strengthen the new SDP bound and the one in [4] by adding polyhedral cuts and we attack the resulting SDP relaxations by means of a cutting-plane algorithm.

  3. 3.

    We design a deterministic rounding procedure providing an upper bound on the optimum value of ccMSSC. We use a variant of the k-means algorithm in conjunction with the solution of the SDP relaxation to initialize the cluster centers.

  4. 4.

    We propose an SDP-based branch-and-cut algorithm producing certifiably optimal solutions for ccMSSC. We use both SDP relaxations for computing the lower bound, exploiting the strength of the one in [4] for small instances and the efficiency of the new one for larger instances.

  5. 5.

    Benefiting from a tailored branching strategy, we manage to reduce the size of the problem in the children nodes by solving equivalent SDPs over lower dimensional positive semidefinite cones.

  6. 6.

    Our exact algorithm solves for the first time to global optimality real-world instances of size 10 times larger than those solved by exact algorithms proposed in the literature.

The rest of this paper is organized as follows. Section 2 reviews the literature related to the ccMSSC problem. In Sect. 3, the existing SDP relaxation of ccMSSC and the new SDP bound are described. In Sect. 4, problem-specific post-processing techniques yielding valid lower bounds are described. In Sect. 5, the cutting-plane algorithm used for computing the bound is proposed. A problem-specific branching rule is shown in Sect. 6, and a primal heuristic is introduced in Sect. 7. Computational results are reported in Sect. 8 and some directions for future work conclude the paper in Sect. 9.

2 Related work

To the best of our knowledge, the first heuristic algorithm for ccMSSC can be found in [17]. They use the classical Lloyd’s algorithm [18] in conjunction with the solution of a linear assignment problem requiring that each cluster contains at least a minimum number of points. Since then, not many heuristics have been proposed in the literature. In [8, 19, 20], different heuristic algorithms are proposed to solve the balanced version of ccMSSC where the clusters have the same number of data points, i.e., \(|C_1 |= |C_2 |= \cdots = |C_k |= n/k\). Although these heuristics tend to produce feasible solutions in a reasonable amount of time, their quality highly depends on the choice of the initial solution and, more importantly, they do not provide optimality guarantees. In the literature, several global optimization algorithms have been proposed for both unconstrained and constrained MSSC problems. In this direction, there are two main methodological approaches: (a) mathematical programming techniques based on branch-and-bound (B &B) or column generation and (b) constraint programming (CP) techniques. The first class of methods is predominant in the unconstrained literature [21,22,23,24] with some extensions designed to solve the MSSC with instance-level constraints [25,26,27]. However, most of the recent contributions to constrained variants come from the CP community. Among the exact methods for constrained MSSC, the method presented in [28] is the first attempt at using CP for MSSC. Other successful approaches based on the CP paradigm can be found in [29, 30]. Turning now to the MSSC with cardinality constraints, all the CP approaches proposed for the MSSC with instance-level constraints can be extended to solve the ccMSSC by simply adding a global cardinality constraint over a set of variables. However, as described in [31], a tailored approach consisting of specialized global constraints coupled with filtering algorithms and search heuristics can quickly reduce the search space and achieve better performance. Despite the improvement with respect to previous CP approaches, the algorithm in [31] exhibits high computational times on small instances, and for some of them, it does not succeed in certifying the optimality of the produced clustering. To the best of our knowledge, this CP approach is the current state-of-the-art global optimization algorithm of ccMSSC.

In recent years, several conic optimization approaches have been proposed in the literature. In particular, due to their effectiveness and strong theoretical properties, there is a large branch of literature toward the application of techniques from semidefinite programming (SDP) [32]. Peng and Wei [33] prove the equivalence between the MSSC problem and a nonlinear SDP reformulation, the so-called 0–1 SDP, and they provide an SDP relaxation. Since then, the Peng-Wei model has been studied from the theoretical point of view developing recovery guarantees and conditions under which the relaxation recovers the underlying clusters with high probability [34,35,36,37]. Recently, it has been successfully embedded in B &B algorithms to globally solve unconstrained MSSC [24] and its variant with pairwise constraints [27]. To the best of our knowledge, the only conic optimization scheme for ccMSSC has been proposed in [4]. Although the focus is not on exact methods, the authors propose lower bounds based on semidefinite and linear programming relaxations and a rounding heuristic to find a feasible clustering satisfying the constraints.

The LP and SDP relaxations in [4] are proven to recover the globally optimal solution whenever a cluster separation condition is met, but only in the balanced case. The separation condition requires all cluster diameters to be smaller than the distance between any two distinct clusters. In other words, for datasets whose hidden classes are balanced and well separated, they succeed in recovering the provably optimal clustering. However, when this condition is not met, they are unable to prove optimality since the upper bound produced by the rounding procedure does not coincide with the lower bound obtained via the relaxations. Furthermore, in real-world applications, data may not be generated according to a distribution that yields well-separated clusters, and this may result in unbalanced and overlapping clusters, implying that the theory behind cluster recovery does not apply. For this reason, in this paper, we certify the globally optimal solution by using global optimization tools that do not require assumptions on the data distribution and/or on the clusters’ structure. In this paper we build on the work in [4] and we design the first SDP-based branch-and-cut algorithm for solving ccMSSC.

2.1 Notation

Throughout this paper, [n] denotes the set \(\{1, \ldots , n\}\) of indices of the data points, and [k] denotes the set \(\{1, \ldots , k\}\) of indices of the clusters. Let \({\mathcal {S}}^n\) be the set of all \(n\times n\) real symmetric matrices. We denote by \(M\succeq 0\) a positive semidefinite matrix M and by \({\mathcal {S}}_+^n\) be the set of all positive semidefinite matrices of size \(n\times n\). Analogously, we denote by \(M \succ 0\) a positive definite matrix M and let \({{\mathcal {S}}}_{++}^n\) be the set of all positive definite matrices of size \(n\times n\). We denote by \(\langle \cdot ,\cdot \rangle \) the trace inner product. That is, for any \(A, B \in {\mathbb {R}}^{n\times n}\), we define \(\langle A,B\rangle := \text {tr}(B^\top A)\). Given a matrix A, we denote by \(A_{i,:}\) and \(A_{:,j}\) the i-th row and the j-th column of A, respectively. Finally, we denote by \(1_n\) the vector of all ones of size n and by \(I_n\) the identity matrix of size \(n \times n\).

3 SDP relaxations for ccMSSC

We study the ccMSSC problem, which is defined as the task of partitioning n data points \(p_1, \ldots , p_n\), where \(p_i \in {\mathbb {R}}^d\), into k clusters of known sizes \(c_1, \ldots , c_k\), with \(\sum _{j=1}^k c_j = n\) and \(c_j \in {\mathbb {Z}}_+\), such that the total sum-of-squared intra-cluster distances is minimized. It can be formulated as the following discrete optimization problem:

$$\begin{aligned}&\!\min \quad \frac{1}{2} \sum _{j=1}^{k} \frac{1}{c_j} \sum _{s=1}^{n} \sum _{t=1}^{n} d_{st} (\pi _j)_s (\pi _j)_t \end{aligned}$$
(ccMSSC)
$$\begin{aligned}&\text {s.t.} \quad \sum _{j=1}^{k} (\pi _j)_i = 1 \ \quad \forall i \in [n], \end{aligned}$$
(1a)
$$\begin{aligned}&\qquad \sum _{i=1}^{n} (\pi _j)_i = c_j \quad \forall j \in [k], \end{aligned}$$
(1b)
$$\begin{aligned}&\qquad \pi _j \in \{0, 1\}^n \ \ \quad \forall j \in [k]. \end{aligned}$$
(1c)

Here, \(d_{st}\) is the squared Euclidean distance between \(p_s\) and \(p_t\) computed as \(\Vert p_s - p_t\Vert _2^2\) and \(\pi _j = [(\pi _j)_1, \ldots , (\pi _j)_n]^\top \) is the indicator variable of cluster j, i.e., the i-th component of \(\pi _j\) is set to 1 is the i-th data point is assigned to cluster j and 0 otherwise. Constraints (1a) ensure that each data point is assigned to exactly one cluster and constraints (1b) ensure that the cluster j contains exactly \(c_j\) data points. For general n and k, it is known that the MSSC problem with cardinality constraints is NP-hard [4]. In this paper we are interested in exact solutions that we generate by means of a branch-and-cut algorithm. Therefore, we need efficiently computable lower and upper bounds on the original problem. For the ccMSSC problem, lower bounds obtained by solving the LP relaxation are weak in practice [4]. On the contrary, the bound provided by SDP relaxations strengthened through valid inequalities we will introduce next is much stronger. This motivates us to study SDPs that can be efficiently used within a branch-and-cut framework.

3.1 Vector lifting SDP relaxation

In this section we review the relaxation technique presented in [4] for problem (ccMSSC). We present the formulation introduced in [4], where we shift the decision variables from \(\{-1,1\}\) to \(\{0,1\}\). This change of variables allows us to better relate the relaxation in [4] to the new one introduced in this paper. Let \(\Pi _j\) be the \(n\times n\) symmetric matrix given by \(\Pi _j = \pi _j (\pi _j)^\top \) for \(\pi _j \in \{0, 1\}^n\) and \(j \in [k]\). This implies \(\Pi _j \succeq \pi _j (\pi _j)^\top \), \(\text {diag}(\Pi _j) = \pi _j\) and \(\text {rank}(\Pi _j) = 1\). Conversely, for any matrix \(\Pi _j\) satisfying \(\Pi _j \succeq 0\), \(\text {diag}(\Pi _j) = \pi _j\) and \(\text {rank}(\Pi _j) = 1\) we have \(\Pi _j = \pi _j (\pi _j)^\top \) for some \(\pi _j \in \{0, 1\}^n\) [32]. Denote by D the Euclidean distance matrix, i.e., \(D_{ij} = \Vert p_i - p_j\Vert _2^2\). Problem (ccMSSC) can be exactly rewritten as

$$\begin{aligned}&\hbox {min} \quad \frac{1}{2} \left\langle D, \sum _{j=1}^{k} \frac{1}{c_j} \Pi _j \right\rangle \nonumber \\&\hbox {s.t.}\quad {\sum _{j=1}^{k} \pi _j}{= 1_n} \nonumber \\&\qquad {1_n^\top \pi _j}{= c_j} \quad {\quad \forall j \in [k]}\\&\qquad {\Pi _j}{\succeq \pi _j (\pi _{j})^\top } {\quad \forall j \in [k]}\nonumber \\&\qquad {\text {rank}(\Pi _j)}{= 1} \quad {\quad \forall j \in [k]}\nonumber \\&\qquad {\text {diag}(\Pi _j)}{= \pi _j} \quad {\quad \forall j \in [k]}.\nonumber \end{aligned}$$
(2)

Because of the rank-one constraints, this is not a semidefinite program. Constraint \(\Pi _j - \pi _j (\pi _j)^\top \succeq 0\) can be reformulated, by the Schur’s complement, as the lifted constraints

$$\begin{aligned} {Y_j=} \begin{bmatrix} 1 &{}\quad (\pi _j)^\top \\ \pi _j &{}\quad \Pi _j \end{bmatrix} \in {\mathcal {S}}_{+}^{n+1} \qquad \forall j \in [k]. \end{aligned}$$

Furthermore, looking at the structure of \(\Pi _j\) additional constraints can be added to strengthen the formulation. It easily follows that \(\Pi _j 1_n = c_j \pi _j\) and \(\Pi _j \ge 0_{n \times n}\) for all \(j \in [k]\).

Let \(W \in {\mathcal {S}}^n_+\) be the matrix of the inner products of the data points, i.e., \(W_{ij} = p_i^\top p_j\). We now rewrite the objective function using the Lindenstrauss mapping between the Euclidean distance matrix D and the Gram matrix W on the clustering feasible set [38, 39]:

$$\begin{aligned} D = \text {diag}(W)1_n^\top + 1_n \text {diag}(W)^\top - 2 W. \end{aligned}$$

The objective function of problem (2) can be rewritten as

$$\begin{aligned} \frac{1}{2} \left\langle D, \sum _{j=1}^{k} \frac{1}{c_j} \Pi _j \right\rangle&= \frac{1}{2} \left\langle \text {diag}(W)1_n^\top , \sum _{j=1}^{k} \frac{1}{c_j} \Pi _j \right\rangle + \frac{1}{2} \left\langle 1_n \text {diag}(W)^\top , \sum _{j=1}^{k} \frac{1}{c_j} \Pi _j \right\rangle \\&\quad - \left\langle W, \sum _{j=1}^{k} \frac{1}{c_j} \Pi _j \right\rangle \\&= \frac{1}{2} \left\langle \text {diag}(W),\sum _{j=1}^{k} \frac{1}{c_j} \Pi _j 1_n\right\rangle + \frac{1}{2} \left\langle \text {diag}(W),\sum _{j=1}^{k} \frac{1}{c_j} (\Pi _j)^\top 1_n\right\rangle \\&\quad - \left\langle W, \sum _{j=1}^{k} \frac{1}{c_j} \Pi _j \right\rangle \\&= \frac{1}{2} \text {diag}(W)^\top 1_n + \frac{1}{2} 1_n^\top \text {diag}(W) - \left\langle W,\sum _{j=1}^{k} \frac{1}{c_j} \Pi _j\right\rangle = \text {tr}(W) \\&- \left\langle W,\sum _{j=1}^{k} \frac{1}{c_j} \Pi _j\right\rangle , \end{aligned}$$

where the last equality derives from the constraints linking \(\Pi _j\) and \(\pi _j\), i.e., \(\Pi _j 1_n = c_j \pi _j\) and \(\sum _{j=1}^{k} \pi _j = 1_n\). Therefore, the resulting SDP relaxation is the convex program

$$\begin{aligned} f^\star _{\text {VL}} = \min _{} \quad&{\left\langle W,I_n-\sum _{j=1}^{k} \frac{1}{c_j} \Pi _j\right\rangle }\\ \text {s.t.} \quad&\sum _{j=1}^{k} \pi _j= 1_n\\&(\pi _j, \Pi _j) \in {\mathcal {S}}_{\text {VL}}(c_j) \quad \forall j \in [k] \\ \end{aligned}$$
(VL-SDP)

where, for any \(c \in {\mathbb {Z}}_+\), the set \({\mathcal {S}}_{\text {VL}}(c) \subset {\mathbb {R}}^n \times {\mathcal {S}}^n\) is defined as

$$\begin{aligned} \mathcal {S_{\text {VL}}}(c) = \left\{ (\pi , \Pi ) \in {\mathbb {R}}^n \times {\mathcal {S}}^n : \begin{array}{l} 1_n^\top \pi = c, \ \text {diag}(\Pi ) = \pi , \ \Pi 1_n = c \pi , \\ \Pi \succeq \pi (\pi )^\top , \ \Pi \ge 0_{n \times n} \\ \end{array} \right\} . \end{aligned}$$

From now on, we refer to the relaxed problem as “vector lifting” SDP relaxation. This kind of SDP is a doubly nonnegative program (DNN) since the matrix variable is both positive semidefinite and elementwise nonnegative. Problem (VL-SDP) provides a lower bound on the optimal objective value of problem (ccMSSC). Moreover, if the optimal solution \(\{(\pi _j^\star , \Pi _j^\star )\}_{j=1}^k\) of problem (VL-SDP) satisfies \(\Pi _j = \pi _j (\pi _j)^\top \) for all \(j \in [k]\), then we can conclude that \(\{(\pi _j^\star , \Pi _j^\star )\}_{j=1}^k\) is an optimal solution of problem (ccMSSC). Note that, this SDP relaxation has k matrix variables of size \((n+1) \times (n+1)\), thus requires \(O(kn^2)\) variables and constraints. Although the number of clusters k is much smaller than n, the relaxation becomes impractical for branch-and-bound algorithms as n and k grow. In the following, we substantially reduce the number of variables by introducing a new SDP bound. This relaxation is based on the “matrix lifting” technique [40, 41] and only has one matrix variable of size \((n+k) \times (n+k)\). We will show that this reduction in the number of variables leads to a significant computational efficiency compared to the vector lifting SDP relaxation. Note that, the matrix lifting technique has been also successfully employed in graph partitioning [42, 43], where the nodes of a graph have to be partitioned into clusters in such a way to minimize the weights of the edges among different clusters.

3.2 Matrix lifting SDP relaxation

We now describe the new bound obtained from our matrix lifting SDP relaxation. This relaxation uses only \(O\left( (n+k)^2\right) \) variables and constraints. Since the computational complexity of solving SDPs is a polynomial function of the number of variables, we can expect a significant complexity reduction with respect to the vector lifting SDP relaxation. In the following, we briefly review the well-known discrete optimization model for unconstrained MSSC and the SDP relaxation proposed in [33]. Let X be the \(n \times k\) assignment matrix, i.e., \(X_{ij} = 1\) if the i-th data point is assigned to the j-th cluster and 0 otherwise. The Peng-Wei MSSC discrete model in matrix notation is

$$\begin{aligned} \begin{aligned} \min _{} \quad&\text {tr}(W - W X (X^\top X)^{-1} X^\top )\\ \text {s.t.} \quad&X 1_k = 1_n, \ X^\top 1_n \ge 0_k, \ X \in \{0, 1\}^{n \times k}. \end{aligned} \end{aligned}$$
(3)

and setting \(Z = X(X^\top X)^{-1}X^\top \), the following SDP relaxation can be derived

$$\begin{aligned} \begin{aligned} f^\star _\text {PW}=\min _{} \quad&\text {tr}(W - W Z)\\ \text {s.t.} \quad&Z 1_n=1_n, \ \text {tr}(Z)=k, \ Z \in {\mathcal {S}}^n_+, \ Z \ge 0_{n \times n}. \end{aligned} \end{aligned}$$
(4)

Note that constraints \(Z \ge 0_{n \times n}\) and \(Z 1_n = 1_n\) ensure that Z is a stochastic matrix, and hence all of its eigenvalues lie between 0 and 1. We now extend the above setting to incorporate cardinality constraints. Let \(C = \text {Diag}(c_1, \ldots , c_k)\), be the diagonal matrix containing the cluster sizes. An exact reformulation of ccMSSC in matrix notation is

$$\begin{aligned} \begin{aligned} \min _{} \quad&\text {tr}(W - W X C^{-1} X^\top )\\ \text {s.t.} \quad&X 1_k= 1_n, \ X^\top 1_n = \text {diag}(C), \ X \in \{0, 1\}^{n \times k}\\ \end{aligned} \end{aligned}$$
(5)

where X is the assignment matrix. Clearly, problem (ccMSSC) is equivalent to problem (5). To see this, note that the j-th column of X is the characteristic vector of the cluster j, i.e, \(\pi _j = X_{:,j}\) for all \(j \in [k]\), and the k clusters are in one-to-one correspondence with the set of partition matrices

$$\begin{aligned} {\mathcal {F}} = \left\{ X \in \{0, 1\}^{n \times k} : \ X 1_k = 1_n, \ X^\top 1_n = \text {diag}(C) \right\} . \end{aligned}$$
(6)

To obtain the matrix lifting SDP relaxation of problem (5), we first linearize the objective function by introducing a new matrix variable \(Z = X C^{-1} X^\top \). This yields the feasible set

$$\begin{aligned} \mathcal {{\bar{F}}} = \text {conv}\left\{ Z \in {\mathcal {S}}^{n+k}: \ Z = XC^{-1}X^\top , \ X \in {\mathcal {F}} \right\} . \end{aligned}$$
(7)

Therefore, we can rewrite the problem as

$$\begin{aligned} \begin{aligned} \min _{Z \in \mathcal {{\bar{F}}}} \quad&\text {tr}(W - W Z)\\ \end{aligned} \end{aligned}$$
(8)

In order to approximate the set \(\mathcal {{\bar{F}}}\), we relax the integrality constraint on X and we set \(X \ge 0_{n \times k}\). Next, we replace the non-convex constraint \(Z = X C^{-1} X^\top \) by \(Z \succeq X C^{-1} X^\top \) and by using the Schur’s complement we obtain the equivalent linear matrix inequality

$$\begin{aligned} {Y} = \begin{bmatrix} C &{} \quad X^\top \\ X &{}\quad Z \end{bmatrix} \in {\mathcal {S}}^{n+k}_+. \end{aligned}$$
(9)

Although constraint \(Z = X C^{-1} X^\top \) is relaxed, we can still consider some additional linear constraints to further improve the quality of the solution. Since Z approximates \(X C^{-1} X^\top \) and \(X \ge 0_{n \times k}\), we have \(Z \ge 0_{n \times n}\), \(\text {diag}(Z) = X \text {diag}(C^{-1})\) and

$$\begin{aligned} Z 1_n&= X C^{-1} X^\top 1_n = X C^{-1} \text {diag}(C) = X 1_k = 1_n. \end{aligned}$$
(10)

By collecting all the mentioned constraints, the matrix lifting SDP relaxation is given by

$$\begin{aligned} f^\star _\text {ML} = \min _{(X, Z) \in {\mathcal {S}}_{\text {ML}}} \quad&\text {tr}(W - W Z) \end{aligned}$$
(ML-SDP)

where the convex set \({\mathcal {S}}_{\text {ML}} \subset {\mathbb {R}}^{n \times k} \times {\mathcal {S}}^n\) is defined as

$$\begin{aligned} \mathcal {S_{\text {ML}}} = \left\{ (X, Z) \in {\mathbb {R}}^{n \times k} \times {\mathcal {S}}^n : \begin{array}{l} X 1_k = 1_n, \ X^\top 1_n = \text {diag}(C), \\ Z 1_n = 1_n, \ \text {diag}(Z) = X \text {diag}(C^{-1}), \\ X \ge 0_{n \times k}, \ Z \ge 0_{n \times n}, \ Z \succeq X C^{-1} X^\top \\ \end{array} \right\} . \end{aligned}$$
(11)

Similarly to (VL-SDP), problem (ML-SDP) is a DNN program whose optimal value yields a lower bound on the optimal value of problem (ccMSSC). Moreover, if the optimal solution \((X^\star , Z^\star )\) is an extreme point of \(\mathcal {{\bar{F}}}\), then we can conclude that \(X^\star _{:,j} = \pi ^\star _j\) for all \(j \in [k]\) is an optimal solution of problem (ccMSSC).

It is interesting to compare problem (ML-SDP) with the Peng-Wei SDP relaxation of unconstrained MSSC in (4). The matrix Z in problem (ML-SDP) shares the same structure of the one in (4). This implies that valid inequalities used in [21] and [24] for strengthening the Peng-Wei bound can also be exploited to tighten our relaxation. Finally, we point out that, in contrast to our relaxation, the Peng-Wei SDP does not have access to either the cluster size or the assignment matrix. We will discuss how to exploit the SDP solution to recover feasible clustering solutions in Sect. 7.

One special case of interest is the balanced ccMSSC where each cluster contains the same number of data points, i.e., \(c_j = \frac{n}{k}\) for all \(j \in [k]\). This leads to the Amini–Levina SDP relaxation, which is presented in [44] and takes the form

$$\begin{aligned} \begin{aligned} f^\star _\text {AL} = \min _{} \quad&\text {tr}(W - W Z)\\ \text {s.t.} \quad&Z 1_n=1_n, \ \text {diag}(Z)=\frac{k}{n}1_n, \ Z \in {\mathcal {S}}^n_+, \ Z \ge 0_{n \times n}. \end{aligned} \end{aligned}$$
(12)

Our (ML-SDP) generalizes the Amini–Levina SDP to the case of unbalanced cluster sizes and it can be shown to be equivalent to problem (12) when the cardinalities of all clusters are the same. In order to prove this result, we need the following lemma.

Proposition 1

Let \(C = \text {Diag}(\frac{n}{k}, \ldots , \frac{n}{k})\), then Problems (ML-SDP) and (12) are equivalent.

Proof

We show that any feasible solution of problem (ML-SDP) gives rise to a feasible solution of problem (12) with the same objective value and vice versa. Let (XZ) be feasible for (ML-SDP) then it immediately follows that \({\bar{Z}} = Z\) is feasible for (12) and achieves the same objective value. Conversely, let Z be feasible for problem (12) then we construct a solution \(({\bar{X}}, {\bar{Z}})\) of problem (ML-SDP) with \({\bar{Z}} = Z\) and \({\bar{X}} = \frac{1}{k} 1_n 1_k^\top \). Then, we have \({\bar{X}} 1_k = 1_n\), \({\bar{X}}^\top 1_n = \text {diag}(C)\) and \(\text {diag}({\bar{Z}}) = {\bar{X}} \text {diag}(C^{-1}) = {\bar{X}} \frac{k}{n} 1_k = \frac{k}{n} 1_n\). Let \(M = {\bar{Z}} - {\bar{X}}C^{-1}{\bar{X}}^\top = {\bar{Z}} - \frac{1}{n} 1_n 1_n^\top \). It remains to show that \( M \succeq 0\). From \({\bar{Z}}1_n = 1_n\) we have that 1 is an eigenvalue of \({\bar{Z}}\) with corresponding eigenvector \(1_n\). The positive semidefiniteness of M follows from \({{\bar{Z}}}\) being positive semidefinite and \(M={\bar{Z}} - \frac{1}{n} 1_n 1_n^\top \) being a deflation matrix of \({{\bar{Z}}}\succeq 0\) with respect to the eigenvalue-normalized eigenvector pair \(\left( 1,\frac{1}{\sqrt{n}}1_n\right) \). \(\square \)

We now prove that our relaxation (ML-SDP) is dominated by (VL-SDP), the best-known relaxation of the ccMSSC problem. Then, we compare the SDPs numerically on some instances from the literature. We show that the bound provided by problem (ML-SDP) is competitive and the computational effort to compute it is much smaller than the one required to solve problem (VL-SDP).

Proposition 2

We have \(f^\star _{\text {VL}} \ge f^\star _{\text {ML}} \ge f^\star _{\text {PW}}\).

Proof

We first show \(f^\star _{\text {VL}} \ge f^\star _{\text {ML}}\). For any feasible solution \(\{(\pi _j, \Pi _j)\}_{j=1}^k\) of problem (VL-SDP), we can construct a solution (ZX) with

$$\begin{aligned} Z = \sum _{j=1}^{k} \frac{1}{c_j} \Pi _j \quad \text {and} \quad X = [\pi _1, \ldots , \pi _k], \end{aligned}$$
(13)

that is feasible for problem (ML-SDP) and achieves the same objective value. By construction we have \(X \ge 0_{n \times n}\) and \(Z \ge 0_{n \times n}\) since \(\pi _j\) is nonnegative due to \(\Pi _j \ge 0_{n \times n}\) and \(\text {diag}(\Pi _j) = \pi _j\) for \(j=1,\ldots ,k\). Similarly, \(X 1_k = 1_n\) and \( X^\top 1_n = \text {diag}(C)\) hold by construction. Then, we have

$$\begin{aligned} Z 1_n&= \sum _{j=1}^{k} \frac{1}{c_j} \Pi _j 1_n = \sum _{j=1}^{k} \frac{1}{c_j} c_j \pi _j = \sum _{j=1}^{k} \pi _j = 1_n, \end{aligned}$$
(14)
$$\begin{aligned} \text {diag}(Z)&= \text {diag}\Bigg (\sum _{j=1}^{k} \frac{1}{c_j} \Pi _j \Bigg ) = \sum _{j=1}^{k} \frac{1}{c_j} \text {diag}(\Pi _j) = \sum _{j=1}^{k} \frac{1}{c_j} \pi _j = X \text {diag}(C^{-1}). \end{aligned}$$
(15)

Using the positive semidefiniteness of \(\Pi _j - \pi _j (\pi _j)^\top \) we have

$$\begin{aligned} Z - X C^{-1} X^\top&= \sum _{j=1}^k \frac{1}{c_j} \Pi _j - [\pi _1, \ldots , \pi _k] C^{-1} [\pi _1, \ldots , \pi _k]^\top \end{aligned}$$
(16)
$$\begin{aligned}&= \sum _{j=1}^k \frac{1}{c_j} \Pi _j - \sum _{j=1}^k \frac{1}{c_j} \pi _j (\pi _j)^\top = \sum _{j=1}^k \frac{1}{c_j} (\Pi _j - \pi _j (\pi _j)^\top ) \succeq 0. \end{aligned}$$
(17)

Clearly, the objective function values coincide. Finally, we have to show that \(f^\star _{\text {ML}} \ge f^\star _{\text {PW}}\). For any feasible solution (ZX) of problem (ML-SDP) we obtain a feasible solution \({\bar{Z}}\) of problem (4) by setting \({\bar{Z}} = Z\). Clearly, \({\bar{Z}} \ge 0_{n \times n}\) and \({\bar{Z}}1_n = 1_n\) hold by construction. Then, \({\bar{Z}} \succeq 0\) since \(Z \succeq X C^{-1} X^\top \). Furthermore, constraints \(\text {diag}(Z) = X \text {diag}(C^{-1})\) and \(X^\top 1_n = \text {diag}(C)\) imply

$$\begin{aligned} \text {tr}({\bar{Z}}) = 1_n^\top \text {diag}(Z) = 1_n^\top X \text {diag}(C^{-1}) = \text {diag}(C) \text {diag}(C^{-1}) = k \end{aligned}$$
(18)

and hence \({\bar{Z}}\) is feasible for problem (4). \(\square \)

To compare the performance of the two relaxations, we select small-scale benchmark instances from the constrained clustering literature [4, 31] that are described in Sect. 8 (see Table 2). We solve both the SDP relaxations by using MOSEK interior-point optimizer [45]. We set a default tolerance of \(10^{-8}\) for primal feasibility, dual feasibility, and relative gap. In Table 1, along with the number of data points n and the target number of cluster k, we report the lower bound (LB) provided by the relaxation (VL-SDP), the relaxation (ML-SDP) and the computational time in seconds.

Table 1 Comparison between VL-SDP and ML-SDP bounds on small-scale instances using MOSEK. The cluster sizes are set according to the ground-truth class labels

The results show that our matrix lifting SDP relaxation provides competitive bounds and is solved significantly faster than the existing vector lifting SDP relaxation. Besides this, the results show that, on some instances, the relaxations provide bounds that are close, and for some problems even equal. Due to the mentioned quality of the new relaxation, we believe that it is suitable for implementation within a branch-and-cut framework. Since SDP solvers can be inaccurate, we describe how to derive valid lower bounds for both SDPs so that they can be safely used within a branch-and-cut framework.

4 Valid lower bounds

Current successful solution techniques based on B &B methods rely on obtaining strong and inexpensive bounds. Computational results in Table 1 show that, regardless of the relaxation, off-the-shelf interior-point methods (IPMs) are inefficient even for small-sized clustering instances. Thus, for the efficiency of the B &B we need algorithms that can solve large-scale SDPs. Compared to IPMs [46], first-order SDP solvers based on augmented Lagrangian method (ALM) or alternating direction method of multipliers (ADMM) can scale to significantly larger problem sizes, while trading off the accuracy of the resulting output [47,48,49]. When using first-order methods, it is hard to reach a solution to high precision in a reasonable amount of time. To safely use the proposed lower bounds within the B &B algorithm, we need a certificate that the optimal value of the primal SDP is indeed a valid lower bound for the discrete optimization problem. However, since we only approximately solve the primal-dual pair of SDPs to some precision, feasibility is not necessarily reached when the algorithm terminates. In the following, we consider post-processing methods to guarantee safe lower bounds for our SDPs. Following the ideas developed in [50, 51], we use two methods: one adding a negative perturbation to the dual objective function value (error bounds) and one that generates a dual feasible solution, and hence a bound, by solving a linear program. Both methods are computationally cheap and produce bounds close to the optimal objective function value. However, since in our experiments there is no method that systematically outperforms the other on all the considered instances, we choose to run both of them and we use the best (largest) bound.

Consider Lagrange multipliers \(y \in {\mathbb {R}}^n\), \(\alpha _j, v_j \in {\mathbb {R}}\), \(\beta _j, \gamma _j, u_j \in {\mathbb {R}}^n\), \(V_j \in {\mathcal {S}}^{n}\), \(V_j \ge 0\), \(U_j \in {\mathcal {S}}^{n+1} \ \forall j \in [k]\). The dual of problem (VL-SDP) is

$$\begin{aligned}&\hbox {max} \quad y^\top 1_n + \sum _{j=1}^k \alpha _j c_j - \sum _{j=1}^{k} v_j + {\left\langle W,I_n\right\rangle }&\nonumber \\&\hbox {s.t.} \quad {-y - \alpha _j 1_n + \beta _j + c_j \gamma _j - 2 u_j}{= 0_n}&{\quad \forall j \in [k]}\nonumber \\&\qquad {{-\frac{1}{c_j}W} - \text {Diag}(\beta _j) - \frac{1}{2} 1_n (\gamma _j)^\top - \frac{1}{2} \gamma _j 1_n^\top - S_j}{= V_j}&{\quad \forall j \in [k]}\\&\qquad {U_j = \begin{bmatrix} v_j &{} (u_j)^\top \\ u_j &{} S_j \end{bmatrix}}{\in {\mathcal {S}}_{+}^{n+1}, \ V_j \ge 0_{n \times n}}&{\quad \forall j \in [k]}\nonumber \end{aligned}$$
(19)

Consider Lagrange multipliers \(y_1, \alpha _1, \alpha _2 \in {\mathbb {R}}^n\), \(y_2 \in {\mathbb {R}}^k\), \(U \in {\mathbb {R}}^{n \times k}\), \(U \ge 0\), \(V \in {\mathcal {S}}^n\), \(V \ge 0\), \(S \in {\mathcal {S}}^{n+k}\). The dual of problem (ML-SDP) is

$$\begin{aligned} \begin{aligned}&\hbox {max} \quad {y_1^\top 1_n + y_2^\top \text {diag}(C) + \alpha _1^\top 1_n - \left\langle C,S_{11}\right\rangle + {\left\langle W,I_n\right\rangle }} \\&\hbox {s.t.} \quad {- y_1 1_k^\top - 1_n y_2^\top + \alpha _2 \text {diag}(C^{-1})^\top - 2S_{12}}{= U}\\&\qquad {{-W} - \frac{1}{2}\alpha _1 1_n^\top - \frac{1}{2} 1_n \alpha _1^\top - \text {Diag}(\alpha _2) - S_{22}}{= V}\\&\qquad {S = \begin{bmatrix} S_{11} &{}\quad S_{12}^\top \\ S_{12} &{}\quad S_{22} \end{bmatrix} \in {\mathcal {S}}^{n+k}_+, }{\ U \ge 0_{n \times k}, \ {V \ge 0_{n \times n}}.}{} \end{aligned} \end{aligned}$$
(20)

We use the dual problems in both post-processing techniques described in the next subsections. The idea of post-processing via error bounds is the following. If the dual feasibility is reached within machine accuracy, then the dual objective function is already a valid lower bound. Otherwise, the dual objective value is perturbed by adding a negative term to keep into account the infeasibility. This perturbation should be as small as possible. Theorems 3 and 5 compute the tailored safe underestimate of the dual objective function for problems (VL-SDP) and (ML-SDP), respectively. The following lemma is needed for proving the validity of the error bounds.

Lemma 1

(Lemma 3.1 in ref. [50]) Let \(S, X \in {\mathcal {S}}^n\) be matrices that satisfy \(0 \le \lambda _{\min }(X)\) and \(\lambda _{\max }(X) \le {\bar{x}}\) for some \({\bar{x}} \in {\mathbb {R}}\). Then the following inequality holds:

$$\begin{aligned} \left\langle S,X\right\rangle \ge {\bar{x}}\sum _{i :\lambda _i(S) <0}\lambda _i(S). \end{aligned}$$

Theorem 3

Let \(p^*\) be the optimal objective function value of (VL-SDP). Given the dual variables \(y \in {\mathbb {R}}^n\), \(\alpha _j, v_j \in {\mathbb {R}}\), \(\beta _j, \gamma _j \in {\mathbb {R}}^n\), \(V_j \in {\mathcal {S}}^{n}\), \(V_j \ge 0\), set

$$\begin{aligned} U_j = \begin{bmatrix} v_j &{}\quad (u_j)^\top \\ u_j &{}\quad S_j \end{bmatrix}, \quad u_j&= \frac{1}{2}\left( -y - \alpha _j 1_n + \beta _j + c_j \gamma _j\right) , \\ S_j&= -\frac{1}{c_j}W - \text {Diag}(\beta _j) - \frac{1}{2} 1_n (\gamma _j)^\top - \frac{1}{2} \gamma _j 1_n^\top - V_j, \end{aligned}$$

for all \(j \in [k]\). A safe lower bound for \(p^\star \) is given by

$$\begin{aligned} lb = {\left\langle W,I_n\right\rangle } + y^\top 1_n + \sum _{j=1}^k \alpha _j c_j - \sum _{j=1}^{k} v_j + \sum _{j=1}^{k} \left( \left( c_j+1\right) \sum _{i:\lambda _i(U_j) < 0} \lambda _i(U_j) \right) . \end{aligned}$$

Proof

Let \(Y_j^\star = \begin{bmatrix} 1 &{} (\pi _j^\star )^\top \\ \pi _j^\star &{} \Pi _j^\star \end{bmatrix}\) for all \(j \in [k]\) be an optimal solution of (VL-SDP) with objective function value \(p^\star \). In order to show that \(p^\star \ge lb\), consider the equation

$$\begin{aligned}&{\left\langle W,I_n - \sum _{j=1}^k \frac{1}{c_j} \Pi _j^\star \right\rangle } - \left( y^\top 1_n + \sum _{j=1}^k \alpha _j c_j + \sum _{j=1}^k (\beta _j)^\top 0_n + \sum _{j=1}^k (\gamma _j)^\top 0_n - \sum _{j=1}^{k} v_j + {\left\langle W,I_n\right\rangle }\right) \\&\quad = {-\left\langle W,\sum _{j=1}^k \frac{1}{c_j} \Pi _j^\star \right\rangle } - y^\top \left( \sum _{j=1}^k \pi _j^\star \right) - \sum _{j=1}^k \alpha _j (1_n^\top \pi _j^\star ) - \sum _{j=1}^k (\beta _j)^\top (\text {diag}(\Pi _j^\star ) - \pi _j^\star ) \\&\qquad - \sum _{j=1}^k (\gamma _j)^\top \left( \frac{1}{2}(\Pi _j^\star + (\Pi _j^\star )^\top ) 1_n - c_j \pi _j^\star \right) + \sum _{j=1}^{k} v_j \\&\quad = {\sum _{j=1}^k -\frac{1}{c_j} \left\langle W,\Pi _j^\star \right\rangle }- \sum _{j=1}^k y^\top \pi _j^\star - \sum _{j=1}^k \alpha _j 1_n^\top \pi _j^\star - \left\langle \sum _{j=1}^k \text {Diag}(\beta _j),\Pi _j^\star \right\rangle + \sum _{j=1}^k (\beta _j)^\top \pi _j^\star \\&\qquad - \left\langle \frac{1}{2} \sum _{j=1}^k 1_n (\gamma _j)^\top ,\Pi _j^\star \right\rangle - \left\langle \frac{1}{2} \sum _{j=1}^k \gamma _j 1_n^\top ,\Pi _j^\star \right\rangle + \sum _{j=1}^k (\gamma _j)^\top c_j \pi _j^\star + \sum _{j=1}^{k} v_j \\&\quad = \sum _{j=1}^k \left\langle {-\frac{1}{ c_j} W }- \text {Diag}(\beta _j) - \frac{1}{2} 1_n (\gamma _j)^\top - \frac{1}{2} \gamma _j 1_n^\top ,\Pi _j^\star \right\rangle \\&\qquad + \sum _{j=1}^k \left( - y - \alpha _j 1_n + \beta _j + \gamma _j c_j\right) ^\top \pi _j^\star + \sum _{j=1}^k v_j \\&\quad = \sum _{j=1}^k \left\langle V_j,\Pi _j^\star \right\rangle + \sum _{j=1}^k \left\langle S_j,\Pi _j^\star \right\rangle + 2 \sum _{j=1}^k (u_j)^\top \pi _j^\star + \sum _{j=1}^k v_j = \sum _{j=1}^k \left\langle V_j,\Pi _j^\star \right\rangle + \sum _{j=1}^k \left\langle U_j,Y^\star _j\right\rangle . \end{aligned}$$

The last term can be bounded by means of Lemma 1 applied on matrix \(Y^\star _j\), so that we need an upper bound \({\bar{y}}_j\) on \(\lambda _{\max }(Y^\star _j)\). The largest eigenvalue of \(\Pi _j\) is lower or equal than its largest row sum, which is the cardinality of cluster j. Due to \(\Pi _j 1_n = c_j \pi _j\) and \(\pi _j \in [0, 1]^n\) we set \({\bar{y}}_j = c_j + 1\). Therefore, using Lemma 1 with \({\bar{y}}_j = c_j+1\), and the nonnegativity of \(V_j\) for all \(j \in [k]\), we obtain

$$\begin{aligned} \sum _{j=1}^k \left\langle V_j,\Pi _j^\star \right\rangle + \sum _{j=1}^k \left\langle U_j,Y^\star _j\right\rangle \ge \sum _{j=1}^k \left( \left( c_j+1\right) \sum _{i:\lambda _i(U_j) < 0} \lambda _i(U_j) \right) . \end{aligned}$$

\(\square \)

In order to state an analogous result for problem (ML-SDP), we need an intermediate result, bounding the eigenvalues of its feasible solutions.

Lemma 2

(ref. [52]) For every block matrix \(M = {\begin{bmatrix} M_{11} &{} M_{12}^\top \\ M_{12} &{} M_{22} \end{bmatrix}} \succeq 0\) we have the decomposition

$$\begin{aligned} {M = U \begin{bmatrix} M_{11} &{} 0 \\ 0 &{} 0 \end{bmatrix} U^\top + V \begin{bmatrix} 0 &{} 0 \\ 0 &{} M_{22} \end{bmatrix} V^\top } \end{aligned}$$

for some orthogonal matrices UV.

Theorem 4

Let \(Y = \begin{bmatrix} C &{} X^\top \\ X &{} Z \end{bmatrix}\) be a feasible solution of problem (ML-SDP), then

$$\begin{aligned} \max _{i=1,\ldots , k} C_{ii} \le \lambda _{\max }(Y) \le \max _{i=1,\ldots , k} C_{ii} + 1 \end{aligned}$$

Proof

Let \(C_0 = \begin{bmatrix} C &{} 0 \\ 0 &{} 0 \end{bmatrix}\) and \(Z_0 = \begin{bmatrix} 0 &{} 0 \\ 0 &{} Z \end{bmatrix}\). From Lemma 2 we have

$$\begin{aligned} Y = U C_0 U^\top + V Z_0 V^\top , \end{aligned}$$

where U and V are orthogonal. Therefore

$$\begin{aligned} \lambda _{\max }(Y) = \max _{\Vert v\Vert =1} v^\top Y v&= \max _{\Vert v\Vert =1} v^\top \big (U C_0 U^\top + V Z_0 V^\top \big ) v \\&\le \max _{\Vert v\Vert =1} v^\top \big (U C_0 U^\top \big ) v + \max _{\Vert v\Vert =1} v^\top \big ( V Z_0 V^\top \big ) v \\&= \lambda _{\max }\big ( U C_0 U^\top \big ) + \lambda _{\max }\big ( V Z_0 V^\top \big ) \\&= \lambda _{\max }\big (C_0 \big ) + \lambda _{\max }\big ( Z_0 \big ) \\&= \lambda _{\max }(C) + \lambda _{\max }(Z) = \max _{i=1,\ldots ,k} C_{ii} + 1, \end{aligned}$$

where the last two equations derive from the orthogonality of U and V and Z being a stochastic matrix. Finally,

$$\begin{aligned} \lambda _{\max }(Y) = \max _{\Vert v\Vert =1} v^\top Y v \ge e_i^\top Y e_i = y_{ii} \quad \forall i \in \{1, \ldots , n+k\}, \end{aligned}$$

where \(e_i\) is the i-th basis vector. Therefore, \(\lambda _{\max }(Y) \ge \max _{i=1,\ldots ,k} C_{ii}\). \(\square \)

Similarly to the vector lifting relaxation, once the SDP has been solved approximately, the following theorem gives a lower bound on the optimal value of the matrix lifting SDP relaxation.

Theorem 5

Let \(p^\star \) be the optimal objective function value of problem (ML-SDP).Given the dual variables \(y_1, \alpha _1, \alpha _2 \in {\mathbb {R}}^n\), \(y_2 \in {\mathbb {R}}^k\), \(U \in {\mathbb {R}}^{n \times k}\), \(U \ge 0\), \(V \in {\mathcal {S}}^n\), \(V \ge 0\), \(S_{11} \in {\mathbb {R}}^{k \times k}\) set

$$\begin{aligned} S = \begin{bmatrix} S_{11} &{} S_{12}^\top \\ S_{12} &{} S_{22} \end{bmatrix}, \quad S_{12}&= \frac{1}{2}\left( - y_1 1_k^\top - 1_n y_2^\top + \alpha _2 \text {diag}(C^{-1})^\top - U \right) , \\ S_{22}&= -W - \frac{1}{2}\alpha _1 1_n^\top - \frac{1}{2} 1_n \alpha _1^\top - \text {Diag}(\alpha _2) - V. \end{aligned}$$

A safe lower bound for the optimal value \(p^\star \) is given by

$$\begin{aligned} lb= & {} {\left\langle W,I_n\right\rangle } + y_1^\top 1_n + y_2^\top \text {diag}(C) + \alpha _1^\top 1_n - \left\langle C,S_{11}\right\rangle \\{} & {} + \left( \max _{i=1, \ldots , k} C_{ii} + 1 \right) \sum _{i:\lambda _i(S) < 0} \lambda _i(S). \end{aligned}$$

Proof

Let \(Y^\star = \begin{bmatrix} C &{} (X^\star )^\top \\ X^\star &{} Z^\star \end{bmatrix}\) be an optimal solution of problem (ML-SDP) with objective function value \(p^\star \). In order to show that \(p^\star \ge lb\), consider the equation

$$\begin{aligned}&{-\left\langle W,Z^\star \right\rangle + \left\langle W,I_n\right\rangle } - \big ( y_1^\top 1_n + y_2^\top \text {diag}(C) + \alpha _1^\top 1_n + \alpha _2^\top 0_n - \left\langle C,S_{11}\right\rangle + {\left\langle W,I_n\right\rangle } \big ) \\&\quad = {-\left\langle W,Z^\star \right\rangle } - y_1^\top X^\star 1_k - y_2^\top (X^\star )^\top 1_n - \frac{1}{2}\alpha _1^\top (Z^\star + (Z^\star )^\top ) 1_n \\&\qquad - \alpha _2^\top (\text {diag}(Z^\star ) - X^\star \text {diag}(C^{-1})) + \left\langle C,S_{11}\right\rangle \\&\quad = {-\left\langle W,Z^\star \right\rangle } - \left\langle y_1 1_k^\top ,X^\star \right\rangle - \left\langle 1_n y_2^\top ,X^\star \right\rangle - \frac{1}{2}\left\langle 1_n \alpha _1^\top ,Z^\star \right\rangle \\&\qquad - \frac{1}{2} \left\langle \alpha _1 1_n^\top ,Z^\star \right\rangle - \left\langle \text {Diag}(\alpha _2),Z^\star \right\rangle + \left\langle \alpha _2 \text {diag}(C^{-1})^\top ,X^\star \right\rangle +\left\langle C,S_{11}\right\rangle \\&\quad = \left\langle {-W} - \frac{1}{2} 1_n \alpha _1^\top - \frac{1}{2} \alpha _1 1_n^\top - \text {Diag}(\alpha _2),Z^\star \right\rangle \\&\qquad + \left\langle -y_1 1_k^\top - 1_n y_2^\top + \alpha _2 \text {diag}(C^{-1})^\top ,X^\star \right\rangle + \left\langle C,S_{11}\right\rangle \\&\quad = \left\langle V,Z^\star \right\rangle + \left\langle U,X^\star \right\rangle + \left\langle S_{11},C\right\rangle + 2\left\langle S_{12},X^\star \right\rangle + \left\langle S_{22},Z^\star \right\rangle \\&\quad = \left\langle V,Z^\star \right\rangle + \left\langle U,X^\star \right\rangle + \left\langle S,Y^\star \right\rangle . \end{aligned}$$

The last term can be bounded by using Lemma 1 applied on \(Y^\star \), with \({\bar{y}} \ge \lambda _{\max }(Y^\star )\). Here, a suitable bound \({\bar{y}}\) for the maximum eigenvalue of any feasible solution Y can be obtained by applying Theorem 4. Therefore, we set \({\bar{y}} = \max _{i=1, \ldots , k} C_{ii} + 1\), and applying Lemma 1 we get

$$\begin{aligned}&\left\langle V,Z^\star \right\rangle + \left\langle U,X^\star \right\rangle + \left\langle S,Y^\star \right\rangle \\&\quad \ge \left\langle V,Z^\star \right\rangle + \left\langle U,X^\star \right\rangle + \left( \max _{i=1, \ldots , k} C_{ii} + 1\right) \sum _{i:\lambda _i(S)< 0} \lambda _i(S) \\&\quad \ge \left( \max _{i=1, \ldots , k} C_{ii} + 1\right) \sum _{i:\lambda _i(S) < 0} \lambda _i(S), \end{aligned}$$

where the last inequality holds because U and V are nonnegative. \(\square \)

Another way to get valid lower bounds for problems (VL-SDP) and (ML-SDP) is to tune the output results to get a feasible solution for the dual SDP problem. More in detail, from an approximate dual solution, that is not positive semidefinite, a partial dual feasibile solution is obtained by computing its projection onto the positive semidefinite cone. Then, the remaining dual variables are computed by solving an auxiliary LP. If the LP is feasible, we can construct a feasible solution, obtaining a valid dual bound. If it is infeasible, then we are neither able to construct a feasible dual solution nor to construct a dual bound. In this case, we set \(lb = - \infty \) and we use the valid bound provided by Theorems 3 and 5, respectively.

5 Cutting-plane algorithm

The bounds given by relaxations (VL-SDP) and (ML-SDP) are not strong enough to be successfully used within a B &B framework to solve large ccMSSC problems to optimality. We propose to strengthen the SDP bounds by using polyhedral cuts and solve the resulting SDPs by means of a cutting-plane algorithm. As mentioned in [4], trivial inequalities for the vector lifting relaxation are obtained via the reformulation–linearization technique (RLT). In [4], all RLT inequalities are included at once. We did some experiments adding all the RLT constraints (for small instances) or adding only the violated ones in a cutting-plane fashion. However, the improvement of the bound was marginal and came at a high computational cost. This motivated us to look for more effective valid inequalities. To this end, one may add any inequality that is valid for the so-called boolean quadric polytope, which is defined in [53] as the convex hull of pairs \((\Pi _j, \pi _j)\) satisfying \(\Pi _j = \pi _j(\pi _j)^\top \) with \(\pi _j \in \{0, 1\}^n\) for \(j \in [k]\). One important class of these cuts are the triangle inequalities, defined as

$$\begin{aligned} \begin{aligned} (\pi _j)_r + (\pi _j)_s + (\pi _j)_t&\le (\Pi _j)_{rs} + (\Pi _j)_{rt} + (\Pi _j)_{st} + 1 \quad \forall j \in [k],\\ (\Pi _j)_{rs} + (\Pi _j)_{rt}&\le (\pi _j)_r + (\Pi _j)_{st} \quad \forall j \in [k] \end{aligned} \end{aligned}$$
(21)

which hold for all distinct triplets (rst). Note that there are \(O(k \left( {\begin{array}{c}n\\ 3\end{array}}\right) )\) inequalities of type (21). Triangle inequalities explore the following property of the clustering problem: if data points \(p_r\), \(p_s\) and \(p_r\), \(p_t\) belong to the cluster j, then \(p_s\) and \(p_t\) must be in the cluster j as well. For the matrix lifting SDP, we consider triangle inequalities of the form

$$\begin{aligned} \begin{aligned} Z_{ij}&\le Z_{ii},\\ Z_{ij} + Z_{ih}&\le Z_{ii} + Z_{jh} \end{aligned} \end{aligned}$$
(22)

which hold for all distinct triplets (ijh). Constraints (22) are also used to strengthen the LP relaxation in [21] and the SDP relaxation in [24] for the unconstrained MSSC. Note that there are \(O(\left( {\begin{array}{c}n\\ 3\end{array}}\right) )\) inequalities of type (22). Similarly to (21), these inequalities ensure that if \(p_i\), \(p_j\) and \(p_i, p_h\) are in the same cluster, then \(Z_{ii} = Z_{ij} = Z_{ih} = Z_{jh}\) must hold.

The enumeration of all triangle inequalities is computationally inexpensive, even for large instances. However, adding all of them would make the relaxation intractable even for moderate values of n. To keep both the SDPs for the bounding routine and the auxiliary LPs for post-processing to a modest size, we limit the number of triangle inequalities that we can add at each cutting-plane iteration. The triangle inequalities are then sorted by violation magnitude and added starting with the most violated ones. Once the SDP minimizer is obtained, we remove all inactive constraints and add new violated inequalities. Next, the problem with an updated set of inequalities is solved and the process is iterated as long as the increase of the lower bound is sufficiently large. Indeed, we terminate the bounding routine if the relative difference between consecutive bounds is less than a fixed threshold. If the gap cannot be closed after some cutting plane iterations, we terminate the bounding routine, branch the current node, and start solving new subproblems as described in Sect. 6. Note that, to improve the efficiency of the overall B &B algorithm, we pass cutting planes from parent to children nodes. This allows us to save a significant number of cutting-plane iterations, and therefore computational time, when processing children nodes. We also point out that, when an SDP is solved with a new set of inequalities, the post-processing phase yielding valid lower bounds is performed by adapting the results in Sect. 4.

6 Branching

Using cutting planes to tighten the SDP relaxations may not be sufficient to prove the optimality of the best clustering solution found so far. If there are no violated cuts or the bound does not improve significantly when adding valid inequalities, then we stop the generation of cutting planes and we branch, i.e., we split the current problem into more problems of smaller dimensions by fixing some variables. An appropriate branching strategy can help limit the branching decisions and prevent unnecessary bound computations. In the notation of problem (ccMSSC), one may selects a variable \((\pi _j)_i\) having a nonbinary value and produce two subproblems: one with \((\pi _j)_i = 1\) (i.e., the i-th data point is assigned to the cluster j) and the other with \((\pi _j)_i = 0\) (i.e., the i-th data point is not in the cluster j). If a subproblem contains additional cluster indicator variables with nonbinary values then this process can be continued to produce a branching tree that will eventually enforce all binary conditions. Despite the existence of many branching rules that perform well for generic problems, there also exist branching rules tailored to the specific MSSC problem. Following [27], we adopt a branching logic aimed at enforcing pairwise constraints, namely must-link and cannot-link constraints. More in detail, a must-link constraint is used to specify that the two data points are in the same cluster, whereas a cannot-link constraint states that they can not be placed together. Both types of constraints can be added to the SDP relaxations through linear equality and inequality constraints. To this end, let \([l_{=}] \subset [n] \times [n]\) and \([l_{\ne }] \subset [n] \times [n]\) be the set of must-link and cannot-link constraints, respectively. Consider the vector lifting relaxation and define the following sets

$$\begin{aligned} \mathcal {ML_{\text {VL}}}&= \left\{ (\pi , \Pi ) \in {\mathbb {R}}^n \times {\mathcal {S}}^n : \forall (i, j) \in [l_{=}] \ \begin{array}{l} \pi _i = \pi _j, \ \Pi _{it} = \Pi _{jt} \ \forall t \in [n] \end{array}\right\} , \\ \mathcal {CL_{\text {VL}}}&= \left\{ (\pi , \Pi ) \in {\mathbb {R}}^n \times {\mathcal {S}}^n : \forall (i, j) \in [l_{\ne }] \ \begin{array}{l} \pi _i + \pi _j \le 1, \ \Pi _{ij} = 0 \end{array}\right\} . \end{aligned}$$

At any level of the search tree, the vector lifting relaxation with pairwise constraints becomes

$$\begin{aligned}&\hbox {min} \quad {{\left\langle W,I_n-\sum _{j=1}^{k} \frac{1}{c_j} \Pi _j\right\rangle }}&\nonumber \\&\hbox {s.t.} \quad {\sum _{j=1}^{k} \pi _j}{= 1_n}&{\quad \forall j \in [k]}\\&\qquad {(\pi _j, \Pi _j)}{\in {\mathcal {S}}_{\text {VL}}(c_j) \cap \mathcal{M}\mathcal{L}_{\text {VL}} \cap \mathcal{C}\mathcal{L}_{\text {VL}}}&{\quad \forall j \in [k]}\nonumber \end{aligned}$$
(23)

Similarly, consider the matrix lifting relaxation and define the analogues sets

$$\begin{aligned} \mathcal {ML_{\text {ML}}}&= \left\{ (X, Z) \in {\mathbb {R}}^{n \times k} \times {\mathbb {S}}^n : \forall (i, j) \in [l_{=}] \ \begin{array}{l} X_{ih} = X_{jh} \ \forall h \in [k], \ Z_{ih} = Z_{jh} \ \forall h \in [n] \end{array}\right\} ,\\ \mathcal {CL_{\text {ML}}}&= \left\{ (X, Z) \in {\mathbb {R}}^{n \times k} \times {\mathbb {S}}^n : \forall (i, j) \in [l_{\ne }] \ \begin{array}{l} X_{ih} + X_{jh} \le 1 \ \forall h \in [k], \ Z_{ij} = 0 \end{array}\right\} . \end{aligned}$$

Therefore, at any level of the tree, the problem takes the form

$$\begin{aligned}&\hbox {min} \quad {{\text {tr}\left( W - W Z \right) }}\nonumber \\&\hbox {s.t.} \quad {(X, Z)}{\in {\mathcal {S}}_{\text {ML}} \cap \mathcal{M}\mathcal{L}_{\text {ML}} \cap \mathcal{C}\mathcal{L}_{\text {ML}}}{}. \end{aligned}$$
(24)

Pairwise constraints partition the set of feasible solutions associated to the parent node into two disjoint subsets. Hence, our algorithm generates a binary enumeration tree: every time a parent node is split into two children, a pair of indices \((i^\star , j^\star )\) is chosen and a must-link and a cannot-link constraint are imposed on their associated problems. Depending on the relaxation, we use the following rules for selecting the branching pair:

$$\begin{aligned} (i^\star , j^\star )_{\text {VL}}&= \mathop {{\mathrm{arg~max}}}\limits _{i, j=1, \ldots n} \left\{ \min _{h = 1, \ldots , k} \left\{ (\Pi _h)_{ij}, \Vert (\Pi _h)_{i,:}-(\Pi _h)_{j,:} \Vert _2^2 \right\} \right\} ,\\ (i^\star , j^\star )_{\text {ML}}&= \mathop {{\mathrm{arg~max}}}\limits _{i, j = 1, \ldots , n} \left\{ \min \left\{ Z_{ij}, \Vert Z_{i,:}-Z_{j,:} \Vert _2^2 \right\} \right\} . \end{aligned}$$

In other words, we choose indices \(i^\star \) and \(j^\star \) with the least tendency to assign data points \(p_{{i}^\star }\) and \(p_{{j}^\star }\) to the same cluster, or to different ones.

Remark 1

A branching rule based on pairwise constraints becomes relevant in our context under three different aspects. First, it allows to constrain multiple variables involving two data points at the same time. This yields tighter relaxations in the children compared to the strategy of enforcing integrality on the variable of a single data point. Second, it allows to exploit symmetry reduction techniques that can significantly reduce the size of the SDPs. Third, it leads to the natural extension of SDP relaxations for ccMSSC to a general class of MSSC with both pairwise and cardinality constraints. To the best of our knowledge, this is the first time that tractable conic formulations are proposed for the MSSC problem with side constraints [15].

The number of pairwise constraints grows exponentially in the number of B &B nodes. As a result, the bounding problems may become expensive to solve. However, must-link constraints can be exploited to reduce the size of the subproblems. To this end, by introducing a suitable transformation matrix that maps the data points onto the so-called “super points”, problems (23) and (24) can be reformulated as programs on lower dimensional positive semidefinite cones. In the next section, we show that the cardinality constraints can be mapped onto the set of super points.

6.1 Size reduction

We consider the undirected graph \(G = (V, E)\) where the vertices are the data points and the edges are the must-link constraints, i.e., \(V = [n]\) and \(E = [l_{=}]\). We find the connected components of G and we assume that there are \(m \le n\) components \([B_1], \ldots , [B_m]\). Let \(T^s\) be the \(m \times n\) transformation matrix having \(T^s_{ij} = 1\) if \(j \in [B_i]\) and 0 otherwise, for each \(i \in \{1, \ldots , m\}\). Matrix \(T^s\) encodes the data points belonging to the same connected component and vector \(e^s = T^s 1_n\) contains the number of data points in each component. Observe that \(T^s W (T^s)^\top \) reduces the size of W. As a result, instead of clustering the initial set of data points \(p_1, \ldots , p_n\), we find a k-partition of the super points \(s_1, \ldots , s_m\). In addition to removing must-link constraints, we are able to reduce the number of cannot-link constraints. To see this, we denote by \([l^s_{\ne }]\) the set of cannot-link constraints between super points. A cannot-link is defined on two super points \(s_i\) and \(s_j\) if there exists a cannot-link on two data points p and q such that \(p \in [B_i]\) and \(q \in [B_j]\). It is easy to verify that, after this process, cannot-link constraints between data points in different connected components are mapped to the same pair of super points. Hence, we can reduce the size of problem (23) at any level of the tree reformulating it as

$$\begin{aligned}&\hbox {min} \quad {\text {tr}\left( W - T^{s} W (T^{s})^\top \sum _{j=1}^{k} \frac{1}{c_j} \Pi _j^s \right) }&\end{aligned}$$
(25a)
$$\begin{aligned}&\hbox {s.t.} \quad {\sum _{j=1}^{k} \pi ^s_j}{= 1_m}&\end{aligned}$$
(25b)
$$\begin{aligned}&\qquad {(e^s)^\top \pi ^s_j}{= c_j}&{ \forall j \in [k]}\end{aligned}$$
(25c)
$$\begin{aligned}&\qquad {(\pi ^s_h)_i + (\pi ^s_h)_j}{\le 1}&{ \forall h \in [k], \ \forall (i, j) \in [l^s_{\ne }]}\end{aligned}$$
(25d)
$$\begin{aligned}&\qquad {\begin{bmatrix} 1 &{} (\pi ^s_j)^\top \\ \pi ^s_j &{} \Pi ^s_j \end{bmatrix}}{\in {\mathcal {S}}_{+}^{m+1}}&{ \forall j \in [k]}\end{aligned}$$
(25e)
$$\begin{aligned}&\qquad {(\Pi ^s_h)_{ij}}{= 0}&{ \forall h \in [k], \ \forall (i, j) \in [l^s_{\ne }]}\end{aligned}$$
(25f)
$$\begin{aligned}&\qquad \text {diag}(\Pi ^s_j){= \pi ^s_j}&{ \forall j \in [k]} \end{aligned}$$
(25g)
$$\begin{aligned}&\qquad {\Pi ^s_j e^s}{= c_j \pi ^s_j}&{ \forall j \in [k]} \end{aligned}$$
(25h)
$$\begin{aligned}&\qquad {\Pi ^s_j}{\ge 0_{m \times m}}&{ \forall j \in [k]}. \end{aligned}$$
(25i)

Theorem 6

Problems (23) and (25) are equivalent.

Proof

Let \(\{(\pi _j^s, \Pi _j^s)\}_{j=1}^k\) be a feasible solution of problem (25). For all \(j \in [k]\) define \(\Pi _j = (T^s)^\top \Pi _j^s T^s\) and \(\pi _j = (T^s)^\top \pi _j^s\). This is equivalent to expanding matrices \(\{\Pi _j^s\}_{j=1}^{k}\) and vectors \(\{\pi _j^s\}_{j=1}^{k}\) by replicating the rows according to the indices of data points involved in must-link constraints. Therefore, for all \((s, t) \in [l]_{=}\), constraints \((\pi _j)_s = (\pi _j)_t\) and \((\Pi _j)_{sh} = (\Pi _j)_{th} \ \forall h \in [n]\) hold by construction. From the definition of \(T^s\), \(\Pi _j \ge 0_{n \times n}\) holds for all \(j \in [k]\). Likewise, from (25g) constraints \(\text {diag}(\Pi _j) = \pi _j\) hold as well. Moreover, for all \(j \in [k]\) we have

$$\begin{aligned} \sum _{j=1}^{k} \pi _j&= \sum _{j=1}^{k} (T^s)^\top \pi _j^s = (T^s)^\top \sum _{j=1}^{k} \pi _j^s = (T^s)^\top 1_m = 1_n, \\ 1_n^\top \pi _j&= 1_n^\top (T^s)^\top \pi _j^s = (e^s)^\top \pi _j^s = c_j,\\ \Pi _j 1_n&= (T^s)^\top \Pi _j^s T^s 1_n = (T^s)^\top \Pi _j^s e^s = (T^s)^\top \pi _j^s c_j = c_j \pi _j, \end{aligned}$$

and

$$\begin{aligned} \Pi _j - \pi _j \pi _j^\top \succeq 0&\Leftrightarrow \left\langle \Pi _j u,u\right\rangle - \left\langle \pi _j \pi _j^\top u,u\right\rangle \\&= \left\langle (T^s)^\top \Pi _j^s T^s u,u\right\rangle - \left\langle (T^s)^\top \pi _j^s (\pi _j^s)^\top T^s u,u\right\rangle \\&= \left\langle \Pi _j^s {\bar{u}},{\bar{u}}\right\rangle - \left\langle \pi _j^s (\pi _j^s)^\top {\bar{u}},{\bar{u}}\right\rangle \ge 0 \quad \forall {\bar{u}} = T^s u, \ u \ne 0_n. \\&\Leftrightarrow \Pi _j^s - \pi _j^s(\pi _j^s)^\top \succeq 0. \end{aligned}$$

Finally, we have

$$\begin{aligned} \left\langle W,I_n-\sum _{j=1}^{k} \frac{1}{c_j} \Pi _j\right\rangle= & {} \left\langle W,I_n -(T^s)^\top \sum _{j=1}^{k} \frac{1}{c_j} \Pi _j^s T^s\right\rangle \\= & {} \text {tr}\left( W - T^{s} W (T^{s})^\top \sum _{j=1}^{k} \frac{1}{c_j} \Pi _j^s \right) \end{aligned}$$

and thus \(\{(\Pi _j, \pi _j)\}_{j=1}^k\) is a feasible solution of problem (23) and the values of the objective functions coincide.

Now assume that \(\{(\Pi _j, \pi _j)\}_{j=1}^k\) is a feasible solution of problem (23). For all \(j \in [k]\) define \(\Pi _j^s = \text {Diag}(1_m / e^s) T^s \Pi _j (T^s)^\top \text {Diag}(1_m / e^s)\) and \(\pi _j^s = \text {Diag}(1_m / e^s) T^s \pi _j\). Constraints (25g) hold by construction since \(\Pi _j^s\) and \(\pi _j^s\) are lower dimensional variables obtained from \(\Pi _j\) and \(\pi _j\). If \(\Pi _j\) is nonnegative, then so is \(\Pi _j^s\). Furthermore, we have

$$\begin{aligned} \sum _{j=1}^k \pi _j^s&= \sum _{j=1}^k \text {Diag}(1_m / e^s) T^s \pi _j = \text {Diag}(1_m / e^s) T^s 1_n = \text {Diag}(1_m / e^s) e^s = 1_m, \\ (e^s)^\top \pi _j^s&= (e^s)^\top \text {Diag}(1_m / e^s) T^s \pi _j = \pi _j^\top (T^s)^\top \text {Diag}(1_m / e^s) e^s = \pi _j^\top 1_n = c_j,\\ \Pi _j^s e^s&= \text {Diag}(1_m / e^s) T^s \Pi _j (T^s)^\top \text {Diag}(1_m / e^s) e^s = \text {Diag}(1_m / e^s) T^s \Pi _j (T^s)^\top 1_m \\&= \text {Diag}(1_m / e^s) T^s \Pi _j 1_n = \text {Diag}(1_m / e^s) T^s \pi _j c_j = \pi _j^s c_j, \end{aligned}$$

and

$$\begin{aligned} \Pi _J^s - \pi _j^s (\pi _j^s)^\top \succeq 0&\Leftrightarrow \left\langle \Pi _j^s u,u\right\rangle - \left\langle \pi _j^s (\pi _j^s)^\top u,u\right\rangle \\&= \left\langle \text {Diag}(1_m / e^s) T^s \Pi _j (T^s)^\top \text {Diag}(1_m / e^s) u,u\right\rangle \ \\&\quad - \left\langle \text {Diag}(1_m / e^s) T^s \pi _j \pi _j^\top (T^s)^\top \text {Diag}(1_m / e^s) u,u\right\rangle \\&= \left\langle \Pi _j {\bar{u}},{\bar{u}}\right\rangle - \left\langle \pi _j \pi _j^\top {\bar{u}},{\bar{u}}\right\rangle \ge 0 \quad \forall {\bar{u}} = (T^s)^\top \text {Diag}(1_m / e^s) u\\&\Leftrightarrow \Pi _j - \pi _j \pi _j^\top \succeq 0. \end{aligned}$$

It remains to show that the objective function values coincide:

$$\begin{aligned}&\text {tr}(W) - \left\langle T^s W (T^s)^\top ,\sum _{j=1}^k \frac{1}{c_j} \Pi _j^s\right\rangle \\&\quad =\text {tr}(W) - \left\langle T^{s} W (T^{s})^\top ,\text {Diag}(1_m / e^s) T^{s} \sum _{j=1}^k \frac{1}{c_j} \Pi _j (T^{s})^\top \text {Diag}(1_m / e^s)\right\rangle \\&\quad =\text {tr}(W) - \left\langle W ,(T^{s})^\top \text {Diag}(1_m / e^s) T^{s}\sum _{j=1}^{k} \frac{1}{c_j}\Pi _j (T^{s})^\top \text {Diag}(1_m / e^s) T^{s}\right\rangle \\&\quad = \text {tr}(W) - \left\langle W,\sum _{j=1}^k \frac{1}{c_j} \Pi _j\right\rangle = \left\langle W,I-\sum _{j=1}^{k} \frac{1}{c_j} \Pi _j\right\rangle . \end{aligned}$$

Note that \((T^{s})^\top \text {Diag}(1_m / e^s) T^{s} \sum _{j=1}^k \frac{1}{c_j} \Pi _j (T^{s})^\top \text {Diag}(1_m / e^s) T^{s}\) “averages” over the rows of matrix \(\sum _{j=1}^k \frac{1}{c_j} \Pi _j\). Since these rows are identical due to must-link constraints, the last equation holds. \(\square \)

Similarly, we can reduce the size of problem (24) at any level of the tree reformulating it as

$$\begin{aligned}&\hbox {min} \quad {{\text {tr}(W - T^{s} W (T^{s})^\top }{Z^{s}})} \end{aligned}$$
(26a)
$$\begin{aligned}&\hbox {s.t.} \quad {X^s 1_k}{= 1_m} \end{aligned}$$
(26b)
$$\begin{aligned}&\qquad {(X^s)^\top e^s}{= \text {diag}(C)} \end{aligned}$$
(26c)
$$\begin{aligned}&\qquad {X^s_{ih} + X^s_{jh}}{\le 1 }\qquad \qquad {\quad \forall h \in [k], \ \forall (i,j) \in [l^s_{\ne }]} \end{aligned}$$
(26d)
$$\begin{aligned}&\qquad {{Z^s e^s}}{= 1_m} \end{aligned}$$
(26e)
$$\begin{aligned}&\qquad {\text {diag}(Z^s)}{= X^s \text {diag}(C^{-1})} \end{aligned}$$
(26f)
$$\begin{aligned}&\qquad {Z^s_{ij}}{= 0 }\qquad \qquad \qquad \qquad {\quad \forall (i,j) \in [l^s_{\ne }]} \end{aligned}$$
(26g)
$$\begin{aligned}&\qquad \begin{bmatrix} C &{} (X^s)^\top \\ X^s &{} Z^s \end{bmatrix} \in {\mathcal {S}}^{m+k}_{+} \end{aligned}$$
(26h)
$$\begin{aligned}&\qquad {X^s \ge 0_{m \times k}, \ Z^s \ge 0_{m \times m}} \end{aligned}$$
(26i)

Theorem 7

Problems (24) and (26) are equivalent.

Proof

Let \((Z^s, X^s)\) a feasible solution of problem (26). Define \(Z = (T^s)^\top Z^s T^s\) and \(X = (T^s)^\top X^s\). This is equivalent to expanding matrices \(Z^s\) and \(X^s\) by replicating the rows according to the indices of data points involved in must-link constraints. Therefore, \(X_{ih} = X_{jh}\) for all \(h \in [k]\) and \(Z_{ih} = Z_{jh}\) for all \(h \in [n]\) hold by construction. Clearly, \(Z \ge 0_{n \times n}\) and \(X \ge 0_{n \times k}\) hold as well. Likewise, from \(\text {diag}(Z^s) = X^s \text {diag}(C^{-1})\) it is easy to verify that \(\text {diag}(Z) = X \text {diag}(C^{-1})\) holds. Moreover, we have that

$$\begin{aligned} X1_k&= (T^s)^\top X^s 1_k = (T^s)^\top 1_m = 1_n, \\ X^\top 1_n&= (X^s)^\top T^s 1_n = (X^s)^\top e^s = \text {diag}(C), \\ Z1_n&= (T^s)^\top Z^s T^s 1_n = (T^s)^\top Z^s e^s = (T^s)^\top 1_m = 1_n. \end{aligned}$$

Constraint (26h) can be rewritten as \(\left\langle Z^s v,v\right\rangle - \left\langle X^s C^{-1} (X^s)^\top v,v\right\rangle \ge 0\) for all \(v \ne 0_m\). Therefore, we have

$$\begin{aligned} Z - X C^{-1} X^\top \succeq 0&\Leftrightarrow \left\langle Z u,u\right\rangle - \left\langle X C^{-1} X^\top u,u\right\rangle \\&= \left\langle (T^s)^\top Z^s T^s u,u\right\rangle - \left\langle (T^s)^\top X^s C^{-1} (X^s)^\top T^s u,u\right\rangle \\&= \left\langle Z^s {\bar{u}},{\bar{u}}\right\rangle - \left\langle X^s C^{-1} (X^s)^\top {\bar{u}},{\bar{u}}\right\rangle \ge 0 \quad \forall {\bar{u}} = T^s u, \ u \ne 0_n. \\&\Leftrightarrow Z^s - X^s C^{-1} (X^s)^\top \succeq 0. \end{aligned}$$

Furthermore, \(\left\langle W,Z\right\rangle = \left\langle W,(T^s)^\top Z^s T^s\right\rangle = \left\langle (T^s)W(T^s)^\top ,Z^s\right\rangle \) and thus (ZX) is a feasible solution of problem (24) and the values of the objective functions coincide.

Assume that (ZX) is a feasible solution of problem (24), set \(X^s = \text {Diag}(1_m / e^s) T^s X\) and \(Z^s = \text {Diag}(1_m / e^s) T^s Z (T^s)^\top \text {Diag}(1_m / e^s)\). Constraint (26f) follows by construction since \(Z^s\) and \(X^s\) are the “shrunk” matrices of Z and X. If (ZX) is nonnegative, then so is \((Z^s, X^s)\). Furthermore, we can derive

$$\begin{aligned} X^s 1_k&= \text {Diag}(1_m / e^s) T^s X 1_k = \text {Diag}(1_m / e^s) T^s 1_n = \text {Diag}(1_m / e^s) e^s = 1_m, \\ (X^s)^\top e^s&= X^\top (T^s)^\top \text {Diag}(1_m / e^s) e^s = (X^s)^\top e^s = X^\top (T^s)^\top 1_m = X^\top 1_n = \text {diag}(C), \\ Z^s e^s&= \text {Diag}(1_m / e^s) T^s Z (T^s)^\top \text {Diag}(1_m / e^s) e^s = \text {Diag}(1_m / e^s) T^s Z (T^s)^\top 1_m \\&= \text {Diag}(1_m / e^s) T^s Z 1_n = \text {Diag}(1_m / e^s) T^s 1_n = \text {Diag}(1_m / e^s) e^s = 1_m. \end{aligned}$$

Then, we have

$$\begin{aligned} Z^s - X^s C^{-1} (X^s)^\top \succeq 0&\Leftrightarrow \left\langle Z^s u,u\right\rangle - \left\langle X^s C^{-1} (X^s)^\top u,u\right\rangle \\&= \left\langle \text {Diag}(1_m / e^s) T^s Z (T^s)^\top \text {Diag}(1_m / e^s) u,u\right\rangle \ + \\&\quad - \left\langle \text {Diag}(1_m / e^s) T^s X C^{-1} X^\top (T^s)^\top \text {Diag}(1_m / e^s) u,u\right\rangle \\&= \left\langle Z {\bar{u}},{\bar{u}}\right\rangle - \left\langle X C^{-1} X^\top {\bar{u}},{\bar{u}}\right\rangle \ge 0 \quad \forall {\bar{u}} = (T^s)^\top \text {Diag}(1_m / e^s) u\\&\Leftrightarrow Z - X C^{-1} X^\top \succeq 0. \end{aligned}$$

It remains to show that the objective function values coincide:

$$\begin{aligned} \left\langle T^s W (T^s)^\top ,Z^s\right\rangle&= \left\langle T^{s} W (T^{s})^\top ,\text {Diag}(1_m / e^s) T^{s} Z (T^{s})^\top \text {Diag}(1_m / e^s)\right\rangle \\&= \left\langle W ,(T^{s})^\top \text {Diag}(1_m / e^s) T^{s} Z (T^{s})^\top \text {Diag}(1_m / e^s) T^{s}\right\rangle = \left\langle W,Z\right\rangle . \end{aligned}$$

In the latter equation, note that \((T^{s})^\top \text {Diag}(1_m / e^s) T^{s} Z (T^{s})^\top \text {Diag}(1_m / e^s) T^{s}\) “averages” over selected rows of matrix Z. Since these rows are identical due to must-link constraints, the last equation holds. \(\square \)

7 Heuristic

In the previous sections we discussed how to obtain strong and inexpensive lower bounds for ccMSSC by using semidefinite programming tools. In this section, we develop a rounding algorithm that recovers a feasible clustering (and thus an upper bound on the ccMSSC problem) from the solution of the SDP relaxation. By far, the most popular heuristic solving the unconstrained MSSC problem is the k-means algorithm [18]. Given the initial cluster centers, k-means proceeds by alternating two steps until convergence. In the first step, each observation is assigned to the closest cluster center, whereas in the second step, each center is updated by taking the mean of all the data points assigned to it. The algorithm terminates when the centers no longer change. Like other local search heuristics, the quality of the clustering found by k-means highly depends on the choice of the initial cluster centers [54, 55].

To generate high-quality solutions, we propose a two-phase rounding procedure. In the first phase, we extract the initial cluster centers from the solution of the SDP relaxation solved at each node. Therefore, we exploit the quality of the SDP solution by solving a sequence of relaxations where the underlying clusters become more clearly defined in each cutting-plane iteration. In the second phase, by using integer programming tools, we design a local search procedure inspired by k-means that improves the initial clustering while satisfying the cardinality constraints. We run this procedure at each node and we incorporate pairwise constraints described in Sect. 6 whenever branching decisions are taken.

Consider the ccMSSC formulation and recall that \(\pi _j = X_{:,j}\) is the indicator vector of cluster j. In general, the solution \({\tilde{X}}\) of problem (ML-SDP) represents a soft assignment matrix since its elements are between 0 and 1. To build a soft assignment from a solution \(\{(\tilde{\Pi }_j, \tilde{\pi }_j)\}_{j=1}^k\) of problem (VL-SDP), we stack the cluster indicator variables as columns of the matrix \({\tilde{X}} = [\tilde{\pi }_1, \ldots , \tilde{\pi }_k]\). Recall that \({\mathcal {F}}\) is the set of all assignments with known cluster sizes as defined in equation (6). Furthermore, denote by \(\mathcal{M}\mathcal{L}\) and \(\mathcal{C}\mathcal{L}\) the sets of assignment variables satisfying must-link and cannot-link constraints, respectively. In order to obtain a feasibile clustering, we find the closest assignment matrix \({\bar{X}}_0\) to \({\tilde{X}}\) with respect to the Frobenius norm by solving

$$\begin{aligned} \begin{aligned}&\min _{X} \left\{ \Vert X - {\tilde{X}} \Vert _F^2: X \in {\mathcal {F}} \cap \mathcal{M}\mathcal{L} \cap \mathcal{C}\mathcal{L} \right\} \\&\quad = \max _{X} \left\{ \left\langle X,{\tilde{X}}\right\rangle : X \in {\mathcal {F}} \cap \mathcal{M}\mathcal{L} \cap \mathcal{C}\mathcal{L} \right\} . \end{aligned} \end{aligned}$$
(27)

At the root node, that is when \([l_{=}] = [l_{\ne }] = \emptyset \), problem (27) can be solved in polynomial time because its constraint matrix is totally unimodular, implying that its LP relaxation is exact. When pairwise constraints are added at each node, the unimodularity does not hold anymore. However, in practice the problem can be quickly solved to global optimality by using off-the-shelf integer programming solvers. The solution \({\bar{X}}^0\) is a feasible clustering and thus the corresponding centroids \({\bar{m}}_j = \frac{1}{c_j} \sum _{i=1}^{n} {\bar{X}}^0_{ij} p_i\) for all \(j \in [k]\) can be used to initialize the local search procedure illustrated in Algorithm 1.

Note that in principle, if the solver would fail in solving problem (27), we could use the same initialization found at the root node or any other random initialization. However, in all our experiments, off-the-shelf software always solved problem (27) in negligible time.

figure a

Steps 1 and 2 of Algorithm 1 are reminiscent of a single iteration of k-means algorithm for unconstrained clustering. Specifically, Step 1 computes the optimal assignment of each point to the nearest cluster center while adhering to both cardinality and pairwise constraints. The feasible set of the ILP is the same of problem (27), and the same considerations hold on the total unimodularity property at the root node, and on the practical behaviour of off-the-shelf solvers. The minimization in Step 2, instead, admits a closed form solution, and like k-means, is given by the sample average of data points assigned to each cluster. Similarly to the k-means algorithm, there is no guarantee that the iterates generated from our heuristic will converge to the global minimizer. Differently from the heuristic proposed in [4], our algorithm finds the initial cluster centers by exploiting SDP relaxations strengthened through cutting planes and improves the quality of the initial clustering by appending some iterations of the k-means algorithm with both cardinality and pairwise constraints. Numerical experiments show that proposed heuristic is able to find the optimal solution at the root or in the first few nodes.

The overall branch-and-cut algorithm is shown in Algorithm 2. Algorithm 2 converges to a global minimum with precision \(\varepsilon \), thanks to the binary branching that performs an implicit enumeration of all the possible cluster assignments. In practice, a small number of nodes is required thanks to the strength of the lower bounding routine and to the effectiveness of the heuristic, as we will show in the next section.

figure b

8 Computational results

In this section, we describe the implementation details and we show numerical results on real-world instances.

8.1 Implementation details

Our B &B algorithm is implemented in C++ with some routines written in MATLAB. The SDP relaxations are solved with SDPNAL+, a MATLAB software that implements an augmented Lagrangian method to solve semidefinite programming problems with bound constraints [56]. We set the accuracy tolerance of SDPNAL+ to \(10^{-4}\) in the relative KKT residual and we post-process the output of the solver by using the tecniques described in Sect. 4. We use Gurobi [57] for the LP-based post-processing and for solving the integer problems in Algorithm 1. We run the experiments on a laptop with Intel(R) i7-12700 H CPU @ 3.50 GHz having 14 cores, 16 GB of RAM and Ubuntu operating system. To improve the efficiency of the B &B search we concurrently explore multiple nodes of the tree. As for the cutting-plane procedure, at each iteration, we separate at most 100,000 triangle inequalities, we sort them in decreasing order with respect to the violation and we add the first 10%. In our numerical tests, the tolerance for checking the violation is set to \(10^{-4}\). Furthermore, we use the same tolerance for removing inactive inequalities. We stop the cutting-plane procedure when there are no violated inequalities or the relative difference between consecutive lower bounds is less than \(10^{-4}\). Finally, we explore the B &B tree with the best-first search strategy. The source code is available at https://github.com/antoniosudoso/cc-sos-sdp.

Table 2 Real-world instances
Table 3 The “–” signs indicate that the problem instance could not be solved within a time limit of 12 h (43200 s).The times are in seconds and all the reported gaps are in percentage

8.2 Results on real-world instances

In Table 2, we report 29 real-world datasets for classification problems with number of data points \(n \in [75, 1000]\), number of features \(d \in [2, 20531]\), number of clusters \(k \in [2, 9]\), and \(f_\text {OPT}\) that is the optimal value certified by our methodology. They can all be downloaded at UCIFootnote 1 and UCRFootnote 2 websites. Following the related literature [4, 31], the number of clusters is assumed to be equal to the number of classes. Furthermore, we set the cluster cardinalities \(c_1, \ldots , c_k\) to the numbers of true class occurrences in each dataset. We implement two versions of the B &B algorithm where the only difference is the adopted SDP relaxation: in B &B VL-SDP we use problem (VL-SDP), whereas in B &B ML-SDP we use problem (ML-SDP). We add triangle inequalities in a cutting-plane fashion as described in Sect. 5. We require the optimality tolerance on the percentage gap of \(\varepsilon = 0.01\%\) for instances with \(n < 500\) and \(\varepsilon = 0.1\%\) for instances with \(n \ge 500\), i.e., we terminate each method when \(100\cdot \frac{UB - LB}{UB} \le \varepsilon \), where UB and LB denote the best upper and lower bounds, respectively. In all our experiments, we set a time limit of 12 h of computing time.

In Table 3, we compare the two versions of our B &B algorithm. Here, we report the instance id and for each method some statistics relative to the root node, the total number of processed nodes and the computational time in seconds. As for the root node, we report the percentage gap after solving the SDP without adding triangle inequalities (Gap\(_0\)), the number of cutting-plane iterations (CP), and the percentage gap at the end of the cutting-plane procedure (Gap\(_r\)). Whenever the time limit is reached, we set the time to “-” and we report in brackets in the “Nodes” column the gap when the algorithm stops. It turns out that B &B ML-SDP solves to the required precision all the instances within the time limit. One the other hand, B &B VL-SDP outperforms B &B ML-SDP when n and k are small. We also stress that when B &B VL-SDP stops for the time limit, the optimality gap is always smaller than \(2\%\). Looking at the table, the results confirm that the bound produced by the vector lifting relaxation is stronger (the number of nodes is always smaller than the one of B &B ML-SDP) and computationally tractable for small n and k. When n and k increase, the efficiency of the matrix lifting relaxation allows to process a larger number of nodes in a smaller amount of time. Note that the reduction in size of the SDP described in Sect. 6 helps to limit the computational time when the number of nodes increases. Therefore, the influence of the size reduction is stronger for B &B ML-SDP where the number of nodes is larger and hence the branching decisions are more frequent.

Many instances are solved at the root node (especially with B &B VL-SDP), and this depends both on the strength of the lower bound (thanks to the cutting-plane procedure) and on the excellent upper bound produced by the heuristic. The statistics on the root node confirm that the vector lifting relaxation is stronger: in 8 of 29 instances it is tight (no cutting-plane iterations are performed), and in general the number of cutting-plane iterations is much smaller than the one needed for the matrix lifting relaxation. However, at the end of the root node, the difference in gap between the two relaxations becomes almost negligible, thanks to the cutting-plane procedure. Results confirm that adding inequalities significantly reduces the gap, and allows in many cases to close the gap at the root node: with B &B VL-SDP it happens on 19 out of 29 instances, and with B &B ML-SDP on 13 instances out of 29. In general, the gap at the end of the root node is always below 2% for B &B VL-SDP and 3% for B &B ML-SDP. When k increases, the gap increases, and also finding the global minimum by the heuristic becomes harder. Indeed, the only instances where the global minimum is not found at the root node are: UrbanLand, Plane, EthanolLevel, AbnormalHeart, UWaweGesture. A significant example is UrbanLand, where \(k=9\). This is the instance where the gap at root node is higher, and also the optimal solution is not found at the root, but it is found after 16 and 85 nodes in B &B VL-SDP and B &B ML-SDP, respectively. To get a better understanding of the cutting-plane contribution at the root node, in Figs. 1 and 2 we plot for VL-B &B and ML-B &B the relative gap versus the number of CP iterations whenever \(\text {CP} \ge 3\). The “x” marker indicates that the global upper bound has been updated at the corresponding CP iteration. These plots reveal two insights. First, during the initial iterations, the gap diminishes at a faster rate. However, it’s noteworthy that the subsequent iterations are still valuable, as they lead to further gap reduction and frequently result in UB updates due to the improved SDP solution.

Although the method in [4] does not guarantee globally optimal solutions, the SDP bound and the rounding heuristic are able to prove the optimality of Iris, Parkinson and Seeds: problem (VL-SDP) is tight for Iris and one cutting-plane iteration is sufficient for solving Parkinsons and Seeds. Despite the relevance of ccMSSC problem the constraint programming solver in [31] is the only computational study on exact approaches appeared in the literature, and the code is available. Thus, we run their code on our machine considering all the small-scale instances, i.e., \(n\le 210\), and compare the performance of our B &B algorithms against it. The algorithm in [31] is able to find the globally optimal solution of BreastTissue, Iris, Parkinsons, Ruspini and Wine in 14, 9, 29,397, 8, and 8224 s, respectively. Hence the method is competitive with our approaches apart from Wine and Parkinsons, where we show vastly improved results in terms of computational time. For the remaining small-scale instances, we report the relative gap after the time limit of 12 h: Hierarchical 27.32%, HapticsSmall 56.01%, UrbanLand 34.73%, Connectionist 19.03%, Seeds 1.49%. Given the large gap on these small-scale instances, we did not run their method on the large-scale ones.

Summarizing, we are able to solve for the first time instances having sizes approximately 10 times larger than those solved by previous exact approaches. The cutting plane procedure is fundamental for closing the gap, combined with the effectiveness of the heuristic, that allows to find the global minimum at the root node or in a few nodes. When the size is small, the vector lifting relaxation is preferable thanks to the strength of the bound, but when n and k increase, the matrix lifting relaxation allows to get a more tractable node in terms of computational time, and the cutting plane procedure makes the bound still competitive with respect to the vector lifting.

Fig. 1
figure 1

Gap versus cutting-plane iterations at the root node. The “x” marker indicates that the global upper bound has been updated at the corresponding iteration

Fig. 2
figure 2

Gap versus cutting-plane iterations at the root node. The “x” marker indicates that the global upper bound has been updated at the corresponding iteration

9 Conclusions

In this paper, we proposed an SDP-based branch-and-cut algorithm for solving MSSC with strict cardinality constraints. For computing the lower bound, we used the SDP relaxation recently proposed in [4] for small-scale instances, whereas for large-scale ones we derived a new SDP relaxation. We implemented a cutting-plane algorithm to strengthen the bounds provided by both relaxation by adding polyhedral cuts. For the upper bound computation, we designed a constrained variant of k-means algorithm and we initialized it with the solution of the SDP relaxation solved at each node. Numerical results impressively exhibit the efficiency of our solver: when using the new SDP bound, we can solve real-world instances up to 1000 data points, whereas when using the bound in [4] we can only solve small-size problems and guarantee an optimality gap smaller than 2% on larger instances. To the best of our knowledge, no other exact solution methods can handle real-world instances of that size. An interesting research direction to improve our methodology is to study facial reduction for both the VL-SDP and ML-SDP relaxations, following [43]. This should make the solution of the SDPs faster and more robust, leading to a more efficient branch-and-cut algorithm. Furthermore, an immediate future research direction is to design global algorithms for clustering problems with fairness constraints. As introduced in [58], the goal of fair clustering is to find a partition where all the clusters are balanced with respect to some protected attributes such as gender or religion.