1 Introduction

In many challenging real-world applications involving the grouping of high-dimensional data, points from each group (cluster) can be well approximated by a distinct lower dimensional linear subspace. This is the case in gene sequencing (McWilliams and Montana 2014), cancer genomics (Yeoh et al. 2002), face clustering (Elhamifar and Vidal 2013), motion segmentation (Rao et al. 2010), and text mining (Peng et al. 2018). The problem of simultaneously estimating the linear subspace corresponding to each cluster, and assigning each point to the closest subspace is known as subspace clustering (Vidal 2011). In the data mining literature, this problem has also been referred to as correlation clustering (Kriegel et al. 2009), but we refrain from using this terminology here. It is important to note that in data mining the term “subspace clustering” has been used to refer to a number of distinct high-dimensional clustering problems (Kriegel et al. 2009).Footnote 1 A formal definition of the problem we consider as well as a brief overview of existing methods is provided in Sect. 2.

Spectral methods for subspace clustering have demonstrated excellent performance in numerous real-world applications  (Liu et al. 2012; Elhamifar and Vidal 2013; Huang et al. 2015; Li et al. 2017). These methods construct an affinity matrix for spectral clustering by solving an optimisation problem that aims to approximate each point through a linear combination of other points from the same subspace. In this paper, we first propose a method called Weighted Sparse Simplex Representation (WSSR). Our method is based on the Sparse Simplex Representation (SSR) of Huang et al. (2013), in which each point is approximated through a convex combination of other points. This method was not proposed as a subspace clustering method, but rather as a method for modelling the brain anatomical and genetic networks. We modify SSR to ensure that each point is approximated through a sparse convex combination of nearby neighbours, and thus obtain an algorithm that is effective for the subspace clustering problem.

Due to the complete lack of labelled data, clustering methods rarely achieve perfect performance. For this reason, in real-world applications, clustering models are rarely immediately accepted and acted upon. Instead, they undergo one or more rounds of “validation”, which commonly involves domain experts assessing whether the model is sensible. A generic description of the information generated during validation is to assume that domain experts assess whether the assignment of a (small) subset of the originally unlabelled data is sensible. To be accepted the clustering model needs to be consistent with this information. In this paper, we assume that this external (side) information can be translated to labels. To accommodate the existence of label information, we consider constrained subspace clustering (Basu et al. 2008). An even more interesting problem arises if the learning algorithm can select the points that experts consider during validation. An effective active learning strategy is highly beneficial not only because it minimises the cost of obtaining labels, but also because producing a valid clustering in as few iterations of the validation process as possible improves user confidence in the model. We therefore discuss an active learning strategy that is designed to query the labels of points so as to maximise the overall quality of the subspace clustering model. In particular, we draw on the work of Peng and Pavlidis (2019) to select informative points to label for subspace clustering. The label information is subsequently incorporated in a constrained clustering formulation that combines WSSR and constrained K-subspace clustering. The resulting cluster assignment is guaranteed to satisfy all the constraints arising from the set of labelled data.

The rest of this paper is organised as follows. In Sect. 2, we discuss some relevant existing literature in the areas of subspace clustering, constrained clustering, and active learning. In Sect. 3, we propose the problem formulation of WSSR, discuss its properties, and present an approach for solving the problem. We develop an integrated active learning and constrained clustering framework in Sect. 4. The experimental results are organised in three sections. First, in Sect. 5, we use synthetic data to evaluate the performance of WSSR under various configurations of the subspace clustering problem. Next, we consider real datasets to assess the comparative performance of our proposed method in the completely unsupervised setting (Sect. 6), as well as in the constrained clustering with active learning settings (Sect. 7). The paper ends with concluding remarks and future research directions in Sect. 8.

2 Related work

The linear subspace clustering problem can be defined as follows. A collection of N data points \({\mathcal {X}}= \left\{ {\varvec{x}}_{i} \right\} _{i=1}^{N} \subset {\mathbb {R}}^P\) is drawn from a union of \(K\) linear subspaces \(\left\{ {\mathcal {S}}_{k} \right\} _{k=1}^{K}\) with added noise. Each subspace can be defined as,

$$\begin{aligned} {\mathcal {S}}_{k}=\left\{ {\varvec{x}}\in {\mathbb {R}}^{P} \,|\, {\varvec{x}}= V_{k}{\varvec{y}} \right\} , \;\; \text {for} \;\; k=1, \ldots , K, \end{aligned}$$
(1)

where \(V_{k} \in {\mathbb {R}}^{P\times d_{k}}\), with \(1 \leqslant d_{k}<P\), is a matrix whose columns constitute a basis for \({\mathcal {S}}_{k}\), and \({\varvec{y}}\in {\mathbb {R}}^{d_{k}}\) is the representation of \({\varvec{x}}\) in terms of the columns of \(V_{k}\). The goal of subspace clustering is to find the number of subspaces \(K\); the subspace dimensions \(\left\{ d_{k} \right\} _{k=1}^{K}\); a basis \(\left\{ V_{k} \right\} _{k=1}^{K}\) for each subspace; and finally the assignments of the points in \({\mathcal {X}}\) to clusters. A natural formulation of this problem is as the minimisation of the reconstruction error,

$$\begin{aligned} \sum _{i=1}^{N}\min _{V_1, \ldots , V_K} \left\{ \min _{k=1,\ldots ,K} \Vert {\varvec{x}}_i - V_{k} V_{k}^{\mathsf {T}}{\varvec{x}}_i\Vert _2^2 \right\} . \end{aligned}$$
(2)

K-subspace clustering (KSC) (Bradley and Mangasarian 2000) is an iterative algorithm to solve the problem in (2). Like the classical K-means clustering, KSC alternates between estimating the subspace bases (for a fixed cluster assignment), and assigning points to clusters (for a fixed set of bases). However, iterative algorithms are very sensitive to initialisation and most often converge to poor local minima (Lipor and Balzano 2017).

Currently the most effective approach to subspace clustering is through spectral-based methods (Lu et al. 2012; Elhamifar and Vidal 2013; Hu et al. 2014; You et al. 2016). Spectral-based methods consist of two steps: first an affinity matrix is estimated, and then normalised spectral clustering (Ng et al. 2002) is applied to this affinity matrix. The affinity matrix is constructed by exploiting the self-expressive property: any \({\varvec{x}}_{i} \in {\mathcal {S}}_k\) can be expressed as a linear combination of \(d_{k}\) other points from \({\mathcal {S}}_k\). Thus, for each \({\varvec{x}}_i \in {\mathcal {X}}\) they first solve a convex optimisation problem of the form,

$$\begin{aligned} {\varvec{\beta }}_{i}^\star = \min _{{\varvec{\beta }}_i \in {\mathbb {R}}^{N-1}} \left\| {\varvec{x}}_i - X_{-i} {\varvec{\beta }}_i \right\| _{p} + \rho \left\| {\varvec{\beta }}_i \right\| _{q}, \end{aligned}$$
(3)

where \(X_{-i} = \left[ {\varvec{x}}_{1},\ldots , {\varvec{x}}_{i-1},{\varvec{x}}_{i+1}\,\ldots ,{\varvec{x}}_{N} \right] \in {\mathbb {R}}^{P\times (N-1)}\) is a matrix whose columns correspond to the points in \({\mathcal {X}}\backslash \{{\varvec{x}}_i\}\); and \(\rho >0\) is a penalty parameter. The first term in the objective function quantifies the error of approximating \({\varvec{x}}_i\) through \(X_{-i} {\varvec{\beta }}_{i}\). The penalty (regularisation) term is included to promote solutions in which \(\beta _{ij}^\star \) is small (and ideally zero) if \({\varvec{x}}_{j}\) belongs to a different subspace than \({\varvec{x}}_{i}\). After solving the problem in (3) for each \({\varvec{x}}_i \in {\mathcal {X}}\), the affinity matrix is typically defined as \(A = \left( |B|+|B|^{\mathsf {T}}\right) /2\), where \(B = [{\varvec{\beta }}_1^\star , \ldots , {\varvec{\beta }}_N^\star ]\).

Least Squares Regression (LSR) (Lu et al. 2012) uses the \(L_2\)-norm for both the approximation error, and the regularisation term (\(p=q=2\)). Smooth Representation Clustering (SMR) (Hu et al. 2014) also uses the \(L_2\)-norm on the approximation error, while the penalty term is given by \(\Vert L^{1/2}{\varvec{\beta }}_i\Vert _2^2\) in which L a positive definite Laplacian matrix constructed from pairwise similarities. The main advantage of using the \(L_2\)-norm is that the optimisation problem has a closed-form solution. However the resulting coefficient vectors are dense and hence the affinity matrix contains connections between points from different subspaces. The most prominent spectral-based subspace clustering algorithm is Sparse Subspace Clustering (SSC) (Elhamifar and Vidal 2013). In its most general formulation, SSC accommodates the possibility that points from each subspace are contaminated by both noise and sparse outlying entries. In SSC, \({\varvec{\beta }}_i^\star \) is the solution to the following problem,

$$\begin{aligned} \min _{{\varvec{\beta }}_i \in {\mathbb {R}}^{N-1}}&\; \Vert {\varvec{\beta }}_i\Vert _1 + \rho _\eta \Vert {\varvec{\eta }}_i\Vert _1 + \frac{\rho _z}{2} \Vert {\varvec{z}}_i\Vert _2^2,\\ \text {s.t.}&\; {\varvec{x}}_i = X_{-i} {\varvec{\beta }}_i + {\varvec{\eta }}_i + {\varvec{z}}_i. \nonumber \end{aligned}$$
(4)

SSC therefore decomposes the approximation error into two components (\({\varvec{\eta }}_i\) and \({\varvec{z}}_i\)), which are measured with different norms. Following the success of SSC, several variants have been proposed, including SSC with Orthogonal Matching Pursuit (SSC-OMP) (You et al. 2016), Structured Sparse Subspace Clustering (S3C) (Li et al. 2017), and Affine Sparse Subspace Clustering (ASSC) (Li et al. 2018a).

The method most closely connected to our approach is SSR, proposed by Huang et al. (2013) for the modelling of brain networks. SSR solves the problem in (3) using \(p=2\) and \(q=1\) with the additional constraint that the coefficient vector has to lie in the \((N-1)\)-dimensional unit simplex \({\varvec{\beta }}_i \in \Delta ^{N-1} = \{{\varvec{\beta }}\in {\mathbb {R}}^{N-1} \,|\, {\varvec{\beta }}\geqslant 0, {\varvec{\beta }}^{\mathsf {T}}{\varvec{1}}=1\}\). Since SSR approximates \({\varvec{x}}_i\) through a convex combination of other points, the coefficients have a probabilistic interpretation. However, SSR induces no regularisation since \(\Vert {\varvec{\beta }}_i\Vert _1=1\) for all \({\varvec{\beta }}_i \in \Delta ^{N-1}\), hence coefficient vectors are dense.

We next provide a short overview of clustering with external information, called constrained clustering (Basu et al. 2008), and active learning. Due to space limitations, we only mention the work that is most closely related to our problem. In constrained clustering, the external information can be either in the form of class labels or as pairwise “must-link” and “cannot-link” constraints. Spectral methods for constrained clustering incorporate this information by modifying the affinity matrix. Constrained Spectral Partitioning (CSP) (Wang and Davidson 2010) introduces a pairwise constraint matrix and solves a modified normalised cut spectral clustering problem. Partition Level Constrained Clustering (PLCC) (Liu et al. 2018) forms a pairwise constraint matrix through a side information matrix which is included as a penalty term into the normalised cut objective. Constrained Structured Sparse Subspace Clustering (CS3C) (Li et al. 2017) is specifically designed for subspace clustering. CS3C incorporates a side information matrix that encodes the pairwise constraints into the formulation of S3C. The algorithm alternates between solving for the coefficient matrix and solving for the cluster labels. Constrained clustering algorithms that rely exclusively on modifying the affinity matrix cannot guarantee that all the constraints will be satisfied. CS3C+ (Li et al. 2018b) is an extension of CS3C that applies constrained K-means algorithm (Wagstaff et al. 2001) within the spectral clustering stage, to ensure constraints are satisfied.

In active learning the algorithm controls the choice of points for which external information is obtained. The majority of active learning techniques are designed for supervised methods, and little research has considered the problem of active learning for subspace clustering (Lipor and Balzano 2015, 2017; Peng and Pavlidis 2019). Lipor and Balzano (2015) propose two active strategies. The first queries the point(s) with the largest reconstruction error to its allocated subspace. The second queries the point(s) that is maximally equidistant to its two closest subspaces. Lipor and Balzano (2017) extend the second strategy for spectral clustering by setting the affinity of “must-link” and “cannot-link” pairs of points to one and zero, respectively. Both strategies by Lipor and Balzano (2015) are effective in identifying mislabelled points. However, correctly assigning these points is not guaranteed to maximally improve the accuracy of the estimated subspaces, hence the overall quality of the clustering. Peng and Pavlidis (2019) propose an active learning strategy for sequentially querying point(s) to maximise the decrease of the overall reconstruction error in (2).

3 Weighted sparse simplex representation

In this section, we describe the proposed spectral-based subspace clustering method, called Weighted Sparse Simplex Representation (WSSR). In this description, we assume that no labelled data is available. The constrained clustering version described in the next section accommodates the case of having a subset of labelled observations at the start of the learning process.

Let \(d_{ij} \geqslant 0\) denote a measure of dissimilarity between \({\varvec{x}}_i, {\varvec{x}}_j \in {\mathcal {X}}\), and \({\mathcal {I}}\) the set of indices of all points in \(\mathcal {X}\{x_{i}\}\) with finite dissimilarity to \({\varvec{x}}_i\), that is \({\mathcal {I}}= \{1\leqslant j \leqslant N \,|\, d_{ij} < \infty , \; j \ne i \}\). In WSSR, the coefficient vector for each \({\varvec{x}}_{i}\in {\mathcal {X}}\) is the solution to the following convex optimisation problem,

$$\begin{aligned} {\varvec{\beta }}_i^\star =&{{\,\mathrm{arg\,min}\,}}_{{\varvec{\beta }}_{i}} \frac{1}{2} \left\| {\varvec{x}}_{i} - {\hat{X}}_{{\mathcal {I}}}{\varvec{\beta }}_{i} \right\| _{2}^{2} +\rho \left\| D_{\mathcal {I}}{\varvec{\beta }}_{i} \right\| _{1} +\frac{\xi }{2} \Vert D_{\mathcal {I}}{\varvec{\beta }}_{i} \Vert _{2}^{2} \\&\text {s.t.} \quad {\varvec{\beta }}_{i}^{{\mathsf {T}}}{\varvec{1}} = 1, \quad {\varvec{\beta }}_{i} \geqslant {\varvec{0}}, \nonumber \end{aligned}$$
(5)

where \(\rho , \xi >0\) are penalty parameters, \({\hat{X}}_{{\mathcal {I}}}\in {\mathbb {R}}^{P \times |I|}\) is a matrix whose columns are the scaled versions of the points in \({\mathcal {X}}_{\mathcal {I}}\), and \(D_{\mathcal {I}}= \text {diag}({\varvec{{\varvec{d}}}}_{{\mathcal {I}}})\) is a diagonal matrix of finite pairwise dissimilarities between \({\varvec{x}}_i\) and the points in \({\mathcal {X}}_{\mathcal {I}}\). We first outline our motivation for the choice of penalty function, and then discuss the definition of \({\hat{X}}_{{\mathcal {I}}}\) and the choice of \(d_{ij}\) in the next paragraph. The use of both an \(L_{1}\) and an \(L_{2}\)-norm in (5) is motivated by the elastic net formulation (Zou and Hastie 2005). The \(L_{1}\)-norm penalty promotes solutions in which coefficients of dissimilar points are zero. The \(L_{2}\)-norm penalty encourages what is known as the grouping effect: if a group of points induces a similar approximation error, then either all points in the group are represented in \({\varvec{\beta }}_i^\star \), or none is. This is desirable for subspace clustering, because the points in such a group should belong to the same subspace. In this case, if this subspace is different from the one \({\varvec{x}}_i\) belongs to, then all points should be assigned a coefficient of zero. If instead the group of points are from the same subspace as \({\varvec{x}}_i\), then it is beneficial to connect \({\varvec{x}}_i\) to all of them, because this increases the probability that points from each subspace will belong to a single connected component of the graph defined by the affinity matrix A.

Spectral subspace clustering algorithms commonly normalise the points in \({\mathcal {X}}\) to have unit \(L_2\)-norm prior to estimating the coefficient vectors (Elhamifar and Vidal 2013; You et al. 2016). The simple example in Fig. 1 illustrates that this normalisation tends to increase cluster separability. In the following, we denote \(\bar{{\varvec{x}}}_{i} = {\varvec{x}}_i / \Vert {\varvec{x}}_i\Vert _2\). However, projecting the data onto the unit sphere has important implications for the WSSR problem. In (5) we want the two conflicting objectives of minimising the approximation error and selecting a few nearby points to be separate. Figure 2 contains an example that shows that this is not true after projecting onto the unit sphere. In Fig. 2, the point closest to \(\bar{{\varvec{x}}}_i\) on the unit sphere is \(\bar{{\varvec{x}}}_{1}\). The direction of \(\bar{{\varvec{x}}}_{i}\) can be perfectly approximated by a convex combination of \(\bar{{\varvec{x}}}_{1}\) and \(\bar{{\varvec{x}}}_{2}\), but all convex combinations \(\alpha \bar{{\varvec{x}}}_{1} + (1-\alpha ) \bar{{\varvec{x}}}_{2}\) with \(\alpha \in (0,1)\) have an \(L_2\)-norm less than one. Since the cardinality of \({\varvec{\beta }}_i\) affects the length of the approximation, it affects both the penalty term and the approximation error. A simple solution to address this problem is to scale every point \({\varvec{x}}_j\) with \(j \in {\mathcal {I}}\) such that \(\hat{{\varvec{x}}}_j^{i} = t^{i}_j {\varvec{x}}_j\) lies on the hyperplane perpendicular to the unit sphere at \(\bar{{\varvec{x}}}_i\), \(\hat{{\varvec{x}}}_j^{i} \in \{{\hat{{\varvec{x}}}} \in {\mathbb {R}}^P\;|\; {\hat{{\varvec{x}}}}^{\mathsf {T}}\bar{{\varvec{x}}}_i=1\}\). Note that this implies that if \({\varvec{x}}_j^{\mathsf {T}}{\varvec{x}}_i<0\), then \(t^{i}_j\) is negative. An inspection of Fig. 1 suggests that this is sensible. An appropriate measure of pairwise dissimilarity given the aforementioned preprocessing steps is the inverse absolute cosine similarity,

Fig. 1
figure 1

An illustration of the data normalisation step, and the rationale for using the inverse cosine similarity as the dissimilarity measure. Left: The original data points. Right: The data points that have been normalised to lie on the unit sphere

Fig. 2
figure 2

A geometric illustration of the necessity for stretching points in X

$$\begin{aligned} d_{ij}= \Vert {\varvec{x}}_{i}\Vert _2 \Vert {\varvec{x}}_{j}\Vert _2 /({\varvec{x}}_i^{{\mathsf {T}}} {\varvec{x}}_{j}) = |\bar{{\varvec{x}}}_{i}^{{\mathsf {T}}}\bar{{\varvec{x}}}_{j}|^{-1}. \end{aligned}$$
(6)

Since \(d_{ij}\) is infinite when \({\varvec{x}}_j^{\mathsf {T}}{\varvec{x}}_i = 0\), such points could never be assigned a non-zero coefficient, therefore they are excluded from \({\hat{X}}_{{\mathcal {I}}}\).

We now return to the optimisation problem in (5). Due to the constraint \({\varvec{\beta }}_i \in \Delta ^{|{\mathcal {I}}|}\) and the fact that \({\varvec{d}}_{{\mathcal {I}}} >0\), \(\Vert D_{\mathcal {I}} {\varvec{\beta }}_i\Vert _1 = {\varvec{d}}_{\mathcal {I}}^{\mathsf {T}}{\varvec{\beta }}_i\). This implies that the objective function is a quadratic, and the minimisation problem in (5) is equivalent to the one below,

$$\begin{aligned} \min _{{\varvec{\beta }}_{i}}&\frac{1}{2} {\varvec{\beta }}_i^{\mathsf {T}}( {\hat{X}}_{{\mathcal {I}}}^{\mathsf {T}}{\hat{X}}_{{\mathcal {I}}}+ \xi D_{\mathcal {I}}^2) {\varvec{\beta }}_{i} + (\rho {\varvec{d}}_{{\mathcal {I}}}- {\hat{X}}_{{\mathcal {I}}}^{\mathsf {T}}\hat{{\varvec{x}}}_i)^{\mathsf {T}}{\varvec{\beta }}_i, \;\; \text {s.t.} \;\; {\varvec{\beta }}_i \geqslant 0, \; \; {\varvec{\beta }}_i^{\mathsf {T}}{\varvec{1}} = 1. \end{aligned}$$
(7)

For any \(\xi >0\), the above objective function is guaranteed to be strictly convex, therefore WSSR corresponds to a Quadratic Programme (QP).

The choice of \(\rho \) in (7) is critical to obtain an affinity matrix that accurately captures the cluster structure. The ridge penalty parameter, \(\xi \), is typically assigned a small value, e.g. \(10^{-4}\) (Gaines et al. 2018). For “large” \(\rho \), the optimal solution assigns a coefficient of one to the nearest neighbour of \(\bar{{\varvec{x}}}_i\) and all other coefficients are zero. This is clearly undesirable. Setting this parameter is complicated by the fact that an appropriate choice of \(\rho \) differs for each \(\bar{{\varvec{x}}}_i\). Lemma 1 is a result which can be used to obtain a lower bound on \(\rho \) such that the solution of (7) is the “nearest neighbour” approximation. The proof of this lemma, as well as a second lemma which establishes its geometric interpretation can be found in Appendix B.

Lemma 1

Let (j) denote the index of the j-th nearest neighbour of \(\bar{{\varvec{x}}}_i\), and \(\hat{{\varvec{x}}}_{i}^{(j)}\) denote the j-th nearest neighbour of \(\bar{{\varvec{x}}}_{i}\). Assume that \(\hat{{\varvec{x}}}^{(1)}_{i}\) is unique and that \(\hat{{\varvec{x}}}^{(1)}_{i} \ne \bar{{\varvec{x}}}_i\). We also assume that the pairwise dissimilarities satisfy:

$$\begin{aligned} \Vert \bar{{\varvec{x}}}_i - \hat{{\varvec{x}}}^{(j)}_i \Vert _2> \Vert \bar{{\varvec{x}}}_i - \hat{{\varvec{x}}}^{(k)}_i \Vert _2&\Rightarrow d_{ij} > d_{ik},\\ \Vert \bar{{\varvec{x}}}_i - \hat{{\varvec{x}}}^{(j)}_i \Vert _2 = \Vert \bar{{\varvec{x}}}_i - \hat{{\varvec{x}}}^{(k)}_i \Vert _2&\Rightarrow d_{ij} = d_{ik}. \end{aligned}$$

If

$$\begin{aligned} {\varvec{e}}_{1}=[1,0,\ldots ,0 ]^{{\mathsf {T}}} = {{\,\mathrm{arg\,min}\,}}_{{\varvec{\beta }}_i \in \Delta ^{|{\mathcal {I}}|}} \; \frac{1}{2} {\varvec{\beta }}_i^{\mathsf {T}}( {\hat{X}}_{{\mathcal {I}}}^{\mathsf {T}}{\hat{X}}_{{\mathcal {I}}}+ \xi D_{\mathcal {I}}^{{\mathsf {T}}}D_{\mathcal {I}}) {\varvec{\beta }}_{i} + (\rho - {\hat{X}}_{{\mathcal {I}}}^{\mathsf {T}}\bar{{\varvec{x}}}_i)^{\mathsf {T}}{\varvec{\beta }}_i, \end{aligned}$$

then

$$\begin{aligned} \rho > \max \left\{ 0, \max _{j \in \left\{ 2,\ldots , |{\mathcal {I}}| \right\} } \frac{(\hat{{\varvec{x}}}_{i}^{(1)} - \hat{{\varvec{x}}}_{i}^{(j)})^{\mathsf {T}}(\hat{{\varvec{x}}}_{i}^{(1)} -\bar{{\varvec{x}}}_i) + \xi (d_{i}^{(1)})^{2}}{d_{i}^{(j)} - d_{i}^{(1)}} \right\} . \end{aligned}$$
(8)

Note that our definition of pairwise dissimilarities satisfies the requirements of the lemma. The proof uses directional derivatives and the convexity of the objective function. In effect, Lemma 1 states that there are cases in which the nearest-neighbour approximation is optimal for all \(\rho > 0\). This occurs when it is possible to define a hyperplane that contains \(\hat{{\varvec{x}}}_{i}^{(1)}\), the nearest neighbour of \(\bar{{\varvec{x}}}_{i}\), and separates the column vectors in \({\hat{X}}_{{\mathcal {I}}}\) from \(\bar{{\varvec{x}}}_{i}\). In all other cases, there exists a positive value of \(\rho \) such that \({\varvec{\beta }}_i^\star \) has cardinality greater than one.

We close this section by outlining a simple proximal gradient descent algorithm (Parikh and Boyd 2014) that is faster for large instances of the problem than standard QP solvers. To this end, we first express (7) as an unconstrained minimisation problem through the use of an indicator function,

$$\begin{aligned} \min _{{\varvec{\beta }}_i} f({\varvec{\beta }}_i)+ \mathbb {1}_{\Delta ^{|{\mathcal {I}}|}} ({\varvec{\beta }}_i), \end{aligned}$$

where \(f({\varvec{\beta }}_i)\) is the quadratic function in Eq. (7), and \(\mathbb {1}_{\Delta ^{|{\mathcal {I}}|}}({\varvec{\beta }}_i)\) is zero for \({\varvec{\beta }}_i \in \Delta ^{|{\mathcal {I}}|}\) and infinity otherwise. At each iteration, proximal gradient descent updates \({\varvec{\beta }}_i^t\) by projecting onto the unit simplex a step of the standard gradient descent,

$$\begin{aligned} {\varvec{\beta }}_i^{t+1} = {{\,\mathrm{arg\,min}\,}}_{{\varvec{\beta }}\in \Delta ^{|{\mathcal {I}}|}} \frac{1}{2}\Vert {\varvec{\beta }}- {\varvec{\beta }}_i^{t} + \eta ^{t} \nabla f({\varvec{\beta }}_i^{t}) \Vert _2^2, \end{aligned}$$

where \(\eta ^{t}\) is the step size at iteration \(t\). Projecting onto \(\Delta ^{|{\mathcal {I}}|}\) can be achieved via a simple algorithm with complexity \({\mathcal {O}}\left( |{\mathcal {I}}| \log \left( |{\mathcal {I}}|\right) \right) \) (Wang and Carreira-Perpinán 2013).

4 Active learning and constrained clustering

In this section, we describe the process of identifying informative points to query (active learning), and then updating the clustering model to accommodate the most recent labels (constrained clustering). The constrained clustering algorithm described in this section can also be used if the labels for a subset of points were available at the start of the learning process.

We adopt the active learning strategy of Peng and Pavlidis (2019), that queries the points whose labelling information is expected to induce the largest decrease in the reconstruction error function (defined in (2)). Let \({\mathcal {U}}\subseteq \{1,\ldots ,N\}\) denote the set of indices of the unlabelled points, and \({\mathcal {L}}\subseteq \{1,\ldots ,N\}\) denote the set of indices of labelled points. Furthermore, let \(\{c_{i}\}_{i =1}^N\) denote the cluster assignment of each point, and \(\{l_i\}_{i \in {\mathcal {L}}}\) the class labels for the labelled points. To quantify the expected decrease in the reconstruction error after obtaining the label of \({\varvec{x}}_i\) with \(i \in {\mathcal {U}}\), we estimate two quantities. The first is the decrease in the reconstruction error that can be achieved if \({\varvec{x}}_i\) is removed from its currently assigned cluster \(c_{i}\). This is measured by the function \(U_{1}({\varvec{x}}_{i},V_{c_{i}})\), where \(V_{c_i}\) is a matrix containing a basis for cluster \(c_i\). The second is the increase in the reconstruction error due to the addition of \({\varvec{x}}_i\) to a different cluster \(c_{i}^{\prime }\). This is measured by the function \(U_{2}({\varvec{x}}_i,V_{c_i'})\). To estimate \(U_2\) we assume that \(c_i'\) is the cluster that is the second nearest to \({\varvec{x}}_i\) in terms of reconstruction error. This assumption is not guaranteed to hold but it is valid in the vast majority of cases. According to Peng and Pavlidis (2019), the most informative point to query is defined as,

$$\begin{aligned} {\varvec{x}}_i^{\star }= {{\,\mathrm{arg\,max}\,}}_{i \in {\mathcal {U}}} \left\{ U_{1}({\varvec{x}}_{i},V_{c_i})-U_2({\varvec{x}}_{i},V_{c_{i}'}) \right\} . \end{aligned}$$
(9)

The difficulty in calculating \(U_{1}({\varvec{x}}_{i},V_{c_i})\) and \(U_2({\varvec{x}}_{i},V_{c_{i}'})\) is that one needs to account for the fact that a change in the cluster assignment of \({\varvec{x}}_i\) affects both \(V_{c_i}\) and \(V_{c_i'}\). Recall that (irrespective of the choice of the clustering algorithm), the basis \(V_k\) for each cluster (subspace) k is computed by performing Principal Component Analysis (PCA) on the set of points assigned to this cluster. The advantage of the approach proposed by Peng and Pavlidis (2019) is that this is recognised, and a computationally efficient method to approximate \(U_1\) and \(U_2\) is proposed. In particular, using perturbation results for PCA (Critchley 1985), a first-order approximation of the changes in \(V_{c_i}\) and \(V_{c_i'}\) are computed at a cost of \({\mathcal {O}}(P)\) (compared to the cost of PCA which is \({\mathcal {O}}(\min \{N_k P^2, N_k^2 P\})\), where \(N_k\) is the number of points in cluster k).

Once the labels of the queried points are obtained we proceed to the constrained clustering stage in which we update the cluster assignment to accommodate the new information. The first step in our approach modifies pairwise dissimilarities between points in a manner similar to the work of Li et al. (2017). Specifically, for each \({\varvec{x}}_i \in {\mathcal {X}}\) we update all pairwise dissimilarities \(d_{ij} \in {\varvec{d}}_{\mathcal {I}}\) according to,

$$\begin{aligned} d_{ij} = \left\{ \begin{array}{ll} \frac{\Vert {\varvec{x}}_{i}\Vert _2 \Vert {\varvec{x}}_{j}\Vert _2}{{\varvec{x}}_i^{{\mathsf {T}}} {\varvec{x}}_{j}} e^{1 - 2 \cdot \mathbb {1}(l_i=l_j)} + \alpha \mathbb {1}(l_i \ne l_j), &{} \;\; \text {if} \;\; i,j \in {\mathcal {L}}, \\ \frac{\Vert {\varvec{x}}_{i}\Vert _2 \Vert {\varvec{x}}_{j}\Vert _2}{{\varvec{x}}_i^{{\mathsf {T}}} {\varvec{x}}_{j}} + \alpha \mathbb {1}(c_i \ne c_j), &{} \;\; \text {otherwise}. \end{array} \right. \end{aligned}$$
(10)

The first fraction is the dissimilarity measure in the absence of any label information as defined in (6). If the labels of both \({\varvec{x}}_i\) and \({\varvec{x}}_j\) are known and they are different then the dissimilarity is first scaled by e and a constant \(\alpha \in [0,1]\) is added. If \(l_i = l_j\), then the original dissimilarity is scaled by \(e^{-1}\). If the label of either \({\varvec{x}}_i\) or \({\varvec{x}}_j\) is unknown then no scaling is applied, but if in the previous step the two points were assigned to different clusters then their dissimilarity is increased by \(\alpha \). The term \(\alpha \) quantifies the confidence of the algorithm in the previous cluster assignment. A simple and effective heuristic is to assign \(\alpha \) equal to the proportion of labelled data.

After updating pairwise dissimilarities through (10) we update the coefficient vectors for each point by solving the problem in (7). The resulting affinity matrix is the input to the normalised spectral clustering of Ng et al. (2002). This cluster assignment is not guaranteed to satisfy all the constraints. To ensure constraint satisfaction, we use this cluster assignment as the initialisation for the K-Subspace Clustering with Constraints (KSCC) algorithm (Peng and Pavlidis 2019). KSCC is an iterative algorithm to solve the following optimisation problem,

$$\begin{aligned} \min _{V_1,\ldots , V_K} \left\{ \sum _{i \in {\mathcal {U}}} \min _{k=1,\ldots ,K} \Vert {\varvec{x}}_i - V_k V_k^{\mathsf {T}}{\varvec{x}}_i\Vert ^2_2 \!+\! \min _{ P \in {\mathcal {P}}(K) } \; \sum _{k=1}^K \; \sum _{ \begin{array}{c} j \in {\mathcal {L}}: \\ l_j = k \end{array}} \Vert {\varvec{x}}_j \!-\! V_{P_{k}} V_{P_{k}}^{\mathsf {T}}{\varvec{x}}_j\Vert _2^2 \right\} ,\nonumber \\ \end{aligned}$$
(11)

where \({\mathcal {P}}(K)\) is the set of all permutations of the cluster indices \(\{1,\ldots ,K\}\); while each permutation \(P \in {\mathcal {P}}(K)\) is a vector \(P = (P_1,\ldots ,P_K)\) whose k-th element, \(P_{k}\), indicates the cluster label associated with class k. The first term in (11) is the reconstruction error for the unlabelled points. The second term quantifies the reconstruction error for the labelled / queried points. To estimate this, we first need to match each cluster label to a unique class label. Each permutation \(P \in {\mathcal {P}}(K)\) of the set \(\{1,\ldots ,K\}\) represents a unique matching of cluster to class labels. According to (11), for every class \(k=1,\ldots ,K\), the reconstruction error of the labelled points of this class, \(\{x_j \,:\, j \in {\mathcal {L}}, \; l_j = k\}\), is computed by projecting them onto the linear subspace (cluster) defined by the basis \(V_{P_k}\). By minimising over all possible permutations \(P \in {\mathcal {P}}(K)\), we identify the matching of cluster to class labels that achieves the smallest overall reconstruction error for the labelled points, subject to the constraint that labelled points from each class are assigned to a unique cluster. This is an instance of the minimum weight perfect matching problem, and can be solved by the Hungarian algorithm (Kuhn 1955) in \({\mathcal {O}}(K^3)\). Note that by design the minimisation problem is always feasible irrespective of the composition of the set \({\mathcal {L}}\).

KSCC is an iterative algorithm that monotonically reduces the value of the objective function. It therefore converges to the local minimum of (11) whose region of attraction contains the initial cluster assignment. The computational complexity of KSCC is the same as that of KSC, namely \({\mathcal {O}}(\min \{N_k P^2, N_k^2 P\})\). We summarise the active learning and constrained clustering framework in procedural form in Algorithm 1. We refer to this extended version of WSSR as WSSR+.

figure a

5 Experiments on synthetic data

In this section, we conduct experiments on synthetic data to evaluate the performance of WSSR under various settings. We compare to the following state-of-the-art spectral-based subspace clustering methods: SSC (Elhamifar and Vidal 2013), S3C (Li et al. 2017), ASSC (Li et al. 2018a), SSC-OMP (You et al. 2016), LSR (Lu et al. 2012), and SMR (Hu et al. 2014). In the set of competing algorithms, we also include SSR (Huang et al. 2013) to allow us to assess the extent to which WSSR improves performance over this algorithm. The purpose of using synthetic data is to evaluate the impact on clustering accuracy of (a) the angles between the linear subspaces (clusters); (b) the noise level; and (c) the subspace dimensionality.Footnote 2

5.1 Parameter settings

We first report the parameter settings for all the algorithms considered. These settings are used in all the experiments reported in this paper. In WSSR, to estimate the coefficient vector for each point in the optimisation problem in (7) we consider only 10-nearest neighbourhoods (rather than all the \((N-1)\) points in the dataset). The \(L_{1}\) and \(L_{2}\) penalty parameters are set to be \(\rho =10^{-2}\) and \(\xi =10^{-4}\), respectively.

Both SSC and S3C have the settings of linear subspace and no outliers. For SSC, the penalty parameters \(\rho _{\eta }\) and \(\rho _{z}\) in (4) are computed as \(\rho _{\eta }=\alpha _{\eta }/\mu _{\eta }\) and \(\rho _{z}=\alpha _{z}/\mu _{z}\), where the \(\alpha \)s are user-specified and the \(\mu \)s are data dependent as computed according to (14) in Elhamifar and Vidal (2013). We use the default parameter values of \(\alpha _{\eta }=\alpha _{z}=20\). There is a hard and a soft version of S3C. We use the soft version with all its default parameter settings, as Li et al. (2017) report that this yields better clustering results. ASSC shares the same algorithmic implementation as SSC, except that the affine subspace setting is used. We set the maximum number of non-zero coefficients in SSC-OMP to ten to be consistent with the corresponding setting in WSSR. The OMP algorithm involves a termination threshold, which is set to \(10^{-6}\).

Unlike the previously discussed methods, LSR and SMR yield dense coefficient vectors. There are two versions of LSR, called LSR1 and LSR2. The first solves the LSR problem without allowing self-representation, that is all diagonal entries in the coefficient matrix are forced to be zero. In LSR2, this constraint is absent. We use LSR1 as Lu et al. (2012) report that it performs better in practice. Furthermore, by not allowing self-representation, LSR1 is directly comparable with all the other methods. There are also two different versions of SMR, called SMR-J1 and SMR-J2. These differ in how the affinity matrix, A, is constructed from the coefficient matrix, B. SMR-J1 defines A as in (12). We therefore use SMR-J1 as this is how all other methods define the affinity matrix. Finally, we set the SMR-J1 nearest neighbour parameter to ten to be consistent with the corresponding settings in WSSR and SSC-OMP, while all other parameter values are set to their default settings (Hu et al. 2014).

5.2 Varying angles between subspaces

In this set of experiments, we generate data from two 5-dimensional subspaces embedded in a 50-dimensional space. Each cluster contains 200 data points drawn from one of the subspaces. In addition, additive Gaussian noise with standard deviation \(\sigma =0.001\) is added to the data uniformly. We vary the angles between the two subspaces \(\theta \) between 10 and 60 degrees. The smaller the angle between two subspaces, the more difficult the clustering problem is. We evaluate the performance of various algorithms under each setting using clustering accuracy, and the performance results are reported in Table 1. The best performance results are highlighted in bold, and the second best performance results are underlined.

A general trend across almost all methods is that accuracy increases as the angle between the two subspaces increases. WSSR achieves the best performance in all scenarios. Even when the angle between the two subspaces is only 10 degrees it achieves an accuracy close to 0.95, while for angles greater or equal to 30 degrees clustering accuracy is perfect. This can be attributed to the small standard deviation of the noise term, and as we discuss in the next subsection, performance deteriorates when \(\sigma \) increases. However the accuracy of most other methods, including SSC and several of its variants, is much lower, despite very little noise. S3C is the strongest competitor and achieves the second best performance in effectively all scenarios.

Table 1 Accuracy of various subspace clustering algorithms on synthetic data with varying angles between subspaces

5.3 Varying noise levels

Next, we explore the effect of various noise levels on cluster performance. Data are generated from five linear subspaces (clusters). Each subspace contains 200 data points that lie in a 5-dimensional subspace within an ambient space of dimension 50. Gaussian noise, \(\varepsilon \sim N(0, \sigma ^2 I_{P})\), is added to the full-dimensional data. Table 2 presents clustering accuracy of the different algorithms for \(\sigma \in [0,0.5]\).

It is worth noting that all methods enjoy perfect clustering accuracy in the noise-free scenario. The accuracy scores for all methods decrease as the noise variance increases, although the speed and magnitude of the performance degradation differs markedly. WSSR enjoys the best performance among all competing methods for all values of \(\sigma \), and furthermore its performance degrades much more gradually as \(\sigma \) increases. It is also the only method that maintains an accuracy of over 0.9 for \(\sigma \) up to 0.4. It is worth noting that the performance of SSC-based methods (SSC, S3C, ASSC, SSC-OMP) degrades rapidly with increasing levels of noise. Indeed, all these methods have accuracy scores substantially less than 0.5 for \(\sigma =0.5\). In comparison, WSSR, LSR, and SMR, which use the Frobenius norm on the error matrix, or the \(L_2\)-norm on the reconstruction error, exhibit better performance.

Table 2 Accuracy of various subspace clustering algorithms on synthetic data with varying noise levels. The first column is not highlighted as all methods have perfect cluster performance in the noise-free scenario

5.4 Varying subspace dimensions

In this subsection we investigate the impact of the subspace dimensionality, \(d\), on performance, under a fixed ambient space dimensionality, P. In this set of experiments, we fix the ambient space dimension to \(P=100\), and allow \(d\) to increase from ten to 45. All experiments involve data from five \(d\)-dimensional linear subspaces. As before Gaussian noise with \(\sigma =0.5\) is added to the data.

Table 3 Accuracy of various subspace clustering algorithms on synthetic data with varying subspace dimensions

Table 3 reports the performance of the considered methods. The table shows that for the lowest values of d all considered methods perform relatively poorly. As d increases performance initially improves, but as we approach the maximum value of this parameter accuracy deteriorates. This happens because the combination of a relatively high variance for the noise term, and increasing \(d\) cause higher overlap among clusters. WSSR achieves very good relative performance especially for the lower values of d. For \(d\geqslant 30\) LSR1 is the best performing method. Even for the higher values of \(d\), however, WSSR achieves the best accuracy among the methods that estimate a sparse coefficient vector (with one exception for \(d=40\), where WSSR achieves an accuracy of 0.925, while S3C has an accuracy of 0.926). WSSR is affected by the higher cluster overlap, because it becomes increasingly likely that some of the 10-NNs of each point belong to different clusters. This can be overcome by increasing the number of NNs, but for consistency we use the same value, \(k=10\), throughout all the experiments in this paper.

6 Experiments on real data

In this section, we use real datasets to assess the comparative performance of WSSR on a range of real-world datasets including the MNIST database (LeCun et al. 1998), three Cancer gene datasets, and the Hopkins155 motion segmentation database.

6.1 MNIST data

The MNIST handwritten digits database contains greyscale images of handwritten digits from 0 to 9 (LeCun et al. 1998), and has been widely used in the machine learning literature to benchmark the performance of supervised and unsupervised learning methods. In the context of subspace clustering, You et al. (2016) used this dataset to demonstrate the effectiveness of SSC-OMP.

The original MNIST database contains 60,000 points in \(P=3472\) dimensions. The data is organised in ten clusters, with each cluster correspoding to a different digit. We adopt the experimental design proposed by You et al. (2016) and conduct two sets of experiments. The first aims to investigate the effect of the number of clusters, \(K\), on performance. To this end, we randomly select \(K\in \left\{ 2, 3, 5, 8, 10 \right\} \) clusters out of the ten digits, and from each chosen cluster (digit) we sample uniformly at random 100 points. The data is then projected onto the first 200 principal components. The rationale for projecting onto 200 dimensions is that when \(K=2\) we obtain a sample of size 200, and the maximum dimensionality of the subspace spanned by these vectors is 200. The second set of experiments proposed by You et al. (2016), is designed to investigate the effect of the number of data points per cluster, \(N_k\), on performance. In these experiments \(K=10\), while \(N_{k}\in \left\{ 50, 100, 500, 1000, 2000\right\} \) points are selected uniformly at random from each cluster (digit). As before, the data is first projected onto the first 200 principal components, and then clustering is applied. For every choice of K (first setup) and \(N_k\) (second setup) we sample 20 datasets and apply the considered clustering algorithms on each of them.

Table 4 Median and standard deviation of clustering accuracy on the MNIST handwritten digits dataset over 20 replications with varying number of (randomly selected) clusters, K

Table 4 reports the median and standard deviation of clustering accuracy for the first set of experiments in which K varies. WSSR always achieves the highest median performance and lowest standard deviation, while the improvement it achieves over competing methods increases as the clustering problem becomes more difficult (K increases). Note that WSSR achieves a substantial improvement over SSR, which is the worst performing method for \(K \geqslant 5\). Most of the competing algorithms achieve excellent accuracy for the smaller values of K, but their performance deteriorates considerably as K increases. SSC and S3C exhibit very similar performance, with SSC achieving slightly higher median accuracy. SSC-OMP appears to have a clear advantage over the other SSC-based methods for larger K. ASSC performs similarly to SSC-OMP in this dataset. LSR compares unfavourably with most other methods, both in terms of median performance, and in terms of performance variability especially for small K.

Table 5 reports the results for the second set of experiments where \(K\) is always ten, while \(N_k\), the number of points per cluster, varies. Note that for the largest \(N_k\) values the experiments with SSC, S3C, and SMR did not finish their 20 replications within 24 hours. We report those as dashed lines in Table 5. For all considered methods, as \(N_k\) increases median accuracy improves, while performance variability decreases (with the exception of SSC-OMP). WSSR achieves the highest median accuracy across all settings, and exhibits the least performance variability. SSC-OMP is the algorithm with the second highest median performance when \(N_k=50,100\), while for larger \(N_k\) values SMR is second best.

Table 5 Median clustering accuracy along with the standard deviations on the MNIST handwritten digits data across 20 replications with varying number of points per cluster

All our experiments on the MNIST database were performed on a cloud computing machine with four CPU cores and 8 GB of RAM. All the algorithms were implemented in MATLAB. Figure 3 illustrates computational times in log-scale for the considered algorithms for each of the two sets of experiments on the MNIST dataset. Both figures indicate that S3C is the most computationally intensive method. SSC and ASSC have similar computational times because they are based on the same optimisation framework. The most computationally efficient methods are SSC-OMP and LSR. SSC-OMP is the SSC variant that is most suitable for large-scale problems, while LSR admits a closed-form solution. For smaller problem sizes, the computational time for WSSR is comparable to that of SSC and ASSC, but the comparison becomes more favourable to WSSR for larger values of \(N_k\) and K.

Fig. 3
figure 3

Median running times (in log-scale of seconds) of different algorithms on the MNIST handwritten digits data

6.2 Gene expression datasets

Gene expression data have been shown to exhibit a grouping structure, in which each subtype of gene expression forms a different linear subspace (McWilliams and Montana 2014). The following three datasets have been previously adopted to demonstrate the effectiveness of subspace clustering: St. Jude Leukemia, Lung Cancer, and Novartis BPLC (Li et al. 2018b). A summary of their basic characteristics is provided in Table 6.

Table 6 Summary information on the gene expression datasets

Table 7 reports the performance of all clustering algorithms on these datasets. WSSR achieves the best performance on the St. Jude Leukemia and Novartis BPLS datasets, while on the Lung Cancer dataset it achieves an accuracy in excess of 0.9. Its performance on the last dataset is 5.7% lower compared to the best performing algorithm (S3C) and 0.5% lower compared to the second best (ASSC). In these gene expression datasets SSC-based methods performed well, while the performance of LSR and SMR was considerably lower.

Table 7 Clustering accuracy of subspace clustering algorithms on three cancer gene expression datasets

6.3 Hopkins155 motion segmentation data

A fundamental problem in computer vision is to infer structures and movements of three-dimensional objects from a video sequence. Video sequences often contain multiple objects moving independently in a scene, captured from a potentially moving camera. Thus, an important initial step in the analysis of video sequences is the motion segmentation problem: Given a set of feature points that are tracked through a sequence of video frames, cluster the trajectories of those points according to different motions (Rao et al. 2010). Using the affine camera model, this problem can be cast as the problem of segmenting samples drawn from multiple linear subspaces.

The Hopkins155 motion segmentation database (Tron and Vidal 2007), which contains 155 video sequences, has been widely used in computer vision to benchmark subspace clustering algorithms. Figure 4 presents boxplots of accuracy across the 155 datasets, for all the considered algorithms. LSR achieves the highest median accuracy and overall the most stable performance. ASSC also exhibits a very high median accuracy and stable performance. WSSR achieves the second highest median accuracy (higher than ASSC) but its performance is more variable compared to LSR and ASSC. Moreover, it performs relatively poorly on specific datasets. With the exception of ASSC other SSC variants achieve a median performance considerably lower than WSSR, and exhibit similar variability. The same is true of SMR and SSR.

Fig. 4
figure 4

Performance of subspace clustering methods on Hopkins155 database

7 Constrained clustering with active learning

In this section, we assess the performance of the constrained clustering with active learning framework, WSSR+, introduced in Sect. 4. In the previous subsection, we saw that WSSR achieved low accuracy on specific video sequences of the Hopkins155 motion segmentation dataset. In this section, we select the sequences on which WSSR did not produce perfect performance, to assess the benefits of incorporating label information through WSSR+.

We compare WSSR+ to three state-of-the-art constrained clustering methods: PLCC (Liu et al. 2018), CSP (Wang et al. 2014), and LCVQE (Pelleg and Baras 2007). PLCC has one tuning parameter \(\lambda \) that controls the weight assigned to the constrained information. We use the default setting of \(\lambda =10^4\), as it is the recommended default setting by the authors of Liu et al. (2018). CSP and LCVQE do not involve any user-specified parameter. Note that all three algorithms describe how to incorporate label information into a clustering model. Thus they can be used in conjunction with any clustering algorithm. To ensure a fair comparison, we always use WSSR as the underlying clustering algorithm and select the queried points through the active learning strategy of Peng and Pavlidis (2019). The latter choice is made because Hopkins155 datasets are known to exhibit subspace clustering structure.

Figure 5 presents boxplots of the clustering accuracy with respect to the proportion of labelled data, which ranges between zero and 0.5. When this proportion is zero, there are no constraints due to labelled data, and hence the corresponding four boxplots illustrate the clustering accuracy of WSSR on the selected datasets. (This is the reason that these four boxplots are identical.) We include the case of no labels because a natural benchmark for any constrained clustering algorithm is its performance relative to the fully unsupervised problem. As the first four boxplots in Fig. 5 show, WSSR achieves a median accuracy close to 0.85, but on some datasets accuracy is lower than 0.5.

Fig. 5
figure 5

Clustering accuracy of constrained clustering methods on the selected Hopkins155 datasets with respect to varying proportions of labelled data. Queried points are selected through the active learning strategy of Peng and Pavlidis (2019)

Figure 5 shows that on these datasets incorporating label information through CSP and PLCC initially degrades performance. In fact it is not until the proportion of labelled data reaches 0.4 and 0.5 that the performance of CSP and PLCC (respectively) is not worse than in the fully unsupervised case. WSSR+ and LCVQE improve performance, both in terms of the median and the interquartile range, as the proportion of labelled data increases. When the proportion of labelled data is higher or equal to 0.3, WSSR+ clearly outperforms LCVQE, especially in terms of performance variability. Note that the high variability in performance across all methods when the proportion of labelled data is small is due to the diverse range of datasets considered, rather than an inherent performance variability of the considered constrained clustering algorithms.

8 Conclusions and future work

In this work, we proposed a subspace clustering method called Weighted Sparse Simplex Representation (WSSR), which relies on estimating an affinity matrix by approximating each data point as a sparse convex combination of nearby points. We derived a lemma that provides a lower bound that can be used to select the critical penalty parameter, that controls the degree of sparsity. Extensive experimental results show that WSSR is competitive with state-of-the-art subspace clustering methods. We extended WSSR to the problem of constrained clustering, to accommodate cases where external information about the actual assignment of some points is available. Our constrained clustering approach combines the strengths of spectral-based methods, and constrained K-subspace clustering to ensure that the resulting clustering is accurate, and consistent with the label information. We also discussed an appropriate active learning strategy that aims to query points whose label information can maximally improve the quality of the overall subspace clustering model. Experiments on motion segmentation datasets, in which the unsupervised WSSR algorithm does not perform well, document the effectiveness of the proposed approach in incorporating label information.

In this work we focused on subspace clustering. For this problem, the inverse cosine similarity is a natural proximity measure after projecting the data onto the unit sphere. It would be interesting to explore other affinity measures that can potentially be used to capture data from manifolds. The proposed approach was developed under the assumption that the label information is always correct. This assumption can be violated in certain applications. A challenging future research direction is to design methods that can accommodate uncertainty in the validity of the observed labels, especially when the proportion of labelled data is small.