Lossy compression approach to subspace clustering
Introduction
Clustering techniques have been extensively studied for years in areas such as statistics [6], pattern recognition [10], big data [36], [37] and machine learning. However, most clustering algorithms do not work efficiently in higher dimensional spaces because of the inherent sparsity of data [8]. Problems arise when the distance between any two data points becomes almost the same (this is one of the reasons for the so-called curse of dimensionality [22]); therefore, it is difficult to differentiate similar data points from dissimilar ones. Moreover, clusters are embedded in the subspaces of high-dimensional data space and they often exist in subspaces of different dimensions. Therefore, in recent years, new branches of clustering have been developed: subspace clustering [17], [20], [33] and projected clustering [2], [24], which generalize classical clustering methods for high-dimensional data.
The main problem of standard subspace clustering methods is the determination of the number of clusters and their optimal dimensions. Most algorithms (like ORCLUS) need a fixed number of groups and the same dimensions for all clusters, while some other algorithms can fit the number of clusters (4C), but at the cost of additional parameters (in the case of 4C we have 4 parameters).
In this paper we present a subspace clustering SuMC1 method which does not have the above limitations. Our idea comes from the observation that in case of coding, it is often profitable to use various compression algorithms specialized for various data types. In such a case, a code for each point consists of two parts: the identifying index of compression algorithm used and the code produced by the method. We combine this approach with ideas from [31] where, using ideas similar to rate-distortion in information theory, a subspace clustering approach based on lossy compression was presented. We aim at minimizing the squared error, while keeping the total amount of memory fixed. Thanks to the use of constrained optimization procedure we are able to establish optimal number of clusters and dimensions in all groups simultaneously. In practice our method tends to reduce the unnecessary clusters and determine the “right” dimensions of these clusters (subspaces), see Fig. 1. Moreover, since SuMC is based on information theory, it can be efficiently used not only in clustering, but also in image segmentation or data compression, see Section 7.
SuMC needs only two parameters: the compression level r and the number of bits that are used to code a single scalar. However, in this case one is only interested in clustering itself, the reasonable value of r can be determined by using a version of Bayesian information criterion (BIC), see Section 4.
Conducted tests on artificial and real data show that SuMC obtains better results (measured by Rand index) than modern methods like ORCLUS or 4C [7], see Section 6. For example, on randomly generated datasets, SuMC obtains better Rand index than ORCLUS in about 80% cases, see Tables 3–4; while in the case of real data, SuMC won in about 65% of considered datasets, see Table 5. In terms of computational complexity, SuMC is comparable to the methods regarded as the best in the field, such as ORCLUS [4].
This paper is organized as follows. In Section 2 we review and discuss related works. In Section 3, we show the general theoretical framework, which in Section 4, we adapt to the subspace clustering case. In the next section we describe SuMC algorithm and discuss about its complexity. In Section 6, we present empirical results based on synthetic data and well-known datasets such as the datasets from the UCI Repository.
Section snippets
Related works
Traditional clustering methods are usually not applied to highly dimensional data sets due to their poor effectiveness, which is caused by the curse of dimensionality [22]. The problem of high dimensionality is often tackled by applying a dimensionality reduction method. These methods can be divided into techniques of feature selection or feature extraction. More information on feature selection methods can be found in [13]. The most popular feature extraction technique is the principal
General optimization problem
In this section we are going to present our approach, which is based on the applications of Shannons entropy [1], [28], Kolmogorov complexity [18], and minimum description length principle [12] (MDLP). The first subsection presents an answers of our approach, while the second subsection explains it thoroughly in the context of two clusters.
Subspace clustering optimization problem
In this section, we apply the approach presented in previous sections to the case of subspace clustering. As a result, we obtain a clustering method, which can reduce the number of clusters on-line, while building the clusters of different dimensions.
Let us start from the general subspace clustering problem we consider.
Subspace clustering problem. Let be a given data set and let denote the upper bound of the number of clusters. Our goal is to divide data set X into pairwise disjoint
Algorithm
In this section we present our algorithm, which can be divided into three phases: initialization phase, iterative phase and dimension refinement phase.
Initialization phase
At the beginning we initialize clusters by random assignment of points to clusters. Then, for each group we assign memory proportionally with respect to its size. For a detailed description of this step, suppose that we begin with k clusters such, that and parameter r ≥ 1 (data compression ratio). For each
Experiments
In this section, we evaluate our method implemented in C++ language. All experiments were run on Ubuntu 14.04 (64-bit) workstation with a 3.3 GHz Quad-Core Intel Xeon Processor and 32GB RAM.
In the literature, there exist many different subspace clustering approaches. We decided to compare our method with ORCLUS [4] and 4C [7], which have the most similar features and properties to SuMC. At the beginning, we compare SuMC∞ with ORCLUS, which detects clusters in arbitrary oriented subspaces. In
Image segmentation and compression
In this section, we show some possible applications of our algorithm in image segmentation and image compression. The aim of this section is not to present some new important results, but rather to indicate a possible research direction regarding application of SuMC.
Conclusions
In this paper we presented a subspace projection clustering algorithm SuMC, which is based on information theory and lossy version of the Minimum Description Length Principle (MDLP). As a consequence, our algorithm has a penalty of using each cluster built into the cost function, which results in its ability to reduce unnecessary clusters. Moreover, thanks to strong theoretical background, SuMC has a well-defined cost function, and consequently the results of various clusterings can be easily
References (38)
- et al.
Subspace clustering using affinity propagation
Pattern Recognit.
(2015) - et al.
Cross-entropy clustering
Pattern Recognit.
(2014) - et al.
Distance metric learning for soft subspace clustering in composite kernel space
Pattern Recognit.
(2016) - et al.
High-order possibilistic c-means algorithms based on tensor decompositions for big data in iot
Inf. Fusion
(2018) - et al.
Multiobjective evolutionary algorithm-based soft subspace clustering
Evolutionary Computation (CEC), 2012 IEEE Congress on
(2012) - et al.
On measures of information and their characterizations
(1975) - et al.
Data Clustering: Algorithms and Applications
(2013) - et al.
Fast algorithms for projected clustering
ACM SIGMoD Record
(1999) - et al.
Finding Generalized Projected Clusters in High Dimensional Spaces
(2000) - et al.
Automatic Subspace Clustering of High Dimensional Data for Data Mining Applications
(1998)
An overview of combinatorial data analyis
Cluster. Classif.
Computing clusters of correlation connected objects
Proceedings of the 2004 ACM SIGMOD international conference on Management of data
High-dimensional data analysis: the curses and blessings of dimensionality
AMS Math. Challenges Lecture
A density-based algorithm for discovering clusters in large spatial databases with noise.
Kdd
“Introduction to Statistical Pattern Recognition”
The Minimum Description Length Principle
An introduction to variable and feature selection
J. Mach. Learn. Res.
Clustering Algorithms
Principal Component Analysis
Cited by (19)
Efficient GPU implementation of randomized SVD and its applications[Formula presented]
2024, Expert Systems with ApplicationsSelf-supervised deep geometric subspace clustering network
2022, Information SciencesCitation Excerpt :Subspace clustering [1,4,8,13,32,33], an unsupervised learning methodology, tackles the curse of dimensionality by finding relevant dimensions spanning a subspace for each cluster.
Deep self-representative subspace clustering network
2021, Pattern RecognitionCitation Excerpt :High-dimensional data reduction [1,2] and weighting [3,4] have been researched for effective feature selection and data clustering in the field of machine learning and pattern recognition. In particular, subspace clustering algorithms can reduce high-dimensionality by representing high-dimensional data points into the combination of their linearly dependent low-dimensional subspaces for a variety of tasks including subsequent facial image clustering [5], pose [6,7] and illumination changes, handwritten digits [8], rigidly moving object trajectories (in videos), complex human actions [9], and image representation and compression [10]. Following the trend that machine learning research has largely shifted to deep learning based approaches because of their impressive results, subspace clustering algorithms also embed the data into latent space using an encoder-decoder network.
RPCA-induced self-representation for subspace clustering
2021, NeurocomputingCitation Excerpt :Clustering is an important technology in data mining [1–4] and machine learning [5–7], which has a wide range of applications in multimedia analysis [8,9].
Attributed graph clustering with subspace stochastic block model
2020, Information SciencesCitation Excerpt :The main difference is that the subspaces they looking for are axis parallel and non-axis parallel, respectively. Based on information theory as well as lossy compression, SuMC [33] is able to simultaneously determine the number of clusters and their optimal dimensions. Based on the self-expressiveness property of data, SSC [34] and LRR [35] aim to represent the data via adding different constraints such as sparse or low-rank.
Estimating distance threshold for greedy subspace clustering
2019, Expert Systems with Applications