Lossy compression approach to subspace clustering

doi:10.1016/j.ins.2017.12.056

Information Sciences

Volume 435, April 2018, Pages 161-183

https://doi.org/10.1016/j.ins.2017.12.056 Get rights and content

Abstract

We present a novel subspace clustering algorithm SuMC (Subspace Memory Clustering) based on information theory, MDLP (Minimal Description Length Principle) and lossy compression. SuMC simultaneously solves two fundamental problems of subspace clustering: determination of the number of clusters and their optimal dimensions.

Although SuMC requires only two parameters: data compression ratio r and a number of bits that are used to code a single scalar, the optimal value of compression ratio can be estimated by the Bayesian information criterion (BIC).

We verified that in typical tasks of clustering, image segmentation and data compression, we obtain either better or comparable results to the leading methods of subspace clustering.

Introduction

Clustering techniques have been extensively studied for years in areas such as statistics [6], pattern recognition [10], big data [36], [37] and machine learning. However, most clustering algorithms do not work efficiently in higher dimensional spaces because of the inherent sparsity of data [8]. Problems arise when the distance between any two data points becomes almost the same (this is one of the reasons for the so-called curse of dimensionality [22]); therefore, it is difficult to differentiate similar data points from dissimilar ones. Moreover, clusters are embedded in the subspaces of high-dimensional data space and they often exist in subspaces of different dimensions. Therefore, in recent years, new branches of clustering have been developed: subspace clustering [17], [20], [33] and projected clustering [2], [24], which generalize classical clustering methods for high-dimensional data.

The main problem of standard subspace clustering methods is the determination of the number of clusters and their optimal dimensions. Most algorithms (like ORCLUS) need a fixed number of groups and the same dimensions for all clusters, while some other algorithms can fit the number of clusters (4C), but at the cost of additional parameters (in the case of 4C we have 4 parameters).

In this paper we present a subspace clustering SuMC¹ method which does not have the above limitations. Our idea comes from the observation that in case of coding, it is often profitable to use various compression algorithms specialized for various data types. In such a case, a code for each point consists of two parts: the identifying index of compression algorithm used and the code produced by the method. We combine this approach with ideas from [31] where, using ideas similar to rate-distortion in information theory, a subspace clustering approach based on lossy compression was presented. We aim at minimizing the squared error, while keeping the total amount of memory fixed. Thanks to the use of constrained optimization procedure we are able to establish optimal number of clusters and dimensions in all groups simultaneously. In practice our method tends to reduce the unnecessary clusters and determine the “right” dimensions of these clusters (subspaces), see Fig. 1. Moreover, since SuMC is based on information theory, it can be efficiently used not only in clustering, but also in image segmentation or data compression, see Section 7.

SuMC needs only two parameters: the compression level r and the number of bits that are used to code a single scalar. However, in this case one is only interested in clustering itself, the reasonable value of r can be determined by using a version of Bayesian information criterion (BIC), see Section 4.

Conducted tests on artificial and real data show that SuMC obtains better results (measured by Rand index) than modern methods like ORCLUS or 4C [7], see Section 6. For example, on randomly generated datasets, SuMC obtains better Rand index than ORCLUS in about 80% cases, see Tables 3–4; while in the case of real data, SuMC won in about 65% of considered datasets, see Table 5. In terms of computational complexity, SuMC is comparable to the methods regarded as the best in the field, such as ORCLUS [4].

This paper is organized as follows. In Section 2 we review and discuss related works. In Section 3, we show the general theoretical framework, which in Section 4, we adapt to the subspace clustering case. In the next section we describe SuMC algorithm and discuss about its complexity. In Section 6, we present empirical results based on synthetic data and well-known datasets such as the datasets from the UCI Repository.

Section snippets

Related works

Traditional clustering methods are usually not applied to highly dimensional data sets due to their poor effectiveness, which is caused by the curse of dimensionality [22]. The problem of high dimensionality is often tackled by applying a dimensionality reduction method. These methods can be divided into techniques of feature selection or feature extraction. More information on feature selection methods can be found in [13]. The most popular feature extraction technique is the principal

General optimization problem

In this section we are going to present our approach, which is based on the applications of Shannons entropy [1], [28], Kolmogorov complexity [18], and minimum description length principle [12] (MDLP). The first subsection presents an answers of our approach, while the second subsection explains it thoroughly in the context of two clusters.

Subspace clustering optimization problem

In this section, we apply the approach presented in previous sections to the case of subspace clustering. As a result, we obtain a clustering method, which can reduce the number of clusters on-line, while building the clusters of different dimensions.

Let us start from the general subspace clustering problem we consider.

Subspace clustering problem. Let $X \subset R^{N}$ be a given data set and let $k \in N$ denote the upper bound of the number of clusters. Our goal is to divide data set X into pairwise disjoint

Algorithm

In this section we present our algorithm, which can be divided into three phases: initialization phase, iterative phase and dimension refinement phase.

Initialization phase

At the beginning we initialize clusters by random assignment of points to clusters. Then, for each group we assign memory proportionally with respect to its size. For a detailed description of this step, suppose that we begin with k clusters $X_{1}, \dots, X_{k}$ such, that $X_{1} \cup \dots \cup X_{k} = X \subset R^{N}$ and parameter r ≥ 1 (data compression ratio). For each

Experiments

In this section, we evaluate our method implemented in C++ language. All experiments were run on Ubuntu 14.04 (64-bit) workstation with a 3.3 GHz Quad-Core Intel Xeon Processor and 32GB RAM.

In the literature, there exist many different subspace clustering approaches. We decided to compare our method with ORCLUS [4] and 4C [7], which have the most similar features and properties to SuMC. At the beginning, we compare SuMC_∞ with ORCLUS, which detects clusters in arbitrary oriented subspaces. In

Image segmentation and compression

In this section, we show some possible applications of our algorithm in image segmentation and image compression. The aim of this section is not to present some new important results, but rather to indicate a possible research direction regarding application of SuMC.

Conclusions

In this paper we presented a subspace projection clustering algorithm SuMC, which is based on information theory and lossy version of the Minimum Description Length Principle (MDLP). As a consequence, our algorithm has a penalty of using each cluster built into the cost function, which results in its ability to reduce unnecessary clusters. Moreover, thanks to strong theoretical background, SuMC has a well-defined cost function, and consequently the results of various clusterings can be easily

References (38)

G. Gan et al.
Subspace clustering using affinity propagation
Pattern Recognit.
(2015)
J. Tabor et al.
Cross-entropy clustering
Pattern Recognit.
(2014)
J. Wang et al.
Distance metric learning for soft subspace clustering in composite kernel space
Pattern Recognit.
(2016)
Q. Zhang et al.
High-order possibilistic c-means algorithms based on tensor decompositions for big data in iot
Inf. Fusion
(2018)
L. Zhu et al.
Multiobjective evolutionary algorithm-based soft subspace clustering
Evolutionary Computation (CEC), 2012 IEEE Congress on
(2012)
J. Aczél et al.
On measures of information and their characterizations
(1975)
C.C. Aggarwal et al.
Data Clustering: Algorithms and Applications
(2013)
C.C. Aggarwal et al.
Fast algorithms for projected clustering
ACM SIGMoD Record
(1999)
C.C. Aggarwal et al.
Finding Generalized Projected Clusters in High Dimensional Spaces
(2000)
R. Agrawal et al.
Automatic Subspace Clustering of High Dimensional Data for Data Mining Applications
(1998)

P. Arabie et al.

An overview of combinatorial data analyis

Cluster. Classif.

(1996)

C. Böhm et al.

Computing clusters of correlation connected objects

Proceedings of the 2004 ACM SIGMOD international conference on Management of data

(2004)

D.L. Donoho

High-dimensional data analysis: the curses and blessings of dimensionality

AMS Math. Challenges Lecture

(2000)

M. Ester et al.

A density-based algorithm for discovering clusters in large spatial databases with noise.

Kdd

(1996)

K. Fukunaga

“Introduction to Statistical Pattern Recognition”

(1990)

P.D. Grünwald

The Minimum Description Length Principle

(2007)

I. Guyon et al.

An introduction to variable and feature selection

J. Mach. Learn. Res.

(2003)

J.A. Hartigan

Clustering Algorithms

(1975)

I. Jolliffe

Principal Component Analysis

(2005)

Cited by (19)

Efficient GPU implementation of randomized SVD and its applications[Formula presented]
2024, Expert Systems with Applications
Matrix decompositions are ubiquitous in machine learning, including applications in dimensionality reduction, data compression and deep learning algorithms. Typical solutions for matrix decompositions have polynomial complexity which significantly increases their computational cost and time. In this work, we leverage efficient processing operations that can be run in parallel on modern Graphical Processing Units (GPUs), predominant computing architecture used e.g. in deep learning, to reduce the computational burden of computing matrix decompositions. More specifically, we reformulate the randomized decomposition problem to incorporate fast matrix multiplication operations (BLAS-3) as building blocks. We show that this formulation, combined with fast random number generators, allows to fully exploit the potential of parallel processing implemented in GPUs. Our extensive evaluation confirms the superiority of this approach over the competing methods and we release the results of this research as a part of the official CUDA implementation.¹
Self-supervised deep geometric subspace clustering network
2022, Information Sciences
Citation Excerpt :
Subspace clustering [1,4,8,13,32,33], an unsupervised learning methodology, tackles the curse of dimensionality by finding relevant dimensions spanning a subspace for each cluster.
Graph mining has been widely studied to analyze real-world graph properties and applied to various applications. In particular, graph subspace clustering performance, defined as partitioning high-dimensional graph data into several clusters by finding minimum weights for the edges, has been consistently improved by exploiting deep learning algorithms with Euclidean features extracted from Euclidean domains (image datasets). Most subspace clustering algorithms tend to extract features from the Euclidean domain to identify graph characteristics and structures, and hence are limited for real-world data applications in non-Euclidean domains. This paper proposes a self-supervised deep geometric subspace clustering algorithm optimized for non-Euclidean high-dimensional graph data by emphasizing spatial features and geometric structures while simultaneously reducing redundant nodes and edges. Quantitative and qualitative experimental results verified the proposed approach is effective for graph clustering compared with previous state-of-the-art algorithms on public datasets.
Deep self-representative subspace clustering network
2021, Pattern Recognition
Citation Excerpt :
High-dimensional data reduction [1,2] and weighting [3,4] have been researched for effective feature selection and data clustering in the field of machine learning and pattern recognition. In particular, subspace clustering algorithms can reduce high-dimensionality by representing high-dimensional data points into the combination of their linearly dependent low-dimensional subspaces for a variety of tasks including subsequent facial image clustering [5], pose [6,7] and illumination changes, handwritten digits [8], rigidly moving object trajectories (in videos), complex human actions [9], and image representation and compression [10]. Following the trend that machine learning research has largely shifted to deep learning based approaches because of their impressive results, subspace clustering algorithms also embed the data into latent space using an encoder-decoder network.
Deep learning based subspace clustering networks have been a significant technique for motion segmentation, unsupervised image segmentation, image representation and compression, and face clustering by separating the high-dimensional data points into their representative low-dimensional linear subspaces. Effective feature selection is critical to remove redundant samples and select the representative feature subset from high-dimensional data space; hence deriving the number of subspaces, their dimensions, data segmentation, and a basis for each subspace. The effective self-representative feature selection and emphasis by scaling the feature map in the learned embedded space is required for deep learning based subspace clustering to reduce the number of parameters and dimension of the self-representative layer. In this paper, we propose a self-representative feature extraction deep neural network for unsupervised subspace clustering to improve representativeness and clustering ability. The extensive relevant results on various data demonstrate that deep subspace clustering employing self-representative features from high-dimensional data can effectively reduce the dimension of the self-representative layer while improving performance.
RPCA-induced self-representation for subspace clustering
2021, Neurocomputing
Citation Excerpt :
Clustering is an important technology in data mining [1–4] and machine learning [5–7], which has a wide range of applications in multimedia analysis [8,9].
The self-expressiveness property of the data, i.e., each sample is linearly represented as a combination of other samples, has recently aroused much attention in the community of data mining and machine learning and shown great promising for subspace clustering. However, real-world data are usually contaminated by noise and outliers. The true similarity between samples directly learned from the original data may deviate from the intrinsic structure of the data, and subsequent clustering results will be severely affected. Hence naively taking a corrupted dictionary, i.e., data itself, will not always obtain the desired clustering performance. To address the above issues, this paper proposes a robust principal component analysis induced self-representation clustering method via adaptively learning the similarity graph from the clean data in the self-representation framework, where the original data are recovered, and the representation coefficients are obtained simultaneously in a unified framework. Specifically, we jointly integrate a nonnegative constraint and a distance regularization into the proposed framework, which guarantees the learned affinity matrix can simultaneously capture the global and local structure of data. Experimental results on both synthetic data and several famous multimedia datasets demonstrate that the proposed method performs much better against state-of-the-arts.
Attributed graph clustering with subspace stochastic block model
2020, Information Sciences
Citation Excerpt :
The main difference is that the subspaces they looking for are axis parallel and non-axis parallel, respectively. Based on information theory as well as lossy compression, SuMC [33] is able to simultaneously determine the number of clusters and their optimal dimensions. Based on the self-expressiveness property of data, SSC [34] and LRR [35] aim to represent the data via adding different constraints such as sparse or low-rank.
Inspired by the principle of homophily, most existing graph clustering approaches assume that the formation of clusters is highly related to node attributes, and thus leverage node information to improve graph clustering performance. However, utilizing all attributes as supplemental information for graph clustering may fail on real-world attributed graphs since only a subset of attributes are truly relevant for the formation of clusters, and the relevant attributes (i.e., attribute subspaces) for different clusters often differ largely in real-world graphs. Therefore, in this paper, we propose a subspace stochastic block model (SSB) to explore the cluster structures in attributed graphs. The key point is to view both topological structure and attribute information as the latent factors to drive the formation of clusters in the new proposed generative model. More specifically, relevant attributes are iteratively learned for each cluster, and subsequently used as valuable information to be integrated into the stochastic block model. To solve the likelihood function, an expectation–maximization strategy is developed to infer all parameters efficiently, and finally all clusters and their corresponding attribute subspaces are identified simultaneously. Extensive experimental results on both synthetic and real-world graphs have demonstrated the effectiveness of SSB, and show its superiority over many state-of-art approaches.
Estimating distance threshold for greedy subspace clustering
2019, Expert Systems with Applications
Many approaches have been proposed to recognize clusters in subspaces. However, their performance is highly sensitive to input parameter values. The purpose and expected ranges of these parameters may not available to a non-expert user. The parameter setting producing optimal results can only be known after repeated execution of the clustering process every time with a different set, which is very time consuming. Most of the existing algorithms show high runtimes due to excessive data scans. In this work, we propose a subspace clustering technique that estimates the distance threshold parameter automatically from the data for each attribute and works on the basis of single linkage clustering, in bottom up, greedy fashion. The experimental results show that, the algorithm produces optimal results without accepting any input from the user, achieves up to 10 times better runtime and improved accuracy in a single run without requiring any tuning of parameter values.

View all citing articles on Scopus

View full text

Lossy compression approach to subspace clustering

Abstract

Introduction

Section snippets

Related works

General optimization problem

Subspace clustering optimization problem

Algorithm

Experiments

Image segmentation and compression

Conclusions

Pattern Recognit.

Pattern Recognit.

Pattern Recognit.

Inf. Fusion

On measures of information and their characterizations

Data Clustering: Algorithms and Applications

Fast algorithms for projected clustering

ACM SIGMoD Record

Finding Generalized Projected Clusters in High Dimensional Spaces

Automatic Subspace Clustering of High Dimensional Data for Data Mining Applications

An overview of combinatorial data analyis

Cluster. Classif.

Computing clusters of correlation connected objects

Proceedings of the 2004 ACM SIGMOD international conference on Management of data

High-dimensional data analysis: the curses and blessings of dimensionality

AMS Math. Challenges Lecture

A density-based algorithm for discovering clusters in large spatial databases with noise.

Kdd

“Introduction to Statistical Pattern Recognition”

The Minimum Description Length Principle

An introduction to variable and feature selection

J. Mach. Learn. Res.

Clustering Algorithms

Principal Component Analysis