Elsevier

Information Sciences

Volume 435, April 2018, Pages 161-183
Information Sciences

Lossy compression approach to subspace clustering

https://doi.org/10.1016/j.ins.2017.12.056Get rights and content

Abstract

We present a novel subspace clustering algorithm SuMC (Subspace Memory Clustering) based on information theory, MDLP (Minimal Description Length Principle) and lossy compression. SuMC simultaneously solves two fundamental problems of subspace clustering: determination of the number of clusters and their optimal dimensions.

Although SuMC requires only two parameters: data compression ratio r and a number of bits that are used to code a single scalar, the optimal value of compression ratio can be estimated by the Bayesian information criterion (BIC).

We verified that in typical tasks of clustering, image segmentation and data compression, we obtain either better or comparable results to the leading methods of subspace clustering.

Introduction

Clustering techniques have been extensively studied for years in areas such as statistics [6], pattern recognition [10], big data [36], [37] and machine learning. However, most clustering algorithms do not work efficiently in higher dimensional spaces because of the inherent sparsity of data [8]. Problems arise when the distance between any two data points becomes almost the same (this is one of the reasons for the so-called curse of dimensionality [22]); therefore, it is difficult to differentiate similar data points from dissimilar ones. Moreover, clusters are embedded in the subspaces of high-dimensional data space and they often exist in subspaces of different dimensions. Therefore, in recent years, new branches of clustering have been developed: subspace clustering [17], [20], [33] and projected clustering [2], [24], which generalize classical clustering methods for high-dimensional data.

The main problem of standard subspace clustering methods is the determination of the number of clusters and their optimal dimensions. Most algorithms (like ORCLUS) need a fixed number of groups and the same dimensions for all clusters, while some other algorithms can fit the number of clusters (4C), but at the cost of additional parameters (in the case of 4C we have 4 parameters).

In this paper we present a subspace clustering SuMC1 method which does not have the above limitations. Our idea comes from the observation that in case of coding, it is often profitable to use various compression algorithms specialized for various data types. In such a case, a code for each point consists of two parts: the identifying index of compression algorithm used and the code produced by the method. We combine this approach with ideas from [31] where, using ideas similar to rate-distortion in information theory, a subspace clustering approach based on lossy compression was presented. We aim at minimizing the squared error, while keeping the total amount of memory fixed. Thanks to the use of constrained optimization procedure we are able to establish optimal number of clusters and dimensions in all groups simultaneously. In practice our method tends to reduce the unnecessary clusters and determine the “right” dimensions of these clusters (subspaces), see Fig. 1. Moreover, since SuMC is based on information theory, it can be efficiently used not only in clustering, but also in image segmentation or data compression, see Section 7.

SuMC needs only two parameters: the compression level r and the number of bits that are used to code a single scalar. However, in this case one is only interested in clustering itself, the reasonable value of r can be determined by using a version of Bayesian information criterion (BIC), see Section 4.

Conducted tests on artificial and real data show that SuMC obtains better results (measured by Rand index) than modern methods like ORCLUS or 4C [7], see Section 6. For example, on randomly generated datasets, SuMC obtains better Rand index than ORCLUS in about 80% cases, see Tables 3–4; while in the case of real data, SuMC won in about 65% of considered datasets, see Table 5. In terms of computational complexity, SuMC is comparable to the methods regarded as the best in the field, such as ORCLUS [4].

This paper is organized as follows. In Section 2 we review and discuss related works. In Section 3, we show the general theoretical framework, which in Section 4, we adapt to the subspace clustering case. In the next section we describe SuMC algorithm and discuss about its complexity. In Section 6, we present empirical results based on synthetic data and well-known datasets such as the datasets from the UCI Repository.

Section snippets

Related works

Traditional clustering methods are usually not applied to highly dimensional data sets due to their poor effectiveness, which is caused by the curse of dimensionality [22]. The problem of high dimensionality is often tackled by applying a dimensionality reduction method. These methods can be divided into techniques of feature selection or feature extraction. More information on feature selection methods can be found in [13]. The most popular feature extraction technique is the principal

General optimization problem

In this section we are going to present our approach, which is based on the applications of Shannons entropy [1], [28], Kolmogorov complexity [18], and minimum description length principle [12] (MDLP). The first subsection presents an answers of our approach, while the second subsection explains it thoroughly in the context of two clusters.

Subspace clustering optimization problem

In this section, we apply the approach presented in previous sections to the case of subspace clustering. As a result, we obtain a clustering method, which can reduce the number of clusters on-line, while building the clusters of different dimensions.

Let us start from the general subspace clustering problem we consider.

Subspace clustering problem. Let XRN be a given data set and let kN denote the upper bound of the number of clusters. Our goal is to divide data set X into pairwise disjoint

Algorithm

In this section we present our algorithm, which can be divided into three phases: initialization phase, iterative phase and dimension refinement phase.

Initialization phase

At the beginning we initialize clusters by random assignment of points to clusters. Then, for each group we assign memory proportionally with respect to its size. For a detailed description of this step, suppose that we begin with k clusters X1,,Xk such, that X1Xk=XRN and parameter r ≥ 1 (data compression ratio). For each

Experiments

In this section, we evaluate our method implemented in C++ language. All experiments were run on Ubuntu 14.04 (64-bit) workstation with a 3.3 GHz Quad-Core Intel Xeon Processor and 32GB RAM.

In the literature, there exist many different subspace clustering approaches. We decided to compare our method with ORCLUS [4] and 4C [7], which have the most similar features and properties to SuMC. At the beginning, we compare SuMC with ORCLUS, which detects clusters in arbitrary oriented subspaces. In

Image segmentation and compression

In this section, we show some possible applications of our algorithm in image segmentation and image compression. The aim of this section is not to present some new important results, but rather to indicate a possible research direction regarding application of SuMC.

Conclusions

In this paper we presented a subspace projection clustering algorithm SuMC, which is based on information theory and lossy version of the Minimum Description Length Principle (MDLP). As a consequence, our algorithm has a penalty of using each cluster built into the cost function, which results in its ability to reduce unnecessary clusters. Moreover, thanks to strong theoretical background, SuMC has a well-defined cost function, and consequently the results of various clusterings can be easily

References (38)

  • P. Arabie et al.

    An overview of combinatorial data analyis

    Cluster. Classif.

    (1996)
  • C. Böhm et al.

    Computing clusters of correlation connected objects

    Proceedings of the 2004 ACM SIGMOD international conference on Management of data

    (2004)
  • D.L. Donoho

    High-dimensional data analysis: the curses and blessings of dimensionality

    AMS Math. Challenges Lecture

    (2000)
  • M. Ester et al.

    A density-based algorithm for discovering clusters in large spatial databases with noise.

    Kdd

    (1996)
  • K. Fukunaga

    “Introduction to Statistical Pattern Recognition”

    (1990)
  • P.D. Grünwald

    The Minimum Description Length Principle

    (2007)
  • I. Guyon et al.

    An introduction to variable and feature selection

    J. Mach. Learn. Res.

    (2003)
  • J.A. Hartigan

    Clustering Algorithms

    (1975)
  • I. Jolliffe

    Principal Component Analysis

    (2005)
  • Cited by (19)

    • Self-supervised deep geometric subspace clustering network

      2022, Information Sciences
      Citation Excerpt :

      Subspace clustering [1,4,8,13,32,33], an unsupervised learning methodology, tackles the curse of dimensionality by finding relevant dimensions spanning a subspace for each cluster.

    • Deep self-representative subspace clustering network

      2021, Pattern Recognition
      Citation Excerpt :

      High-dimensional data reduction [1,2] and weighting [3,4] have been researched for effective feature selection and data clustering in the field of machine learning and pattern recognition. In particular, subspace clustering algorithms can reduce high-dimensionality by representing high-dimensional data points into the combination of their linearly dependent low-dimensional subspaces for a variety of tasks including subsequent facial image clustering [5], pose [6,7] and illumination changes, handwritten digits [8], rigidly moving object trajectories (in videos), complex human actions [9], and image representation and compression [10]. Following the trend that machine learning research has largely shifted to deep learning based approaches because of their impressive results, subspace clustering algorithms also embed the data into latent space using an encoder-decoder network.

    • RPCA-induced self-representation for subspace clustering

      2021, Neurocomputing
      Citation Excerpt :

      Clustering is an important technology in data mining [1–4] and machine learning [5–7], which has a wide range of applications in multimedia analysis [8,9].

    • Attributed graph clustering with subspace stochastic block model

      2020, Information Sciences
      Citation Excerpt :

      The main difference is that the subspaces they looking for are axis parallel and non-axis parallel, respectively. Based on information theory as well as lossy compression, SuMC [33] is able to simultaneously determine the number of clusters and their optimal dimensions. Based on the self-expressiveness property of data, SSC [34] and LRR [35] aim to represent the data via adding different constraints such as sparse or low-rank.

    View all citing articles on Scopus
    View full text