Clustering algorithm for mixed datasets using density peaks and Self-Organizing Generative Adversarial Networks

https://doi.org/10.1016/j.chemolab.2020.104070Get rights and content

Highlights

  • Clustering mixed data set using density peaks clustering algorithm.

  • Feature map representation by Generative Adversarial Networks.

  • A novel adaptive cost function for Network architecture.

  • The proposed model increases stability and efficiency of network training.

Abstract

This paper presents a new Density-Peaks and Self-Organizing Generative Adversarial Networks (DP-SO-GAN) for clustering mixed datasets. Many clustering methods depend on the assumption that datasets contain either categorical or numerical attributes. Nevertheless, in real-time, most of the applications include mixed categorical and numerical attributes. In medicine, the clustering of cardiovascular disease is an essential task. The clustering of such data attributes is a vital and challenging issue. First, we transform mixed data attributes such as categorical attributes using a one-hot encoding technique and numerical attributes using normalization techniques. The converted characteristics are input to a Self-Organizing Generative Adversarial Networks (SO-GAN) to learn the feature map. Second, we train two kernel networks, such as the generator and discriminator, and each one holds a trivial amount of convolution kernels. Last, we propose an enhanced density peaks clustering algorithm and computing similarity measure between the data objects in the feature representation. The clustering accuracy for the cardiovascular disease dataset results in 88.32% with a standard deviation of 0.1 and is relatively higher than that of other existing algorithms. The training time for hand-written digits datasets over 300 epochs is 3148.26 ​s. Experiment results obtained on a set of five datasets demonstrate the merits of the proposed method, especially in terms of the stability and efficiency of network training. The computational complexity of the proposed method in terms of floating-point operations is reduced by around 18% as compared with the classical generative adversarial networks.

Introduction

Clustering is objective at determining the association within the group of data and evaluating similarity among them. Clustering techniques used in many real-time applications such as medicine, biology, pattern recognition, image classification, document retrieval, and computer vision [1]. Nowadays, clustering analysis becomes a more efficient method in machine learning. The basic idea of clustering algorithm deals with similarity measure between data objects, making objects in concern cluster are more similar than in other clusters. In partition-based clustering, cluster center selection based on the center of the data object, and data objects employed into various clusters depend on the partition. Some algorithms, such as K-Prototypes [2], K-Means Clustering for Mixed Datasets (KMCMD) [3], K-Centers [4], Improved K-Prototypes [5], and K-Harmonic Means type Clustering algorithm for Mixed Datasets (KHMCMD) [6] are proposed using the partition-based clustering technique. The hierarchical clustering approach used to create a tree-like structure to split or merge the data objects into clusters. Some algorithms, such as Distance Hierarchy (DH) [7], and Similarity-Based Agglomerative Clustering (SBAC) [8] are proposed based on the hierarchical clustering approach. In contrast to the aforementioned algorithms, incremental clustering algorithms are developed to process the entire pattern at a time and also save the memory space in memory. Some algorithms, such as Modified Adaptive Resonance Theory (M-ART) [9], Clustering Algorithm based on the methods of Variance and Entropy (CAVE) [10], Mixed Self-Organizing Incremental Neural Network (MSOINN) [11], are proposed based on an incremental clustering approach. To be more explicit, fuzzy clustering algorithms are developed that describe the relationship between the data objects more accurately. Some algorithms, such as Fuzzy K-Prototypes [12], Kullback-Leibler Fuzzy C-Means Gaussian Mixture Models (KL-FCM-GM) [13] are proposed based on this category. Density Peaks clustering algorithms are developed based on local density to find the relationship of data objects and also their relative distance. Some algorithms, such as Density Peaks clustering for Mixed numerical and categorical Data attributes using Fuzzy Neighborhood (DP-MD-FN) [14], Density Peaks Clustering for Mixed Datasets (DPC-MD) [15], and fast Density Peaks Clustering for mixed Datasets (DPC-M) [16] are proposed based on this category. Besides, deep learning has more courtesy in many fields. Some algorithms, such as Deep Embedded Clustering (DEC) [17], Discriminatively Boosted Clustering (DBC) [18] and Deep Learning with Nonparametric Clustering (DNC) [19] are proposed based on deep learning techniques.

In general, generative models are experts in producing a range of new models in a high-dimensional subspace [[20], [21], [22], [23], [24]]. In recent times, Generative Adversarial Network (GAN) [25] has become one of the ultimate standard and fruitful models, which are capable to produce high-quality realistic images and videos. In contrast to Pixel Recurrent Neural Networks (Pixel-RNN) [26] and Variational Auto-Encoder (VAE) [27], GAN can produce high-quality images with lower deviation and has low time complexity. The numerous counter-productive generative networks are asymptotically reliable, while VAE has some bias [28]. At the same time, GAN has neither a lower bound of variation nor a complicated partition function as compared with Deep Boltzmann Machines (DBM) [29]. In general, virtual samples can be produced by GAN in a single forward pass, rather than over an iterative method such as Markov chain operators [30]. The GAN model requires two sub-networks, such as generator and discriminator for reconstruction and classification of images. The generator regularly trains the dissemination of the real images fed into the network; consequently that it can produce realistic images. Meanwhile, the discriminator model is used to recognize the legitimacy of an input image, i.e., real or fake image. The discriminator and generator are accomplished in a way as a minimax two-player game [31] to enhance the entire model. The generator trained to produce better realistic images that attempt to fool the discriminator. At the same time, the discriminator trained to recognize synthetic images produced by the generator and real images. The training process of the GAN model converges until both the generator and discriminator cannot be optimized further. Such a training process leads to complications and uncertainty of network training. The capability of the generator and discriminator must be well-adjusted throughout the training process of a GAN model.

In this paper, we proposed a novel Density-Peaks and Self-Organizing Generative Adversarial Networks model, namely DP-SO-GAN, for clustering mixed datasets. To be more explicit, we transform mixed data attributes, such as categorical attributes using a one-hot encoding technique and numerical attributes using normalization technique. To improve the feature representation of the datasets, the normalization attribute values are fed into a generative adversarial network. The proposed SO-GAN model uses a co-occurrence scheme of growth and prune, associated with a novel adaptive cost function for optimization of network. The proposed SO-GAN model increases the training stability of a GAN model. The growth section of the proposed SO-GAN model is inspired by the evolution process of the neural cells of the human brain [32].

The rest of the paper is organized as follows. We first introduce the related work in Section 2. Then we present the proposed DP-SO-GAN model in Section 3. The experimental results are reported in Section 4 and the conclusion is drawn in Section 5.

Section snippets

Related work

To handle mixed categorical and numerical attributes, many clustering methods are proposed based on the partition-based clustering, hierarchical clustering, attributed conversion clustering, and density-based clustering.

The partitioned-based clustering algorithm is the most primitive type, which usually clusters elements into k clusters depend on the cohesive distance of attributes. Each cluster is initialized with a cluster center and determined at the beginning of the clustering process. The

The proposed DP-SO-GAN model

In this section, we design a novel clustering algorithm for mixed data. The proposed DP-SO-GAN clustering model is illustrated in Fig. 1.

To summarize, the main innovations of the proposed method include:

  • Initially, transform mixed data attributes such as categorical attributes using a one-hot encoding technique and numerical attributes using normalization techniques. The converted characteristics are input to a SO-GAN to learn the feature map

  • The growth strategy increases the training steadiness,

Experimental results

To illustrate the efficiency of the proposed DP-SO-GAN model, conducting a set of experiments on five datasets, including the Cardiovascular Disease Dataset, MNIST dataset of handwritten digits, CIFAR 10, CUHK Face Sketch (CUFS), and CelebFaces Attributes dataset (Celeb-A).

Conclusion

In this paper, we proposed a novel DP-SO-GAN model for clustering mixed datasets. We transform mixed data attributes such as categorical attributes using a one-hot encoding technique, and numerical attributes using normalization techniques are input to a SO-GAN to learn the feature map. The proposed model is based on the mechanism with growth and pruning strategies for network optimization. The major superiority of the proposed method lies in the self-organization mechanism of the network. The

Author contributions

All authors equally contributed.

Declaration of competing interest

There is no conflict of interest.

References (70)

  • F. Li et al.

    Discriminatively boosted image clustering with fully convolutional autoencoders

    Pattern Recogn.

    (2018)
  • D.W. Kim et al.

    Fuzzy clustering of categorical data using fuzzy centroids”

    Pattern Recogn. Lett.

    (2004)
  • H. Ralambondrainy

    A conceptual version of the K-means algorithm

    Pattern Recogn. Lett.

    (1995)
  • A.K. Jain et al.

    Data clustering: a review

    ACM Comput. Surv.

    (1999)
  • Z. Huang

    Extensions to the k-means algorithm for clustering large data sets with categorical values

    Data Min. Knowl. Discov.

    (1998)
  • Wei-Dong Zhao et al.

    “K-Centers Algorithm for Clustering Mixed Type Data”, PAKDD

    (2007)
  • C. Li et al.

    Unsupervised learning with mixed numeric and nominal data

    IEEE Trans. Knowl. Data Eng.

    (2002)
  • Fakhroddin Noorbehabahani et al.

    An Incremental mixed data clustering method using a new distance measure”

    Soft Comput.

    (2015)
  • S. Liu et al.

    Clustering mixed data by fast search and find of density peaks

    Math. Probl. Eng.

    (2017)
  • J. Xie et al.

    unsupervised deep embedding for clustering analysis

  • Chen, G., “Deep Learning with Nonparametric Clustering”, arXiv preprint arXiv:...
  • A.v. d. Oord et al.

    A Generative Model for Raw Audio

    (2016)
  • J. Dai et al.

    Generative Modeling of Convolutional Neural Networks

    (2014)
  • J.D. Co-Reyes et al.

    Self-consistent Trajectory Autoencoder: Hierarchical Reinforcement Learning with Trajectory Embeddings

    (2018)
  • D. George et al.

    A generative vision model that trains with high data efficiency and breaks text-based captchas

    Science

    (2017)
  • Z.-H. Feng et al.

    A unified tensor-based active appearance model

    ACM Trans. Multimed Comput. Commun. Appl

    (2019)
  • I. Goodfellow et al.

    Generative adversarial nets

  • A.v. d. Oord et al.

    Pixel Recurrent Neural Networks

    (2016)
  • D.P. Kingma et al.

    Auto-encoding Variational Bayes

    (2013)
  • D.P. Kingma et al.

    Improved variational inference with inverse autoregressive flow

  • R. Salakhutdinov et al.

    “Deep Boltzmann Machines,” in Artificial Intelligence and Statistics

    (2009)
  • W.R. Gilks et al.

    Markov Chain Monte Carlo in Practice

    (1995)
  • M. Arjovsky et al.

    Wasserstein gan

    (2017)
  • S.F. Sorrells et al.

    Human hippocampal neurogenesis drops sharply in children to undetectable levels in adults

    Nature

    (2018)
  • Z. Huang

    A fast clustering algorithm to cluster very large categorical data sets in data mining

    Proceedings of the the SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery (DMKD’97), Tucson, AZ, USA

    (11 May 1997)
  • Cited by (0)

    View full text