Clustering algorithm for mixed datasets using density peaks and Self-Organizing Generative Adversarial Networks
Introduction
Clustering is objective at determining the association within the group of data and evaluating similarity among them. Clustering techniques used in many real-time applications such as medicine, biology, pattern recognition, image classification, document retrieval, and computer vision [1]. Nowadays, clustering analysis becomes a more efficient method in machine learning. The basic idea of clustering algorithm deals with similarity measure between data objects, making objects in concern cluster are more similar than in other clusters. In partition-based clustering, cluster center selection based on the center of the data object, and data objects employed into various clusters depend on the partition. Some algorithms, such as K-Prototypes [2], K-Means Clustering for Mixed Datasets (KMCMD) [3], K-Centers [4], Improved K-Prototypes [5], and K-Harmonic Means type Clustering algorithm for Mixed Datasets (KHMCMD) [6] are proposed using the partition-based clustering technique. The hierarchical clustering approach used to create a tree-like structure to split or merge the data objects into clusters. Some algorithms, such as Distance Hierarchy (DH) [7], and Similarity-Based Agglomerative Clustering (SBAC) [8] are proposed based on the hierarchical clustering approach. In contrast to the aforementioned algorithms, incremental clustering algorithms are developed to process the entire pattern at a time and also save the memory space in memory. Some algorithms, such as Modified Adaptive Resonance Theory (M-ART) [9], Clustering Algorithm based on the methods of Variance and Entropy (CAVE) [10], Mixed Self-Organizing Incremental Neural Network (MSOINN) [11], are proposed based on an incremental clustering approach. To be more explicit, fuzzy clustering algorithms are developed that describe the relationship between the data objects more accurately. Some algorithms, such as Fuzzy K-Prototypes [12], Kullback-Leibler Fuzzy C-Means Gaussian Mixture Models (KL-FCM-GM) [13] are proposed based on this category. Density Peaks clustering algorithms are developed based on local density to find the relationship of data objects and also their relative distance. Some algorithms, such as Density Peaks clustering for Mixed numerical and categorical Data attributes using Fuzzy Neighborhood (DP-MD-FN) [14], Density Peaks Clustering for Mixed Datasets (DPC-MD) [15], and fast Density Peaks Clustering for mixed Datasets (DPC-M) [16] are proposed based on this category. Besides, deep learning has more courtesy in many fields. Some algorithms, such as Deep Embedded Clustering (DEC) [17], Discriminatively Boosted Clustering (DBC) [18] and Deep Learning with Nonparametric Clustering (DNC) [19] are proposed based on deep learning techniques.
In general, generative models are experts in producing a range of new models in a high-dimensional subspace [[20], [21], [22], [23], [24]]. In recent times, Generative Adversarial Network (GAN) [25] has become one of the ultimate standard and fruitful models, which are capable to produce high-quality realistic images and videos. In contrast to Pixel Recurrent Neural Networks (Pixel-RNN) [26] and Variational Auto-Encoder (VAE) [27], GAN can produce high-quality images with lower deviation and has low time complexity. The numerous counter-productive generative networks are asymptotically reliable, while VAE has some bias [28]. At the same time, GAN has neither a lower bound of variation nor a complicated partition function as compared with Deep Boltzmann Machines (DBM) [29]. In general, virtual samples can be produced by GAN in a single forward pass, rather than over an iterative method such as Markov chain operators [30]. The GAN model requires two sub-networks, such as generator and discriminator for reconstruction and classification of images. The generator regularly trains the dissemination of the real images fed into the network; consequently that it can produce realistic images. Meanwhile, the discriminator model is used to recognize the legitimacy of an input image, i.e., real or fake image. The discriminator and generator are accomplished in a way as a minimax two-player game [31] to enhance the entire model. The generator trained to produce better realistic images that attempt to fool the discriminator. At the same time, the discriminator trained to recognize synthetic images produced by the generator and real images. The training process of the GAN model converges until both the generator and discriminator cannot be optimized further. Such a training process leads to complications and uncertainty of network training. The capability of the generator and discriminator must be well-adjusted throughout the training process of a GAN model.
In this paper, we proposed a novel Density-Peaks and Self-Organizing Generative Adversarial Networks model, namely DP-SO-GAN, for clustering mixed datasets. To be more explicit, we transform mixed data attributes, such as categorical attributes using a one-hot encoding technique and numerical attributes using normalization technique. To improve the feature representation of the datasets, the normalization attribute values are fed into a generative adversarial network. The proposed SO-GAN model uses a co-occurrence scheme of growth and prune, associated with a novel adaptive cost function for optimization of network. The proposed SO-GAN model increases the training stability of a GAN model. The growth section of the proposed SO-GAN model is inspired by the evolution process of the neural cells of the human brain [32].
The rest of the paper is organized as follows. We first introduce the related work in Section 2. Then we present the proposed DP-SO-GAN model in Section 3. The experimental results are reported in Section 4 and the conclusion is drawn in Section 5.
Section snippets
Related work
To handle mixed categorical and numerical attributes, many clustering methods are proposed based on the partition-based clustering, hierarchical clustering, attributed conversion clustering, and density-based clustering.
The partitioned-based clustering algorithm is the most primitive type, which usually clusters elements into k clusters depend on the cohesive distance of attributes. Each cluster is initialized with a cluster center and determined at the beginning of the clustering process. The
The proposed DP-SO-GAN model
In this section, we design a novel clustering algorithm for mixed data. The proposed DP-SO-GAN clustering model is illustrated in Fig. 1.
To summarize, the main innovations of the proposed method include:
- ⁃
Initially, transform mixed data attributes such as categorical attributes using a one-hot encoding technique and numerical attributes using normalization techniques. The converted characteristics are input to a SO-GAN to learn the feature map
- ⁃
The growth strategy increases the training steadiness,
Experimental results
To illustrate the efficiency of the proposed DP-SO-GAN model, conducting a set of experiments on five datasets, including the Cardiovascular Disease Dataset, MNIST dataset of handwritten digits, CIFAR 10, CUHK Face Sketch (CUFS), and CelebFaces Attributes dataset (Celeb-A).
Conclusion
In this paper, we proposed a novel DP-SO-GAN model for clustering mixed datasets. We transform mixed data attributes such as categorical attributes using a one-hot encoding technique, and numerical attributes using normalization techniques are input to a SO-GAN to learn the feature map. The proposed model is based on the mechanism with growth and pruning strategies for network optimization. The major superiority of the proposed method lies in the self-organization mechanism of the network. The
Author contributions
All authors equally contributed.
Declaration of competing interest
There is no conflict of interest.
References (70)
- et al.
A K-mean clustering algorithm for mixed numeric and categorical data
Data Knowl. Eng.
(2007) - et al.
An improved k-prototypes clustering algorithm for mixed numeric and categorical data
Neurocomputing
(2013) Sarosh Hashmi, “K-Harmonic means type clustering algorithm for mixed datasets”
Appl. Soft Comput.
(2016)- et al.
Hierarchical clustering of mixed data based on distance hierarchy
Inf. Sci.
(2007) - et al.
Incremental Clustering of mixed data based on distance hierarchy
Expert Syst. Appl.
(2008) - et al.
Mining of mixed data with application to catalog marketing
Expert Syst. Appl.
(2007) - et al.
A fuzzy k-prototype clustering algorithm for mixed numeric and categorical data
Knowl. Base Syst.
(2012) A fuzzy c-means-type algorithm for clustering of data with mixed numeric and categorical attributes employing a probabilistic dissimilarity functional
Expert Syst. Appl.
(2011)- et al.
An entropy-based density peaks clustering algorithm for mixed type data employing fuzzy neighborhood
Knowl. Base Syst.
(2017) - et al.
A novel density peaks clustering algorithm for mixed data
Pattern Recogn. Lett.
(2017)