Keywords

1 Introduction

Clustering is a fundamental pillar of unsupervised machine learning and it is widely used in a range of tasks across disciplines. In past decades, a variety of clustering algorithms have been developed [5] such as k-means [6], Gaussian Mixture Models (GMMs) [14], HDBSCAN [1], and hierarchical algorithms [15]. However, these clustering algorithms typically require features to be hand crafted or learned for each dataset and task. Then, those features should be analyzed using feature selection, in order to eliminate redundant or poor quality features. Those requirements are more challenging in the unsupervised setting. Additionally, this process is time-consuming and brittle [17], since the choice of features has a large influence on the performance of the clustering algorithm. In this paper, we formulate the following hypothesis: if we apply an adequate embedding on our raw data, i.e., an embedding which allows to find a good distance preserving manifold, than this could help clustering algorithms in doing their job. One key question was: which embedding technique to apply it to find the best embedding manifold. Many methods exists, including those performing a linear transformation of data like the well-known Principal Component Analysis (PCA) [10]. However, PCA is a linear method and does not perform well in cases where relationships are non-linear. Thankfully, alternative non-linear manifold learning methods exist, and can be categorized by their focus on finding local or global structure. Isomap [7] is well known globally focused method. While T-SNE [8] is considered as locally focused method. More recently manifold learning technique is UMAP [11], UMAP showed better performance to preserve both the local and global structure. In this paper, we will investigate the use of this latter technique: because it outperforms its concurrents [11] and it has proven to be able to exactly meet our needs [16, 18]. In this paper, Our main focus was on measuring the improvement achieved by each clustering algorithm thanks to the application of UMAP embedding manifold, and in order to validate our method we conduct a number of experiments on five datasets. We empirically observe that this method allows to the clustering algorithms to be competitive with state-of-the-art techniques. The rest of this paper is organized as follows. We present more details about UMAP technique in Sect. 2. In Sect. 3 we introduce our idea. Section 4 discusses the experimental results in five image datasets. Section 5 concludes our work.

2 UMAP Embedding Technique for Dimensionality Reduction

Uniform Manifold Approximation and Projection (UMAP) is a recently proposed manifold learning method, which seeks to accurately represent local structure and better incorporate global structure [9]. Compared to t-SNE it has a number of advantages. UMAP has been shown to scale well with large datasets, while t-SNE typically struggles with them. UMAP relies on three hypothesis, namely that 1) the data is uniformly distributed on a Riemannian manifold, 2) the Riemannian metric is locally constant 3) the manifold is locally connected. From these assumptions it is possible to represent the manifold with a fuzzy topological structure of high dimensional data points. The embedding manifold is found by searching for a fuzzy topological structure of low dimensional projection of the data. To construct the fuzzy topological structure UMAP represents the data points by a high-dimensional graph. The constructed high-dimensional graph is weighted graph, with edge weights representing the likelihood that two points are connected. UMAP uses exponential probability distribution to compute the similarity between high dimensional data points:

$$\begin{aligned} p_{i|j}=\exp (-\frac{d(x_i,x_j)-\rho _i}{\sigma _i}) \end{aligned}$$
(1)

Where d(\(x_{i}\),\(x_{j}\)) is the distance between the i-th and j-th data points and \(\rho \) is the distance between i-th data points and its first nearest neighbor. In cases that the weight of the graph between i and j nodes is not equal to the weight between j and i nodes. UMAP uses a symmetrization of the high-dimensional probability:

$$\begin{aligned} p_{ij}=p_{i|j}+p_{j|i}-p_{i|j}p_{j|i} \end{aligned}$$
(2)

As we said above the constructed graph is a likelihood graph, and UMAP needs to specify k the number of nearest neighbor:

$$\begin{aligned} k=2^{\sum _i{p_{ij}}} \end{aligned}$$
(3)

Once the high-dimensional graph is constructed, UMAP constructs and optimizes the layout of a low-dimensional analogue to be as similar as possible. For modelling distance in low dimensions, UMAP uses probability measure similar to Student t-distribution:

$$\begin{aligned} q_{ij}=(1+a(y_i-y_j)^{2b})^{-1} \end{aligned}$$
(4)

where \(a\approx 1.93\) and \(b\approx 0.79\) for default UMAP.

UMAP uses binary cross-entropy (CE) as a cost function due to its capability of capturing the global data structure:

$$\begin{aligned} CE(P, Q)=\sum _i{\sum _j[p_{ij}\log (\frac{p_{ij}}{q_{ij}})+(1-p_{ij})\log (\frac{1-p_{ij}}{1-q_{ij}})]} \end{aligned}$$
(5)

Where P is the probabilistic similarity of the high dimensional data points, and Q is for the low dimensional data points.

The derivative of the cross-entropy used to update the coordination of the low-dimensional data points to optimize the projection space until the convergence. UMAP applied Stochastic Gradient Descent (SGD) due to its faster convergence and it reduces the memory consumption since we compute the gradients for a subset of the data set.

UMAP has a number of important hyper-parameters that influence its performance. These hyper-parameters are:

  • The dimensionality of the target embedding

  • The number of neighbor k, choosing small value means the interpretation will be very local and capture fine detail structure. While choosing a large value means the estimation will be based on larger regions, and thus, will missing some of the fine detail structure.

  • The minimum allowed distance between points in the embedding space. Lower values of this minimum distance will more accurately capture the true manifold structure, but may lead to dense clouds that make visualization difficult.

3 Our Method

Our method relies primarily on the application of clustering algorithms on embedding manifold extracted by manifold learning methods UMAP [9] due to its success in preserving both the local and the global structure. We chose four of well-known algorithms as clustering algorithms which are represented in k-means [6], HDBSCAN [1], GMM [14] and Agglomerative Clustering [15]. We will show that by augmenting the clustering task with a manifold learning technique which explicitly takes local structure into account, we can increase the quality of clustering performance of the different algorithms. Figure 1 represents the architecture of our method.

Fig. 1.
figure 1

The structure of our method.

4 Experiments

To assess the improvement of using UMAP with the clustering algorithms studied, we conduct experiments on a range of diverse datasets, including standard datasets widely used to evaluate clustering algorithms.

4.1 Datasets

We conducted our experiments on five diverse image datasets, including standard datasets used to evaluate deep clustering algorithms. Those datasets are MNIST [2], Fashion MNIST [3], USPS [13], Pen Digits [4] and UMIST Face Cropped [12]. Table 1 summarizes the main characteristics of each dataset.

Table 1. Datasets statistics.

4.2 Evaluation Metrics

In order to validate the performance of unsupervised clustering algorithms, we use the two standard evaluation metrics, accuracy (ACC) and Normalized Mutual Information (NMI).

$$\begin{aligned} ACC = max_m \frac{\sum _{i=1}^{n} 1\{y_i = m(c_i)\}}{n} \end{aligned}$$
(6)
$$\begin{aligned} NMI =\frac{2I(y,c)}{[H(y)+H(c)]} \end{aligned}$$
(7)

4.3 Results

Figure 2 shows the resulting clusters when using k-means for visualization purposes. We could see that the visualization is better when we apply the algorithm on the UMAP embedded manifold of the five datasets. However, in order to better understand the effectiveness of our method at clustering we will study each clustering algorithm via measuring its own results on the different datasets using the accuracy and NMI, as well as when we apply it on the extracted features by UMAP.

Fig. 2.
figure 2

Visualization of K-Means applied to all five datasets. The first row represents the K-Means visualization of the five datasets themselves, and the second row represents the visualization of K-Means on the UMAP embedded manifold of these datasets.

Table 2 and Table 3 show the accuracy and NMI results for the clustering algorithms on five different datasets comparable to the same algorithms applied on embedding manifold of the datasets extracted by UMAP. In both tables, improvement score rows represent the difference between the results of the algorithms and the results after the application of these algorithms on the features extracted by UMAP. By doing so, we can see clearly how UMAP can help the four clustering algorithms and to what extent the results improved. Actually, great results were achieved by the algorithms on embedded data points, where the results are improved by an increase of up to 60% in term of accuracy, and in range of 5% to 48% in term of NMI. What is striking is how UMAP helped HDBSCAN to improve its result by 60 % points on USPS dataset. Also, it had an improvement better than the other algorithms in 2 of the 5 datasets with at least 50% in term of accuracy and over than 38% in term of NMI measure. GMM is improved better than the other, on 3 of the 5 datasets, with percentage over than 34% in term of accuracy, and over than 25% in term NMI measure.

Table 2. Comparison between the different clustering algorithms on the five datasets according to the accuracy measure.
Table 3. Comparison between the different clustering algorithms on the five datasets according to the NMI measure

The accuracy and NMI measures showed us that the studied clustering algorithms in general and HDBSCAN as a particular case had bad results and especially in MNIST and Fashion MNIST datasets. The problem here is all the clustering algorithms tend to suffer from the curse of dimensionality: high dimensional data requires more observed samples to produce much density. If we could reduce the dimensionality of the data more we would make the density more evident and make it far easier for those algorithms to cluster the data. What we need is strong manifold learning, and this is where UMAP can come into play. One of the reasons which help the studied algorithms to perform well on the learned manifold is to set the min distance (the hyper-parameter of UMAP) to be 0. And thus make the points packed together densely as well as making cleaner separations between clusters.

Table 4. The execution time before and after applying UMAP on the different clustering algorithms on the five datasets.

Table 4 gives us the execution time taken for each clustering algorithm on the different datasets compared to the run-time of these algorithms applied to the embedding manifold of the five datasets. We can observe that the run-time is also improved, where it was reduced to a few seconds and sometimes to a few split-seconds, and this is really a good achievement for our method compared to the size of the datasets. Especially for agglomerative and HDBSCAN algorithms, the run-time of HDBSCAN is reduced from over than 26 min until around 5 s in MNIST and Fashion MNIST datasets. From these results, we demonstrate that these clustering algorithms can now handle large databases well.

5 Conclusion

In this paper, we investigated the use of UMAP technique for dimensionality reduction before applying a number of well-known clustering algorithms on datasets. We showed that it can drastically improve the performance of the studied algorithms, both in terms of clustering accuracy and time. Experimental results indicate that the proposed approach can improve clustering performance obviously, we show how our proposed method can make the mentioned clustering algorithms competitive with the current state-of-the-art clustering approaches. It is also validated by experiments that our method allows to the clustering algorithms considered to deal better on larger data sets.