A fast clustering algorithm based on pruning unnecessary distance computations in DBSCAN for high-dimensional data

doi:10.1016/j.patcog.2018.05.030

Pattern Recognition

Volume 83, November 2018, Pages 375-387

https://doi.org/10.1016/j.patcog.2018.05.030 Get rights and content

Highlights

•
The underlying idea is: point p and point q should have similar neighbors, provided p and q are close to each other; given a certain eps, the closer they are, the more similar their neighbors are.
•
NQ-DBSCAN is an exact algorithm that may return the same result as DBSCAN if the parameters are same. While ρ-Approximate DBSCAN is an approximate algorithm.
•
The best complexity of NQ-DBSCAN can be O(n), and the average complexity of NQ-DBSCAN is proved to be O(n log(n)) provided the parameters are properly chosen. While ρ-Approximate DBSCAN runs only in O(n²) in high dimension.
•
NQ-DBSCAN is suitable for clustering data with a lot of noise.

Abstract

Clustering is an important technique to deal with large scale data which are explosively created in internet. Most data are high-dimensional with a lot of noise, which brings great challenges to retrieval, classification and understanding. No current existing approach is “optimal” for large scale data. For example, DBSCAN requires O(n²) time, Fast-DBSCAN only works well in 2 dimensions, and ρ-Approximate DBSCAN runs in O(n) expected time which needs dimension D to be a relative small constant for the linear running time to hold. However, we prove theoretically and experimentally that ρ-Approximate DBSCAN degenerates to an O(n²) algorithm in very high dimension such that 2^D >  > n. In this paper, we propose a novel local neighborhood searching technique, and apply it to improve DBSCAN, named as NQ-DBSCAN, such that a large number of unnecessary distance computations can be effectively reduced. Theoretical analysis and experimental results show that NQ-DBSCAN averagely runs in O(n*log(n)) with the help of indexing technique, and the best case is O(n) if proper parameters are used, which makes it suitable for many realtime data.

Introduction

Nowadays, large collections of data are explosively created in different fields, and most of these data are high dimensional with a lot of noise, e.g. Web Texts and Web videos, some of them have more than 10,000 dimensions, which brings great challenges to retrieval, classification and understanding. Many researches are launched in this area to deal with this kind of data [1], [2], [3], [4], [5], [6], [7], [8], [9], [10], [11], [12], [13].

Data clustering is one of the most important and popular data analysis techniques to understand data. It refers to the process of grouping objects into meaningful subclasses (clusters) so that members of a cluster are as similar as possible whereas members of different clusters differ as much as possible [14], [15], [16]. Numerous clustering algorithms have been used in many areas such as image processing [17], [18], [19], geophysics [20], [21], customer and marketing analysis [22], [23], crime detection [24], medicine [25], [26] and agriculture [27]. Innovative clustering methods [28], [29], [30] and parallel implementation frameworks [31], [32] have been proposed.

Clustering algorithms can be roughly categorized into partition, hierarchical, grid-based and density-based approaches etc. Density-based clustering approach is one of the most popular paradigms, and the most famous algorithm of this kind is DBSCAN [33] which is designed to discover clusters of arbitrary shape with a fixed scanning radius ϵ (eps) and a density threshold MinPts. DBSCAN has a large amount of extensions, e.g. [34], [35], [36], [37], and has been widely applied in many applications, such as astronomy [38], neuroscience [39]. However, DBSCAN has some drawbacks as follows.

(1) It renders almost useless when subject to high-dimensional data due to the so-called “Curse of dimensionality”.

(2) The running time for DBSCAN is heavily dominated by finding neighbors or obtaining density for each data point. Without indexing, the complexity of DBSCAN would always be O(n²) regardless of the parameters ϵ and MinPts. If a tree-based spatial index is used, the ϵ-neighborhood are expected to be small compared to the size of the whole data space, the average complexity is reduced to O(n*log(n)) [33]. However, for dimension d > 3 the DBSCAN problem require Ω(N^4/3) time to solve, unless very significant breakthroughs could made in theoretical computer science [40].

Many researchers have proposed various techniques in attempts to improve the performance of clustering algorithm on high-dimensional data. For example, Wang and Deng developed a serial of important work on soft subspace clustering and fuzzy clustering for high dimensional data [41], [42], [43], [44], which overcome the drawbacks of utilizing only one distance function in most of existing clustering algorithms, and adaptively learn the distance functions suitable for data sets during the clustering process.

Grid-based technique and approximation techniques are also popular, such as Fast-DBSCAN [45] and others [46], [47]. Grid-based techniques, e.g. [48], [49], [50], [51], divide the data space by grids, perform clustering in each cell locally and merge the results thereby saving runtime. Gunawan [45] proposed a Fast-DBSCAN based on drawing a 2-dimensional grid. The algorithm imposes an arbitrary grid T on the data space $R^{2},$ where each cell of T has side length $\sqrt{ϵ / 2}$ . If a non-empty cell c contains at least MinPts points, then all those points in the cell must be core points, because the maximum distance within the cell is ϵ. This algorithm theoretically runs in O(n*log(n)) time in the worst case. However it is only applied in 2-dimensional data space.

Inspired by Fast-DBSCAN, Gan and Tao [40] proposed a novel algorithm named ρ-approximate DBSCAN, which has a computation time that scales only linearly in n. The improvement of this method from Gunawan [45] lies in its new tree structure, i.e. quadtree-like hierarchical grid, as well as the sacrifice of small accuracy. Because the cell number in the quadtree-like hierarchical grid T will increase explosively with dimension D, therefore ρ-approximate only saves those non-empty cells. However, it needs dimension D to be a relative small constant for the linear running time to hold, and actually it still runs in O(n²) in high dimension, as the following theorem shows.

Theorem 1

ρ-approximate DBSCAN degenerates to an O(n²) algorithm if 2^D ≫ n.

Proof

Let X be the maximum radius for DBSCAN to correctly cluster data set P, and dimension D be large enough such that 2^D ≫ n, which implies there are much more cells than n in the grid. Set $ϵ = X,$ for each cell there is at most one point contained if D is large enough, because the side length of each cell is $\frac{X}{\sqrt{D}}$ and $\lim_{D \to \infty} \frac{X}{\sqrt{D}} = 0$ .

In the case of 2^D ≪ n, ρ-approximate DBSCAN answers any approximate range count query in O(1) expected time (see Lemma 5 in [40]). But here, since each non-empty cell contains at most one point, then there are about n nonempty cells are saved. Thus the query time for each cell to find neighbors is O(n), not O(1) any more, and hence ρ-approximate DBSCAN runs in O(n²) expected time. □

Therefore, most existing current clustering algorithms are not suitable for many realtime applications, due to the “curse of dimensionality”. The main reason lies in great number of unnecessary distance calculations, which can be greatly reduced by neighbor searching technique, such as Product quantization for nearest neighbor search [52], LSH (Locality-Sensitive Hashing) [53], FLANN [54].

In this paper, we propose a new clustering approach, named NQ-DBSCAN, by using local neighbor query technique and quadtree-like hierarchical grid to reduce great number of unnecessary distance computations. Theoretical analysis and experimental results show that the proposed algorithm NQ-DBSCAN can averagely run in O(n*log(n)) expected time with the help of indexing technique, and the best case is O(n) if proper parameters are used, which makes it suitable for many realtime data.

Because ρ-Approximate DBSCAN is the most important improvement of DBSCAN currently, we only focus on DBSCAN, ρ-Approximate DBSCAN and NQ-DBSCAN in this paper. There are some advantages of NQ-DBSCAN to ρ-Approximate DBSCAN as below.

(1) NQ-DBSCAN is an exact algorithm that may return the same result as DBSCAN if the parameters are same. While ρ-Approximate DBSCAN is an approximate algorithm.

(2) The best complexity of NQ-DBSCAN can be O(n), and the average complexity of NQ-DBSCAN is proved to be O(nlog(n)) provided the parameters are properly chosen. While ρ-Approximate DBSCAN runs only in O(n²) in high dimension.

(3) NQ-DBSCAN is suitable for clustering data with a lot of noise.

The rest of this paper is organized as follow: Section 2 introduces the basic concepts; Section 3 presents the details of the proposed clustering algorithm; Section 4 demonstrates the experimental results of the proposed algorithms on various data sets, and Section 5 gives the conclusion and our future works.

Section snippets

Basic concepts

Density-based clustering algorithms have the ability to find out the clusters of different shapes and sizes. DBSCAN, a pioneer density-based clustering algorithms, is one of the most important and popular clustering algorithms in scientific literature.¹ DBSCAN accepts two parameters: ϵ (Eps) and MinPts, where ϵ is scanning radius and MinPts is the minimal number of neighbor points for a core point. Some concepts and terms to explain the DBSCAN algorithm can

Basic concepts

We propose a new algorithm to improve DBSCAN by filtering a large number of unnecessary density computations, which is based on the following idea.

Point p and point q should have similar neighbors, provided p and q are close; given a certain ϵ, the closer they are, the more similar their neighbors are. As Fig. 1 shows, we can see that points p and q in Fig. 1 (a) have more same neighbors than that they have in Fig. 1 (b). Formally, we have some theorems which are important for validating the

Experiments

In this section, we conduct experiments to evaluate the performance of NQ-DBSCAN, and make comparisons with original DBSCAN and ρ-approximate DBSCAN [40], on synthetic and realtime data sets.

Conclusion

Today, large collections of data are explosively created in different fields, and most of these data are high dimensional with a lot of noise, which bring great challenging to clustering. DBSCAN is a creative and elegant technique for density-based clustering. However, it is rendered almost useless for high-dimensional data, due to the “curse of dimensionality”, which limits its applicability in many realtime applications. ρ-approximate DBSCAN [40] is an efficient approach designed to replace

Acknowledgment

The National Science Foundation of China (No.61673186,71771094) ; this work was supported by the Open Project Program of the National Laboratory of Pattern Recognition (NLPR) (NO.201700002); the Open Project Program of the State Key Lab of CAD&CG(Grant No.A1722), Zhejiang University; the Natural Science Foundation of Fujian Province (No.2016J01303); Project of Science and Technology Plan of Fujian Province of China (No.2017H01010065); the Graduate Students Research and Innovation Ability

Yewang Chen received the B.S. and M.S. degrees from information management and computer science of Huaqiao University, Xiamen, China, in 2001 and 2006, respectively, and the Ph.D. degree from computer science of Fudan University, Shanghai, China, in 2009. Currently, he is a Lecturer in the School of Computer Science and Technology, Huaqiao University, Xiamen, China. His current research interests include natural language processing,machine learning and pattern recognition.

References (63)

Z. Deng et al.
A survey on soft subspace clustering
Inf. Sci.
(2016)
L. Bai et al.
Fast density clustering strategies based on the k-means algorithm
Pattern Recognit.
(2017)
N.A. Yousri et al.
A distance-relatedness dynamic model for clustering high dimensional data of arbitrary shapes and densities
Pattern Recognit.
(2009)
C. Zhong et al.
A graph-theoretical clustering method based on two rounds of minimum spanning trees
Pattern Recognit.
(2010)
J. Hou et al.
Towards parameter-independent data clustering and image segmentation
Pattern Recognit.
(2016)
S. Mitra et al.
Satellite image segmentation with shadowed c -means
Inf. Sci.
(2011)
S. Das et al.
Kernel-induced fuzzy clustering of image pixels with an improved differential evolution algorithm
Inf. Sci. Int. J.
(2010)
A. Ghosh et al.
Fuzzy clustering algorithms for unsupervised change detection in remote sensing images
Inf. Sci.
(2011)
Y.J. Wang et al.
A clustering method to identify representative financial ratios
Inf. Sci.
(2008)
Q. Bsoul et al.
An intelligent document clustering approach to detect crime patterns
Procedia Technol.
(2013)

A. Hatamlou

Black hole: a new heuristic optimization approach for data clustering

Inf. Sci.

(2013)

J. Wang et al.

Distance metric learning for soft subspace clustering in composite kernel space

Pattern Recognit.

(2016)

Z. Deng et al.

Enhanced soft subspace clustering integrating within-cluster and between-cluster information

Pattern Recognit.

(2010)

P. Viswanath et al.

Rough-dbscan: a fast hybrid density based clustering method for large data sets

Pattern Recognit. Lett.

(2009)

D. Birant et al.

St-dbscan: an algorithm for clustering spatial–temporal data

Data Knowl. Eng.

(2007)

Y. Chen et al.

Decentralized clustering by finding loose and distributed density cores

Inf. Sci.

(2018)

J. Song et al.

Optimized graph learning using partial tags and multiple features for image and video annotation

IEEE Trans. Image Process.

(2016)

J. Song et al.

Deep and fast: deep learning hashing with semi-supervised graph construction

Image Vision Comput.

(2016)

J. Song et al.

A distance-computation-free search scheme for binary code databases

IEEE Trans. Multimedia

(2016)

W. Zhou et al.

Scalable feature matching by dual cascaded scalar quantization for image retrieval

IEEE Trans. Pattern Anal. Mach. Intell.

(2016)

S. Zhang et al.

Affective visualization and retrieval for music video

IEEE Trans. Multimedia

(2010)

A.K. Rajagopal et al.

Exploring transfer learning approaches for head pose classification from multi-view surveillance images

Int. J. Comput. Vision

(2014)

Y. Yan et al.

A multi-task learning framework for head pose estimation under target motion

IEEE Trans. Pattern Anal. Mach. Intell.

(2016)

B.F. Qaqish et al.

Accelerating high dimensional clustering with lossless data reduction

Bioinformatics

(2017)

O. Limwattanapibool et al.

Determination of the appropriate parameters for k-means clustering using selection of region clusters based on density dbscan (srcd-dbscan)

Expert Syst.

(2017)

M. Ester et al.

Algorithms and applications for spatial data mining

Geographic Data Min. Knowl. Discovery

(2001)

W.A. Barbakh et al.

Non-Standard Parameter Adaptation for Exploratory Data Analysis

(2009)

J. Han et al.

Data Mining: Concepts and Techniques

(2011)

Y.C. Song et al.

The application of cluster analysis in geophysical data interpretation

Comput. Geosci.

(2010)

J. Li et al.

Chameleon based on clustering feature tree and its application in customer segmentation

Ann. Oper. Res.

(2009)

C.-W. Huang et al.

Intuitionistic fuzzy c-means clustering algorithm with neighborhood attraction in segmenting medical image

Soft Comput.

(2015)

Cited by (149)

GD3N: Adaptive clustering-based detection of selective forwarding attacks in WSNs under variable harsh environments
2024, Information Sciences
Wireless sensor networks (WSNs) are susceptible to numerous security threats due to their reliance on open environments and broadcast communication methods. Among these, the selective forwarding attack is notably challenging to detect. This difficulty arises from the ability of malicious nodes to imitate the behavior of normal nodes, and selectively drop data packets, which makes them virtually indistinguishable from normal ones, particularly under conditions of poor channel quality. To address this challenge with harsh environments, we introduce a novel methodology termed GD3N. This approach is underpinned by the design of a unique type of data point that encapsulates both short-term and long-term forwarding behaviors of nodes. It combines a refined version of the Gradient Diffusion Density-based Spatial Clustering of Applications with Noise (GD-DBSCAN) algorithm, with a novel Double-Parameter Neighbor Voting (DP-NV) method based on the data set. These innovations contribute to a significant enhancement in detection accuracy and a reduction in computational complexity when compared to traditional DBSCAN and NV methods. Simulation results show that our GD3N achieves a false detection rate (FDR) of less than 2%, a missed detection rate (MDR) of below 10%, and an overall detection accuracy rate (DAR) of over 95% across various testing scenarios.
Approximate DBSCAN on obfuscated data
2024, Journal of Information Security and Applications
With the emergence of remote storage, computation facilities, and the availability of high-speed data connectivity — cloud computation has become the call for the day. In this scenario, security and computability of data have emerged as two crucial aspects which often conflict with each other. An efficient solution is to build a trade-off between the two. In this work, we propose a novel Hilbert-Curve-based data encoding scheme (HC encoding) which obfuscates the data points. We use the obfuscated data in DBSCAN clustering and obtain an approximated output with respect to the original clustering (without invoking the plaintext data at any stages of the computation). HC encoding has a provision to modulate the degree of obfuscation. The correctness of the clustering performance decreases with the increase in obfuscation and vice versa. The empirical study is carried out on 8 regular datasets and a large, color image dataset, CIFAR10. Two key applications are explored– (i) clustering of homogeneous images and (ii) clustering of heterogeneous images originating from two different sources. The empirical results show the effectiveness of the HC encoding to provide approximate clustering performances on the obfuscated multimedia data as well as regular datasets. The outcomes also manifest HC encoding’s superiority against the comparing methods which operate under a similar fully private constraint.
Semi-supervised deep density clustering
2023, Applied Soft Computing
Deep clustering generally obtains promising performance by learning deep feature representations. However, there are two limitations: (1) end-to-end deep density clustering needs to be explored; (2) prior information is ignored to guide the learning process. To overcome these limitations, we propose a novel semi-supervised deep density clustering (SDDC). Specifically, a convolutional autoencoder is applied to learn embedded features, and semi-supervised density peaks clustering is designed to identify stable cluster centers. Meanwhile, prior information is introduced to instruct the preferable clustering process. By integrating prior information, a joint clustering loss is directly built on embedded features to perform feature representation and cluster assignment simultaneously. Extensive experiments validate the power of SDDC for initializing and the effectiveness on clustering tasks.
Varying-scale HCA-DBSCAN-based anomaly detection method for multi-dimensional energy data in steel industry
2023, Information Sciences
The quality of the acquisition data in the energy system of steel industry is the basis of prediction analysis and scheduling operation. Facing with its multi-dimensional and high-noise characteristics, in this study, an anomaly detection method based on a varying-scale hypercube accelerated density based spatial clustering for applications with noise (VHCA-DBSCAN) is proposed. An HCA-DBSCAN model based on the Gaussian probability density estimation is firstly established, in which the searching radius is adaptively calculated to identify the suspected outliers efficiently. Then, a traversal search strategy for hypercube segmentation is designed, where the lengths of the edges are varying according to the characteristics of each dimension. Finally, a modified local outlier factor (LOF) by using the approximate upper boundary is presented to evaluate the degree of the suspected outliers, and the confirmed ones are identified by collaboratively considering the manufacturing signals. Validation experiments by employing real-world data from a typical steel enterprise are carried out, and the results indicate that the proposed method exhibits reliable performance and high efficiency when facing with the anomaly detection problem of multi-dimensional industrial data.
An overview on density peaks clustering
2023, Neurocomputing
Density peaks clustering (DPC) algorithm is a new algorithm based on density clustering analysis, which can quickly obtain the cluster centers by drawing the decision diagram by using the calculation of local density and relative distance. Without prior knowledge and iteration, the parameters and structure are simple and easy to implement. Since it was proposed in 2014, it has attracted a large number of researchers to explore experiments and improve applications in recent years. In this paper, we first analyze the theory of DPC and its performance advantages and disadvantages. Secondly, it summarizes the improvement of DPC in recent years, analyzes the improvement effect, and shows it with experimental data. The related application research of DPC in different fields is introduced. Finally, the clustering results of DPC, LCDPC, DCHDPC and NADPC algorithms in different data sets are analyzed. Experiments show that the improved algorithm can divide the clustering more accurately, which provides new ideas for improving DPC algorithm in the future. At the same time, this paper summarizes and prospects the improvement and development of DPC.
GriT-DBSCAN: A spatial clustering algorithm for very large databases
2023, Pattern Recognition
DBSCAN is a fundamental spatial clustering algorithm with numerous practical applications. However, a bottleneck of DBSCAN is its $O (n^{2})$ worst-case time complexity. To address this limitation, we propose a new grid-based algorithm for exact DBSCAN in Euclidean space called GriT-DBSCAN, which is based on the following two techniques. First, we introduce grid tree to organize the non-empty grids for the purpose of efficient non-empty neighboring grids queries. Second, by utilizing the spatial relationships among points, we propose a technique that iteratively prunes unnecessary distance calculations when determining whether the minimum distance between two sets is less than or equal to a certain threshold. We theoretically demonstrate that GriT-DBSCAN has excellent reliability in terms of time complexity. In addition, we obtain two variants of GriT-DBSCAN by incorporating heuristics, or by combining the second technique with an existing algorithm. Experiments are conducted on both synthetic and real-world data sets to evaluate the efficiency of GriT-DBSCAN and its variants. The results show that our algorithms outperform existing algorithms.

View all citing articles on Scopus

Shenyu Tang received the B.S degree from College of Mathematics of Huaqiao University, Quanzhou, China, in 2012, and he is a postgraduate student in the School of Computer Science and Technology, Huaqiao University, Xiamen, China. His current research interests is machine learning and pattern recognition.

Nizar Bouguila received the engineer degree from the University of Tunis in 2000 the M.Sc. and Ph.D degrees from Sherbrooke University in 2002 and 2006, respectively, all in computer sciences. He is currently a professor within the Concordia Institute for Information Systems Engineering (CIISE) at Concordia University, Montreal, Qc, Canada. His current research interests include: Computer vision and pattern recognition, Machine learning and data mining, Image and signal processing, Statistical Process Control, 3D Graphics and Games.

Cheng Wang received the B.S. degrees from software engineering of Xidian University, Xian, China, in 2006 and received the Ph.D. degree in mechanics from Xi’an jiaotong University, China in 2012 respectively. Currently, he is an associate professor in the School of Computer Science and Technology, Huaqiao University, Xiamen, China. His research interests include signal processing and data mining.

Jixiang Du took coursers B.Sc degree candidate in Vehicle Engineering, Hefei University of Technology, and obtained B.Sc. degree in July 1999. From September 1999 to July 2002, took courses as M.Sc. degree candidate in Vehicle Engineering, Hefei University of Technology, and obtained B.Sc. degree in July 2002. From February 2003 on, in pursuit for Ph.D. degree in Pattern Recognition and Intelligent System in University of Science and Technology of China (USTC), Hefei, China, and in December 2005, he received Ph.D. degree. Now, he is a professor at the College of Computer Science and Technology at Huaqiao University.

Hailin Li received the B.S. degrees from Information and computing science of Jingdezhen Ceramic University, Jingdezhen, China, in 2006 and received the Ph.D. degree in Management Science and Engineering from Dalian University of Technology , China in 2012 respectively. Currently, he is an associate professor in the school of Business Administration, Huaqiao University, Quanzhou, China. His research interests include data mining and decision making.

View full text

A fast clustering algorithm based on pruning unnecessary distance computations in DBSCAN for high-dimensional data

Highlights

Abstract

Introduction

Section snippets

Basic concepts

Basic concepts

Experiments

Conclusion

Acknowledgment

Inf. Sci.

Pattern Recognit.

Pattern Recognit.

Pattern Recognit.

Pattern Recognit.

Inf. Sci.

Inf. Sci. Int. J.

Inf. Sci.

Inf. Sci.

Procedia Technol.

Inf. Sci.

Pattern Recognit.

Pattern Recognit.

Pattern Recognit. Lett.

Data Knowl. Eng.

Inf. Sci.

Optimized graph learning using partial tags and multiple features for image and video annotation

IEEE Trans. Image Process.

Deep and fast: deep learning hashing with semi-supervised graph construction

Image Vision Comput.

A distance-computation-free search scheme for binary code databases

IEEE Trans. Multimedia

Scalable feature matching by dual cascaded scalar quantization for image retrieval

IEEE Trans. Pattern Anal. Mach. Intell.

Affective visualization and retrieval for music video

IEEE Trans. Multimedia

Exploring transfer learning approaches for head pose classification from multi-view surveillance images

Int. J. Comput. Vision

A multi-task learning framework for head pose estimation under target motion

IEEE Trans. Pattern Anal. Mach. Intell.

Accelerating high dimensional clustering with lossless data reduction

Bioinformatics

Determination of the appropriate parameters for k-means clustering using selection of region clusters based on density dbscan (srcd-dbscan)

Expert Syst.

Algorithms and applications for spatial data mining

Geographic Data Min. Knowl. Discovery

Non-Standard Parameter Adaptation for Exploratory Data Analysis

Data Mining: Concepts and Techniques

The application of cluster analysis in geophysical data interpretation

Comput. Geosci.

Chameleon based on clustering feature tree and its application in customer segmentation

Ann. Oper. Res.

Intuitionistic fuzzy c-means clustering algorithm with neighborhood attraction in segmenting medical image

Soft Comput.