Elsevier

Pattern Recognition

Volume 83, November 2018, Pages 375-387
Pattern Recognition

A fast clustering algorithm based on pruning unnecessary distance computations in DBSCAN for high-dimensional data

https://doi.org/10.1016/j.patcog.2018.05.030Get rights and content

Highlights

  • The underlying idea is: point p and point q should have similar neighbors, provided p and q are close to each other; given a certain eps, the closer they are, the more similar their neighbors are.

  • NQ-DBSCAN is an exact algorithm that may return the same result as DBSCAN if the parameters are same. While ρ-Approximate DBSCAN is an approximate algorithm.

  • The best complexity of NQ-DBSCAN can be O(n), and the average complexity of NQ-DBSCAN is proved to be O(n log(n)) provided the parameters are properly chosen. While ρ-Approximate DBSCAN runs only in O(n2) in high dimension.

  • NQ-DBSCAN is suitable for clustering data with a lot of noise.

Abstract

Clustering is an important technique to deal with large scale data which are explosively created in internet. Most data are high-dimensional with a lot of noise, which brings great challenges to retrieval, classification and understanding. No current existing approach is “optimal” for large scale data. For example, DBSCAN requires O(n2) time, Fast-DBSCAN only works well in 2 dimensions, and ρ-Approximate DBSCAN runs in O(n) expected time which needs dimension D to be a relative small constant for the linear running time to hold. However, we prove theoretically and experimentally that ρ-Approximate DBSCAN degenerates to an O(n2) algorithm in very high dimension such that 2D >  > n. In this paper, we propose a novel local neighborhood searching technique, and apply it to improve DBSCAN, named as NQ-DBSCAN, such that a large number of unnecessary distance computations can be effectively reduced. Theoretical analysis and experimental results show that NQ-DBSCAN averagely runs in O(n*log(n)) with the help of indexing technique, and the best case is O(n) if proper parameters are used, which makes it suitable for many realtime data.

Introduction

Nowadays, large collections of data are explosively created in different fields, and most of these data are high dimensional with a lot of noise, e.g. Web Texts and Web videos, some of them have more than 10,000 dimensions, which brings great challenges to retrieval, classification and understanding. Many researches are launched in this area to deal with this kind of data [1], [2], [3], [4], [5], [6], [7], [8], [9], [10], [11], [12], [13].

Data clustering is one of the most important and popular data analysis techniques to understand data. It refers to the process of grouping objects into meaningful subclasses (clusters) so that members of a cluster are as similar as possible whereas members of different clusters differ as much as possible [14], [15], [16]. Numerous clustering algorithms have been used in many areas such as image processing [17], [18], [19], geophysics [20], [21], customer and marketing analysis [22], [23], crime detection [24], medicine [25], [26] and agriculture [27]. Innovative clustering methods [28], [29], [30] and parallel implementation frameworks [31], [32] have been proposed.

Clustering algorithms can be roughly categorized into partition, hierarchical, grid-based and density-based approaches etc. Density-based clustering approach is one of the most popular paradigms, and the most famous algorithm of this kind is DBSCAN [33] which is designed to discover clusters of arbitrary shape with a fixed scanning radius ϵ (eps) and a density threshold MinPts. DBSCAN has a large amount of extensions, e.g. [34], [35], [36], [37], and has been widely applied in many applications, such as astronomy [38], neuroscience [39]. However, DBSCAN has some drawbacks as follows.

(1) It renders almost useless when subject to high-dimensional data due to the so-called “Curse of dimensionality”.

(2) The running time for DBSCAN is heavily dominated by finding neighbors or obtaining density for each data point. Without indexing, the complexity of DBSCAN would always be O(n2) regardless of the parameters ϵ and MinPts. If a tree-based spatial index is used, the ϵ-neighborhood are expected to be small compared to the size of the whole data space, the average complexity is reduced to O(n*log(n)) [33]. However, for dimension d > 3 the DBSCAN problem require Ω(N4/3) time to solve, unless very significant breakthroughs could made in theoretical computer science [40].

Many researchers have proposed various techniques in attempts to improve the performance of clustering algorithm on high-dimensional data. For example, Wang and Deng developed a serial of important work on soft subspace clustering and fuzzy clustering for high dimensional data [41], [42], [43], [44], which overcome the drawbacks of utilizing only one distance function in most of existing clustering algorithms, and adaptively learn the distance functions suitable for data sets during the clustering process.

Grid-based technique and approximation techniques are also popular, such as Fast-DBSCAN [45] and others [46], [47]. Grid-based techniques, e.g. [48], [49], [50], [51], divide the data space by grids, perform clustering in each cell locally and merge the results thereby saving runtime. Gunawan [45] proposed a Fast-DBSCAN based on drawing a 2-dimensional grid. The algorithm imposes an arbitrary grid T on the data space R2, where each cell of T has side length ϵ/2. If a non-empty cell c contains at least MinPts points, then all those points in the cell must be core points, because the maximum distance within the cell is ϵ. This algorithm theoretically runs in O(n*log(n)) time in the worst case. However it is only applied in 2-dimensional data space.

Inspired by Fast-DBSCAN, Gan and Tao [40] proposed a novel algorithm named ρ-approximate DBSCAN, which has a computation time that scales only linearly in n. The improvement of this method from Gunawan [45] lies in its new tree structure, i.e. quadtree-like hierarchical grid, as well as the sacrifice of small accuracy. Because the cell number in the quadtree-like hierarchical grid T will increase explosively with dimension D, therefore ρ-approximate only saves those non-empty cells. However, it needs dimension D to be a relative small constant for the linear running time to hold, and actually it still runs in O(n2) in high dimension, as the following theorem shows.

Theorem 1

ρ-approximate DBSCAN degenerates to an O(n2) algorithm if 2D ≫ n.

Proof

Let X be the maximum radius for DBSCAN to correctly cluster data set P, and dimension D be large enough such that 2D ≫ n, which implies there are much more cells than n in the grid. Set ϵ=X, for each cell there is at most one point contained if D is large enough, because the side length of each cell is XD and limDXD=0.

In the case of 2D ≪ n, ρ-approximate DBSCAN answers any approximate range count query in O(1) expected time (see Lemma 5 in [40]). But here, since each non-empty cell contains at most one point, then there are about n nonempty cells are saved. Thus the query time for each cell to find neighbors is O(n), not O(1) any more, and hence ρ-approximate DBSCAN runs in O(n2) expected time. 

Therefore, most existing current clustering algorithms are not suitable for many realtime applications, due to the “curse of dimensionality”. The main reason lies in great number of unnecessary distance calculations, which can be greatly reduced by neighbor searching technique, such as Product quantization for nearest neighbor search [52], LSH (Locality-Sensitive Hashing) [53], FLANN [54].

In this paper, we propose a new clustering approach, named NQ-DBSCAN, by using local neighbor query technique and quadtree-like hierarchical grid to reduce great number of unnecessary distance computations. Theoretical analysis and experimental results show that the proposed algorithm NQ-DBSCAN can averagely run in O(n*log(n)) expected time with the help of indexing technique, and the best case is O(n) if proper parameters are used, which makes it suitable for many realtime data.

Because ρ-Approximate DBSCAN is the most important improvement of DBSCAN currently, we only focus on DBSCAN, ρ-Approximate DBSCAN and NQ-DBSCAN in this paper. There are some advantages of NQ-DBSCAN to ρ-Approximate DBSCAN as below.

(1) NQ-DBSCAN is an exact algorithm that may return the same result as DBSCAN if the parameters are same. While ρ-Approximate DBSCAN is an approximate algorithm.

(2) The best complexity of NQ-DBSCAN can be O(n), and the average complexity of NQ-DBSCAN is proved to be O(nlog(n)) provided the parameters are properly chosen. While ρ-Approximate DBSCAN runs only in O(n2) in high dimension.

(3) NQ-DBSCAN is suitable for clustering data with a lot of noise.

The rest of this paper is organized as follow: Section 2 introduces the basic concepts; Section 3 presents the details of the proposed clustering algorithm; Section 4 demonstrates the experimental results of the proposed algorithms on various data sets, and Section 5 gives the conclusion and our future works.

Section snippets

Basic concepts

Density-based clustering algorithms have the ability to find out the clusters of different shapes and sizes. DBSCAN, a pioneer density-based clustering algorithms, is one of the most important and popular clustering algorithms in scientific literature.1 DBSCAN accepts two parameters: ϵ (Eps) and MinPts, where ϵ is scanning radius and MinPts is the minimal number of neighbor points for a core point. Some concepts and terms to explain the DBSCAN algorithm can

Basic concepts

We propose a new algorithm to improve DBSCAN by filtering a large number of unnecessary density computations, which is based on the following idea.

Point p and point q should have similar neighbors, provided p and q are close; given a certain ϵ, the closer they are, the more similar their neighbors are. As Fig. 1 shows, we can see that points p and q in Fig. 1 (a) have more same neighbors than that they have in Fig. 1 (b). Formally, we have some theorems which are important for validating the

Experiments

In this section, we conduct experiments to evaluate the performance of NQ-DBSCAN, and make comparisons with original DBSCAN and ρ-approximate DBSCAN [40], on synthetic and realtime data sets.

Conclusion

Today, large collections of data are explosively created in different fields, and most of these data are high dimensional with a lot of noise, which bring great challenging to clustering. DBSCAN is a creative and elegant technique for density-based clustering. However, it is rendered almost useless for high-dimensional data, due to the “curse of dimensionality”, which limits its applicability in many realtime applications. ρ-approximate DBSCAN [40] is an efficient approach designed to replace

Acknowledgment

The National Science Foundation of China (No.61673186,71771094) ; this work was supported by the Open Project Program of the National Laboratory of Pattern Recognition (NLPR) (NO.201700002); the Open Project Program of the State Key Lab of CAD&CG(Grant No.A1722), Zhejiang University; the Natural Science Foundation of Fujian Province (No.2016J01303); Project of Science and Technology Plan of Fujian Province of China (No.2017H01010065); the Graduate Students Research and Innovation Ability

Yewang Chen received the B.S. and M.S. degrees from information management and computer science of Huaqiao University, Xiamen, China, in 2001 and 2006, respectively, and the Ph.D. degree from computer science of Fudan University, Shanghai, China, in 2009. Currently, he is a Lecturer in the School of Computer Science and Technology, Huaqiao University, Xiamen, China. His current research interests include natural language processing,machine learning and pattern recognition.

References (63)

  • A. Hatamlou

    Black hole: a new heuristic optimization approach for data clustering

    Inf. Sci.

    (2013)
  • J. Wang et al.

    Distance metric learning for soft subspace clustering in composite kernel space

    Pattern Recognit.

    (2016)
  • Z. Deng et al.

    Enhanced soft subspace clustering integrating within-cluster and between-cluster information

    Pattern Recognit.

    (2010)
  • P. Viswanath et al.

    Rough-dbscan: a fast hybrid density based clustering method for large data sets

    Pattern Recognit. Lett.

    (2009)
  • D. Birant et al.

    St-dbscan: an algorithm for clustering spatial–temporal data

    Data Knowl. Eng.

    (2007)
  • Y. Chen et al.

    Decentralized clustering by finding loose and distributed density cores

    Inf. Sci.

    (2018)
  • J. Song et al.

    Optimized graph learning using partial tags and multiple features for image and video annotation

    IEEE Trans. Image Process.

    (2016)
  • J. Song et al.

    Deep and fast: deep learning hashing with semi-supervised graph construction

    Image Vision Comput.

    (2016)
  • J. Song et al.

    A distance-computation-free search scheme for binary code databases

    IEEE Trans. Multimedia

    (2016)
  • W. Zhou et al.

    Scalable feature matching by dual cascaded scalar quantization for image retrieval

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2016)
  • S. Zhang et al.

    Affective visualization and retrieval for music video

    IEEE Trans. Multimedia

    (2010)
  • A.K. Rajagopal et al.

    Exploring transfer learning approaches for head pose classification from multi-view surveillance images

    Int. J. Comput. Vision

    (2014)
  • Y. Yan et al.

    A multi-task learning framework for head pose estimation under target motion

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2016)
  • B.F. Qaqish et al.

    Accelerating high dimensional clustering with lossless data reduction

    Bioinformatics

    (2017)
  • O. Limwattanapibool et al.

    Determination of the appropriate parameters for k-means clustering using selection of region clusters based on density dbscan (srcd-dbscan)

    Expert Syst.

    (2017)
  • M. Ester et al.

    Algorithms and applications for spatial data mining

    Geographic Data Min. Knowl. Discovery

    (2001)
  • W.A. Barbakh et al.

    Non-Standard Parameter Adaptation for Exploratory Data Analysis

    (2009)
  • J. Han et al.

    Data Mining: Concepts and Techniques

    (2011)
  • Y.C. Song et al.

    The application of cluster analysis in geophysical data interpretation

    Comput. Geosci.

    (2010)
  • J. Li et al.

    Chameleon based on clustering feature tree and its application in customer segmentation

    Ann. Oper. Res.

    (2009)
  • C.-W. Huang et al.

    Intuitionistic fuzzy c-means clustering algorithm with neighborhood attraction in segmenting medical image

    Soft Comput.

    (2015)
  • Cited by (149)

    • Approximate DBSCAN on obfuscated data

      2024, Journal of Information Security and Applications
    • Semi-supervised deep density clustering

      2023, Applied Soft Computing
    View all citing articles on Scopus

    Yewang Chen received the B.S. and M.S. degrees from information management and computer science of Huaqiao University, Xiamen, China, in 2001 and 2006, respectively, and the Ph.D. degree from computer science of Fudan University, Shanghai, China, in 2009. Currently, he is a Lecturer in the School of Computer Science and Technology, Huaqiao University, Xiamen, China. His current research interests include natural language processing,machine learning and pattern recognition.

    Shenyu Tang received the B.S degree from College of Mathematics of Huaqiao University, Quanzhou, China, in 2012, and he is a postgraduate student in the School of Computer Science and Technology, Huaqiao University, Xiamen, China. His current research interests is machine learning and pattern recognition.

    Nizar Bouguila received the engineer degree from the University of Tunis in 2000 the M.Sc. and Ph.D degrees from Sherbrooke University in 2002 and 2006, respectively, all in computer sciences. He is currently a professor within the Concordia Institute for Information Systems Engineering (CIISE) at Concordia University, Montreal, Qc, Canada. His current research interests include: Computer vision and pattern recognition, Machine learning and data mining, Image and signal processing, Statistical Process Control, 3D Graphics and Games.

    Cheng Wang received the B.S. degrees from software engineering of Xidian University, Xian, China, in 2006 and received the Ph.D. degree in mechanics from Xi’an jiaotong University, China in 2012 respectively. Currently, he is an associate professor in the School of Computer Science and Technology, Huaqiao University, Xiamen, China. His research interests include signal processing and data mining.

    Jixiang Du took coursers B.Sc degree candidate in Vehicle Engineering, Hefei University of Technology, and obtained B.Sc. degree in July 1999. From September 1999 to July 2002, took courses as M.Sc. degree candidate in Vehicle Engineering, Hefei University of Technology, and obtained B.Sc. degree in July 2002. From February 2003 on, in pursuit for Ph.D. degree in Pattern Recognition and Intelligent System in University of Science and Technology of China (USTC), Hefei, China, and in December 2005, he received Ph.D. degree. Now, he is a professor at the College of Computer Science and Technology at Huaqiao University.

    Hailin Li received the B.S. degrees from Information and computing science of Jingdezhen Ceramic University, Jingdezhen, China, in 2006 and received the Ph.D. degree in Management Science and Engineering from Dalian University of Technology , China in 2012 respectively. Currently, he is an associate professor in the school of Business Administration, Huaqiao University, Quanzhou, China. His research interests include data mining and decision making.

    View full text