Skip to main content
Log in

DECODE: a new method for discovering clusters of different densities in spatial data

  • Published:
Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Abstract

When clusters with different densities and noise lie in a spatial point set, the major obstacle to classifying these data is the determination of the thresholds for classification, which may form a series of bins for allocating each point to different clusters. Much of the previous work has adopted a model-based approach, but is either incapable of estimating the thresholds in an automatic way, or limited to only two point processes, i.e. noise and clusters with the same density. In this paper, we present a new density-based cluster method (DECODE), in which a spatial data set is presumed to consist of different point processes and clusters with different densities belong to different point processes. DECODE is based upon a reversible jump Markov Chain Monte Carlo (MCMC) strategy and divided into three steps. The first step is to map each point in the data to its mth nearest distance, which is referred to as the distance between a point and its mth nearest neighbor. In the second step, classification thresholds are determined via a reversible jump MCMC strategy. In the third step, clusters are formed by spatially connecting the points whose mth nearest distances fall into a particular bin defined by the thresholds. Four experiments, including two simulated data sets and two seismic data sets, are used to evaluate the algorithm. Results on simulated data show that our approach is capable of discovering the clusters automatically. Results on seismic data suggest that the clustered earthquakes, identified by DECODE, either imply the epicenters of forthcoming strong earthquakes or indicate the areas with the most intensive seismicity, this is consistent with the tectonic states and estimated stress distribution in the associated areas. The comparison between DECODE and other state-of-the-art methods, such as DBSCAN, OPTICS and Wavelet Cluster, illustrates the contribution of our approach: although DECODE can be computationally expensive, it is capable of identifying the number of point processes and simultaneously estimating the classification thresholds with little prior knowledge.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  • Agrawal R, Gehrke J, Gunopulos D, Raghavan P (1998) Automatic subspace clustering of high dimensional data for data mining applications. In: Proceedings of the ACM SIGMOD ’98 international conference on management of data, Seattle, WA, USA, pp 94–105

  • Allard D, Fraley C (1997) Nonparametric maximun likelihood estimation of features in spatial point process using voronoi tessellation. J Am Stat Assoc 92: 1485–1493. doi:10.2307/2965419

    Article  MATH  Google Scholar 

  • Andrieu C, Freitas DN, Doucet A, Jordan IM (2003) An introduction to MCMC for machine learning. Mach Learn 50: 5–43. doi:10.1023/A:1020281327116

    Article  MATH  Google Scholar 

  • Ankerst M, Breunig MM, Kriegel H-P, Sander J (1999) OPTICS: ordering points to identify the clustering structure. In: Proceedings of ACM-SIGMOD’99 international conference on management data, Philadelphia, USA, pp 46-60

  • Byers S, Raftery AE (1998) Nearest-neighbor clutter removal for estimating features in spatial point processes. J Am Stat Assoc 93: 577–584. doi:10.2307/2670109

    Article  MATH  Google Scholar 

  • Cheng KH (2002) An analysis of tectonic environment and contemporary seismicity of frontal orogeny in central Taiwan area. Seismol Geol 24(3): 400–411

    Google Scholar 

  • China Seismograph Network (CSN) catalog available online at: http://www.csndmc.ac.cn. Accessed in 2008

  • Cressie NAC (1991) Statistics for spatial data, 1st edn. Wiley, New York

    MATH  Google Scholar 

  • Daszykowski M, Walczak B, Massart DL (2001) Looking for natural patterns in data Part 1. Density-based approach. Chemom Intell Lab Syst 56: 83–92. doi:10.1016/S0169-7439(01)00111-3

    Article  Google Scholar 

  • Diggle PJ (1985) A kernel method for smoothing point process data. Appl Stat 34: 138–147. doi:10.2307/2347366

    Article  MATH  Google Scholar 

  • Ester M, Kriegel H-P, Sander J, Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proc. 2nd int. conf. on knowledge discovery and data mining, Portland, OR, pp 226–231

  • Feng H, Huang DY (1980) Earthquake catalogue inWest China (1970—1975,M≥1). Seismological Press, Beijing (in Chinese)

  • Feng H, Huang DY (1989) Earthquake catalogue inWest China (1976—1979,M≥1). Seismological Press, Beijing (in Chinese)

  • Fu ZX, Jiang LX (1997) On large-scale spatial heterogeneties of great shallow earthquakes and plates coupling mechanism in Chinese mainland and its adjacent area. Earthq Res China 13(1):1–9 (in Chinese)

    Google Scholar 

  • Ghosh SC (2002) The raniganj coal basin: an example of an Indian Gondwana rift. Sediment Geol 147(Sp. Iss.): 155–176

    Article  Google Scholar 

  • Green PJ (1995) Reversible jump Markov chain Monte Carlo computation and Bayesian model determination. Biometrika 82: 711–732. doi:10.1093/biomet/82.4.711

    Article  MATH  MathSciNet  Google Scholar 

  • Gu GX (1983) Chin seismic catalog (1831 BC-1969 AD). Science Press, Beijing

    Google Scholar 

  • Han JW, Kamber M, Tung AKH (2001) Spatial clustering methods in data mining. In: Miller HJ, Han JW(eds) Geographic data mining and knowledge discovery. Taylor & Francis, London, pp 188–217

    Google Scholar 

  • Hinneburg A, Keim DA (1998) An efficient approach to clustering in large multimedia databases with noise. In: Proceedings of the knowledge discovery and data mining, pp 58–65

  • Jain AK, Dubes RC (1988) Algorithms for clustering data. Prentice Hall, Englewood Cliffs

    MATH  Google Scholar 

  • Jasra A, Stephens DA, Gallagher K, Holmes CC (2006) Bayesian mixture modelling in geochronology via Markov chain Monte Carlo. Math Geol 38: 269–300. doi:10.1007/s11004-005-9019-3

    Article  MATH  Google Scholar 

  • Jiao MR, Zhang GM, Che S, Liu J (1999) Numerical calculations of tectonic stress field of Chinese mainland and its neighboring regions and their applications to explanation of seismic activity. Acta Seismologica Sin 12(2): 137–147. doi:10.1007/s11589-999-0018-1

    Article  Google Scholar 

  • Kagan YY, Houston H (2005) Relation between mainshock rupture process and Omori’s law for aftershock moment release rate. Geophys J Int 163: 1039–1048

    Article  Google Scholar 

  • Kaufman L, Rousseeuw P (1990) Finding groups in data: an introduction to cluster analysis. Wiley, New York

    Google Scholar 

  • Lin CY, Chang CC (2005) A new density-based scheme for clustering based on genetic algorithm. Fundam Inform 68: 315–331

    MATH  MathSciNet  Google Scholar 

  • Liu P, Zhou D, Wu NJ (2007) VDBSCAN: varied density based spatial clustering of applications with noise. In: Proceedings of IEEE international conference on service systems and service management, Chengdu, China, pp 1–4

  • Markus MB, Kriegel H-P, Raymond TN, Sander J (2000) LOF: identifying density-based local outliers. In: Proceedings of ACM SIGMOD of 2000 international conference on management of data, vol 29, pp 93–104

  • Matsu’ura RS, Karakama I (2005) A point-process analysis of the Matsushiro earthquake swarm sequence: the effect of water on earthquake occurrence. Pure Appl Geophys 162: 1319–1345. doi:10.1007/s00024-005-2672-0

    Article  Google Scholar 

  • Murtagh F, Starck JL (1998) Pattern clustering based on noise modeling in wavelet space. Pattern Recogn 31(7): 847–855. doi:10.1016/S0031-3203(97)00115-5

    Article  Google Scholar 

  • Neill DB (2006) Detection of spatial and spatio-temporal clusters. Ph.D. Thesis of University of South Carolina

  • Neill DB, Moore AW (2005) Anomalous spatial cluster detection. In: Proceeding of KDD 2005 workshop on data mining methods for anomaly detection, Chicago, Illinois, USA, pp 41–44

  • Pascual D, Pla F, Sanchez JS (2006) Non parametric local density-based clustering for multimodal overlapping distributions. In: Proceedings of intelligent data engineering and automated learning (IDEAL2006), Spain, Burgos, pp 671–678

  • Pei T, Yang M, Zhang JS, Zhou CH, Luo JC, Li QL (2003) Multi-scale expression of spatial activity anomalies of earthquakes and its indicative significance on the space and time attributes of strong earthquakes. Acta Seismologica Sin 3: 292–303. doi:10.1007/s11589-003-0033-6

    Article  Google Scholar 

  • Pei T, Zhu AX, Zhou CH, Li BL, Qin CZ (2006) A new approach to the nearest-neighbour method to discover cluster features in overlaid spatial point processes. Int J Geogr Inf Sci 20: 153–168. doi:10.1080/13658810500399654

    Article  Google Scholar 

  • Reasenberg PA (1999) Foreshock occurrence rates before large earthquakes worldwide. Pure Appl Geophys 155: 355–379. doi:10.1007/s000240050269

    Article  Google Scholar 

  • Richardson S, Green PJ (1997) On Bayesian analysis of mixtures with an unknown number of components. J Roy Stat Soc Ser B-Methodol 59: 731–758

    Article  MATH  MathSciNet  Google Scholar 

  • Robert CP, Casella G (2004) Monte Carlo statistical methods, 2nd edn. Springer, New York

    MATH  Google Scholar 

  • Roy S, Bhattacharyya DK (2005) An approach to find embedded clusters using density based techniques. Lect Notes Comput Sci 3816: 523–535. doi:10.1007/11604655_59

    Article  Google Scholar 

  • Sander J, Ester M, Kriegel H, Xu X (1998) Density-based clustering in spatial databases: the algorithm GDBSCAN and its applications. Data Min Knowl Discov 2: 169–194. doi:10.1023/A:1009745219419

    Article  Google Scholar 

  • Sheikholeslami G, Chatterjee S, Zhang A (1998) WaveCluster: a multi-resolution clustering approach for very large spatial databases. In: Proceedings of the 24th international conference on very large data bases, New York City, NY, pp 428-439

  • Thompson HR (1956) Distribution of distance to nth nearest neighbour in a population of randomly distributed individuals. Ecology 27: 391–394. doi:10.2307/1933159

    Article  Google Scholar 

  • Tran TN, Wehrensa R, Lutgarde MCB (2006) KNN-kernel density-based clustering for high-dimensional multivariate data. Comput Stat Data Anal 51: 513–525. doi:10.1016/j.csda.2005.10.001

    Article  MATH  Google Scholar 

  • Umino N, Okada T, Hasegawa A (2002) Foreshock and aftershock sequence of the 1998 M ≥ 5.0 Sendai, northeastern Japan, earthquake and its implications for earthquake nucleation. Bull Seismol Soc Am 92: 2465–2477. doi:10.1785/0120010140

    Article  Google Scholar 

  • Wyss M, Toya Y (2000) Is background seismicity produced at a stationary Poissonian rate. Bull Seismol Soc Am 90: 1174–1187. doi:10.1785/0119990158

    Article  Google Scholar 

  • Zhang GM, Ma HS, Wang H, Wang XL (2005) Boundaries between active-tectonic blocks and strong earthquakes in the China mainland. Chin J Geophys 48: 602–610

    Google Scholar 

  • Zhou CH, Pei T, Li QL, Chen JB, Qin CZ, Han ZJ (2006) Database of Integrated Catalog of Chinese earthquakes and Its Application. Water and Electricity Press, Beijing (in Chinese)

  • Zhuang JC, Chang CP, Ogata Y, Chen YI (2005) A study on the background and clustering seismicity in the Taiwan region by using point process models. J Geophys Res Solid Earth 110(B05S18). doi:10.1029/2004JB003157

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Chenghu Zhou.

Additional information

Responsible editor: Charu Aggarwal.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Pei, T., Jasra, A., Hand, D.J. et al. DECODE: a new method for discovering clusters of different densities in spatial data. Data Min Knowl Disc 18, 337–369 (2009). https://doi.org/10.1007/s10618-008-0120-3

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10618-008-0120-3

Keywords

Navigation