A dual approach to cluster discovery in point event data sets

https://doi.org/10.1016/j.compenvurbsys.2005.07.004Get rights and content

Abstract

Spatial data mining seeks to discover meaningful patterns in data where a prime dimension of interest is geographical location. Consideration of a spatial dimension becomes important where data either refer to specific locations and/or have significant spatial dependence which needs to be considered if meaningful patterns are to emerge. For point event data there are two main groups of approaches to identifying clusters. One stems from the statistical tradition of classification which assigns point events to a spatial segmentation. A popular method is the k-means algorithm. The other broad approach is one which searches for ‘hot spots’ which can be loosely defined as a localised excess of some incidence rate. Examples of this approach are GAM and kernel density estimation. This paper presents a novel variable resolution approach to ‘hot spot’ cluster discovery which acts to define spatial concentrations within the point event data. ‘Hot spot’ centroids are then used to establish additional distance variables and initial cluster centroids for a k-means classification that produces a segmentation, both spatially and by attribute. This dual approach is effective in quickly focusing on rational candidate solutions to the values of k and choice of initial candidate centroids in the k-means clustering. This is demonstrated through the analysis of a business transactions database. The overall dual approach can be used effectively to explore clusters in very large point event data sets.

Introduction

Prior to the 1990s the spatial sciences, and the application of geographical information systems (GIS) in particular, suffered from a paucity of digital data sets. The 1990s were a period of transition into data-richness, a trend which accelerates today. Digital spatial data sets have grown rapidly in coverage, volume of records and numbers of attributes per record (Gahegan, 2003, Miller and Han, 2001). This state change has come about as a result of:

  • improved technology and wider use of GPS, remote sensing and digital photogrammetry for collecting data on topographic and other physical objects;

  • the introduction of new approaches to obtaining lifestyle and preference data such as through loyalty cards;

  • dramatic increases in computing power to process raw data coupled with falling unit costs of data storage and data processing;

  • the advent of data warehousing technologies;

  • more efficient means of accessing and delivering data on-line.

The technical advances in hardware, software and data have been so profound that they have fundamentally affected the range of problems studied and the methodologies used to do so (Macmillan, 1998). An exponential rise in the size of databases, their increasing complexity and the rate at which they can accumulate on a daily basis have therefore lead to an urgent need for techniques that can mine very large databases for the knowledge they contain. Consequently, an active area of research has focused on spatial data mining which can be defined as techniques for the discovery of meaningful patterns from large data sets where a prime dimension of interest is geographical location. This paper focuses on clustering as a central aspect of spatial data mining and seeks to demonstrate the benefits of using ‘hot spot’ approaches to clustering in tandem with segmentation approaches to clustering. This is demonstrated using a case study analysis of a business transactions database. The following section discusses the theoretical perspectives and the dichotomy between the two different approaches to clustering. A form of ‘hot spot’ type clustering is then introduced and is subsequently used in the case study to guide a k-means classification of spatial and non-spatial attributes for a customer database. This forms the basis of a dual approach to cluster discovery as alluded to in the title of the paper.

Section snippets

Cluster detection in point event data

Transactions databases, be they for business, crime or health, can be regarded as point event data sets if each record has a specific geographical identifier such that geocoding can be achieved at the resolution of an address or postcode. From a location perspective the point event is a binary occurrence – either it happened there or it did not. From a data perspective, the binary occurrence may have added dimensions of attributes that describe the nature or content of the transaction which may

A variable resolution approach to the analysis of point event data

A recursive decomposition of space into gradually smaller spatial units that are nevertheless space-filling is generally referred to as a hierarchical tessellation. Any such decomposition requires some predefined criteria such as decomposition ratio and minimum size of spatial units in order to guide and finally terminate the algorithm. Possibly the most studied hierarchical tessellation is the quadtree (Samet, 1984) in which an initial square region covering the entire study area is repeatedly

Geo-ProZones and k-means clustering in tandem: a case study

The case study focuses on an analysis of a business transactions database for one year. The database contained details of 2390 customers of whom 2361 (98.7%) could be geocoded to postcode level. This is a relatively small database by data mining standards but is sufficiently tractable to allow experimentation and tracking of special cases and checking results without being deluged by the data. The distribution of geocoded customers is given in Fig. 2(a). This particular business operates a

Conclusions

This paper has explored and demonstrated a dual approach in spatial data mining of point event data. The sequence has been:

  • use a ‘hot spot’ style clustering of point events (in this instance, Geo-ProZones) treating each point as a binary event to suggest k number of classes centred on N initial candidate centroids (the ‘hot spots’) where k = N;

  • create k new attributes for each point event being the Euclidean distance to each initial candidate centroid;

  • bring the records spatially nearest to the N

References (48)

  • A.J. Brimicombe

    GIS, environmental modelling and engineering

    (2003)
  • A.J. Brimicombe et al.

    A variable resolution, geocomputational approach to the analysis of point patterns

    Hydrological Processes

    (2000)
  • P.J. Clark et al.

    Distance to nearest neighbour as a measure of spatial relations in populations

    Ecology

    (1954)
  • A.D. Cliff et al.

    Spatial processes: Models and applications

    (1981)
  • H. Couclelis

    Computation and space

    Environment & Planning B

    (1998)
  • A.S. Fotheringham

    Exploratory spatial data analysis and GIS

    Environment and Planning A

    (1992)
  • A.S. Fotheringham

    Trends in quantitative methods I: Stressing the local

    Progress in Human Geography

    (1997)
  • A.S. Fotheringham

    Trends in quantitative methods II: Stressing the computational

    Progress in Human Geography

    (1998)
  • A.S. Fotheringham et al.

    Local forms of spatial analysis

    Geographical Analysis

    (1999)
  • A.S. Fotheringham et al.

    Quantitative geography

    (2000)
  • M. Gahegan

    Is inductive machine learning just another wild goose (or might it lay the golden egg)?

    International Journal of Geographical Information Science

    (2003)
  • A.C. Gatrell et al.

    Spatial point pattern analysis and its application in geographical epidemiology

    Transactions of the Institute of British Geographers, NS

    (1996)
  • A.C. Gatrell et al.

    Spatial point process modelling in a geographical information system environment

  • J. Han et al.

    Spatial clustering methods in data mining

  • Cited by (19)

    • Capacitated spatial clustering with multiple constraints and attributes

      2024, Engineering Applications of Artificial Intelligence
    • Comparison of two fuzzy algorithms in geodemographic segmentation analysis: The fuzzy C-means and Gustafson-Kessel methods

      2012, Applied Geography
      Citation Excerpt :

      Still, as Singleton and Longley (2008) suggest, a k-means algorithm is unstable, and further research is required. Clustering techniques coming from artificial neural networks, genetic algorithms, or fuzzy logic are more efficient within large, multidimensional databases (Brimicombe, 2007; Brunsdon, 2006). Neural networks can handle non-linear relationships, are robust to noise and exhibit a high degree of automation.

    • Mining complex spatial patterns:Issues and techniques

      2014, Journal of Information and Knowledge Management
    View all citing articles on Scopus
    View full text