A dual approach to cluster discovery in point event data sets
Introduction
Prior to the 1990s the spatial sciences, and the application of geographical information systems (GIS) in particular, suffered from a paucity of digital data sets. The 1990s were a period of transition into data-richness, a trend which accelerates today. Digital spatial data sets have grown rapidly in coverage, volume of records and numbers of attributes per record (Gahegan, 2003, Miller and Han, 2001). This state change has come about as a result of:
- •
improved technology and wider use of GPS, remote sensing and digital photogrammetry for collecting data on topographic and other physical objects;
- •
the introduction of new approaches to obtaining lifestyle and preference data such as through loyalty cards;
- •
dramatic increases in computing power to process raw data coupled with falling unit costs of data storage and data processing;
- •
the advent of data warehousing technologies;
- •
more efficient means of accessing and delivering data on-line.
The technical advances in hardware, software and data have been so profound that they have fundamentally affected the range of problems studied and the methodologies used to do so (Macmillan, 1998). An exponential rise in the size of databases, their increasing complexity and the rate at which they can accumulate on a daily basis have therefore lead to an urgent need for techniques that can mine very large databases for the knowledge they contain. Consequently, an active area of research has focused on spatial data mining which can be defined as techniques for the discovery of meaningful patterns from large data sets where a prime dimension of interest is geographical location. This paper focuses on clustering as a central aspect of spatial data mining and seeks to demonstrate the benefits of using ‘hot spot’ approaches to clustering in tandem with segmentation approaches to clustering. This is demonstrated using a case study analysis of a business transactions database. The following section discusses the theoretical perspectives and the dichotomy between the two different approaches to clustering. A form of ‘hot spot’ type clustering is then introduced and is subsequently used in the case study to guide a k-means classification of spatial and non-spatial attributes for a customer database. This forms the basis of a dual approach to cluster discovery as alluded to in the title of the paper.
Section snippets
Cluster detection in point event data
Transactions databases, be they for business, crime or health, can be regarded as point event data sets if each record has a specific geographical identifier such that geocoding can be achieved at the resolution of an address or postcode. From a location perspective the point event is a binary occurrence – either it happened there or it did not. From a data perspective, the binary occurrence may have added dimensions of attributes that describe the nature or content of the transaction which may
A variable resolution approach to the analysis of point event data
A recursive decomposition of space into gradually smaller spatial units that are nevertheless space-filling is generally referred to as a hierarchical tessellation. Any such decomposition requires some predefined criteria such as decomposition ratio and minimum size of spatial units in order to guide and finally terminate the algorithm. Possibly the most studied hierarchical tessellation is the quadtree (Samet, 1984) in which an initial square region covering the entire study area is repeatedly
Geo-ProZones and k-means clustering in tandem: a case study
The case study focuses on an analysis of a business transactions database for one year. The database contained details of 2390 customers of whom 2361 (98.7%) could be geocoded to postcode level. This is a relatively small database by data mining standards but is sufficiently tractable to allow experimentation and tracking of special cases and checking results without being deluged by the data. The distribution of geocoded customers is given in Fig. 2(a). This particular business operates a
Conclusions
This paper has explored and demonstrated a dual approach in spatial data mining of point event data. The sequence has been:
- •
use a ‘hot spot’ style clustering of point events (in this instance, Geo-ProZones) treating each point as a binary event to suggest k number of classes centred on N initial candidate centroids (the ‘hot spots’) where k = N;
- •
create k new attributes for each point event being the Euclidean distance to each initial candidate centroid;
- •
bring the records spatially nearest to the N
References (48)
- et al.
Density and local attribute estimation of an infectious disease using MapInfo
Computers and Geosciences
(2002) - et al.
Argument free clustering for large spatial point-data sets via boundary extraction from Delaunay Diagram
Computers, Environment and Urban Systems
(2002) - et al.
Dirichlet neighbours: Revisiting Dirichlet tessellation for neighbourhood analysis
Computers, Environment and Urban Systems
(2001) Extending the Kohonen self-organizing map networks for clustering analysis
Computational Statistics & Data Analysis
(2001)- et al.
GIS approaches to the problem of disease clusters: A brief commentary
Social Science & Medicine
(2001) - et al.
Splancs: Spatial point pattern analysis code in S-Plus
Computers and Geosciences
(1993) - et al.
Cluster analysis
(1984) Geography and computational science
Annals of the Association of American Geographers
(2000)Small may be beautiful – but is simple sufficient?
Geographical and Environmental Modelling
(1999)Constructing and evaluating contextual indices using GIS: A case of primary school performance
Environment & Planning A
(2000)
GIS, environmental modelling and engineering
A variable resolution, geocomputational approach to the analysis of point patterns
Hydrological Processes
Distance to nearest neighbour as a measure of spatial relations in populations
Ecology
Spatial processes: Models and applications
Computation and space
Environment & Planning B
Exploratory spatial data analysis and GIS
Environment and Planning A
Trends in quantitative methods I: Stressing the local
Progress in Human Geography
Trends in quantitative methods II: Stressing the computational
Progress in Human Geography
Local forms of spatial analysis
Geographical Analysis
Quantitative geography
Is inductive machine learning just another wild goose (or might it lay the golden egg)?
International Journal of Geographical Information Science
Spatial point pattern analysis and its application in geographical epidemiology
Transactions of the Institute of British Geographers, NS
Spatial point process modelling in a geographical information system environment
Spatial clustering methods in data mining
Cited by (19)
Capacitated spatial clustering with multiple constraints and attributes
2024, Engineering Applications of Artificial IntelligenceComparison of two fuzzy algorithms in geodemographic segmentation analysis: The fuzzy C-means and Gustafson-Kessel methods
2012, Applied GeographyCitation Excerpt :Still, as Singleton and Longley (2008) suggest, a k-means algorithm is unstable, and further research is required. Clustering techniques coming from artificial neural networks, genetic algorithms, or fuzzy logic are more efficient within large, multidimensional databases (Brimicombe, 2007; Brunsdon, 2006). Neural networks can handle non-linear relationships, are robust to noise and exhibit a high degree of automation.
Spatial data mining and geographic knowledge discovery-An introduction
2009, Computers, Environment and Urban SystemsGeodemographics, visualisation, and social networks in applied geography
2009, Applied GeographyMining complex spatial patterns:Issues and techniques
2014, Journal of Information and Knowledge ManagementSpatial machine learning: new opportunities for regional science
2022, Annals of Regional Science