A dual approach to cluster discovery in point event data sets

doi:10.1016/j.compenvurbsys.2005.07.004

Computers, Environment and Urban Systems

Volume 31, Issue 1, January 2007, Pages 4-18

https://doi.org/10.1016/j.compenvurbsys.2005.07.004 Get rights and content

Abstract

Spatial data mining seeks to discover meaningful patterns in data where a prime dimension of interest is geographical location. Consideration of a spatial dimension becomes important where data either refer to specific locations and/or have significant spatial dependence which needs to be considered if meaningful patterns are to emerge. For point event data there are two main groups of approaches to identifying clusters. One stems from the statistical tradition of classification which assigns point events to a spatial segmentation. A popular method is the k-means algorithm. The other broad approach is one which searches for ‘hot spots’ which can be loosely defined as a localised excess of some incidence rate. Examples of this approach are GAM and kernel density estimation. This paper presents a novel variable resolution approach to ‘hot spot’ cluster discovery which acts to define spatial concentrations within the point event data. ‘Hot spot’ centroids are then used to establish additional distance variables and initial cluster centroids for a k-means classification that produces a segmentation, both spatially and by attribute. This dual approach is effective in quickly focusing on rational candidate solutions to the values of k and choice of initial candidate centroids in the k-means clustering. This is demonstrated through the analysis of a business transactions database. The overall dual approach can be used effectively to explore clusters in very large point event data sets.

Introduction

Prior to the 1990s the spatial sciences, and the application of geographical information systems (GIS) in particular, suffered from a paucity of digital data sets. The 1990s were a period of transition into data-richness, a trend which accelerates today. Digital spatial data sets have grown rapidly in coverage, volume of records and numbers of attributes per record (Gahegan, 2003, Miller and Han, 2001). This state change has come about as a result of:

•
improved technology and wider use of GPS, remote sensing and digital photogrammetry for collecting data on topographic and other physical objects;
•
the introduction of new approaches to obtaining lifestyle and preference data such as through loyalty cards;
•
dramatic increases in computing power to process raw data coupled with falling unit costs of data storage and data processing;
•
the advent of data warehousing technologies;
•
more efficient means of accessing and delivering data on-line.

The technical advances in hardware, software and data have been so profound that they have fundamentally affected the range of problems studied and the methodologies used to do so (Macmillan, 1998). An exponential rise in the size of databases, their increasing complexity and the rate at which they can accumulate on a daily basis have therefore lead to an urgent need for techniques that can mine very large databases for the knowledge they contain. Consequently, an active area of research has focused on spatial data mining which can be defined as techniques for the discovery of meaningful patterns from large data sets where a prime dimension of interest is geographical location. This paper focuses on clustering as a central aspect of spatial data mining and seeks to demonstrate the benefits of using ‘hot spot’ approaches to clustering in tandem with segmentation approaches to clustering. This is demonstrated using a case study analysis of a business transactions database. The following section discusses the theoretical perspectives and the dichotomy between the two different approaches to clustering. A form of ‘hot spot’ type clustering is then introduced and is subsequently used in the case study to guide a k-means classification of spatial and non-spatial attributes for a customer database. This forms the basis of a dual approach to cluster discovery as alluded to in the title of the paper.

Section snippets

Cluster detection in point event data

Transactions databases, be they for business, crime or health, can be regarded as point event data sets if each record has a specific geographical identifier such that geocoding can be achieved at the resolution of an address or postcode. From a location perspective the point event is a binary occurrence – either it happened there or it did not. From a data perspective, the binary occurrence may have added dimensions of attributes that describe the nature or content of the transaction which may

A variable resolution approach to the analysis of point event data

A recursive decomposition of space into gradually smaller spatial units that are nevertheless space-filling is generally referred to as a hierarchical tessellation. Any such decomposition requires some predefined criteria such as decomposition ratio and minimum size of spatial units in order to guide and finally terminate the algorithm. Possibly the most studied hierarchical tessellation is the quadtree (Samet, 1984) in which an initial square region covering the entire study area is repeatedly

Geo-ProZones and k-means clustering in tandem: a case study

The case study focuses on an analysis of a business transactions database for one year. The database contained details of 2390 customers of whom 2361 (98.7%) could be geocoded to postcode level. This is a relatively small database by data mining standards but is sufficiently tractable to allow experimentation and tracking of special cases and checking results without being deluged by the data. The distribution of geocoded customers is given in Fig. 2(a). This particular business operates a

Conclusions

This paper has explored and demonstrated a dual approach in spatial data mining of point event data. The sequence has been:

•
use a ‘hot spot’ style clustering of point events (in this instance, Geo-ProZones) treating each point as a binary event to suggest k number of classes centred on N initial candidate centroids (the ‘hot spots’) where k = N;
•
create k new attributes for each point event being the Euclidean distance to each initial candidate centroid;
•
bring the records spatially nearest to the N

References (48)

P.J. Atkinson et al.
Density and local attribute estimation of an infectious disease using MapInfo
Computers and Geosciences
(2002)
V. Estivill-Castro et al.
Argument free clustering for large spatial point-data sets via boundary extraction from Delaunay Diagram
Computers, Environment and Urban Systems
(2002)
P.J. Halls et al.
Dirichlet neighbours: Revisiting Dirichlet tessellation for neighbourhood analysis
Computers, Environment and Urban Systems
(2001)
M.Y. Kiang
Extending the Kohonen self-organizing map networks for clustering analysis
Computational Statistics & Data Analysis
(2001)
T. Koch et al.
GIS approaches to the problem of disease clusters: A brief commentary
Social Science & Medicine
(2001)
B.S. Rowlingson et al.
Splancs: Spatial point pattern analysis code in S-Plus
Computers and Geosciences
(1993)
M.S. Aldenderfer et al.
Cluster analysis
(1984)
M.P. Armstrong
Geography and computational science
Annals of the Association of American Geographers
(2000)
A.J. Brimicombe
Small may be beautiful – but is simple sufficient?
Geographical and Environmental Modelling
(1999)
A.J. Brimicombe
Constructing and evaluating contextual indices using GIS: A case of primary school performance
Environment & Planning A
(2000)

A.J. Brimicombe

GIS, environmental modelling and engineering

(2003)

A.J. Brimicombe et al.

A variable resolution, geocomputational approach to the analysis of point patterns

Hydrological Processes

(2000)

P.J. Clark et al.

Distance to nearest neighbour as a measure of spatial relations in populations

Ecology

(1954)

A.D. Cliff et al.

Spatial processes: Models and applications

(1981)

H. Couclelis

Computation and space

Environment & Planning B

(1998)

A.S. Fotheringham

Exploratory spatial data analysis and GIS

Environment and Planning A

(1992)

A.S. Fotheringham

Trends in quantitative methods I: Stressing the local

Progress in Human Geography

(1997)

A.S. Fotheringham

Trends in quantitative methods II: Stressing the computational

Progress in Human Geography

(1998)

A.S. Fotheringham et al.

Local forms of spatial analysis

Geographical Analysis

(1999)

A.S. Fotheringham et al.

Quantitative geography

(2000)

M. Gahegan

Is inductive machine learning just another wild goose (or might it lay the golden egg)?

International Journal of Geographical Information Science

(2003)

A.C. Gatrell et al.

Spatial point pattern analysis and its application in geographical epidemiology

Transactions of the Institute of British Geographers, NS

(1996)

A.C. Gatrell et al.

Spatial point process modelling in a geographical information system environment

J. Han et al.

Spatial clustering methods in data mining

Cited by (19)

Capacitated spatial clustering with multiple constraints and attributes
2024, Engineering Applications of Artificial Intelligence
Capacitated spatial clustering, a type of unsupervised machine learning method, is often used to tackle problems in compressing data, classification, logistic optimization and infrastructure optimization. Depending on the application at hand, a multitude of extensions to the clustering problem may be necessary. In this article, we propose a number of novel extensions to PACK, a recent capacitated partitional spatial clustering method which uses an optimization algorithm that is based on linear programming tasks. These extensions relate to the relocation and location preference of cluster centers, outliers, and non-spatial attributes, and they can be considered jointly. In the context of edge server placement, these improve the spatial location of servers while considering, for example, application placement on the servers in response to spatial application usage patterns. We demonstrate the usefulness of an extended version of PACK with an example with simulated data, as well as a real world example in edge server placement for a city region with various different setups. These setups are evaluated with summary statistics about spatial proximity and attribute similarity. As a result, the similarity of the clusters was improved by 53% at best while simultaneously the proximity degraded only by 18%. The extensions provide valuable means for including non-spatial information in the cluster analysis, and to attain better overall proximity and similarity.
Comparison of two fuzzy algorithms in geodemographic segmentation analysis: The fuzzy C-means and Gustafson-Kessel methods
2012, Applied Geography
Citation Excerpt :
Still, as Singleton and Longley (2008) suggest, a k-means algorithm is unstable, and further research is required. Clustering techniques coming from artificial neural networks, genetic algorithms, or fuzzy logic are more efficient within large, multidimensional databases (Brimicombe, 2007; Brunsdon, 2006). Neural networks can handle non-linear relationships, are robust to noise and exhibit a high degree of automation.
Clustering techniques are frequently used to analyze census data and obtain meaningful large-scale groups. Geodemographic segmentation involves classifying small geographic areas – for example, block groups, census tracts, or neighborhoods - into relatively homogeneous segments. Most studies concerning geodemographic analysis and fuzzy logic employ the Fuzzy C-Means algorithm. In this paper, we compare two algorithms for fuzzy clustering in geodemographic analysis, and their structures, as well as their pros and cons, are analyzed. These are the Fuzzy C-Means algorithm and the Gustafson–Kessel algorithm The main objective of this paper is to evaluate the performance of the Fuzzy C-Means and Gustafson–Kessel algorithms in the clustering problem, under specific conditions. An experimental approach to this problem is adopted through the use of a real-world dataset describing 52 attributes of the 285 postal codes in the Athens metropolitan area.
Spatial data mining and geographic knowledge discovery-An introduction
2009, Computers, Environment and Urban Systems
Voluminous geographic data have been, and continue to be, collected with modern data acquisition techniques such as global positioning systems (GPS), high-resolution remote sensing, location-aware services and surveys, and internet-based volunteered geographic information. There is an urgent need for effective and efficient methods to extract unknown and unexpected information from spatial data sets of unprecedentedly large size, high dimensionality, and complexity. To address these challenges, spatial data mining and geographic knowledge discovery has emerged as an active research field, focusing on the development of theory, methodology, and practice for the extraction of useful information and knowledge from massive and complex spatial databases.
This paper highlights recent theoretical and applied research in spatial data mining and knowledge discovery. We first briefly review the literature on several common spatial data-mining tasks, including spatial classification and prediction; spatial association rule mining; spatial cluster analysis; and geovisualization. The articles included in this special issue contribute to spatial data mining research by developing new techniques for point pattern analysis, prediction in space–time data, and analysis of moving object data, as well as by demonstrating applications of genetic algorithms for optimization in the context of image classification and spatial interpolation. The papers concludes with some thoughts on the contribution of spatial data mining and geographic knowledge discovery to geographic information sciences.
Geodemographics, visualisation, and social networks in applied geography
2009, Applied Geography
This review begins by acknowledging the success of geodemographics as an important area of activity in applied geography. However, it then develops a critique of the conceptual and computational underpinnings of the approach, and argues that changes in data supply and online communication have rendered current practices obsolete. It presents elements of a new perspective, entailing: changes in the specification, estimation and testing of online geodemographic systems; adoption of consultative practices from online folksonomies; automated generation of pen portraits; and ‘on the fly’ visualisation of the outcome of geodemographic classifications.
Mining complex spatial patterns:Issues and techniques
2014, Journal of Information and Knowledge Management
Spatial machine learning: new opportunities for regional science
2022, Annals of Regional Science

View all citing articles on Scopus

View full text

A dual approach to cluster discovery in point event data sets

Abstract

Introduction

Section snippets

Cluster detection in point event data

A variable resolution approach to the analysis of point event data

Geo-ProZones and k-means clustering in tandem: a case study

Conclusions

Computers and Geosciences

Computers, Environment and Urban Systems

Computers, Environment and Urban Systems

Computational Statistics & Data Analysis

Social Science & Medicine

Computers and Geosciences

Cluster analysis

Geography and computational science

Annals of the Association of American Geographers

Small may be beautiful – but is simple sufficient?

Geographical and Environmental Modelling

Constructing and evaluating contextual indices using GIS: A case of primary school performance

Environment & Planning A

GIS, environmental modelling and engineering

A variable resolution, geocomputational approach to the analysis of point patterns

Hydrological Processes

Distance to nearest neighbour as a measure of spatial relations in populations

Ecology

Spatial processes: Models and applications

Computation and space

Environment & Planning B

Exploratory spatial data analysis and GIS

Environment and Planning A

Trends in quantitative methods I: Stressing the local

Progress in Human Geography

Trends in quantitative methods II: Stressing the computational

Progress in Human Geography

Local forms of spatial analysis

Geographical Analysis

Quantitative geography

Is inductive machine learning just another wild goose (or might it lay the golden egg)?

International Journal of Geographical Information Science

Spatial point pattern analysis and its application in geographical epidemiology

Transactions of the Institute of British Geographers, NS

Spatial point process modelling in a geographical information system environment

Spatial clustering methods in data mining