A spatial data pre-processing tool to improve the quality of the analysis and to reduce preparation duration

https://doi.org/10.1016/j.cie.2018.03.025Get rights and content

Highlights

  • The complexity and particularities of spatial data pre-processing are explained.

  • A tool that automates and improves the spatial data pre-processing is proposed.

  • Specifications, architecture and tools are presented to allow for reproducibility.

  • A case study based on real data supports the efficiency of the tool.

Abstract

Spatial data analysis allows for a better understanding of environmental effects on the performance of an organization’s activities. One of the first steps required to process such an analysis is to gather all of the spatialized data corresponding to the elements that might influence the activities. Then, a series of treatments must be processed on those datasets to make them ready to be used in classical data mining tools.

Those pre-processing steps are complex and time consuming tasks that may require advanced Geographic Information System (GIS) skills. Moreover, the choices involved in this process influence the quality of analysis results.

With the aim of addressing those issues, we developed a tool that automatizes several steps of spatial data pre-processing tasks. To allow for reproducibility, the specifications of our approach, tools, architectures and techniques required are presented in detail.

To support the effectiveness of our approach, a case study is presented that focuses on an evaluation of the processing time that is saved and the improvement of the quality of analysis.

Introduction

Understanding the effects of the spatial environment on the performance of an activity is a real advantage for many organizations in the public and private sectors. With the changing capabilities and costs of technologies, more and more organizations are accumulating data on their activities (such as sales metrics), which include spatial characteristics such as addresses or GPS coordinates. At the same time, there is an increasing amount of data made available on elements that potentially influence the performance of these organizations’ activities. For these reasons, significant research in various fields has aspired to extract relevant information to understand what actually influences an activity. Recently, Mennis and Guo (2009) said that spatial data mining is a trending area. Spatial data mining, as defined by Koperski and Han (1995), consists of extracting implicit knowledge from spatial data. This research field is an extension of the Knowledge Discovery from Databases (KDD) introduced by Fayyad, Piatetsky-Shapiro, and Smyth (1996). However, many existing data mining algorithms are not able to take advantage of the spatial aspect of the data. Thus, spatial components have to be prepared to be taken into consideration, but this preparation of spatial data is a complex and tedious task.

The aim of this research is to present a tool that automates the pre-processing of the spatial data, removing the GIS skills requirement and allowing for improvement in the analysis quality and savings in processing time.

The next Section 2 presents the elements of the literature related to spatial data analysis and pre-processing to allow a better understanding of the problems that arise from the consideration of spatial data. Section 3 first presents the specificities related to the preparation of spatial data, then it focuses on how the choices made in this pre-processing may influence additional analysis quality. Section 4 presents the specifications of our approach. Technical aspects related to the implementation of our solution are also presented. Section 5 permits an evaluation of the improvements provided by our tool. For this, a case study with real data shows the pre-processing tasks with and without our tool and how it performs. Finally, the limitations and perspectives of our research are discussed.

Section snippets

Spatial decision-making

Thirty years ago, Schmidt (1983) revealed that localization decisions were made quickly by people without experience or knowledge of the issues involved. Decisions were made subjectively with few requirements and considering only a small portion of existing options. At the same time, Herring and DeBinder (1981) argued that the use of computer tools could greatly improve the localization decision-making process. A few years ago MacEachren and Kraak (2001) noted that many problems in the

Necessity and complexity of pre-processing

To illustrate the problems associated with spatial data pre-processing, the following section focuses on the real case of a partner company for which we develop a SDSS for a retail perspective. The company works in construction materials and distributes its products through third-party retailers. As mentioned by Cliquet et al., 2006, Dubelaar et al., 2002 in the particular case of the retail sector, knowledge of the environment can be a major competitive advantage for improving performance. In

Specifications and technical aspects

As the tool developed guides users through several steps, a first scheme that shows that sequence of steps is proposed and each step is then detailed. Next, the prerequisites necessary for the implementation of this solution are mentioned, a possible architecture is proposed and optimizations to improve usability are presented.

Case study to evaluate improvements

First, to support the efficiency of the proposed pre-processing tool, the next subsections describe the steps to be taken with (Section 5.1) and without it (Section 5.2). Second, the quality improvement in further analysis is presented (Section 5.3).

Conclusions and perspectives

Numerous studies denounce the complexity and time-cost of spatial data pre-processing. Few studies have tried to address these issues and have proposed methodological approaches or frameworks to facilitate pre-processing. Although these studies aim to simplify spatial data pre-processing, they do not provide solutions to the need for knowledge of GIS, or to the difficulty of choosing the spatial relations to be taken into account. From this observation, our research proposes a tool that

Acknowledgments

This research was supported by the FORAC research consortium and its partners, as well as financial support from NSERC.

References (50)

  • M. Vlachopoulou et al.

    Geographic information systems in warehouse site selection decisions

    International Journal of Production Economics

    (2001)
  • T. Wanderer et al.

    Creating a spatial multi-criteria decision support system for energy related integrated environmental impact assessment

    Environmental Impact Assessment Review

    (2015)
  • H. Alatrista Salas et al.

    A spatial-based kdd process to better understand the spatiotemporal phenomena

  • N. Andrienko et al.

    Exploratory analysis of spatial data using interactive maps and data mining

    Cartography and Geographic Information Science

    (2001)
  • L. Anselin et al.

    Geoda: An introduction to spatial data analysis

    Geographical Analysis

    (2006)
  • A. Appice et al.

    Discovery of spatial association rules in geo-referenced census data: A relational mining approach

    Intelligent Data Analysis

    (2003)
  • M. Armstrong et al.

    A knowledge-based approach for supporting locational decision making

    Environment and Planning B: Planning and Design

    (1990)
  • V. Bogorny et al.

    Spatial data preparation for knowledge discovery

  • Caret. <https://cran.r-project.org/web/packages/caret/index.html>. Accessed 19 July...
  • K. Cios et al.

    The knowledge discovery process

  • E. Clementini et al.

    A small set of formal topological relationships suitable for end-user interaction

  • G. Cliquet et al.

    Management de la distribution

    (2006)
  • G. Daras et al.

    Development of business spatial analysis tools: Methodology and framework

  • P. Densham

    Spatial decisions support systems

  • M.J. Egenhofer et al.

    Categorizing binary topological relations between regions, lines, and points in geographic databases

  • Cited by (3)

    View full text