A fast and robust bulk-loading algorithm for indexing very large digital elevation datasets: I. Algorithm

https://doi.org/10.1016/j.cageo.2011.01.003Get rights and content

Abstract

Digital elevation models (DEMs) constitute a valuable source of data for a number of geoscience-related applications. The Shuttle Radar Topography Mission (SRTM) collected and made available to the public the world's largest DEM (composed of billions of points) until that date. The SRTM DEM is stored on the NASA repository as a well-organized collection of flat files. The retrieval of this stored topographic information about a region of interest involves one selection of a proper list of files, their downloading, data filtering in the desired region, and their processing according to user needs. With the aim to provide an easier and faster access to this data by improving its further analysis and processing, we have indexed the SRTM DEM by means of a spatial indexing based on the kd-tree data structure, called the Q-tree. This paper is the first in a two-part series that describes the method followed to build an index on such huge amounts of data, minimizing the number of insert operations. We demonstrate that our method can build a very efficient space-partitioning index, with good performance in both point and range queries on the spatial data. To the best of our knowledge, this is the only successful spatial indexing proposal in the literature that deals with such a huge volume of data.

Introduction

A digital elevation model (DEM) is a digital representation of terrain surface topography frequently used as a relevant data source for geographic information systems (GIS) and geoscience-related applications. In a DEM, the spatial framework represents a set of locations on a portion of the Earth's surface, and the data values are the heights of the terrain at those locations. DEMs are very useful in a wide range of application domains, including representation of geological phenomena (such as water-flow patterns or tsunami assessment), landscape analysis (e.g., in archeological prospecting, agricultural fields, or infrastructure), environmental impact studies, civil engineering, urban plans, or, as indicated in Worboys and Duckham (2004, p. 143), visibility analysis, routing, and shortest-path calculations. One of the most complete and high-quality DEMs comes from the Shuttle Radar Topography Mission (SRTM), which is accessible from the official National Aeronautics and Space Administration site (NASA, 2010). The SRTM DEM is distributed in a set of files, each representing an area of 1° latitude and longitude. Every file contains the heights of points inside the corresponding area, and file names reference the covered surface by including the values of latitude and longitude for this area. Although this way of managing the data is efficient in space utilization, there are many interesting scenarios in which DEM data indexing using a vector model would be advantageous, such as to integrate heterogeneous data sources and to facilitate data mining processing with geodata. Vector data cope well with boundaries, representing geometric shapes precisely, providing more flexibility in adjusting the data to different scales, and also allowing resizing without losing resolution (see Worboys and Duckham, 2004, Chapter 1). In addition to this, the benefits of integrating heterogeneous geodata are well exposed in Butenuth et al. (2007), namely: (i) integrated analysis with prior information, (ii) reference to common geometry, (iii) mutual corrections and refinements, and (iv) mutual enrichment with semantic and geometric properties. With these premises in mind, indexing DEM data mapped to the vector model is an interesting solution when working with heterogeneous data integrated from diverse sources. For example, suppose we want to study the transit of invasive species on a given territory based on both terrain and weather conditions. We can store and index several data sources (QuikSCAT, NOAA, ASTER, SRTM) from satellites, together with other vector data of interest. Thus, in the same database can coexist data with different content (points, lines, regions, etc.). Thus, by using a single latitude–longitude selection criteria in a query, we could retrieve heterogeneous information for data properly located. Following this approach, indexing data under the vector model (including DEM data) allows us (i) efficient management and retrieval of data, (ii) incremental updates of a portion of data without having to rebuild the data entirely, and (iii) easy integration of geospatial data from different sources, content, location, and granularity.

Based on the benefits discussed in the previous paragraph, our proposal is to enhance the process of retrieval of the desired SRTM DEM data by means of an efficient indexing platform that will provide the data requested by the user. The data will be indexed as vector data using an initial bulk-load process that minimizes the space needed for storage (subsequent geospatial data will be loaded using bulk insert techniques). Having these goals in mind, we have developed a method for indexing the complete SRTM DEM that presents two remarkable strengths. On one hand, given the huge number of data involved, the index creation is particularly competitive regarding the loading operations existing previously, due to the design of an extremely efficient bulk-load algorithm (around 9% of the time required by the fastest Oracle 11g loading process), specifically adapted to the kd-tree structure of our index. On the other hand, the resulting spatial index is almost perfectly balanced, and presents very good storage utilization (96% page utilization as average), which leads to optimal query performance (Ghanem et al., 2004, Papadopoulos and Manolopoulos, 2003). An important aspect is that our bulk-load algorithm is applicable to virtually any space-partitioning structure by minimally adapting the insertion process of the indexing method. To exemplify the proposed bulk-load algorithm, we use the Q-tree (Barrena, 1995, Jurado and Barrena, 2002, López et al., 2005), a space-partitioning index structure developed in our research group. Although it should be noted that one of the techniques we propose, called the kd_mode, is intended to be specifically used with index structures derived from the kd-tree family, because its main purpose is to keep these kinds of indices as balanced as possible. Moreover, deletions, updates, and bulk inserts (massive loads when the index is in production) are outside the scope of this study. To our knowledge, there is no comparable experience in the literature to index the whole Earth's surface data through a spatial index, including the use of commercial database systems with spatial extensions.

The remainder of this paper is organized as follows: Section 2 presents the SRTM data. Section 3 gives a brief description of the Q-tree (i.e., the access method used to index the SRTM data). Section 4 introduces the details of the bulk-load algorithm developed on the Q-tree to efficiently store all the SRTM dataset. Section 5 gives related work, and, finally, Section 6 concludes.

Section snippets

SRTM dataset

In February 2000, SRTM, an international project led by the National Geospatial Intelligence Agency NGA (2010) and the NASA, elaborated the first-ever near-global dataset of land elevations. The data gathering, done by the Space Shuttle Endeavor, provided a global high-quality DEM, which meant an important step toward globalizing and homogenizing most topographical surface data on Earth. Before this mission, DEM data came from different sources, often derived from satellite imagery or

The Q-tree index

Like the K-D-B-tree (Robinson, 1981), the LSD-tree (Henrich et al., 1989), the hB-tree (Lomet and Salzberg, 1990), and the Bkd-tree (Procopiuc et al., 2003), the Q-tree, proposed in Barrena (1995), Jurado and Barrena (2002), and López et al. (2005), is basically a paginated kd-tree (Bentley, 1975), which adopts the index node splitting strategy from the hB-tree. Although important structural aspects make the Q-tree different from the hB-tree, from the space-partitioning point of view, each

The bulk-load algorithm

The idea of indexing as large an amount of data as the SRTM DEM is a challenge in itself. A naive approach to this task would insert, all the topographic points into the structure one by one, using the basic insert operation implemented for the index. While the insertion of some data files evolved, the complete process by means of this strategy would turn out to be impossible, and this approach cannot be feasible in terms of time or manner.

If Fcg is the number of coarse-granularity SRTM files

Related work

With the proliferation of spatial databases, extensive research has provided a number of multidimensional data structures to enable practical and efficient spatial indexing (Gaede and Günther, 1998, Ghanem et al., 2004). These hierarchical structures are suitable to cope with the demanding task of indexing massive volumes of data. The main reason is the distribution according to two types of nodes: the index nodes to determine the search path, and the data nodes, at the end of the path, to

Conclusions

The SRTM represents one of the most valuable initiatives for the progress in the geosciences by providing a global DEM of planet Earth. To use the SRTM DEM, applications must access the NASA repository to obtain the flat files that have the data of the region of interest. The data processing is completely up to the application. In this paper we have proposed a new and efficient way to index the complete SRTM DEM to facilitate further data processing tasks. Our algorithm is specifically designed

Acknowledgments

Useful discussions with Professor Ángel Felicísimo, from the Kraken research group at the University of Extremadura in Spain (Kraken, 2010), are gratefully acknowledged. His research, the quality of his advice, and his sound judgment have served as great aid to the development of our work. This work has been supported by the Spanish “Ministerio de Ciencia e Innovación” and the European Union (Project Grants TIN2005-05939 and TIN2008-03063).

References (35)

  • CGIAR, 2010. Consortium for Spatial Information GeoPortal. 〈http://srtm.csi.cgiar.org/〉 (accessed...
  • T.G. Farr et al.

    The Shuttle Radar Topography Mission

    Reviews of Geophysics

    (2007)
  • V. Gaede et al.

    Multidimensional access methods

    ACM Computing Surveys

    (1998)
  • T.M. Ghanem et al.

    Bulk operations for space-partitioning trees

  • GIM, 2010. Media Engineering Research Group. Universidad de Extremadura. Spain. 〈http://gim.unex.es/〉 Source code for...
  • A. Guttman

    R-trees: a dynamic index structure for spatial searching

  • Henrich, A., werner Six, H., Hagen, F., Hagen, F., 1989. The Lsd tree: spatial access to multidimensional point and...
  • Cited by (3)

    View full text