Processing large-scale multi-dimensional data in parallel and distributed environments☆
Introduction
There is a large body of research devoted to developing high-performance architectures and algorithms for efficient execution of large-scale scientific applications. Moreover, it is becoming increasingly more efficient to use collections of high-performance machines for application execution, because of the availability of faster networks and tools for discovery, allocation, and management of distributed resources. As a result, long-running, large-scale simulations [20], [46], [56], [58] are producing unprecedented amounts of data. In addition, advanced sensors attached to instruments, such as earth-orbiting satellites and medical instruments [3], [62], are generating very large datasets that must be made available to a wider audience.
Looking at available technology, disk space has become plentiful and relatively inexpensive. Using off-the-shelf components, it is currently possible to build a disk-based storage cluster with about 1 Terabyte of storage space, consisting of six Pentium III PCs, each with two 80 GB EIDE disks, for about $10,000. The availability of such low-cost systems, built from networks of commodity computers and high-capacity disks, has greatly enhanced a scientist's ability to store large-scale scientific data. However, the primary goal of gathering data is better understanding of the scientific problem at hand, and data analysis is key to this understanding. The vast amount of data available in scientific datasets makes it an onerous task for a scientist both to efficiently access the data, and to manage the system resources required to process it.
A growing set of data-intensive applications query and analyze collections of very large multi-dimensional datasets. Examples of such applications include satellite data processing [24], [27], [62], full-scale water contamination studies and surface/subsurface petroleum reservoir simulations [44], [66], visualization and processing of digitized microscopy images [3], visualization of large-scale data [5], [8], [29], [42], [61], and data mining [4], [7], [34], [68]. Although the datasets used for analysis and the data products generated by applications that manipulate those datasets may differ in many ways, a close look at many data-intensive applications [17], [21], [31], [42], [44] reveals that there exist commonalities in their data access patterns and processing structures. Analysis requires extracting the data of interest from the dataset, and processing and transforming it into a new data product that can be more efficiently consumed by another program or analyzed by a human. Subsetting of data is often done through range queries, and aggregation (reduction) operations are commonly executed in the data processing step of a wide range of applications.
We argue that frameworks and methods can be developed that will provide common programming and runtime support for a wide range of applications that make use of large scientific datasets. In this paper, we present an overview of the methods and frameworks we have developed for efficient execution of applications that query and manipulate large, multi-dimensional datasets. The algorithms and runtime systems presented in this paper target architectures that range from tightly coupled distributed-memory parallel machines with attached disk farms to heterogeneous collections of high-performance machines and storage systems in a distributed computing environment.
Section snippets
Overview
In this section we briefly describe several data-intensive applications that have motivated the design and implementation of the algorithms and frameworks presented in this paper. We also discuss data access and processing patterns commonly observed in these applications.
Supporting reduction operations on distributed-memory parallel machines
The implementation of aggregation operations on a parallel machine requires distribution of data and computations among disks and processors to make efficient use of aggregate storage space and computing power, and carefully scheduling data retrieval, computation and network operations to keep all resources (i.e., disks, processor memory, network, and CPU) busy without overloading any of the resources. We have developed a framework, called the Active Data Repository (ADR) [21], [31], that
Supporting reduction operations in distributed, heterogeneous environments
In the previous section, we presented a framework and algorithms for efficient execution of data subsetting and reduction operations on tightly coupled parallel computer systems. With the help of faster networks and the tools to discover and allocate distributed resources, it is increasingly becoming cost-effective to use collections of archival storage and computing systems in a distributed environment, to store and manipulate large datasets. A networked collection of storage and computing
Related work
Reduction operations have long been recognized as an important source of parallelism for many scientific applications [26], [35], [36], [67]. Most techniques for optimizing parallel reductions have been developed for scenarios where data can fit into processor memory, and the main goal is to partition the iterations among processors to achieve good load balance with low induced interprocessor communication overhead. Brezany et al. [14] have extended the inspector–executor approach [51] for
Conclusions and future work
We have presented an overview of frameworks and methods we have developed to provide support for applications that analyze and explore large multi-dimensional scientific datasets. The ADR framework targets optimized execution of data intensive applications on distributed memory architectures with a disk farm. The DataCutter and filter-stream programming framework extend the work on tightly-coupled, homogeneous systems to distributed, heterogeneous collections of computational and storage
Acknowledgements
We are grateful to the Albuquerque High Performance Computing Center for providing access to their Linux clusters and providing all the necessary support for some of the ADR experiments.
References (68)
- et al.
PARSIMONY: An infrastructure for parallel multidimensional analysis and data mining
Journal of Parallel and Distributed Computing
(2001) - A. Acharya, M. Uysal, R. Bennett, A. Mendelson, M. Beynon, J. Hollingsworth, J. Saltz, A. Sussman, Tuning the...
- A. Acharya, M. Uysal, J. Saltz, Active disks: programming model, algorithms and evaluation, in: Proceedings of the...
- A. Afework, M.D. Beynon, F. Bustamante, A. Demarzo, R. Ferreira, R. Miller, M. Silberman, J. Saltz, A. Sussman, H....
- et al.
Database mining: A performance perspective
IEEE Transactions on Knowledge and Data Engineering
(December 1993) - et al.
Large-scale data visualization using parallel data streaming
IEEE Computer Graphics and Applications
(July/August 2001) - K. Amiri, D. Petrou, G. Ganger, G. Gibson, Dynamic function placement in active storage clusters. Technical Report...
- H. Andrade, T. Kurc, A. Sussman, J. Saltz, Decision tree construction for data mining on clusters of shared-memory...
- C.L. Bajaj, V. Pascucci, D. Thompson, X.Y. Zhang, Parallel accelerated isocontouring for out-of-core visualization, in:...
- N. Beckmann, H.-P. Kriegel, R. Schneider, B. Seeger, The R*-tree: An efficient and robust access method for points and...
Performance impact of proxies in data intensive client-server applications
Optimizing execution of component-based applications using group instances
Parallelization of irregular codes including out-of-core data and index arrays
Hypergraph-partitioning based decomposition for parallel spars e-matrix vector multiplication
IEEE Transactions on Parallel and Distributed Systems
Improving the performance and functionality of the Virtual Microscope
Archives of Pathology and Laboratory Medicine
Improving the performance and functionality of the Virtual Microscope
Archives of Pathology and Laboratory Medicine
Infrastructure for building parallel database systems for multi-dimensional data
Titan: A high performance remote-sensing database
The Vesta parallel file system
ACM Transactions on Computer Systems
Communication optimizations for irregular scientific computations on distributed memory architectures
Journal of Parallel and Distributed Computing
Fast algorithms for removing atmospheric effects from satellite images
IEEE Computational Science and Engineering
Out-of-core rendering of large, unstructured grids
IEEE Computer Graphics and Applications
Compiling object-oriented data intensive applications
Object-relational queries into multi-dimensional databases with the Active Data Repository
Parallel Processing Letters
The GRID: blueprint for a new computing infrastructure
Morgan-Kaufmann
Cited by (45)
Robust heuristic algorithms for exploiting the common tasks of relational cloud database queries
2015, Applied Soft Computing JournalCitation Excerpt :MOCHA could move the code required to process the query to the data storage site. In Beynon [26], user-defined functions can be executed at data storage sites to perform subsetting operations and many filter (e.g. aggregation) operators can be run in parallel on a large number of computers. Indexing the data at each server is an efficient method for distributed query optimization.
Multiple query scheduling for distributed semantic caches
2010, Journal of Parallel and Distributed ComputingCitation Excerpt :When neither data shipping nor code shipping are viable options, distributed applications have employed proxy front-ends to distribute the processing for a query. Beynon et al. [5,30] proposed a proxy-based infrastructure for handling data intensive applications. Such approaches are inherently less scalable than relying on a collection of distributed cache servers available at multiple back-ends.
Principles for designing data-/compute-intensive distributed applications and middleware systems for heterogeneous environments
2007, Journal of Parallel and Distributed ComputingGeoMatch: Efficient Large-Scale Map Matching on Apache Spark
2018, Proceedings - 2018 IEEE International Conference on Big Data, Big Data 2018A Lightweight CUDA-based parallel map reprojection method for raster datasets of continental to global extent
2017, ISPRS International Journal of Geo-Information
- ☆
This research was supported by the National Science Foundation under Grants #ACI-9619020 (UC Subcontract #10152408) and #ACI-9982087, the Office of Naval Research under Grant #N6600197C8534, Lawrence Livermore National Laboratory under Grant #B500288 (UC Subcontract #10184497), and the Department of Defence, Advanced Research Projects Agency, USAF, AFMC through Science Applications International Corporation under Grant #F30602-00-C-0009 (SAIC Subcontract #4400025559).