Processing large-scale multi-dimensional data in parallel and distributed environments

doi:10.1016/S0167-8191(02)00097-2

Parallel Computing

Volume 28, Issue 5, May 2002, Pages 827-859

https://doi.org/10.1016/S0167-8191(02)00097-2 Get rights and content

Abstract

Analysis of data is an important step in understanding and solving a scientific problem. Analysis involves extracting the data of interest from all the available raw data in a dataset and processing it into a data product. However, in many areas of science and engineering, a scientist's ability to analyze information is increasingly becoming hindered by dataset sizes. The vast amount of data in scientific datasets makes it a difficult task to efficiently access the data of interest, and manage potentially heterogeneous system resources to process the data. Subsetting and aggregation are common operations executed in a wide range of data-intensive applications. We argue that common runtime and programming support can be developed for applications that query and manipulate large datasets. This paper presents a compendium of frameworks and methods we have developed to support efficient execution of subsetting and aggregation operations in applications that query and manipulate large, multi-dimensional datasets in parallel and distributed computing environments.

Introduction

There is a large body of research devoted to developing high-performance architectures and algorithms for efficient execution of large-scale scientific applications. Moreover, it is becoming increasingly more efficient to use collections of high-performance machines for application execution, because of the availability of faster networks and tools for discovery, allocation, and management of distributed resources. As a result, long-running, large-scale simulations [20], [46], [56], [58] are producing unprecedented amounts of data. In addition, advanced sensors attached to instruments, such as earth-orbiting satellites and medical instruments [3], [62], are generating very large datasets that must be made available to a wider audience.

Looking at available technology, disk space has become plentiful and relatively inexpensive. Using off-the-shelf components, it is currently possible to build a disk-based storage cluster with about 1 Terabyte of storage space, consisting of six Pentium III PCs, each with two 80 GB EIDE disks, for about $10,000. The availability of such low-cost systems, built from networks of commodity computers and high-capacity disks, has greatly enhanced a scientist's ability to store large-scale scientific data. However, the primary goal of gathering data is better understanding of the scientific problem at hand, and data analysis is key to this understanding. The vast amount of data available in scientific datasets makes it an onerous task for a scientist both to efficiently access the data, and to manage the system resources required to process it.

A growing set of data-intensive applications query and analyze collections of very large multi-dimensional datasets. Examples of such applications include satellite data processing [24], [27], [62], full-scale water contamination studies and surface/subsurface petroleum reservoir simulations [44], [66], visualization and processing of digitized microscopy images [3], visualization of large-scale data [5], [8], [29], [42], [61], and data mining [4], [7], [34], [68]. Although the datasets used for analysis and the data products generated by applications that manipulate those datasets may differ in many ways, a close look at many data-intensive applications [17], [21], [31], [42], [44] reveals that there exist commonalities in their data access patterns and processing structures. Analysis requires extracting the data of interest from the dataset, and processing and transforming it into a new data product that can be more efficiently consumed by another program or analyzed by a human. Subsetting of data is often done through range queries, and aggregation (reduction) operations are commonly executed in the data processing step of a wide range of applications.

We argue that frameworks and methods can be developed that will provide common programming and runtime support for a wide range of applications that make use of large scientific datasets. In this paper, we present an overview of the methods and frameworks we have developed for efficient execution of applications that query and manipulate large, multi-dimensional datasets. The algorithms and runtime systems presented in this paper target architectures that range from tightly coupled distributed-memory parallel machines with attached disk farms to heterogeneous collections of high-performance machines and storage systems in a distributed computing environment.

Section snippets

Overview

In this section we briefly describe several data-intensive applications that have motivated the design and implementation of the algorithms and frameworks presented in this paper. We also discuss data access and processing patterns commonly observed in these applications.

Supporting reduction operations on distributed-memory parallel machines

The implementation of aggregation operations on a parallel machine requires distribution of data and computations among disks and processors to make efficient use of aggregate storage space and computing power, and carefully scheduling data retrieval, computation and network operations to keep all resources (i.e., disks, processor memory, network, and CPU) busy without overloading any of the resources. We have developed a framework, called the Active Data Repository (ADR) [21], [31], that

Supporting reduction operations in distributed, heterogeneous environments

In the previous section, we presented a framework and algorithms for efficient execution of data subsetting and reduction operations on tightly coupled parallel computer systems. With the help of faster networks and the tools to discover and allocate distributed resources, it is increasingly becoming cost-effective to use collections of archival storage and computing systems in a distributed environment, to store and manipulate large datasets. A networked collection of storage and computing

Related work

Reduction operations have long been recognized as an important source of parallelism for many scientific applications [26], [35], [36], [67]. Most techniques for optimizing parallel reductions have been developed for scenarios where data can fit into processor memory, and the main goal is to partition the iterations among processors to achieve good load balance with low induced interprocessor communication overhead. Brezany et al. [14] have extended the inspector–executor approach [51] for

Conclusions and future work

We have presented an overview of frameworks and methods we have developed to provide support for applications that analyze and explore large multi-dimensional scientific datasets. The ADR framework targets optimized execution of data intensive applications on distributed memory architectures with a disk farm. The DataCutter and filter-stream programming framework extend the work on tightly-coupled, homogeneous systems to distributed, heterogeneous collections of computational and storage

Acknowledgements

We are grateful to the Albuquerque High Performance Computing Center for providing access to their Linux clusters and providing all the necessary support for some of the ADR experiments.

References (68)

S. Goil et al.
PARSIMONY: An infrastructure for parallel multidimensional analysis and data mining
Journal of Parallel and Distributed Computing
(2001)
A. Acharya, M. Uysal, R. Bennett, A. Mendelson, M. Beynon, J. Hollingsworth, J. Saltz, A. Sussman, Tuning the...
A. Acharya, M. Uysal, J. Saltz, Active disks: programming model, algorithms and evaluation, in: Proceedings of the...
A. Afework, M.D. Beynon, F. Bustamante, A. Demarzo, R. Ferreira, R. Miller, M. Silberman, J. Saltz, A. Sussman, H....
R. Agrawal et al.
Database mining: A performance perspective
IEEE Transactions on Knowledge and Data Engineering
(December 1993)
J. Ahrens et al.
Large-scale data visualization using parallel data streaming
IEEE Computer Graphics and Applications
(July/August 2001)
K. Amiri, D. Petrou, G. Ganger, G. Gibson, Dynamic function placement in active storage clusters. Technical Report...
H. Andrade, T. Kurc, A. Sussman, J. Saltz, Decision tree construction for data mining on clusters of shared-memory...
C.L. Bajaj, V. Pascucci, D. Thompson, X.Y. Zhang, Parallel accelerated isocontouring for out-of-core visualization, in:...
N. Beckmann, H.-P. Kriegel, R. Schneider, B. Seeger, The R*-tree: An efficient and robust access method for points and...

M. Beynon et al.

Performance impact of proxies in data intensive client-server applications

M.D. Beynon, R. Ferreira, T. Kurc, A. Sussman, J. Saltz, DataCutter: Middleware for filtering very large scientific...

M.D. Beynon et al.

Optimizing execution of component-based applications using group instances

M.D. Beynon, A. Sussman, U. Catalyurek, T. Kurc, J. Saltz, Performance optimization for data intensive grid...

P. Brezany et al.

Parallelization of irregular codes including out-of-core data and index arrays

A. Brown, D. Oppenheimer, K. Keeton, R. Thomas, J. Kubiatowicz, D. Patterson, ISTORE: Introspective storage for...

U. Catalyurek et al.

Hypergraph-partitioning based decomposition for parallel spars e-matrix vector multiplication

IEEE Transactions on Parallel and Distributed Systems

(1999)

U. Catalyurek et al.

Improving the performance and functionality of the Virtual Microscope

Archives of Pathology and Laboratory Medicine

(August 2001)

U. Catalyurek et al.

Improving the performance and functionality of the Virtual Microscope

Archives of Pathology and Laboratory Medicine

(August 2001)

Common Component Architecture Forum....

C.F. Cerco, T. Cole, User's guide to the CE-QUAL-ICM three-dimensional eutrophication model, release version 1.0....

C. Chang et al.

Infrastructure for building parallel database systems for multi-dimensional data

C. Chang, T. Kurc, A. Sussman, U. Catalyurek, J. Saltz, A hypergraph-based workload partitioning strategy for parallel...

C. Chang, T. Kurc, A. Sussman, J. Saltz, Optimizing retrieval and processing of multi-dimensional scientific datasets....

C. Chang et al.

Titan: A high performance remote-sensing database

P.F. Corbett et al.

The Vesta parallel file system

ACM Transactions on Computer Systems

(1996)

R. Das et al.

Communication optimizations for irregular scientific computations on distributed memory architectures

Journal of Parallel and Distributed Computing

(1994)

H. Fallah-Adl et al.

Fast algorithms for removing atmospheric effects from satellite images

IEEE Computational Science and Engineering

(1996)

C. Faloutsos, P. Bhagwat, Declustering using fractals, in: Proceedings of the 2nd International Conference on Parallel...

R. Farias et al.

Out-of-core rendering of large, unstructured grids

IEEE Computer Graphics and Applications

(2001)

R. Ferreira et al.

Compiling object-oriented data intensive applications

R. Ferreira et al.

Object-relational queries into multi-dimensional databases with the Active Data Repository

Parallel Processing Letters

(1999)

I. Foster et al.

The GRID: blueprint for a new computing infrastructure

Morgan-Kaufmann

(1999)

Global Grid Forum....

Cited by (45)

Robust heuristic algorithms for exploiting the common tasks of relational cloud database queries
2015, Applied Soft Computing Journal
Citation Excerpt :
MOCHA could move the code required to process the query to the data storage site. In Beynon [26], user-defined functions can be executed at data storage sites to perform subsetting operations and many filter (e.g. aggregation) operators can be run in parallel on a large number of computers. Indexing the data at each server is an efficient method for distributed query optimization.
Cloud computing enables a conventional relational database system's hardware to be adjusted dynamically according to query workload, performance and deadline constraints. One can rent a large amount of resources for a short duration in order to run complex queries efficiently on large-scale data with virtual machine clusters. Complex queries usually contain common subexpressions, either in a single query or among multiple queries that are submitted as a batch. The common subexpressions scan the same relations, compute the same tasks (join, sort, etc.), and/or ship the same data among virtual computers. The total time spent for the queries can be reduced by executing these common tasks only once. In this study, we build and use efficient sets of query execution plans to reduce the total execution time. This is an NP-Hard problem therefore, a set of robust heuristic algorithms, Branch-and-Bound, Genetic, Hill Climbing, and Hybrid Genetic-Hill Climbing, are proposed to find (near-) optimal query execution plans and maximize the benefits. The optimization time of each algorithm for identifying the query execution plans and the quality of these plans are analyzed by extensive experiments.
Multiple query scheduling for distributed semantic caches
2010, Journal of Parallel and Distributed Computing
Citation Excerpt :
When neither data shipping nor code shipping are viable options, distributed applications have employed proxy front-ends to distribute the processing for a query. Beynon et al. [5,30] proposed a proxy-based infrastructure for handling data intensive applications. Such approaches are inherently less scalable than relying on a collection of distributed cache servers available at multiple back-ends.
In distributed query processing systems, load balancing plays an important role in maximizing system throughput. When queries can leverage cached intermediate results, improving the cache hit ratio becomes as important as load balancing in query scheduling, especially when dealing with computationally expensive queries. The scheduling policies must be designed to take into consideration the dynamic contents of the distributed caching infrastructure. In this paper, we propose and discuss several distributed query scheduling policies that directly consider the available cache contents by employing distributed multidimensional indexing structures and an exponential moving average approach to predicting cache contents. These approaches are shown to produce better query plans and faster query response times than traditional scheduling policies that do not predict dynamic contents in distributed caches. We experimentally demonstrate the utility of the scheduling policies using MQO, which is a distributed, Grid-enabled, multiple query processing middleware system we developed to optimize query processing for data analysis and visualization applications.
Principles for designing data-/compute-intensive distributed applications and middleware systems for heterogeneous environments
2007, Journal of Parallel and Distributed Computing
The nature of distributed systems is constantly and steadily changing as the hardware and software landscape evolves. Porting applications and adapting existing middleware systems to ever changing computational platforms has become increasingly complex and expensive. Therefore, the design of applications, as well as the design of next generation middleware systems, must follow a set of guiding principles in order to insure long-term “survivability” without costly re-engineering. From our practical experience, the key determinants to success in this endeavor are adherence to the following principles: (1) Design for change; (2) Provide for storage subsystem I/O coordination; (3) Employ workload partitioning and load balancing techniques; (4) Employ caching; (5) Schedule the workload; and (6) Understand the workload. In order to support these principles, we have collected extensive experimental results comparing three middleware systems targeted at data- and compute-intensive applications implemented by our research group during the course of the last decade, on a single data- and compute-intensive application. The main contribution of this work is the analysis of a level playing field, where we discuss and quantify how adherence to these guiding principles impacts overall system throughput and response time.
Textiverse: A Scalable Visual Analytics System for Exploring Geotagged and Timestamped Text Corpora
2023, arXiv
GeoMatch: Efficient Large-Scale Map Matching on Apache Spark
2018, Proceedings - 2018 IEEE International Conference on Big Data, Big Data 2018
A Lightweight CUDA-based parallel map reprojection method for raster datasets of continental to global extent
2017, ISPRS International Journal of Geo-Information

View all citing articles on Scopus

^☆: This research was supported by the National Science Foundation under Grants #ACI-9619020 (UC Subcontract #10152408) and #ACI-9982087, the Office of Naval Research under Grant #N6600197C8534, Lawrence Livermore National Laboratory under Grant #B500288 (UC Subcontract #10184497), and the Department of Defence, Advanced Research Projects Agency, USAF, AFMC through Science Applications International Corporation under Grant #F30602-00-C-0009 (SAIC Subcontract #4400025559).

View full text

Processing large-scale multi-dimensional data in parallel and distributed environments☆

Abstract

Introduction

Section snippets

Overview

Supporting reduction operations on distributed-memory parallel machines

Supporting reduction operations in distributed, heterogeneous environments

Related work

Conclusions and future work

Acknowledgements

Journal of Parallel and Distributed Computing

Database mining: A performance perspective

IEEE Transactions on Knowledge and Data Engineering

Large-scale data visualization using parallel data streaming

IEEE Computer Graphics and Applications

Performance impact of proxies in data intensive client-server applications

Optimizing execution of component-based applications using group instances

Parallelization of irregular codes including out-of-core data and index arrays

Hypergraph-partitioning based decomposition for parallel spars e-matrix vector multiplication

IEEE Transactions on Parallel and Distributed Systems

Improving the performance and functionality of the Virtual Microscope

Archives of Pathology and Laboratory Medicine

Improving the performance and functionality of the Virtual Microscope

Archives of Pathology and Laboratory Medicine

Infrastructure for building parallel database systems for multi-dimensional data

Titan: A high performance remote-sensing database

The Vesta parallel file system

ACM Transactions on Computer Systems

Communication optimizations for irregular scientific computations on distributed memory architectures

Journal of Parallel and Distributed Computing

Fast algorithms for removing atmospheric effects from satellite images

IEEE Computational Science and Engineering

Out-of-core rendering of large, unstructured grids

IEEE Computer Graphics and Applications

Compiling object-oriented data intensive applications

Object-relational queries into multi-dimensional databases with the Active Data Repository

Parallel Processing Letters

The GRID: blueprint for a new computing infrastructure

Morgan-Kaufmann