A calculus for parallel computations over multidimensional dense arrays

https://doi.org/10.1016/j.cl.2006.07.005Get rights and content

Abstract

We present a calculus to formalize and give costs to parallel computations over multidimensional dense arrays. The calculus extends a simple distribution calculus (proposed in some previous work) with computation and data collection. We consider an SPMD programming model in which process interaction can take place using point-to-point as well as collective operations, much in the style of MPI. We want to give a rigorous description of all stages of data parallel applications working over dense arrays: initial distribution (i.e., partition and replication) of arrays over a set of processors, parallel computation over distributed data, exchange of intermediate results and final data gather. In the paper, beside defining the calculus, we give it a formal semantics, prove equations between different combinations of operations, and show how to associate a cost to operation combinations. This last feature makes possible to make quantitative cost-driven choices between semantically equivalent implementation strategies.

Introduction

Large number-crunching problems over multidimensional dense arrays have always been an important application area for parallel computing and, over the years, many libraries and tools have been built to ease the parallelization of such problems (see for instance [1], [2], [3]). However, surprisingly little effort has been spent in trying to develop a formal methodology to assist the development of such applications in a systematic way. In particular, when designing the efficient implementation of an application working on multidimensional dense arrays, a programmer needs to compare different strategies for distributing/moving data across processors, to take into account many machine specific details and to reason about the relative performance of a large space of options. This is usually done in an ad hoc way, drawing complex data graphs on paper and trying to figure out the actual correctness of the strategies at hand. The problem is even worse when one tries to provide general and efficient implementations of high-level mechanisms describing in a compact way large families of computations over dense arrays (with an arbitrary number of dimensions), as we experienced when trying to incorporate the powerful P3L map skeleton [4] in the OcamlP3l library [5]. The actual efficient implementation of the general map requires us to compare different data distribution and recollection strategies, to evaluate their costs and to prove their correctness in a formal way.

In this paper, we make a step towards the development of a formal framework to reason about dense arrays manipulation, with no bounds on the number of dimensions. In particular, we assume an SPMD model of computation, in which process interaction can take place using point-to-point communication as well as a small set of collective operations (much in the style of MPI) and we define a calculus which allows to formalize all the steps of an application, from the initial data distribution to the final collection of results. The calculus is simple enough to allow a readable semantics, yet powerful enough to allow us to compare realistic implementation strategies.

The contributions and the structure of the paper are as follows: Section 2 motivates our work through a simple example. Then, Section 3 defines a model both for multidimensional dense arrays (with unlimited number of dimensions) and for processors. In particular, we formalize the concept of communicator provided by MPI and use it to bound the scope of our collective array operations. Then, Section 4 defines our calculus targeted at the description of parallel manipulation of multidimensional dense arrays over a set of processors. We give a formal semantics to it in the style of denotational semantics, which helps us to prove equations between implementation strategies. Then, we discuss two demonstration cost models for the calculus, one adopting BSP style of interaction and one adopting MPI asynchronous style (Section 5). Using the models we will weight different implementation strategies and make choices between semantically equivalent options. More examples are provided in Section 6. Finally, related work is reviewed in Section 7 and Section 8 concludes.

Section snippets

A motivating example

We now introduce some of the problems to be addressed by a simple example, which will also be used as a running example through the rest of the paper.

Consider the problem of computing matrix multiplication in parallel, that is, we want to compute C=B*A, where B and A are two dense matrices n×k and k×m, respectively. Devising a good parallel algorithm for matrix multiplication involves devising a good strategy for breaking up the underlying data—initial and final matrices—among processors, and

A model for dense arrays

In this section, we extend the model for array distributions presented in [7] adding intermediate data collection, collective computations over processors and result gathering. We first recall some basic definitions and then introduce the operations.

A calculus for dense arrays

Using the array operations introduced in Section 3 as a semantic model, we can now introduce a simple calculus to describe formally the evolution over time of distributed dense arrays. For the sake of brevity, we do not give the formal semantics of the full language (notably, to the standard functional part of it), but only of the special features modeling distributed data.

Definition 21 Dense array calculus

The language of the dense array calculus is composed of the following syntactic categories:

Index domains: idom:aCartesian

Cost models

We want now to associate a cost with array operations, so we first fix some details on the execution model. We assume a set of processors arranged as a Cartesian communicator and communicating via message passing. We consider two message passing styles: bulk synchronous (such as in BSP [13]) and asynchronous (such as in MPI). Since the interaction style influences the cost model, we discuss the two cases in two separate subsections. In particular, Section 5.1 discusses the costs in BSP and

More examples: broadcast and scatter

In this section, we apply our calculus to study how a broadcast on a communicator C can be decomposed in a sequence of broadcast on ‘smaller’ (i.e., lower size) communicators forming a partition of C. A similar result is then discussed for the scatter operation. For both decompositions we provide the semantic equivalence rules and compare performance costs according to the BSP model.

Related work

The goal of our calculus is to give a sound basis to experienced program developers for their design choices in implementing applications or supports working on multidimensional dense arrays, with an arbitrary number of dimensions. Thus, we do not consider automatic optimization of programs working on dense arrays nor we want to derive such programs automatically from some sequential/functional description.

This is why, we believe, we did not really find a suitable calculus ready for us in the

Conclusions and future work

We have introduced a calculus to describe parallel computations over dense arrays with an arbitrary number of dimensions. The calculus comes with a denotational semantics, which allows us to formally prove the equivalence of different strategies for the same operation. We also show how to associate costs to such strategies, using both BSP and MPI/PVM style of interaction, allowing us to choose the most efficient one. This result is quite satisfactory for us, as it sets in a formal, yet simple,

References (28)

  • C. Rodríguez et al.

    A new parallel model for the analysis of asynchronous algorithms

    Parallel Computing

    (2000)
  • J. Choi et al.

    ScaLAPACK: a scalable linear algebra library for distributed memory concurrent computers

  • Casanova H, Dongarra J. Netsolve: a network server for solving computational science problems. Technical report,...
  • S. Sekiguchi et al.

    Ninf: network based information library for globally high performance computing

  • Danelutto M, Pasqualetti F, Pelagatti S. Skeletons for data parallelism in P3L. In: Proceedings of EURO-PAR ’97,...
  • Danelutto M, Di Cosmo Ri, Leroy X, Pelagatti S. Parallel functional programming with skeletons: the OcamlP3l...
  • M. Quinn

    Parallel computing: theory and practice

    (1994)
  • R. Di Cosmo et al.

    A calculus for dense array distributions

    Parallel Processing Letters

    (2003)
  • H. Xi et al.

    Dependent types in practical programming

  • Bird R. Lectures in constructive functional programming. In: Constructive methods in computer science. NATO ASI, vol....
  • S. Gorlatch et al.

    Optimization rules for programming with collective operations

  • Mosses PD. Handbook of theoretical computer science, vol. B. Cambridge, MA: MIT Press; 1991. [Denotational...
  • Li Z. Efficient implementation of map skeleton for the OcamlP3l system. DEA report, Université PARIS VII; July...
  • D.B. Skillicorn et al.

    Questions and answers about bsp

    Scientific Programming

    (1997)
  • Cited by (5)

    • Skeleton-based parallel programming: Functional and parallel semantics in a single shot

      2007, Computer Languages, Systems and Structures
      Citation Excerpt :

      In addition, some of these equip the functional behavior with a parallel behavior. These are mostly focused on the description of data parallel computations, as, for example, in BSP style computations [8,13–15]. Other works focused on task parallelism also (e.g. [16]), even if it might be hard to equip them with a simple cost model.

    • Skeletal parallel programming with OcamLP3L 2.0

      2008, Parallel Processing Letters
    • A verified library of algorithmic skeletons on evenly distributed arrays

      2012, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
    • Verification of a heat diffusion simulation written with Orléans Skeleton Library

      2012, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
    • A formal programming model of Orléans Skeleton library

      2011, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
    View full text