One-dimensional partitioning for heterogeneous systems: Theory and practice

https://doi.org/10.1016/j.jpdc.2008.07.005Get rights and content

Abstract

We study the problem of one-dimensional partitioning of nonuniform workload arrays, with optimal load balancing for heterogeneous systems. We look at two cases: chain-on-chain partitioning, where the order of the processors is specified, and chain partitioning, where processor permutation is allowed. We present polynomial time algorithms to solve the chain-on-chain partitioning problem optimally, while we prove that the chain partitioning problem is NP-complete. Our empirical studies show that our proposed exact algorithms produce substantially better results than heuristics, while solution times remain comparable.

Introduction

In many applications of parallel computing, load balancing is achieved by mapping a possibly multi-dimensional computational domain down to a one-dimensional (1D) array, and then partitioning this array into parts with equal weights. Space filling curves are commonly used to map the higher dimensional domain to a 1D workload array to preserve locality and minimize communication overhead after partitioning [5], [6], [9], [15]. Similarly, processors can be mapped to a 1D array so that communication is relatively faster between close processors in this processor chain [10]. This eases mapping for computational domains and improves efficiency of applications. The load balancing problem for these applications can be modeled as the chain-on-chain partitioning (CCP) problem, where we map a chain of tasks onto a chain of processors. Formally, the objective of the CCP problem is to find a sequence of P1 separators to divide a chain of N tasks with associated computational weights into P consecutive parts to minimize maximum load among processors.

In our earlier work [17], we studied the CCP problem for homogenous systems, where all processors have identical computational power. We have surveyed the rich literature on this problem, proposed novel methods as well as improvements on existing methods, and studied how these algorithms can be implemented efficiently to be effective in practice. In this work, we investigate how these techniques can be generalized for heterogeneous systems, where processors have varying computational powers. Two distinct problems arise in partitioning chains for heterogeneous systems. The first problem is the CCP problem, where a chain of tasks is to be mapped onto a chain of processors, i.e., the pth task subchain in a partition is assigned to the pth processor. The second problem is the chain partitioning (CP) problem, where a chain of tasks is to be mapped onto a set, as opposed to a chain, of processors, i.e., processors can be permuted for subchain assignments. For brevity, the CCP problem for homogenous systems and heterogeneous systems will be referred to as the homogenous CCP problem and heterogeneous CCP problem, respectively. The CP problem refers to the chain partitioning problem for heterogeneous systems, since it has no counterpart for homogenous systems.

In this article, we show that the heterogeneous CCP problem can be solved in polynomial time, by enhancing the exact algorithms proposed for the solution of the homogenous CCP problem [17]. We present how these exact algorithms for homogenous systems can be enhanced for heterogeneous systems and implemented efficiently for runtime performance. We also present how the heuristics widely used for the solution of homogenous CCP problem can be adapted for heterogeneous systems. We present the implementation details and pseudocodes for the exact algorithms and heuristics for clarity and reproducibility. Our experiments with workload arrays coming from image-space-parallel volume rendering and row-parallel sparse matrix vector multiplication applications show that our proposed exact algorithms produce substantially better results than the heuristics, while the solution times remain comparable. On average, optimal solutions provide 4.9 and 8.7 times better load imbalance than heuristics for 128-way partitionings of volume rendering and sparse matrix datasets, respectively. On average, the time it takes to compute an optimal solution is less than 2.20 times the time it takes to compute an approximation using heuristics for 128 processors, and thus the preprocessing times can be easily compensated by the improved efficiency of the subsequent computation even for a few iterations.

The CP problem on the other hand, is NP-complete as we prove in this paper. Our proof uses a pseudo-polynomial reduction from the 3-Partition problem, which is known to be NP-complete in the strong sense [7]. Our empirical studies showed that processor ordering has a very limited effect on the solution quality, and an optimal CCP solution on a random processing ordering serves as an effective CP heuristic.

The remainder of this paper is organized as follows. Table 1 summarizes important symbols used throughout the paper. Section 2 introduces the heterogeneous CCP problem. In Section 3, we summarize the solution methods for homogenous CCP. In Section 4, we discuss how solution methods for homogenous systems can be enhanced to solve the heterogeneous CCP problem. In Section 5, we discuss the CP problem, prove that it is NP-Complete. We present the results of our empirical studies with the proposed methods in Section 6, and finally, we conclude with Section 7.

Section snippets

Chain-on-chain (CCP) problem for heterogeneous systems

In the heterogeneous CCP problem, a computational problem, which is decomposed into a chain T=t1,t2,,tN of N tasks with associated positive computational weights W=w1,w2,,wN is to be mapped onto a processor chain P=P1,P2,,PP of P processors with associated execution speeds E=e1,e2,,eP. The execution time of task ti on processor Pp is wi/ep. For clarity, we note that there are no precedence constraints among the tasks in the chain.

A task subchain Ti,j=ti,ti+1,,tj is defined as a

CCP algorithms for homogenous systems

The homogenous CCP problem can be considered as a special case of the heterogeneous CCP problem, where the processors are assumed to have equal speed, i.e., ep=1 for all p. Here, we review the CCP algorithms for homogenous systems. A comprehensive review and presentation of homogenous CCP algorithms are available in [17].

Proposed CCP algorithms for heterogeneous systems

The algorithms we propose in this section extend the techniques for homogenous CCP to heterogeneous CCP. All algorithms discussed in this section require an initial prefix-sum operation on the task-weight array W for the efficiency of subsequent subchain-weight computations. The prefix-sum operation replaces the ith entry W[i] with the sum of the first i entries (h=1iwh) so that computational weight Wij of a task subchain Tij can be efficiently determined as W[j]W[i1] in O(1) time. In our

Chain Partitioning (CP) problem for heterogeneous systems

In this section, we study the problem of partitioning a chain of tasks onto a set of processors, as opposed to a chain of processors. The solution to this problem is not only separators on the task chain, but also processor-to-subchain assignments. Thus, we define a mapping M as a partition Π=s0=0,s1,,sP=N of the given task chain T=t1,t2,tN with spsp+1 for 0p<P, and a permutation π1,π2,,πP of the given set of P processors P={P1,P2,,PP}. According to this mapping, the pth task

Experimental setup

The 1D task arrays used in both CCP and CP experiments were derived from two different applications: image-space-parallel direct volume rendering and row-parallel sparse matrix vector multiplication.

Direct volume rendering experiments are performed on three curvilinear datasets from NASA Ames Research Center [13], namely Blunt Fin (blunt), Combustion Chamber (comb), and Oxygen Post (post). These datasets are processed using the tetrahedralization techniques described in [8], [18] to produce

Conclusions

We studied the problem of one-dimensional partitioning of nonuniform workload arrays with optimal load balancing for heterogeneous systems. We investigated two cases: chain-on-chain partitioning, where a chain of tasks is partitioned onto a chain of processors; and chain partitioning, where the task chain is partitioned onto a set of processors (i.e., permutation of the processors is allowed). We showed that chain-on-chain partitioning algorithms for homogenous systems can be revised to solve

Availability

The algorithms proposed in this work are implemented in Java language and made publicly available at http://www.cs.bilkent.edu.tr/~tabak/hetccp/.

Acknowledgments

First author was supported by the Director, Office of Science, Division of Mathematical, Information, and Computational Sciences of US Department of Energy under contract DE-AC03-76SF00098.

Ali Pinar is a member of the High Performance Computing Research Department at Lawrence Berkeley National Laboratory. His research is on combinatorial problems arising in algorithms and applications of scientific computing, with emphasis on parallel computing, sparse matrix computations, computational problems in electric power systems, data analysis, and interconnection network design. He received his PhD degree in 2001 in computer science from University of Illinois at Urbana-Champaign, with

References (18)

  • K.D. Devine et al.

    New challanges in dynamic load balancing

    Applied Numerical Mathematics

    (2005)
  • D.M. Nicol

    Rectilinear partitioning of irregular data parallel computations

    Journal of Parallel and Distributed Computing

    (1994)
  • A. Pinar et al.

    Fast optimal load balancing algorithms for 1D partitioning

    Journal of Parallel and Distributed Computing

    (2004)
  • B.B. Cambazoglu et al.

    Hypergraph-partitioning-based remapping models for image-space-parallel direct volume rendering of unstructured grids

    IEEE Transactions on Parallel and Distributed Systems

    (2007)
  • H.-A. Choi, B. Narahari, Algorithms for mapping and partitioning chain structured parallel computations, in:...
  • T.H. Cormen et al.

    Introduction to Algorithms

    (1989)
  • T. Davis, University of Florida Sparse Matrix Collection, http://www.cise.ufl.edu/research/sparse/matrices, NA Digest,...
  • K.D. Devine, B. Hendrickson, E.G. Boman, M.M.S. John, C. Vaughan, Zoltan: A dynamic load-balancing library for parallel...
  • M.R. Garey et al.

    Computers and Intractability; A Guide to the Theory of NP-Completeness

    (1990)
There are more references available in the full text version of this article.

Cited by (14)

  • Dynamic load balancing for direct-coupled multiphysics simulations

    2020, Computers and Fluids
    Citation Excerpt :

    Thus, determining a domain decomposition reduces to solving a one-dimensional partitioning problem, which is known as the chains-on-chains partitioning (CCP) problem. The CCP problem has extensively been studied in the literature, giving exact solutions [6], parallel heuristics [7], scalable hierarchical algorithms [8], and also an extension to heterogeneous systems [9]. All those methods assume that sufficiently accurate information about the workload distribution is available.

  • Automatic mesh refinement and parallel load balancing for Fokker–Planck-DSMC algorithm

    2018, Journal of Computational Physics
    Citation Excerpt :

    Balancing total storage, however, allows for simulation runs with a larger amount of particles and avoids exceeding the available memory on a given process. We further remark that EB is trivially extensible to heterogeneous systems [21] in that it allows to compute optimal partitions also when the actual computational power of the individual machines on which the processes run is taken into account. The performance of each process could be dynamically estimated at run-time by the measurement of execution time per particle per time-step, which is already done in the code for profiling.

  • Load-balancing spatially located computations using rectangular partitions

    2012, Journal of Parallel and Distributed Computing
  • Placing pipeline stages on a Grid: Single path and multipath pipeline execution

    2010, Future Generation Computer Systems
    Citation Excerpt :

    This heuristic shows the best trade-off between efficiency and simplicity, as discussed in [16]. In contrast, solving for the exact solution of the interval-based single path pipeline execution (ISP) remains an NP-complete problem which can be proved by the reduction from the Chain Partitioning (CP) problem for heterogeneous systems [24]. The proof can be carried out by showing that the CP problem is a special case of the ISP problem which has no communication cost between pipeline stages.

View all citing articles on Scopus

Ali Pinar is a member of the High Performance Computing Research Department at Lawrence Berkeley National Laboratory. His research is on combinatorial problems arising in algorithms and applications of scientific computing, with emphasis on parallel computing, sparse matrix computations, computational problems in electric power systems, data analysis, and interconnection network design. He received his PhD degree in 2001 in computer science from University of Illinois at Urbana-Champaign, with the option of computational science and engineering, and his B.S. and M.S. degrees in 1994 and 1996, respectively, in Computer Engineering from Bilkent University, Turkey.

He is a member of SIAM and its activity groups in Supercomputing, Computational Science and Engineering, and Optimization; IEEE Computer Society; and ACM. He is also elected to serve as the secretary of SIAM activity group in Supercomputing for the 2008–2009 term.

E. Kartal Tabak is a Ph.D. candidate at the Computer Engineering Department of Bilkent University. His research interests include parallel computing and algorithms, high performance Web Search Engines, high performance application servers, and software engineering.

Cevdet Aykanat received the BS and MS degrees from Middle East Technical University, Ankara, Turkey, both in electrical engineering, and the PhD degree from Ohio State University, Columbus, in electrical and computer engineering. He was a Fulbright scholar during his PhD studies. He worked at the Intel Supercomputer Systems Division, Beaverton, Oregon, as a research associate. Since 1989, he has been affiliated with the Department of Computer Engineering, Bilkent University, Ankara, Turkey, where he is currently a professor. His research interests mainly include parallel computing, parallel scientific computing and its combinatorial aspects, parallel computer graphics applications, parallel data mining, graph and hypergraph partitioning, load balancing, neural network algorithms, high performance information retrieval systems, parallel and distributed web crawling, parallel and distributed databases, and grid computing. He has (co)authored about 50 technical papers published in academic journals indexed in SCI. He is the recipient of the 1995 Young Investigator Award of The Scientific and Technological Research Council of Turkey and 2007 Science Award bestowed by the METU Parlar Foundation. He is a member of the ACM and the IEEE Computer Society. He has been recently appointed as a member of IFIP Working Group 10.3 (Concurrent Systems).

This work is partially supported by The Scientific and Technological Research Council of Turkey (TÜBİTAK) under projects EEEAG-105E065 and EEEAG-106E069.

View full text