Bandwidth optimal all-reduce algorithms for clusters of workstations

https://doi.org/10.1016/j.jpdc.2008.09.002Get rights and content

Abstract

We consider an efficient realization of the all-reduce operation with large data sizes in cluster environments, under the assumption that the reduce operator is associative and commutative. We derive a tight lower bound of the amount of data that must be communicated in order to complete this operation and propose a ring-based algorithm that only requires tree connectivity to achieve bandwidth optimality. Unlike the widely used butterfly-like all-reduce algorithm that incurs network contention in SMP/multi-core clusters, the proposed algorithm can achieve contention-free communication in almost all contemporary clusters, including SMP/multi-core clusters and Ethernet switched clusters with multiple switches. We demonstrate that the proposed algorithm is more efficient than other algorithms on clusters with different nodal architectures and networking technologies when the data size is sufficiently large.

Introduction

The all-reduce operation combines values from all processes and distributes the results to all processes. It is commonly used in parallel computing. In the Message Passing Interface (MPI) standard [18], the routine for this operation is MPI_Allreduce.

We consider an efficient realization of the all-reduce operation with large data sizes in cluster environments, under the assumption that the reduce operator is associative and commutative. We derive a tight lower bound of the amount of data that must be communicated in order to complete the all-reduce operation, and use the lower bound to establish the minimum time required for this operation. We propose a ring-based algorithm for this operation that achieves bandwidth optimality on tree topologies in that (1) each node sends the minimum amount of data required to complete this operation, and (2) all communications are contention free.

Currently, the most widely used all-reduce scheme is the butterfly-like algorithm [22], [23], [27], where the all-reduce operation is realized with a recursive halving reduce-scatter followed by a recursive doubling all-gather. When the network can support the butterfly communication pattern without contention, this algorithm is optimal both in the latency term (using the minimum number of communication rounds needed) and in the bandwidth term (each node communicating the minimum amount of data required). The problem with the butterfly-like algorithm is that the butterfly communication pattern can cause network contention in many contemporary clusters, such as the widely deployed SMP/multi-core clusters. In contrast, our ring-based algorithm only requires a tree topology to be bandwidth optimal, and can achieve contention-free communication in almost all contemporary clusters including SMP/multi-core clusters and Ethernet switched clusters with multiple switches. The ring-based algorithm also requires less working memory and can be applied to clusters with non-power-of-two numbers of nodes. One limitation of the proposed algorithm is that it is only optimal in the bandwidth term, but not the latency term: the number of communication rounds is proportional to the number of processes. Another issue in the ring-based algorithm is that the reduction results are computed with different “bracketing”, which may cause problems in the presence of rounding errors.

We evaluate the proposed algorithm on various clusters of workstations, including high-end SMP/multi-core clusters with Myrinet and InfiniBand interconnects and low-end Ethernet switched clusters. The results show that the proposed algorithm significantly outperforms other algorithms when the data size is sufficiently large, which demonstrates the effectiveness of the proposed algorithm.

The rest of the paper is organized as follows: Section 2 introduces the all-reduce operation and the communication model that we use. In Section 3, we derive the theoretical lower bound on the communication time required for this operation. Section 4 presents the proposed bandwidth optimal all-reduce algorithm. Section 5 reports the results of our experiments. The related work is discussed in Section 6. Section 7 concludes the paper.

Section snippets

All-reduce operation

We will use a generic operator to denote the reduce operator in the all-reduce operation. MPI_Allreduce requires the reduce operator to be associative, that is, (ab)c=a(bc). Moreover, all built-in operations for MPI_Allreduce are also commutative, that is, ab=ba. We assume that the reduce operator is both associative and commutative in this paper.

In terms of operating results, an all-reduce operation is equivalent to a reduction operation that reduces the results to one process,

The lower bounds

In a general-purpose all-reduce operation, data items are independent of each other. The reduction on different items are independent of one another. Hence, the amount of data that must be communicated in order to complete an all-reduce operation on X items is equal to X times the amount for the single-item all-reduce operation. To obtain the lower bound for a general all-reduce operation on X items, we only need to find the lower bound for the operation on a single item.

In a one-item

A ring-based bandwidth optimal algorithm

We will describe a ring-based bandwidth optimal all-reduce algorithm for tree topologies, and show how such an algorithm can be applied to high-end clusters. The tree topology by itself is not particularly interesting. We develop bandwidth optimal algorithms for the tree topology only because tree provides the minimum connectivity and most networks have tree embeddings: our bandwidth optimal all-reduce algorithm for trees can be applied to any topology with a tree embedding. Note that since our

Experiments

Based on the algorithms described in the previous section, we implement all-reduce routines in two forms. For high-end SMP/multi-core clusters, we implement a stand-alone all-reduce routine based on Lemma 6. For clusters with a physical tree topology, we develop a routine generator that takes the tree topology information as input and automatically produces an all-reduce routine that uses the topology specific algorithm for the topology. The routine generator reads a topology description file

Related work

The all-reduce operation has been extensively studied. The one-item all-reduce operation has been studied under different names such as census function [2], global combine[3], [5], [6], [27], and gossip [15]. The lower bound for the communication time under various communication models has been established [2], [3], [5], [15]. In [15], it is shown that to complete a one-item all-reduce operation under the telephone model, at least 2N4 connections must be established when N>4. In [2], [3], [5],

Conclusions

We investigate efficient implementations of the all-reduce operation with large data sizes under the assumption that the reduce operator is both associative and commutative. We derive a theoretical lower bound on the communication time of this operation and develop a bandwidth optimal all-reduce algorithm on tree topologies. This algorithm only requires tree connectivity to achieve bandwidth optimality and can be applied to contemporary clusters. We demonstrate the effectiveness of the proposed

Acknowledgments

This work is supported in part by National Science Foundation (NSF) grants: CCF-0342540, CCF-0541096, and CCF-0551555. Experiments are also performed on resources sponsored through an NSF Teragrid grant CCF-050010T.

Pitch Patarasuk received the B.S. degree in civil engineering from Chulalongkorn University, Thailand, in 1999, and the M.S. degree in computer science from Florida State University in 2004. He is currently a Ph.D. student in the Computer Science Department at Florida State University. His research interests include distributed systems, cluster computing, and communication optimizations.

References (28)

  • J. Bruck et al.

    Efficient global combine operations in multi-port message-passing systems

    Parallel Processing Letters

    (1993)
  • J. Bruck et al.

    On the design and implementation of broadcast and global combine operations using the postal model

    IEEE Transactions on Parallel and distributed Systems

    (1996)
  • A. Faraj et al.

    Bandwidth efficient all-to-all broadcast on switched clusters

    International Journal of Parallel Programming

    (2008)
  • A. Faraj et al.

    A message scheduling scheme for all-to-all personalized communication on ethernet switched clusters

    IEEE Transactions on Parallel and Distributed Systems

    (2007)
  • Cited by (370)

    View all citing articles on Scopus

    Pitch Patarasuk received the B.S. degree in civil engineering from Chulalongkorn University, Thailand, in 1999, and the M.S. degree in computer science from Florida State University in 2004. He is currently a Ph.D. student in the Computer Science Department at Florida State University. His research interests include distributed systems, cluster computing, and communication optimizations.

    Xin Yuan received his B.S. and M.S. degrees in Computer Science from Shanghai Jiaotong University in 1989 and 1992, respectively. He obtained his Ph.D. degree in Computer Science from the University of Pittsburgh in 1998. He is currently an associate professor at the Department of Computer Science, Florida State University. His research interests include parallel and distributed systems, compilers, and networking. His email address is: [email protected].

    View full text