Abstract
Exascale computing aspires to meet the increasing demands from large scientific applications. Software targeting exascale is typically designed for heterogeneous architectures; henceforth, it is not only important to develop well-designed software, but also make it aware of the hardware architecture and efficiently exploit its power. Currently, several and diverse applications, such as those part of the Exascale Computing Project (ECP) in the United States, rely on efficient computation of the Fast Fourier Transform (FFT). In this context, we present the design and implementation of heFFTe (Highly Efficient FFT for Exascale) library, which targets the upcoming exascale supercomputers. We provide highly (linearly) scalable GPU kernels that achieve more than \(40{\times }\) speedup with respect to local kernels from CPU state-of-the-art libraries, and over \(2{\times }\) speedup for the whole FFT computation. A communication model for parallel FFTs is also provided to analyze the bottleneck for large-scale problems. We show experiments obtained on Summit supercomputer at Oak Ridge National Laboratory, using up to 24,576 IBM Power9 cores and 6,144 NVIDIA V-100 GPUs.
You have full access to this open access chapter, Download conference paper PDF
Similar content being viewed by others
Keywords
1 Introduction
Considered one of the top 10 algorithms of the 20th century, the Fast Fourier transform (FFT) is widely used by applications in science and engineering. Such is the case of applications targeting exascale, e.g. LAMMPS (EXAALT-ECP) [14], and diverse software ranging from particle applications [21] and molecular dynamics, e.g. HACC [7], to applications in machine learning, e.g. [16]. For all these applications, it is critical to have access to an heterogeneous, fast and scalable parallel FFT library, with an implementation that can take advantage of novel hardware components, and efficiently exploit their benefits.
Highly efficient implementations to compute FFT on a single node have been developed for a long time. One of the most widely used libraries is FFTW [10], which has been tuned to optimally perform in several architectures. Vendor libraries for this purpose have also been highly optimized, such is the case of MKL (Intel) [13], ESSL (IBM) [8], clFFT (AMD) [1] and CUFFT (NVIDIA) [18]. Novel libraries are also being developed to further optimize single node FFT computation, e.g. FFTX [9] and Spiral [22]. Most of the previous libraries have been extended to distributed memory versions, some by the original developers, and others by different authors.
1.1 Related Work
In the realm of distributed-CPU libraries, FFTW supports MPI via slab decomposition, however it has limited scalability and hence it is limited to a small number of nodes. P3DFFT [19] extends FFTW functionalities and supports both pencil and slab decompositions. Large scale applications have built their own FFT library, such as FFTMPI [20] (built-in on LAMMPS [14]) and SWFFT [23] (built-in on HACC [7]). These libraries are currently being used by several molecular-dynamics applications.
Concerning distributed-GPU libraries, the slab-approach introduced in [17] is one of the first heterogeneous codes for large FFT computation on GPUs. Its optimization approach is limited to small number of nodes and focus on reducing tensor transposition cost (known to be the bottleneck) by exploiting infiniband-interconnection using the IBverbs library, which makes it not portable. Further improvements to scalablity have been presented in FFTE library [26] which supports pencil decompositions and includes several optimizations, although with limited features and limited improvements on communication. Also, FFTE relies on the commercial PGI compiler, which may limit its usage. Finally, one of the most recent libraries is AccFFT [11], its approach consists in overlapping computation and blocking collective communication by reducing the PCIe overhead, they provide good (sublinear) scalability results for large real-to-complex transforms using NVIDIA K20 GPUs.
Even though the fast development of GPUs has enabled great speedup on local computations, the cost of communication between CPUs/GPUs on large-scale computations remains as the bottleneck, and this is a major challenge supercomputing has been facing over the last decade [6]. Large parallel FFT is well-known to be communication bounded, experiments and models have shown that for large node counts the impact on communication needs to be efficiently managed to properly target exascale systems [5, 15].
In this context, we introduce heFFTe (pronounced “hefty”), which provides very good (linear) scalability for large node count, it is open-source and consists of C++ and CUDA kernels with (CUDA-aware) MPI and OpenMP interface for communication. It has a user-friendly interface and does not require any commercial compiler. Wrappers to interface with C, Fortran and Python are available. It is publicly available in [2] and documented in [24, 27,28,29]. Its main objective is to become the standard for large FFT computations on the upcoming exascale systems. Figure 1 shows how heFFTe is positioned on the ECP software stack, and some of its target exascale applications (gray boxes).
This paper is organized as follows, Sect. 2 describes the classical FFT multi-dimensional algorithm and its implementation phases within heFFTe. We then present heFFTe’s main features and functionalities. Next, Sect. 3 presents a multi-node communication model for parallel FFTs and an approach to reduce its computational complexity for a small number of nodes. Section 4 presents numerical experiments on Summit supercomputer, and evaluates the multi-GPU communication impact on performance. Finally, Sect. 5 concludes our paper.
2 Methodology and Algorithmic Design
Multidimensional FFTs can be performed by a sequence of low-dimensional FFTs (see e.g. [12]). Typical approaches used by parallel libraries are the pencil and slab decompositions. Algorithm 1 presents the pencil decomposition approach, which computes 3D FFTs by means of three 1D FFTs. This approach is schematically shown in Fig. 2. On the other hand, slab decomposition relies on computing sets of 2D and 1D FFTs.
Figure 2 schematically shows the steps during a 3D FFT computation in parallel, using a 3D partition of processors. On the top part of this figure, we present the pencil methodology, as described in Algorithm 1, in which \(\hat{N}_i\) denotes output data obtained from applying 1D FFT of size \(N_i\) on the i-th direction. This approach can be summarized as follows, the input data of size \(N_0 \times N_1 \times N_2\) is initially distributed into a grid processors, \(P_{\text {i}0} \times P_{\text {i}1} \times P_{\text {i}2}\), in what is known as brick decomposition. Then, a reshape (transposition) puts data into pencils on the first direction where the first set of 1D FFTs are performed. These two steps are repeated for the second and third direction. Observe that intermediate reshaped data is handled in new processor grids which must be appropriately created to ensure load-balancing, for simplicity a single \(Q_0 \times Q_1\) grid is used in Algorithm 1. Finally, a last data-reshape takes pencils on the third direction into the output brick decomposition.
Several applications provide input data already on pencil distribution on the first direction and require the output written as pencils on the third direction. In this case, only two data-reshapes are required, this is the default for FFTE [26] and AccFFT [11] libraries. On the other hand, heFFTe can treat input and output shapes with high flexibility, generalizing features of modern libraries, and with a friendly interface as presented in Sect. 2.3.
Finally, in the bottom part of Fig. 2, we show the slab approach which saves one step of data reshape by performing 2D FFTs, this has a considerable impact in performance for a small number of nodes [25].
2.1 Kernels Implementation
Two main sets of kernels intervene into a parallel FFT computation:
-
1.
Computation of low dimensional FFTs, which can be obtained by optimized libraries for single node FFT, as those described in Sect. 1.
-
2.
Data reshape, which essentially consists on a tensor transposition, and takes a great part of the computation time.
To compute low-dimensional FFTs, heFFTe supports several open-source and vendor libraries for single node, as those described in Sect. 1. And it also provides templates to select types and precision of data. For direct integration to applications, heFFTe provides example wrappers and templates to help users to easily link with their libraries.
Data reshape is essentially built with two sets of routines, the first one consists in packing and unpacking kernels which, respectively, manage data to be sent and to be received among processors. Generally, these set of kernels account for less than 10% of the reshaping time. Several options for packing and unpacking data are available in heFFTe, and there is an option to tune and find the best one for a given problem on a given architecture. The last set of routines correspond to communication kernels, heFFTe supports binary and collective communications as presented in Table 1.
The main impact on performance obtained by heFFTe in comparison to standard libraries, comes from the kernels optimization (\(> 40{\times }\) speedup w.r.t CPU libraries, c.f. Fig. 6), and also by a novel efficient asynchronously-management of packing and communication, as shown in Fig. 3, where we can observe the classical packing process supported by most libraries and a novel approach via routine heFFTe_alltoallv, introduced in [4], which overlaps packing/unpacking with MPI communication.
2.2 Communication Design and Optimization
Parallel FFT libraries typically handle communication by moving data structures on the shape of pencils, bricks, and slabs of data. For each of these options the total amount of data communicated is always the same. Hence, decreasing the number of messages between processors yields to increasing the size of the messages they send. On the other hand, for modern hardware architectures, it is well-known that latency and bandwidth improvements do not grow as quickly as the arithmetic computation power [6]. Therefore, it is important to choose the appropriate communication scheme. For instance, reshaping brick to pencil data requires \(O(P^{1/3})\) messages, this can be verified by overlapping both grids. Analogously, the number of messages for reshaping pencil to pencil is \(O(P^{1/2})\), while \(O(P^{2/3})\) for brick to slab, and O(P) for slab to pencil.
Choosing the right communication scheme highly depends on the problem size and hardware features, heFFTe supports standard MPI_Alltoallv within subgroups, which generally yields better performance compared to poin-to-point communication. However, optimizations of all-to-all routines on heterogeneous clusters are still not available (e.g. on the NVIDIA Collective Communications Library [3]), even though, as can be regarded in Fig. 6, improvements to all-to-all communication are critical. For this reason, we developed a routine called heFFTe_alltoallv [4] which includes several all-to-all communication kernels and can be used for tuning and selecting the best one for a given architecture. This routines aimed to provide a better management of multi-rail communication. The asynchronous approach of heFFTe_alltoallv was proved efficient for up to 32 nodes, and the multi-rail management, although promising, negatively impacts the performance when increasing node count, degrading the potential benefit of the multi-rail optimization [4]. Work on communication avoiding frameworks is ongoing, targeting large node count on heterogeneous clusters.
2.3 Application Programming Interface (API) for heFFTe
heFFTe’s software design is built on C and C++, and provides wrappers to be used in Fortran and Python. The API aims to be user-friendly and portable among several architectures, allowing users to easily link it to their applications. The API follows styles from FFTW3 and CUFFT libraries, adding novel features, such as templates for multitype and multiprecision data.
Define FFT Parameters. Distributed FFT requires data split on a processors grid. In 3D, input/output arrays are typically defined by the six vertices of a brick, e.g. (\(i_{\text {lo}},i_{\text {hi}}\)), as shown in Fig. 2; heFFTe allows this definition using any MPI sub-communicator, fft_comm, which has to be provided by user together with data definition.
FFT Plan Definition. Once data and processors grids are locally defined, user can create an FFT plan by simply providing the following parameters,
-
dim: Problem dimension, e.g., dim = 3
-
N: Array size, e.g., N = [nx,ny,nz]
-
permute: Permutation storage of output array.
Next, we show how a heFFTe plan is created; memory requirements are returned to the user as a workspace array.
Note that a single plan can be used for several FFT computations, which is typical for several applications where grids are fixed.
FFT Execution. One of heFFTe’s most important kernels is the one in charge of the FFT computation. This kernel has the same syntax for any type of data and its usage follows APIs from CUFFT and FFTW3.
Similar execution function is available for the case of real-to-complex (R2C) transforms, heffte_execute_r2c.
3 Multi-node Communication Model
To analyze the bottleneck of exascale FFT computation, communication models can be deduced and experimentally verified for different types of computer clusters [5]. These models can be built for specific frameworks, as for pencil and slab data exchanges [25]; or they could be oriented to the hardware architecture [11].
In this section, we propose an inter-node communication model for large FFTs. We focus on inter-node effects since fast interconnection is typically available intra-node, e.g. NVLINK. And properly scheduling intra-node communications can overlap their cost with the inter-node communications. In Table 2, we summarize the parameters to be used for the communication model.
To create a communication model, we analyze the computational intensity (\(\varphi \)) in Flops/Byte. For the case of FFT, we have that the number of FLOPS is \(5N\log (N)\) and the volume of data moved at each reshape is \(\alpha N\), then for the total FFT computation using P nodes, we get,
and the peak performance (in GFlops) is defined as,
For the case of Summit supercomputer, we have a node interconnection of \(B=25\) GB/s, considering \(r=4\) (c.f. Fig. 2) and data-type as double-precision complex (i.e. \(\alpha = 16\)). Then,
Figure 8, shows heFFTe’s performance for a typical FFT of size \(N=1024^3\), and compares it to the roofline peak for increasing number of nodes, getting about to 90% close to peak value.
4 Numerical Experiments
In this section we present numerical experiments on Summit supercomputer, which has 4,608 nodes, each composed by 2 IBM Power9 CPUs and 6 Nvidia V100 GPUs. For our experiments, we use the pencil decomposition approach, which is commonly available in classical libraries and can be shown to be faster than the slab approach for large node count [25]. In Fig. 4, we first show strong scalability comparison between heFFTe GPU and CPU implementations, being the former \({\sim }2{\times }\) faster than the latter. We observe very good linear scalability in both curves. Also, since heFFTe CPU version was based on improved versions of kernels from FFTMPI and SWFFT libraries [29], then its performance is at least as good as them. Therefore, heFFTe GPU is also \({\sim }2{\times }\) faster than FFTMPI and SWFFT libraries. Drop in performance for the CPU implementation after 512 nodes (12,288 cores) is due to latency impact, which we verified with several experiments. This is because for a \(1024^3\) size, the number of messages become very large while their size becomes very small. When increasing the problem size, the CPU version keeps scaling very well as shown in Fig. 5.
Next, Fig. 5 shows weak scalability comparison of heFFTe GPU and CPU implementations for different 3D FFT sizes, showing over \(2{\times }\) speedup and very good scaling.
In order to show the impact of local kernels acceleration, Fig. 6 shows a profile of a single 3D FFT using both, CPU and GPU, versions of heFFTe, where over \(40{\times }\) speedup of local kernels and the great impact of communication are clearly displayed.
Next, in Fig. 7, we compare strong and weak scalability of heFFTe and FFTE libraries. Concluding that heFFTe overcomes FFTE in performance (by a factor \({>}2\)) and having better scalability. We do not include results with AccFFT library, since its GPU version did not verify correctness on several experiments performed in Summit. However, AccFFT reported a fairly constant speedup of \({\sim }1.5\) compared with FFTE, while having very similar scalability [11].
4.1 Multi-node Communication Model
In Fig. 8, we numerically analyze how we approach to the roofline peak performance as described in Sect. 3. We observe that by appropriately choosing the transform size and the number of nodes, we approach to the proposed peak, and hence a correlation could be established between these two parameters to ensure that maximum resources are being used, while still leaving GPU resources to simultaneously run other computations needed by applications.
4.2 Using heFFTe with Applications
Diverse applications targeting exascale make use of FFT within their models. In this section, we consider LAMMPS [14], part of the EXAALT ECP project. Its KSPACE package provides a variety of long-range Coulombic solvers, as well as pair styles which compute the corresponding pairwise Coulombic interactions. This package heavily rely on efficient FFT computations, with the purpose to compute the energy of a molecular system.
In Fig. 9 we present an experiment obtained using a LAMMPS benchmark experiment, where we compare the performance when using its built-in FFTMPI library, and then using the GPU version of heFFTe library. As shown in Fig. 4, it is expected that even for large runs, using LAMMPS with heFFTe would provide a \(2{\times }\) speedup of its KSPACE routine.
5 Conclusions
In this article, we presented the methodology, implementation and performance results of heFFTe library, which performs FFT computation on heterogeneous systems targeting exascale. We have provided experiments showing considerable speedups compared to state-of-the-art libraries, and that linear scalability is achievable. We have greatly speedup local kernels getting very close to the experimental roofline peak on Summit (a large heterogeneous cluster which ranks first on the top 500 supercomputers). Our results show that further optimizations would require better hardware interconnection and/or new communication-avoiding algorithmic approaches.
References
Ayala, A., et al.: Impacts of Multi-GPU MPI collective communications on large FFT computation. In: IEEE/ACM Workshop on Exascale MPI (ExaMPI) (2019)
Czechowski, K., McClanahan, C., Battaglino, C., Iyer, K., Yeung, P.K., Vuduc, R.: On the communication complexity of 3D FFTs and its implications for exascale, June 2012. https://doi.org/10.1145/2304576.2304604
Demmel, J.: Communication-avoiding algorithms for linear algebra and beyond. In: IEEE 27th International Symposium on Parallel and Distributed Processing (2013)
Emberson, J., Frontiere, N., Habib, S., Heitmann, K., Pope, A., Rangel, E.: Arrival of first summit nodes: HACC testing on phase I system. Technical report. MS ECP-ADSE01-40/ExaSky, Exascale Computing Project (ECP) (2018)
Filippone, S.: The IBM parallel engineering and scientific subroutine library. In: Dongarra, J., Madsen, K., Waśniewski, J. (eds.) PARA 1995. LNCS, vol. 1041, pp. 199–206. Springer, Heidelberg (1996). https://doi.org/10.1007/3-540-60902-4_23
Franchetti, F., et al.: FFTX and SpectralPack: a first look. In: IEEE International Conference on High Performance Computing, Data, and Analytics (2018)
Frigo, M., Johnson, S.G.: The design and implementation of FFTW3. Proc. IEEE 93(2), 216–231 (2005). Special issue on “Program Generation, Optimization, and Platform Adaptation”
Gholami, A., Hill, J., Malhotra, D., Biros, G.: Accfft: a library for distributed-memory FFT on CPU and GPU architectures. CoRR abs/1506.07933 (2015). http://arxiv.org/abs/1506.07933
Grama, A., Gupta, A., Karypis, G., Kumar, V.: Accuracy and Stability of Numerical Algorithms, 2nd edn. Addison Wesley, Boston (2003)
Intel: Intel Math Kernel Library. http://software.intel.com/en-us/articles/intel-mkl/, https://software.intel.com/en-us/mkl/features/fft
Large-scale atomic/molecular massively parallel simulator (2018). https://lammps.sandia.gov/
Lee, M., Malaya, N., Moser, R.D.: Petascale direct numerical simulation of turbulent channel flow on up to 786k cores. In: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (2013)
Lin, S., et al.: FFT-based deep learning deployment in embedded systems. In: Design, Automation Test in Europe Conference Exhibition (DATE), pp. 1045–1050 (2018)
Nukada, A., Sato, K., Matsuoka, S.: Scalable multi-GPU 3-D FFT for TSUBAME 2.0 supercomputer. In: High Performance Computing, Networking, Storage and Analysis (2012)
CUDA NVIDIA: cuFFT library (2018). http://docs.nvidia.com/cuda/cufft
Pekurovsky, D.: P3DFFT: a framework for parallel computations of Fourier transforms in three dimensions. SIAM J. Sci. Comput. 34(4), C192–C209 (2012). https://doi.org/10.1137/11082748X
Plimpton, S., Kohlmeyer, A., Coffman, P., Blood, P.: fftMPI, a library for performing 2d and 3d FFTs in parallel. Technical report, Sandia National Lab. (SNL-NM), Albuquerque, NM, United States (2018)
Plimpton, S.J.: FFTs for (mostly) particle codes within the DOE exascale computing project (2017)
Popovici, T., Low, T.M., Franchetti, F.: Large bandwidth-efficient FFTs on multicore and multi-socket systems. In: IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE (2018)
Richards, D., Aziz, O., Cook, J., Finkel, H., et al.: Quantitative performance assessment of proxy apps and parents. Technical report, Lawrence Livermore National Lab. (LLNL), Livermore, CA, United States (2018)
Shaiek, H., Tomov, S., Ayala, A., Haidar, A., Dongarra, J.: GPUDirect MPI Communications and Optimizations to Accelerate FFTs on Exascale Systems. Extended Abstract icl-ut-19-06, September 2019
Takahashi, D.: Implementation of parallel 3-D real FFT with 2-D decomposition on Intel Xeon Phi clusters. In: 13th International Conference on Parallel Processing and Applied Mathematics (2019)
Takahashi, D.: FFTE: a fast Fourier transform package (2005). http://www.ffte.jp/
Tomov, S., Haidar, A., Ayala, A., Schultz, D., Dongarra, J.: Design and implementation for FFT-ECP on distributed accelerated systems. ECP WBS 2.3.3.09 Milestone report FFT-ECP ST-MS-10-1410, Innovative Computing Laboratory, University of Tennessee, revision, April 2019
Tomov, S., Haidar, A., Ayala, A., Shaiek, H., Dongarra, J.: FFT-ECP Implementation optimizations and features phase. Technical report, ICL-UT-19-12, October 2019
Tomov, S., Haidar, A., Schultz, D., Dongarra, J.: Evaluation and design of FFT for distributed accelerated systems. ECP WBS 2.3.3.09 Milestone report FFT-ECP ST-MS-10-1216, Innovative Computing Laboratory, University of Tennessee, revision, October 2018
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Ayala, A., Tomov, S., Haidar, A., Dongarra, J. (2020). heFFTe: Highly Efficient FFT for Exascale. In: Krzhizhanovskaya, V., et al. Computational Science – ICCS 2020. ICCS 2020. Lecture Notes in Computer Science(), vol 12137. Springer, Cham. https://doi.org/10.1007/978-3-030-50371-0_19
Download citation
DOI: https://doi.org/10.1007/978-3-030-50371-0_19
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-50370-3
Online ISBN: 978-3-030-50371-0
eBook Packages: Computer ScienceComputer Science (R0)