IBM®
Skip to main content
    Country/region [change]    Terms of use
 
 
 
    Home    Products    Services & solutions    Support & downloads    My account    

IBM Journal of Research and Development

Blue Gene   Volume 49, Number 2/3, 2005
Table of contents: HTMLPDF This article: HTML PDFDOI: 10.1147/rd.492.0437Copyright info

Vectorization techniques for the Blue Gene/L double FPU

by J. Lorenz,
S. Kral,
F. Franchetti,
and C. W. Ueberhuber

This paper presents vectorization techniques tailored to meet the specifics of the two-way single-instruction multiple-data (SIMD) double-precision floating-point unit (FPU), which is a core element of the node application-specific integrated circuit (ASIC) chips of the IBM 360-teraflops Blue Gene®/L supercomputer. This paper focuses on the general-purpose basic-block vectorization and optimization methods as they are incorporated in the Vienna MAP vectorizer and optimizer. The innovative technologies presented here, which have consistently delivered superior performance and portability across a wide range of platforms, were carried over to prototypes of Blue Gene/L and joined with the automatic performance-tuning system known as Fastest Fourier Transform in the West (FFTW). FFTW performance-optimization facilities working with the compiler technologies presented in this paper are able to produce vectorized fast Fourier transform (FFT) codes that are tuned automatically to single Blue Gene/L processors and are up to 80% faster than the best-performing scalar FFT codes generated by FFTW.

Introduction

The IBM Blue Gene*/L (BG/L) supercomputer [1], planned to be in operation in 2005, will be an order of magnitude faster than the Earth Simulator. BG/L will feature eight times more processors than current massively parallel systems. To tame this vast parallelism, new approaches and tools have had to be developed. However, developing highly efficient numerical software has to start with optimizing the computational kernels for the nonstandard floating-point unit (FPU) of the BG/L processors. This so-called double FPU provides support for complex arithmetic as an important prerequisite to speed up large scientific codes.

The utilization of nonstandard FPUs in computational kernels, like fast Fourier transforms (FFTs), is by no means straightforward. Optimization of FFT kernels leads to complicated data dependencies of real variables that cannot easily be mapped to the elaborate BG/L FPU. This problem is particularly demanding in the context of automatic performance tuning, but it must be addressed in order to obtain high-performance FFT implementations, which are required as major building blocks for the large scientific codes planned to be run on BG/L. Most of these applications require very fast one-dimensional FFT routines to be run on a single processor for computing relatively small transforms.

This paper introduces a new FFT library, BGL/FFTW–GEL, that runs efficiently on the BG/L prototypes. This library is the first numerical library for BG/L not developed by IBM. It takes full advantage of the double FPU by means of short-vector single-instruction multiple-data (SIMD) vectorization.

BGL/FFTW–GEL is the result of a combination of FFTW with special-purpose vectorization technology in the Vienna MAP vectorizer [24]. FFT codes produced by BGL/FFTW–GEL are running five times faster than standard nonadaptive FFT libraries [2]. On the DD2 prototype, speeds up to 1.8 times greater than the best FFT code not utilizing the special features of the BG/L double FPU were achieved.

The Blue Gene/L supercomputer

The initial DD1 prototype of the IBM Blue Gene/L supercomputer [1], equipped with 8,192 custom-made IBM PowerPC* 440 FP2 processors running at 500 MHz, achieved a Linpack performance of Rmax = 11.7 teraflops, i.e., 71% of its theoretical peak performance of Rpeak = 16.4 teraflops. This performance ranks the BG/L prototype machine at position four of the Top500 list (June 2004) [5]. The prototype machine is roughly 1/20th the physical size of machines of comparable compute power—such as Linux** clusters—that exist today.

The 64K-processor BG/L machine currently being built for the Lawrence Livermore National Laboratory (LLNL) will be eight times larger, occupying 64 full racks. When completed in 2005, the LLNL supercomputer—featuring 360 teraflops peak performance—is expected to lead the Top500 list. Compared with the fastest supercomputers of today, it will be an order of magnitude faster, consume 1/15th of the power, and be ten times more compact.

Complex and real arithmetic

Since there are many areas of scientific computing, such as computational electronics, in which complex arithmetic plays an important role, its native support has been integrated into the FPUs of computers devoted to such applications. Nevertheless, even algorithms using complex arithmetic may have to be reformulated in terms of real arithmetic to allow for the application of the inevitable optimization techniques to achieve satisfactory performance of scientific codes: common subexpression elimination, constant folding, and copy propagation on the real and imaginary parts.

BG/L double FPU

The BG/L PowerPC 440 (PPC440) double floating-point FPU (FP2) was obtained by replicating the standard PPC440 FPU and adding crossover datapaths and sign-change capabilities to allow the short-vector SIMD fused multiply–add (FMA) operations to support complex multiplication. Up to four real floating-point operations (one SIMD FMA) can be issued every cycle, and efficient intermixing of scalar and vector operations is possible. The data to be processed has to be naturally aligned on 16-byte boundaries in memory.

The PPC440 FP2 exhibits some problematic characteristics: a single data reorder operation within a short-vector SIMD register is as expensive as one arithmetic two-way FMA operation, and alternatively, either a floating-point operation or a data reorganization instruction can be issued every cycle. Conventional vectorization techniques are not able to deal efficiently with these architectural shortcomings. Thus, without a tailor-made adaptation of established short-vector SIMD vectorization techniques to the specific features of the BG/L double FPU, no high-performance short-vector code can be obtained.

Blue Gene/L ISA extension

The Blue Gene/L new single-instruction, multiple-operation, multiple-data (SIMOMD) instruction set architecture (ISA) extension includes all well-known short-vector SIMD-style (interoperand, parallel) instructions, such as the ones supported by the Intel** Streaming SIMD Extensions 2 (SSE2) or AMD 3DNow!**. This ISA extension allows the use of the double FPU either as a complex FPU or as a real two-way vector FPU.

When it is used as a complex FPU, programs using complex arithmetic can be mapped to the Blue Gene/L double FPU in a straightforward manner. The alternative, using it as a real two-way vector FPU where real code is presupposed, is applicable only if the underlying algorithm allows for enough parallelism to be extracted. Moreover, the mapping is significantly more complicated in this case.

In the context of automatic performance-tuning software, the GNU C compiler port for Blue Gene/L is not suitable because it supports only 32 temporary variables when accessing the double FPU. Thus, only the IBM XL C compiler is a reasonable choice.

To employ the Blue Gene/L double FPU in automatic performance-tuning software, three approaches are possible:

  1. Implement the numerical kernels in C using proprietary directives such that the XL C compiler vectorization possibilities prove successful.
  2. Rewrite the numerical kernels in assembly language using specific double FPU instructions.
  3. Rewrite the numerical kernels utilizing the XL C language extension to C99 that provides access to the double FPU on the source level by means of special data types and intrinsic functions.

This paper describes how vector code can be generated automatically by following the third approach. Thus, register allocation and instruction scheduling is left to the compiler while vectorization and instruction selection is done at the source-code level by the newly developed approach presented in the following sections.

Self-adapting FFT software

The FFT algorithm is among the most important computational innovations of the twentieth century and continues to be a focus of current research. In scientific computing, FFT algorithms are—in addition to linear algebra algorithms—core algorithms of almost any computationally intensive numerical software. Accordingly, the application of FFT transforms ranges from small-scale problems with stringent time constraints (for instance, in real-time signal processing) up to large-scale simulations and partial differential equation (PDE) solvers running on the largest supercomputers in the world. Thus, high-performance software tailor-made for such applications is desperately needed.

FFT algorithms are structurally complex and difficult to efficiently map onto standard hardware. Until recently, FFT packages required large, finely tuned, machine-specific libraries produced by highly skilled software developers. Therefore, these packages failed to perform well for a variety of architectures.

In 1997, the state of the art in scientific software production changed dramatically when automatic software generators producing highly optimized code entered the field. New standards were set by FFTW, a free collection of fast C routines for computing the discrete Fourier transform (DFT) in one or more dimensions [67]. FFTW was designed for producing automatically tuned FFT libraries and automatically tuned linear algebra software (ATLAS) [8] that generates highly efficient basic linear algebra subroutines (BLAS).

Typically, FFTW produces code that runs faster than publicly available FFT codes and compares well to vendor libraries. A dynamic programming approach relying on a recursive implementation of the Cooley/Tukey FFT algorithm [9] enables the adaptation of the FFT computation of a given size to a given target machine at runtime. The actual computation is done within routines called twiddle and no-twiddle codelets, which are produced by the code generator GENFFT [10], whose output consists of basic blocks of thousands of lines of code that can be transformed into static single assignment (SSA) form.

Code generated by automatic performance-tuning software such as FFTW and ATLAS is supposed to be translated by standard compilers to enable portability. However, automatically generated numerical code translated by standard compilers is often not able to achieve satisfactory performance. To accomplish top performance in such cases, the exploitation of special processor features, such as short-vector SIMD or FMA ISA extensions, is imperative.

Unfortunately, the methods used by conventional vectorizing compilers to deal with loops or basic blocks lead to inefficient results when applied to automatically generated numerical codes. These inefficiencies are due, among other things, to the inability of such compilers to utilize domain-specific information revealing the parallelism inherent in the codes. For instance, conventional vectorization techniques entail unacceptably large overhead by applying data-reordering operations that are, in principle, nonessential.

Related work

The main topic of this paper is the automatic vectorization of basic blocks of automatically generated code. Besides this kind of vectorization, there are also other ways to automatically vectorize FFT code.

Formal FFT vectorization

The formal vectorization approach [1116] developed for classical short-vector SIMD extensions, such as the Intel SSE family, the AMD 3DNow! family, and the Motorola AltiVec**, has been ported successfully to Blue Gene/L FPUs [17]. This type of vectorization is based on the SIMD vectorizing version of the synchronous programming language (SPL) compiler [14] that enables the SPIRAL system [1819] to automatically optimize code targeted at the double FPUs of Blue Gene/L.

Vectorization in FFTW 3

Version 3.0.1 of FFTW supports the SIMD extensions SSE, SSE2, 3DNow!, and AltiVec. A new algorithm is used to compute complex DFTs by means of two-way parallel computation of real DFTs. Porting FFTW 3.0.1 to Blue Gene/L requires the mapping of the SIMD instructions required by FFTW to instructions that exist on Blue Gene/L. Preliminary experiments carried out with no-twiddle codelets show promising results.

Other vectorization techniques

Some methods for vectorizing basic blocks [2022] attempt to find an efficient mix of SIMD and scalar instructions to carry out the required computation, whereas the vectorization techniques introduced in the next section aim at a full utilization of the power of SIMD instructions while trying to keep the SIMD reordering overhead reasonably small.

The vectorization method of [21] introduces more SIMD data-reordering instructions than necessary, because it is unable to use a representation of the numerical scalar directed acyclic graph (DAG) as vectorization input, and is thus deprived of this parallelism-revealing instrument. This approach is not a suitable choice for the efficient handling of typical numerical codes, such as FFTs, since explicit SIMD data-reordering operations are very expensive on the Blue Gene/L FPU.

Vienna MAP vectorizer

The Vienna MAP vectorizer [2423] automatically extracts two-way short-vector SIMD parallelism from a scalar code block by adequately combining scalar variables into SIMD variables and by joining the corresponding scalar instructions to one or more short-vector SIMD instructions, as illustrated in Figure 1. The MAP vectorizer targets automatically generated code that consists solely of arithmetic operations and read/write array access operations involving index computation.

Figure 1 Figure 1

Ideally, two-way vectorization transforms any pair of scalar instructions to one SIMD instruction yielding 100% SIMD utilization (Figure 1). This maximum is achievable only for completely parallel scalar DAGs. For DAGs with less parallelism, SIMD reordering instructions are required, at the cost of reduced SIMD utilization.

Since not all combinations of scalar operations may be joined into one SIMD instruction (as defined by the ISA extension of the target processor), a realistic goal for the vectorizer is to completely cover the given scalar DAG by natively supported SIMD instructions while achieving a satisfactory runtime performance, which is tantamount to minimizing SIMD data reorganization. Figure 2 gives an example of short-vector SIMD code obtained by vectorizing straight-line complex FFT code.

Figure 2 Figure 2

The Vienna MAP vectorizer was adapted to support Blue Gene/L FPUs. As a supplement to the MAP vectorizer, a peephole optimizer enables the extraction of fused multiply–add SIMD instructions.

Fundamentals of vectorization

Fusion of variables
Two scalar variables A and B can be fused either into a SIMD variable of the form AB = (A, B) or vice versa, BA = (B, A), where ABBA. Moreover, no scalar variable can be part of two different SIMD variables.

An already existing fusion AB = (A, B) is said to be compatible with another fusion CD = (C, D) requested in the vectorization process if and only if AB = CD or A = D and B = C. In the first case, fusion CD does not have to be generated, since AB can be used. In the second case, a SIMD swap operation is required to maintain data-flow consistency when using fusion AB instead of generating and using CD (Figure 3).

Figure 3 Figure 3

Combination of operations
Joining rules specify the allowed ways of pairwise transforming scalar instructions into one or more SIMD instructions. The MAP vectorizer supports transformations of the following instruction combinations: 1) load/load, 2) store/store, 3) unary/unary, 4) binary/binary, 5) unary/binary, 6) unary/load, and 7) load/binary.

Joining rules 1 and 2 supports the transformation of memory operations accessing consecutive and nonconsecutive memory locations. Rules 3 and 4 allow the pairing of instructions of only the same type, while rules 5 to 7 also allow mixed-type pairings.

Rules of type 3 fuse the two source operands S1 and S2 for transforming two unary instructions (uop1, S1, D1) and (uop2, S2, D2).

Rules of type 4 provide several alternatives (Figure 4). Because they target two binary instructions (bop1, S1, T1, D1) and (bop2, S2, T2, D2), different possibilities arise for choosing the fusion partners among the four source operands S1, T1, S2, and T2. Thus, three layouts—ACC, PAR, and CHI, which define the possibilities for fusing the operand variables for binary instructions as shown in Figure 5—are introduced. ACC is needed for fusing variables used as operands for SIMD instructions of intra-operand style, whereas PAR and CHI are meant for those of parallel style, as illustrated in Figure 4.

Figure 4 Figure 4 Figure 5 Figure 5

Vectorization quality
To extract high-performance short-vector SIMD code distinguished by good SIMD utilization, the joining rules issue SIMD reorder instructions only in the unavoidable case of a compatible fusion demanding a swap instruction. The majority of the extracted swaps can be removed by next applying a peephole optimization after the vectorization process.

The vectorization engine begins by constraining all SIMD memory operations to access consecutive locations and by disabling the suboptimal pairing rules 5–7. If these restrictions cause the vectorization process to fail, it is restarted after enabling operation pairing rules 5–7 and support for less efficient (that is, nonconsecutive) memory access operations. Abandonment of restrictions substantially augments the class of vectorizable codes by allowing the extraction of some less efficient instruction combinations.

MAP vectorization algorithm

Before the actual vectorization process is started, the following preparatory steps are taken. First, a dependency analysis is performed on the scalar DAG. Then, instruction statistics are assembled which provide instruction counts for each instruction type and operation. Data gathered in these first two steps is used as a heuristic to speed up the vectorization process by avoiding futile vectorization attempts. Finally, store instructions are combined nondeterministically by fusing their respective source operands.

Vectorization algorithm
The MAP vectorization algorithm consists of two steps:

  1. Pick I1 = (op1, s1, t1, d1) and I2 = (op2, s2, t2, d2), i.e., two scalar instructions that have not yet been vectorized, with (d1, d2) or (d2, d1) being an existing fusion.
  2. The two scalar operations op1 and op2 are paired nondeterministically, yielding an equivalent sequence of SIMD operations. This step may impose the need for new fusions if no compatible fusions are available. In this case, the layout for the fusion of the respective source operands s1, t1, s2, and t2 is mandated by the pairing rule. The vectorization process must ensure that no scalar variable is part of two different fusions.

The vectorizer alternately applies step 1 and step 2 until either the vectorization succeeds (i.e., thereafter all scalar variables are part of at most one fusion, and all scalar operations have been paired) or the vectorization fails. If the vectorizer succeeds, it immediately commits to the first solution of the search process, which keeps the vectorization runtime reasonably small. Although a search for the solution that achieves the shortest runtime would be desirable, it is not feasible using the current version of the vectorizer, even for relatively small straight-line codes.

Nondeterminism in vectorization
Nondeterminism in vectorization arises due to vectorization alternatives such as ACC, PAR, and CHI, for binary/binary pairings. For a fusion (d1, d2), there may be several layouts for fusing the source operands s1, t1, s2, and t2, depending on the pairing (op1, op2), as illustrated in Figure 4. This kind of nondeterminism widens the search space of the vectorizer backtracking search engine.

The rule ranking, i.e., the order in which vectorization alternatives are tried, may influence the order of the solutions of the vectorization process. Because the vectorizer always commits to the first solution, the rule ranking is adapted such that the first solution favors instruction sequences that are particularly well-suited for the given target machine, taking into account the different costs of individual instructions (Table 1).


Table 1 Relative costs of SIMD operations. For a selection of ISA extensions, the actual number of SIMD instructions required to implement the respective SIMD operations is given. This data directly influences the rule ranking underlying the MAP vectorizer.
SIMD opISA

Basic 3DNow! (K6-II+)Extended 3DNow! (K7/K8)SSE2 (P4/K8)SSE3 (P4e)IA64 (Intel Itanium**)Double FPU (440 FP2)

Load/store111111
 
Uniform unpack111112
Mixed unpack222212
 
Uniform ACC113135
Mixed ACC214235
 
Uniform PAR111111
Mixed PAR222111
 
PAR FMA222211

The rule ranking has to be considered as a kind of extraction “hint.” At every point of decision, the search engine initially tries the rule that is ranked first. If this does not succeed, i.e., does not lead to a vectorization, later-ranked rules are also used, even if their application effectuates the extraction of pseudo instructions that are not supported on the target ISA. This kind of retreat is unavoidable, as a complete vectorization is the central goal.

Realization of the vectorization engine

The MAP vectorization algorithm is implemented using a depth-first search engine with chronological backtracking. This backtracking capability is indispensable when a fusion that is requested by the current vectorization alternative does not comply with the globally existing fusions. In these cases, the search engine backtracks to the last nondeterministic point of decision. There, another vectorization alternative is chosen, and corresponding fusions of different layouts are requested and generated if necessary. If these fusions comply with the set of existing fusions, the rule fires and the vectorization process continues. Otherwise, backtracking is chronologically applied repeatedly until either a vectorization is obtained or the search space is exhausted. In the latter case, the vectorization engine is unable to find a valid fusion set for the given scalar DAG.

Vienna MAP optimizer

The Vienna MAP optimizer is a rule-based local rewriting system that implements peephole optimization on vector DAGs. It postprocesses the output of the MAP vectorizer and comprises two groups of rewriting rules. Finally, the optimized output is sorted topologically in an attempt to minimize the lifespan of variables by improving the locality of variable accesses, using a scheduling algorithm based on the FFTW scheduler GENFFT [10].

General set of rules

The first group of rewriting rules consists of general optimization rules (Figure 6) aiming at minimization of the instruction count, elimination of redundancy and dead code, reduction of the number of source operands (which reduces register pressure), minimization of the critical path length of the vector DAG, copy propagation, and constant folding.

Figure 6 Figure 6

With target architectures that support FMAs, such as the Blue Gene/L double FPU, the FMAs are extracted by combining multiplications (or sign changes) with directly dependent addition operations (or subtraction operations or already existing FMAs) into FMAs. If this direct combination is not possible at first, the respective instructions are moved down in the DAG in an attempt to fold them into other instructions.

Specific set of rules

The second group of rewriting rules is specific to target architecture. When optimizing for the Blue Gene/L PPC440 FP2, vector swap instructions are folded into FMAs utilizing vector FMA instructions with crossed datapaths, exclusively available on Blue Gene/L, using a method similar to FMA extraction (Figure 7).

Figure 7 Figure 7

Experimental results

Numerical experiments based on one-dimensional FFTs applied to vectors with power-of-2 lengths N = 23, 24, ···, 218 = 262,144 and non-power-of-2 lengths N = 9, 12, ···, 27,000 were carried out using FFTW combined with the new vectorization and optimization techniques as they are implemented in the Vienna MAP vectorizer and optimizer.

In these experiments the new vectorization and optimization techniques were built into BGL/FFTW–GEL and evaluated on the Blue Gene/L DD2 prototype using one single PowerPC 440 FP2 processor running at 700 MHz. The best-performing scalar code generated by FFTW as well as FFTW code vectorized by the XL C compiler is used as a performance benchmark.

The floating-point performance displayed in Figures 8(a) and 8(b) is given in pseudo Gflops, i.e., 5N log N/T, with N being the vector length and T the measured runtime in nanoseconds.

Figure 8 Figure 8

Figures 8(a) and 8(b) compare different FFTW library implementations:

  • The best vectorized code obtained using all technologies presented in this paper (BGL/FFTW–GEL).
  • The best scalar FFTW implementation (with the XL C vectorizer and FMA extraction facility turned off).
  • The best vectorized FFTW implementation obtained with the activated vectorizer and FMA extraction facility of the XL C compiler.

The impressive performance gain attainable with BGL/FFTW–GEL over scalar code generated by FFTW is shown in Figure 9. For power-of-2 FFTs, BGL/FFTW–GEL yields speedup figures up to 80% with respect to the best-performing scalar code, i.e., the scalar FFTW library (which serves as a baseline in the diagram). For large problem sizes, BGL/FFTW–GEL still yields a speedup of 60%. Thus, the vectorization and optimization methods of this paper have been demonstrated to effectuate significant performance improvements. XL C, with its vectorization and FMA extraction facility turned on, when applied to FFTW-generated code without taking advantage of the techniques presented in this paper, produces vectorized code that runs at the same speed or slightly slower than scalar XL C code.

Figure 9 Figure 9

Conclusions and outlook

FFTs are indispensable parts of nearly every kind of scientific computing application. Thus, efficient FFT software is needed by Blue Gene/L scientific users. The performance portable vectorization techniques introduced in this paper allow timely software optimization to be done concurrently with IBM hardware development on Blue Gene/L.

The highly portable Vienna MAP vectorizer can be used to automatically vectorize numerical straight-line code generated by state-of-the-art automatic performance-tuning software, such as FFTW, thereby helping to develop highly efficient implementations of FFT kernels.

Performance experiments carried out on Blue Gene/L prototypes show that the newly developed vectorization approach in combination with state-of-the-art performance-tuning software is able to speed up numerical codes considerably. The vectorization approach of this paper has been demonstrated to produce high-performance FFT kernels for the Blue Gene/L supercomputers that fully utilize the new double FPU.

An integral part of our current work is a compiler back end that will be particularly well-suited for the compilation of numerical straight-line code.

Acknowledgments

We thank Matteo Frigo and Steven G. Johnson for years of pleasant and productive cooperation. Special thanks go to Manish Gupta, José Moreira, and their group at the IBM Thomas J. Watson Research Center for making it possible to work on the Blue Gene/L prototype and for a very pleasant and fruitful cooperation. The Center for Applied Scientific Computing at Lawrence Livermore National Laboratory deserves particular appreciation for ongoing support. Additionally, we would like to acknowledge the financial support of the Austrian Science Fund FWF. The work described in this paper was supported by the Special Research Program SFB F011 “Aurora” of the Austrian Science Fund FWF. Franz Franchetti was supported by the Austrian Science Fund FWF's Erwin Schroedinger Fellowship J2322.

*Trademark or registered trademark of International Business Machines Corporation.
**Trademark or registered trademark of Linus Torvalds, Intel Corporation, Advanced Micro Devices, Inc., or Motorola, Inc. in the United States, other countries, or both.

References

Received June 17, 2004; accepted for publication August 23, 2004; Published online March 31, 2005.


    About IBMPrivacyContact