Performance of lattice QCD programs on CP-PACS
Introduction
Quarks and gluons are the building blocks of a large number of elementary particles, collectively called hadrons, that include well-known particles such as protons and neutrons. A remarkable property of quarks and gluons is confinement, that is while there is solid evidence that they exist within hadrons, they have never been observed in isolation in experiments. The theoretical principle governing the physical dynamics of quarks and gluons is described by a gauge field theory called quantum chromodynamics (QCD).
QCD is a highly non-linear quantum mechanical system in which the basic quark and gluon field degrees of freedom are defined at each point of four-dimensional space-time. While the fluctuations of the fields with short wave length are weakly coupled, the coupling becomes stronger for longer wave lengths. These features render an analytical solution of QCD an impossible arduous task. Instead progress over the past two decades came from numerical simulations using a formulation of QCD on a four-dimensional space-time lattice, known as lattice QCD (see e.g. [1] and for introduction to lattice gauge theories see [2], [3]).
Approximating continuous space-time with a sufficiently fine lattice necessarily requires a large lattice size L, with the consequence that the number of degrees of freedom increases as L4. When we increase L we usually reduce the light quark mass such that it becomes closer to the physical value; this requires additional computations due to the critical slowing down. Taking these two factors into account, the amount of computing actually needed is considered to increase, at least, as fast as L(8−10). The numerical simulation of lattice QCD therefore requires significant computing power. On the other hand, quark and gluon fields interact only locally in space-time in QCD. Thus, lattice QCD simulations are ideally suited for parallelism in the space-time coordinates.
Exploiting this feature, a number of dedicated parallel computers has been developed since the 1980s aiming to advance lattice QCD simulations (for reviews, see e.g. [4], [5]. The CP-PACS parallel computer is one of the latest efforts in this direction.1,2 It is worth emphasizing, however, that the parallelism inherent in lattice QCD is shared by a large number of physics problems in which space-time or space fields are the basic dynamical variable. Thus, the overall objective of the CP-PACS Project is broader, encompassing astrophysics and condensed matter applications in computational physics. This is reflected in the name of the computer, which is an acronym for computational physics by parallel array computer system. The CP-PACS has been developed in collaboration with Hitachi.
The CP-PACS started to operate for physics computations in April 1996 with 1024 processing nodes. The upgrade to the final 2048 processor system with a peak speed of 614 GFLOPS was completed in late September 1996. So far most of the CPU time has been devoted to simulations of lattice QCD. In this article we report the performance of CP-PACS for this problem based on the measurements recorded in the actual production runs.
We first performed a large scale simulation of QCD in the “quenched” approximation, where the effects of quark–antiquark pair creation/annihilation are neglected in the intermediate processes. Quenched QCD calculations require a large memory size and were performed using the entire system of 2048 nodes. Physics results of the quenched simulation have been presented elsewhere [12], [13], [14]. We then started a systematic study of “full QCD”, progressively eliminating the quenched approximation. Full QCD simulations demand much more computer time than quenched simulations. Preliminary physics results of our full QCD simulations have been presented in [15], [16], [17]. For a short summary of physics results from the CP-PACS, see [6], [7], [8], [9].
Summarizing the results for the performance of the entire CP-PACS system, our optimized code has achieved a sustained speed of 237.5 GFLOPS for the heat-bath update of gluon variables, 264.6 GFLOPS for the over-relaxation update, and 325.3 GFLOPS for quark matrix inversion with the even–odd preconditioned minimal residual algorithm.
Section snippets
CP-PACS computer
The CP-PACS is a MIMD parallel computer with distributed memory consisting of 2048 processing nodes (PU) and 128 I/O nodes (IOU). The nodes are interconnected into an 8×17×16 array by a three-dimensional hyper-crossbar network of crossbar switches as shown schematically in Fig. 1. Each PU has a newly made RISC processor HARP1-E with the peak speed of 300 MFLOPS for 64 bit data and 64–256 MByte of main memory. For intermediate storage a RAID-5 disk of 8.3 GByte is attached to each IOU. Thus,
Coding lattice QCD on CP-PACS
The basic dynamical variables of QCD are the gluon field and the quark field. On a four-dimensional lattice of a size Nx×Ny×Nz×Nt, the gluon field is represented by a set of complex 3×3 matrices U(n,μ), where n=(nx,ny,nz,nt) denotes lattice sites with 1⩽nx,y,z,t⩽Nx,y,z,t and μ=x,y,z,t the four directions. The quark field in the Wilson's formalism is represented by a 12-component complex vector Q(n). The objective of lattice QCD simulations is to numerically evaluate by a Monte Carlo method the
Performance of quenched simulation
We now describe the performance of our QCD programs on the CP-PACS computer obtained in our production run for a quenched QCD hadron spectrum calculation with the full 2048 processing units. The lattice size is 643×112 which is the largest lattice employed so far in lattice QCD simulations. In Table 2 we list a breakdown of the execution time of one cycle of simulation into those of the four basic steps. During the run we measure the execution times of important subroutines using a
Full QCD simulations
In full QCD simulations, the most time-consuming part is mult both in update and solver. The performance of mult has already been discussed in Section 4.2. Here, however, several additional remarks are in order.
First, full QCD simulations are extremely computer time-consuming compared to those of quenched QCD. Simple scaling estimates place a 100-fold or more increase in the amount of computations for full QCD compared to that of quenched QCD with current algorithms. Therefore, the use of a
Summary
We have presented the performance data of the CP-PACS computer measured during recent production runs for quantum chromodynamics simulations of quarks and gluons. In a run with the quenched approximation, we used the full 2048 processing nodes of the CP-PACS and obtained a sustained speed of 237.5 GFLOPS for the heat-bath update of gluon variables, 264.6 GFLOPS for the over-relaxation update, and 325.3 GFLOPS for quark matrix inversion with an even–odd preconditioned minimal residual algorithm.
Acknowledgements
We are grateful to members of the CP-PACS project for discussions and encouragements. We thank Y. Iwasaki, A. Ukawa and H.P. Shanahan for reading the manuscript. This work is supported by the Grant-in-Aid of Ministry of Education, Science and Culture (No. 08NP0101 and No. 10640248).
References (37)
- et al.
A potpourri of results in QCD from large lattice simulations on the CM5
Nucl. Phys. B (Proc. Suppl).
(1994) - et al.
Performance of the Cray-T3D and emerging architectures on canopy QCD applications
Nucl. Phys. B (Proc. Suppl.)
(1996) - et al.
Improved continuum limit lattice action for QCD with wilson fermions
Nucl. Phys.
(1985) Confinement of quarks
Phys. Rev. D
(1974)- M. Creutz, Quarks, Gluons, Lattices, Cambridge University Press, Cambridge,...
- I. Montvay, G. Münster, Field Theory on the Lattice, Cambridge University Press, Cambridge,...
- Y. Iwasaki, Computers for lattice field theories, Nucl. Phys. B (Proc. Suppl.) 34 (1994)...
- J.C. Sexton, Computers for lattice simulation, ibid. 47 (1996)...
- Y. Iwasaki, The CP-PACS project, Nucl. Phys. B (Proc.Suppl.) 60A (1998)...
- Y. Iwasaki, Computers for lattice field theories, ibid. 34 (1994)...
Cited by (11)
LGT simulations on APE machines
2001, Computer Physics CommunicationsDesign and power performance evaluation of on-chip memory processor with arithmetic accelerators
2008, Proceedings of the Innovative Architecture for Future Generation High-Performance Processors and SystemsSCIMA-SMP: On-chip memory processor architecture for SMP
2004, ACM International Conference Proceeding SeriesPerformance Modeling of Distributed Hybrid Architectures
2004, IEEE Transactions on Parallel and Distributed SystemsReducing memory system energy by software-controlled on-chip memory
2003, IEICE Transactions on ElectronicsLight hadron spectrum and quark masses from quenched lattice QCD
2003, Physical Review D - Particles, Fields, Gravitation and Cosmology