High-performance direct gravitational N-body simulations on graphics processing units
Introduction
Since the first large scale simulations of self gravitating systems the direct N-body method has gained a solid footing in the research community. At the moment N-body techniques are used in astronomical studies of planetary systems, debris discs, stellar clusters, galaxies all the way to simulations of the entire universe (Hut, 2007). Outside astronomy the main areas of research which utilize the same techniques are molecular dynamics, elementary particle scattering simulations, plate tectonics, traffic simulations and chemical reaction network studies. In the latter non-astronomical applications, the main force evaluating routine is not as severe as in the gravitational N-body simulations, but the backbone simulation environments are not very different.
The main difficulty in simulating self gravitating systems is the lack of antigravity, which results in the requirement of global communication; each object feels the gravitational attraction of any other object.
The first astronomical simulation of a self gravitating N-body system was carried out by Holmberg (1941) with the use of 37 light bulbs and photoelectric cells to evaluate the forces on the individual objects. Holmberg spent weeks in order to perform this quite moderate 37-particle simulation. Over the last 60 or so years many different techniques have been introduced to speed-up the kernel calculation. Today, such a calculation requires about 50,000 integration steps for one dynamical time unit. At a speed of ∼10 GFLOP/s the calculation would be performed in a few seconds.
The gravitational N-body problem has made enormous advances in the last decade due to algorithmic design. The introduction of digital computers in the arena (von Hoerner, 1963, Aarseth and Hoyle, 1964, van Albada, 1968) led to a relatively quick evaluation of mutual particle forces. Advanced integration techniques, introduced to turn the particle forces in a predicted space–time trajectory, opened the way to predictable theoretical results (Aarseth and Lecar, 1975, Aarseth, 1999). One of the major developments in the speed-up and improved accuracy of the direct N-body problem was the introduction of the block-time step algorithm (Makino, 1991, McMillan and Aarseth, 1993).
In the late 1980s it became quite clear that the advances of modern computer technology via Moore’s law (Moore, 1965) were insufficient to simulate large star clusters by the new decade (Makino and Hut, 1988, Makino and Hut, 1990). This realization brought forward the initiatives employed around the development of special hardware for evaluating the forces between the particles (Applegate et al., 1986, Taiji et al., 1996, Makino and Taiji, 1998, Makino, 2001, Makino et al., 2003), and of the efficient use of assembler code on general purpose hardware (Nitadori et al., 2006, Nitadori et al., 2007).
One method to improve performance is by parallelising force evaluation (Eq. (1)) for use on a Beowulf or cluster computer (with or without dedicated hardware) (Harfst et al., 2007), a large parallel supercomputer (Makino, 2002, Dorband et al., 2003) or for grid operations (Gualandris et al., 2007). In particular, for distributed hardware it is crucial to implement an algorithm that limits communication as much as possible, otherwise the bottleneck simply shifts from the force evaluation to interprocessor communication.
A breakthrough in direct-summation N-body simulations came in the late 1990s with the development of the GRAPE series of special-purpose computers (Makino and Taiji, 1998), which achieve spectacular speedups by implementing the entire force calculation in hardware and placing many force pipelines on a single chip. The latest special purpose computer for gravitational N-body simulations, GRAPE-6, performs at a peak speed of about 64 TFLOP/s (Makino, 2001).
In our standard setup, one GRAPE-6Af processor board is attached to a host workstation, in much the same way that a floating-point or graphics accelerator card is used. We use a smaller version: the GRAPE-6Af which has four chips connected to a personal workstation via the PCI bus delivering a theoretical peak performance of ∼131 GFLOP/s for systems of up to 128k particles at a cost of ∼$6K (Fukushige et al., 2005). Advancement of particle positions is carried out on the host computer, while interparticle forces are computed on the GRAPE.
The latest developments in this endeavour is the design and construction of the GRAPE-DR, the special purpose computer which will break the PFLOP/s barrier by the summer of 2008 (Makino, 2007).1 One of the main arguments to develop such a high powered and relatively diverse computer is to perform simulations of entire galaxies (Makino, 2005a, Hoekstra et al., 2007).
The main disadvantages of these special purpose computers, however, are the relatively short mean time between failure, the limited availability, the limited applicability, the limited on-board memory to store particles, the simple fact that they are basically build by a single research team led by prof. J. Makino and the lack of competing architectures.
The gaming industry, though not deliberately supportive of scientific research, has been developing high power parallel vector processors for performing specific rendering applications, which are in particular suitable for boosting the frame-rate of games. Over the last 7 years graphics processing units (GPUs) have evolved from fixed function hardware for the support of primitive graphical operations to programmable processors that outperform conventional CPUs, in particular for vectorizable parallel operations. Regretfully, the precision of these processors is still 32-bit IEEE which is below the average general purpose processor, but for many applications it turns out that the higher (double) precision is not crucial or can be emulated at some cost. It is because of these developments, that more and more people use the GPU for wider purposes than just for graphics (Fernando, 2004, Pharr and Fernando, 2005, Buck et al., 2004). This type of programming is also called general purpose computing on graphics processing units (GPGPU)2. Earlier attempts to use a GPU for gravitational N-body simulations were carried out using approximate force evaluation methods with shared time steps (Nyland et al., 2004), but provide little improvement in performance. A 25-fold speed increase compared to an Intel Pentium IV processor was reported by Elsen et al. (2006), but details of their implementation of the force evaluation algorithm are yet unclear. Recently, Hamada and Iitaka (2007) proposed the ‘Chamomile’ scheme for running N-body simulations with a shared time-step algorithm on GPUs. Though, their method, using the CUDA programming environment, outperforms our implementations, the shared time step renders their code unpractical for simulating dense star clusters.
Using GPUs as a general purpose vector processor works as follows. Colours in computer graphics are represented by one or more numbers. The luminance can be represented by just a single number, whereas a coloured pixel may contain separate values indicating the amount of red, green and blue. A fifth value alpha may be included to indicate the amount of transparency. Using this information, a pixel can be drawn. For general purpose computing, the colour information of a pixel is used to represent attributes of the computation. There are many pixels in a frame, and ideally, these should be updated all at the same time and at a rate exceeding the response time of the human eye. This requires fast computations for updating the pixels when, for example, a camera moves or a new object comes into view. Such operations usually have an impact on many or even all pixels and therefore fast computations are required. As the majority of pixels do not require information from other pixels, processing can be done efficiently in parallel. All information required to build a pixel should go through a series of similar operations, a technique which is better known as single instruction, multiple data (SIMD). There are many different kinds of operations this information needs to go through. The stream programming model has been designed to make the information go through these operations efficiently, while exposing as much parallelism as possible. The stream programming model views all informations as “streams” of ordered data of the same data type. The streams pass through “kernels” that operate on the streams and produce one or more streams as output.
In this paper we report on our endeavour to convert a high precision production quality N-body code to operate with graphics processing units. In Section 2, we explain the adopted N-body integration algorithm, in Section 3, we address the programming environment we used to program the GPU. In Sections 4 Results, 5 Performance modelling of the GPU, we present the results on two GPUs and compare them with GRAPE-6Af and we discuss a model to explain the GPU’s performance. In Section 6, we summarize our findings, and in Supplementary material we present a snippet of the source code in Cg.
Section snippets
Calculating the force and integrating the particles
The gravitational evolution of a system consisting of N stars with masses mj and at position rj is computed by the direct summation of the Newtonian force between each of the N stars. The force Fi acting on particle i is then obtained by summation of all other N − 1 particlesHere G is the Newton constant. For further readability we omit the particle index i and present vectors in boldface.
A cluster consisting of N stars evolves dynamically due to the mutual
The programming environment
The part of the algorithm that executes on the GPU (the force evaluation) is implemented in the Cg programming language (C for graphics, Fernando and Kilgard (2003), see Supplementary material), which has a syntax quite similar to C. The Cg programming environment includes a compiler and run-time libraries for use with the open graphics library (OpenGL)3 and DirectX4 graphics application programming interfaces. Though originally
Results
To test the various implementations of the force evaluator we perform several tests on different hardware. For clarity we perform each test with the same realization of the initial conditions. For this we vary the number of stars from N = 256 particles with steps of two to half a million stars (see Table 2). Not each set of initial condition is run on every processor, as the Intel Xeons, for example, would take a long time and the scaling with N is unlikely to change as we clearly have reached
Performance modelling of the GPU
In modelling the performance of the GPU we adopt the model proposed by Makino, 2002, Harfst et al., 2007 but tailored to the host plus GPU and to the GRAPE architecture.
The wall clock time required for advancing the nblock particles in a single block time step in the N-body systems isHere thost = tpred + tcorr is the time spent on the host computer for predicting and correcting the particles in the block, tforce is the time spent on the attached processor and tcomm is the
Discussion
We have successfully implemented the direct gravitational force evaluation calculation using Cg on two graphics cards, the NVIDIA Quadro FX1400 and the NVIDIA GeForce 8800GTX, and compared their performance with the host workstation and the GRAPE-6Af special purpose computer.
For N ≲ 104 particles the workstation outperforms the GPUs. This is mainly due to additional overhead introduced by the communication to the GPU and memory allocation on the GPU. For a larger number of particles the more
Acknowledgments
We are grateful to Mark Harris and David Luebke of NVIDIA for supplying us with the two NVIDIA GeForce 8800GTX graphics cards on which part of the simulations were performed. We also thank Jeroen Bédorf, Derek Groen, Alessia Gualandris and Jun Makino for numerous discussions, and the referee Piet Hut for pointing us to the importance of discussing the accuracy of the GPU. This work was supported by NWO (via Grants #635.000.303 and #643.200.503) and the Netherlands Advanced School for
References (41)
- et al.
J. Comput. Phys.
(2003) - et al.
PARCO
(2007) - et al.
NewA
(2007) NewA
(2002)- et al.
NewA
(2006) PASP
(1999)- et al.
A& A
(1974) - et al.
ApNr
(1964) - Aarseth, S.J., 1985. Direct methods for N-body simulations. In: Brackbill, J.C., Cohen, B.I. (Eds.), Multiple Time...
- et al.
ARA&A
(1975)
GPU Gems (Programming Techniques, Tips, and Tricks for Real-Time Graphics)
The Cg Tutorial (The Definitive Guide to Programmable Real-Time Graphics)
PASJ
Cited by (86)
cuFFS: A GPU-accelerated code for Fast Faraday rotation measure Synthesis
2018, Astronomy and ComputingGPU-enabled N-body simulations of the Solar System using a VOVS Adams integrator
2016, Journal of Computational ScienceCitation Excerpt :The large number of floating point operations required for N-body simulations make them obvious candidates for implementation on GPUs. Considerable research has been done on using GPUs for galactic N-body simulations, see for example Hamanda et al. [5], Portegies Zwart et al. [6], Berzcek et al. [7], Bédorf et al. [8], Spurzem et al. [9], Watanabe et al. [10]. Galactic simulations typically require several orders of magnitude fewer integration steps than simulations of planetary systems.
The role of 3-D interactive visualization in blind surveys of Hi in galaxies
2015, Astronomy and ComputingGraviDy: A modular, GPU-based, direct-summation N-body code
2016, Proceedings of the International Astronomical UnionGPURepair: Automated Repair of GPU Kernels (Extended Version)
2024, Sadhana - Academy Proceedings in Engineering Sciences