High-performance direct gravitational N-body simulations on graphics processing units

doi:10.1016/j.newast.2007.05.004

New Astronomy

Volume 12, Issue 8, November 2007, Pages 641-650

Communicated by M. van der Klis

https://doi.org/10.1016/j.newast.2007.05.004 Get rights and content

Abstract

We present the results of gravitational direct N-body simulations using the commercial graphics processing units (GPU) NVIDIA Quadro FX1400 and GeForce 8800GTX, and compare the results with GRAPE-6Af special purpose hardware. The force evaluation of the N-body problem was implemented in Cg using the GPU directly to speed-up the calculations. The integration of the equations of motions were, running on the host computer, implemented in C using the 4th order predictor–corrector Hermite integrator with block time steps. We find that for a large number of particles (N ≳ 10⁴) modern graphics processing units offer an attractive low cost alternative to GRAPE special purpose hardware. A modern GPU continues to give a relatively flat scaling with the number of particles, comparable to that of the GRAPE. The GRAPE is designed to reach double precision, whereas the GPU is intrinsically single-precision. For relatively large time steps, the total energy of the N-body system was conserved better than to one in 10⁶ on the GPU, which is impressive given the single-precision nature of the GPU. For the same time steps, the GRAPE gave somewhat more accurate results, by about an order of magnitude. However, smaller time steps allowed more energy accuracy on the grape, around 10⁻¹¹, whereas for the GPU machine precision saturates around 10⁻⁶ For N ≳ 10⁶ the GeForce 8800GTX was about 20 times faster than the host computer. Though still about a factor of a few slower than GRAPE, modern GPUs outperform GRAPE in their low cost, long mean time between failure and the much larger onboard memory; the GRAPE-6Af holds at most 256k particles whereas the GeForce 8800GTX can hold 9 million particles in memory.

Introduction

Since the first large scale simulations of self gravitating systems the direct N-body method has gained a solid footing in the research community. At the moment N-body techniques are used in astronomical studies of planetary systems, debris discs, stellar clusters, galaxies all the way to simulations of the entire universe (Hut, 2007). Outside astronomy the main areas of research which utilize the same techniques are molecular dynamics, elementary particle scattering simulations, plate tectonics, traffic simulations and chemical reaction network studies. In the latter non-astronomical applications, the main force evaluating routine is not as severe as in the gravitational N-body simulations, but the backbone simulation environments are not very different.

The main difficulty in simulating self gravitating systems is the lack of antigravity, which results in the requirement of global communication; each object feels the gravitational attraction of any other object.

The first astronomical simulation of a self gravitating N-body system was carried out by Holmberg (1941) with the use of 37 light bulbs and photoelectric cells to evaluate the forces on the individual objects. Holmberg spent weeks in order to perform this quite moderate 37-particle simulation. Over the last 60 or so years many different techniques have been introduced to speed-up the kernel calculation. Today, such a calculation requires about 50,000 integration steps for one dynamical time unit. At a speed of ∼10 GFLOP/s the calculation would be performed in a few seconds.

The gravitational N-body problem has made enormous advances in the last decade due to algorithmic design. The introduction of digital computers in the arena (von Hoerner, 1963, Aarseth and Hoyle, 1964, van Albada, 1968) led to a relatively quick evaluation of mutual particle forces. Advanced integration techniques, introduced to turn the particle forces in a predicted space–time trajectory, opened the way to predictable theoretical results (Aarseth and Lecar, 1975, Aarseth, 1999). One of the major developments in the speed-up and improved accuracy of the direct N-body problem was the introduction of the block-time step algorithm (Makino, 1991, McMillan and Aarseth, 1993).

In the late 1980s it became quite clear that the advances of modern computer technology via Moore’s law (Moore, 1965) were insufficient to simulate large star clusters by the new decade (Makino and Hut, 1988, Makino and Hut, 1990). This realization brought forward the initiatives employed around the development of special hardware for evaluating the forces between the particles (Applegate et al., 1986, Taiji et al., 1996, Makino and Taiji, 1998, Makino, 2001, Makino et al., 2003), and of the efficient use of assembler code on general purpose hardware (Nitadori et al., 2006, Nitadori et al., 2007).

One method to improve performance is by parallelising force evaluation (Eq. (1)) for use on a Beowulf or cluster computer (with or without dedicated hardware) (Harfst et al., 2007), a large parallel supercomputer (Makino, 2002, Dorband et al., 2003) or for grid operations (Gualandris et al., 2007). In particular, for distributed hardware it is crucial to implement an algorithm that limits communication as much as possible, otherwise the bottleneck simply shifts from the force evaluation to interprocessor communication.

A breakthrough in direct-summation N-body simulations came in the late 1990s with the development of the GRAPE series of special-purpose computers (Makino and Taiji, 1998), which achieve spectacular speedups by implementing the entire force calculation in hardware and placing many force pipelines on a single chip. The latest special purpose computer for gravitational N-body simulations, GRAPE-6, performs at a peak speed of about 64 TFLOP/s (Makino, 2001).

In our standard setup, one GRAPE-6Af processor board is attached to a host workstation, in much the same way that a floating-point or graphics accelerator card is used. We use a smaller version: the GRAPE-6Af which has four chips connected to a personal workstation via the PCI bus delivering a theoretical peak performance of ∼131 GFLOP/s for systems of up to 128k particles at a cost of ∼$6K (Fukushige et al., 2005). Advancement of particle positions $[O (N)]$ is carried out on the host computer, while interparticle forces $[O (N^{2})]$ are computed on the GRAPE.

The latest developments in this endeavour is the design and construction of the GRAPE-DR, the special purpose computer which will break the PFLOP/s barrier by the summer of 2008 (Makino, 2007).¹ One of the main arguments to develop such a high powered and relatively diverse computer is to perform simulations of entire galaxies (Makino, 2005a, Hoekstra et al., 2007).

The main disadvantages of these special purpose computers, however, are the relatively short mean time between failure, the limited availability, the limited applicability, the limited on-board memory to store particles, the simple fact that they are basically build by a single research team led by prof. J. Makino and the lack of competing architectures.

The gaming industry, though not deliberately supportive of scientific research, has been developing high power parallel vector processors for performing specific rendering applications, which are in particular suitable for boosting the frame-rate of games. Over the last 7 years graphics processing units (GPUs) have evolved from fixed function hardware for the support of primitive graphical operations to programmable processors that outperform conventional CPUs, in particular for vectorizable parallel operations. Regretfully, the precision of these processors is still 32-bit IEEE which is below the average general purpose processor, but for many applications it turns out that the higher (double) precision is not crucial or can be emulated at some cost. It is because of these developments, that more and more people use the GPU for wider purposes than just for graphics (Fernando, 2004, Pharr and Fernando, 2005, Buck et al., 2004). This type of programming is also called general purpose computing on graphics processing units (GPGPU)². Earlier attempts to use a GPU for gravitational N-body simulations were carried out using approximate force evaluation methods with shared time steps (Nyland et al., 2004), but provide little improvement in performance. A 25-fold speed increase compared to an Intel Pentium IV processor was reported by Elsen et al. (2006), but details of their implementation of the force evaluation algorithm are yet unclear. Recently, Hamada and Iitaka (2007) proposed the ‘Chamomile’ scheme for running N-body simulations with a shared time-step algorithm on GPUs. Though, their method, using the CUDA programming environment, outperforms our implementations, the shared time step renders their code unpractical for simulating dense star clusters.

Using GPUs as a general purpose vector processor works as follows. Colours in computer graphics are represented by one or more numbers. The luminance can be represented by just a single number, whereas a coloured pixel may contain separate values indicating the amount of red, green and blue. A fifth value alpha may be included to indicate the amount of transparency. Using this information, a pixel can be drawn. For general purpose computing, the colour information of a pixel is used to represent attributes of the computation. There are many pixels in a frame, and ideally, these should be updated all at the same time and at a rate exceeding the response time of the human eye. This requires fast computations for updating the pixels when, for example, a camera moves or a new object comes into view. Such operations usually have an impact on many or even all pixels and therefore fast computations are required. As the majority of pixels do not require information from other pixels, processing can be done efficiently in parallel. All information required to build a pixel should go through a series of similar operations, a technique which is better known as single instruction, multiple data (SIMD). There are many different kinds of operations this information needs to go through. The stream programming model has been designed to make the information go through these operations efficiently, while exposing as much parallelism as possible. The stream programming model views all informations as “streams” of ordered data of the same data type. The streams pass through “kernels” that operate on the streams and produce one or more streams as output.

In this paper we report on our endeavour to convert a high precision production quality N-body code to operate with graphics processing units. In Section 2, we explain the adopted N-body integration algorithm, in Section 3, we address the programming environment we used to program the GPU. In Sections 4 Results, 5 Performance modelling of the GPU, we present the results on two GPUs and compare them with GRAPE-6Af and we discuss a model to explain the GPU’s performance. In Section 6, we summarize our findings, and in Supplementary material we present a snippet of the source code in Cg.

Section snippets

Calculating the force and integrating the particles

The gravitational evolution of a system consisting of N stars with masses m_j and at position r_j is computed by the direct summation of the Newtonian force between each of the N stars. The force F_i acting on particle i is then obtained by summation of all other N − 1 particles $F_{i} \equiv m_{i} a_{i} = m_{i} G \sum_{j = 0, j \neq i}^{N - 1} m_{j} \frac{r_{i} - r_{j}}{| r_{i} - r_{j} |^{3}} .$ Here G is the Newton constant. For further readability we omit the particle index i and present vectors in boldface.

A cluster consisting of N stars evolves dynamically due to the mutual

The programming environment

The part of the algorithm that executes on the GPU (the force evaluation) is implemented in the Cg programming language (C for graphics, Fernando and Kilgard (2003), see Supplementary material), which has a syntax quite similar to C. The Cg programming environment includes a compiler and run-time libraries for use with the open graphics library (OpenGL)³ and DirectX⁴ graphics application programming interfaces. Though originally

Results

To test the various implementations of the force evaluator we perform several tests on different hardware. For clarity we perform each test with the same realization of the initial conditions. For this we vary the number of stars from N = 256 particles with steps of two to half a million stars (see Table 2). Not each set of initial condition is run on every processor, as the Intel Xeons, for example, would take a long time and the scaling with N is unlikely to change as we clearly have reached

Performance modelling of the GPU

In modelling the performance of the GPU we adopt the model proposed by Makino, 2002, Harfst et al., 2007 but tailored to the host plus GPU and to the GRAPE architecture.

The wall clock time required for advancing the n_block particles in a single block time step in the N-body systems is $t_{step} = t_{host} + t_{force} + t_{comm} .$ Here t_host = t_pred + t_corr is the time spent on the host computer for predicting and correcting the particles in the block, t_force is the time spent on the attached processor and t_comm is the

Discussion

We have successfully implemented the direct gravitational force evaluation calculation using Cg on two graphics cards, the NVIDIA Quadro FX1400 and the NVIDIA GeForce 8800GTX, and compared their performance with the host workstation and the GRAPE-6Af special purpose computer.

For N ≲ 10⁴ particles the workstation outperforms the GPUs. This is mainly due to additional overhead introduced by the communication to the GPU and memory allocation on the GPU. For a larger number of particles the more

Acknowledgments

We are grateful to Mark Harris and David Luebke of NVIDIA for supplying us with the two NVIDIA GeForce 8800GTX graphics cards on which part of the simulations were performed. We also thank Jeroen Bédorf, Derek Groen, Alessia Gualandris and Jun Makino for numerous discussions, and the referee Piet Hut for pointing us to the importance of discussing the accuracy of the GPU. This work was supported by NWO (via Grants #635.000.303 and #643.200.503) and the Netherlands Advanced School for

References (41)

E.N. Dorband et al.
J. Comput. Phys.
(2003)
A. Gualandris et al.
PARCO
(2007)
S. Harfst et al.
NewA
(2007)
J. Makino
NewA
(2002)
K. Nitadori et al.
NewA
(2006)
S.J. Aarseth
PASP
(1999)
S.J. Aarseth et al.
A& A
(1974)
S.J. Aarseth et al.
ApNr
(1964)
Aarseth, S.J., 1985. Direct methods for N-body simulations. In: Brackbill, J.C., Cohen, B.I. (Eds.), Multiple Time...
S.J. Aarseth et al.
ARA&A
(1975)

Applegate, J.H., Douglas, M.R., Gürsel, Y., Hunter, P., Seitz, C.L., Sussman, G.J., 1986. In: Hut, P., McMillan, S.L.W....

Bédorf, P., Belleman, R., Portegies Zwart, S. In: The International Conference for High Performance Computing,...

Buck, I., Foley, T., Horn, D., Sugerman, J., Mike, K., Pat, H., 2004. Brook for GPUs: stream computing on graphics...

E. Elsen et al.

R. Fernando

GPU Gems (Programming Techniques, Tips, and Tricks for Real-Time Graphics)

(2004)

R. Fernando et al.

The Cg Tutorial (The Definitive Guide to Programmable Real-Time Graphics)

(2003)

T. Fukushige et al.

PASJ

(2005)

Göddeke, D., 2005. GPGPU – basic math tutorial. Ergebnisberichte des Instituts für Angewandte Mathematik, Nummer...

Hamada, T., Iitaka, T., 2007. NewA. ArXiv Astrophysics e-prints (astro-ph/0703100) (submitted for...

Heggie, D.C., Mathieu, R.D., 1986. In: Hut, P., McMillan, S.L.W. (Eds.), LNP Vol. 267: The Use of Supercomputers in...

Cited by (86)

cuFFS: A GPU-accelerated code for Fast Faraday rotation measure Synthesis
2018, Astronomy and Computing
Rotation measure (RM) synthesis is a widely used polarization processing algorithm for reconstructing polarized structures along the line of sight. Performing RM synthesis on large datasets produced by telescopes like LOFAR can be computationally intensive as the computational cost is proportional to the product of the number of input frequency channels, the number of output Faraday depth values to be evaluated and the number of lines of sight present in the data cube. The required computational cost is likely to get worse due to the planned large area sky surveys with telescopes like the Low Frequency Array (LOFAR), the Murchison Widefield Array (MWA), and eventually the Square Kilometre Array (SKA). The massively parallel General Purpose Graphical Processing Units (GPGPUs) can be used to execute some of the computationally intensive astronomical image processing algorithms including RM synthesis. In this paper, we present a GPU-accelerated code, called cuFFS or CUDA-accelerated Fast Faraday Synthesis, to perform Faraday rotation measure synthesis. Compared to a fast single-threaded and vectorized CPU implementation, depending on the structure and format of the data cubes, our code achieves an increase in speed of up to two orders of magnitude. During testing, we noticed that the disk I/O when using the Flexible Image Transport System (FITS) data format is a major bottleneck and to reduce the time spent on disk I/O, our code supports the faster HDFITS format in addition to the standard FITS format. The code is written in C with GPU-acceleration achieved using Nvidia’s CUDA parallel computing platform. The code is available at https://github.com/sarrvesh/cuFFS.
GPU-enabled N-body simulations of the Solar System using a VOVS Adams integrator
2016, Journal of Computational Science
Citation Excerpt :
The large number of floating point operations required for N-body simulations make them obvious candidates for implementation on GPUs. Considerable research has been done on using GPUs for galactic N-body simulations, see for example Hamanda et al. [5], Portegies Zwart et al. [6], Berzcek et al. [7], Bédorf et al. [8], Spurzem et al. [9], Watanabe et al. [10]. Galactic simulations typically require several orders of magnitude fewer integration steps than simulations of planetary systems.
Collisionless N-body simulations over tens of millions of years are an important tool in understanding the early evolution of planetary systems. We first present a CUDA kernel for evaluating the gravitational acceleration of N bodies that is intended primarily for when N is less than several thousand. We then use the kernel with a variable-order, variable-stepsize Adams method to perform long, collisionless simulations of the Solar System near limiting precision. The varying stepsize means no special scheme is required to integrate close encounters, and the motion of bodies on eccentric orbits or close to the Sun is calculated accurately. Our method is significantly more accurate than symplectic methods and sufficiently fast.
The role of 3-D interactive visualization in blind surveys of Hi in galaxies
2015, Astronomy and Computing
Upcoming H i surveys will deliver large datasets, and automated processing using the full 3-D information (two positional dimensions and one spectral dimension) to find and characterize H i objects is imperative. In this context, visualization is an essential tool for enabling qualitative and quantitative human control on an automated source finding and analysis pipeline. We discuss how Visual Analytics, the combination of automated data processing and human reasoning, creativity and intuition, supported by interactive visualization, enables flexible and fast interaction with the 3-D data, helping the astronomer to deal with the analysis of complex sources. 3-D visualization, coupled to modeling, provides additional capabilities helping the discovery and analysis of subtle structures in the 3-D domain. The requirements for a fully interactive visualization tool are: coupled 1-D/2-D/3-D visualization, quantitative and comparative capabilities, combined with supervised semi-automated analysis. Moreover, the source code must have the following characteristics for enabling collaborative work: open, modular, well documented, and well maintained. We review four state of-the-art, 3-D visualization packages assessing their capabilities and feasibility for use in the case of 3-D astronomical data.
GraviDy: A modular, GPU-based, direct-summation N-body code
2016, Proceedings of the International Astronomical Union
GPURepair: Automated Repair of GPU Kernels (Extended Version)
2024, Sadhana - Academy Proceedings in Engineering Sciences
Computational Methods for Collisional Stellar Systems
2023, arXiv

View all citing articles on Scopus

View full text

High-performance direct gravitational N-body simulations on graphics processing units

Abstract

Introduction

Section snippets

Calculating the force and integrating the particles

The programming environment

Results

Performance modelling of the GPU

Discussion

Acknowledgments

J. Comput. Phys.

PARCO

NewA

NewA

NewA

PASP

A& A

ApNr

ARA&A

GPU Gems (Programming Techniques, Tips, and Tricks for Real-Time Graphics)

The Cg Tutorial (The Definitive Guide to Programmable Real-Time Graphics)

PASJ