GenASiSBasics provides Fortran 2003 classes furnishing extensible object-oriented utilitarian functionality for large-scale physics simulations on distributed memory supercomputers. This functionality includes physical units and constants; display to the screen or standard output device; message passing; I/O to disk; and runtime parameter management and usage statistics. This revision – Version 3 of Basics – includes a significant name change, some minor additions to functionality, and a major addition to functionality: infrastructure facilitating the offloading of computational kernels to devices such as GPUs.
New version program summary
Program Title: SineWaveAdvection, SawtoothWaveAdvection, and RiemannProblem (fluid dynamics example problems illustrating GenASiSBasics); ArgonEquilibrium and ClusterFormation (molecular dynamics example problems illustrating GenASiSBasics)
Does the new version supersede the previous version?: Yes
Reasons for the new version: This version includes a significant name change, some minor additions to functionality, and a major addition to functionality: infrastructure facilitating the offloading of computational kernels to devices such as GPUs.
Summary of revisions:
The class VariableGroupForm – a major workhorse for handling set of related fields – has been renamed StorageForm.
The ability to use unicode characters in standard output has been added, but is currently only supported by the GNU Compiler Collection (GCC). This capability is used to display exponents as numerical superscripts, as well as symbols such as , , and Å in the display of relevant units. It is made operational by the line
which is now included in the machine-specific makefile fragments with a GCC suffix in the Build/Machines directory.
There are some changes to units and constants. The geometrized units of past releases (, with a fundamental unit of meter) have been replaced by natural units (, with MeV as the fundamental unit). Lorentz–Heaviside electromagnetic units are employed (permeability ; no factors of in the Maxwell equations). This refers to numbers as processed internally by the code; as described in the initial release, users can employ the members of the UNIT singleton for input/output purposes, that is, to specify or display numbers with any available units they wish. A number of units have been added, and the specification all units has been put on a more rational basis in keeping with six of the seven standard SI base units (meter, kilogram, second, ampere, kelvin, mole; we have not needed the candela; see [3]). Some physical and astrophysical constants have also been added. All constants have been updated to 2018 values [4].
For notifications to standard output, a few tweaks to ignorability levels have been made in various classes. The default output to screen is now less verbose (ignorability INFO_1, our designation for messages of significance just below WARNING).
A couple of additions have been made to MessagePassing: null subcommunicators are accommodated, and an AllToAll_V method has been added to CollectiveOperation_R_Form.
Enhancements to timer functionality have been made. The class TimerForm now has a member Level, which is specified in order to control indentation in screen output. Some functionality has been added to PROGRAM_HEADER_Singleton to work with timers. A method TimerPointer returns a pointer to a timer with a specified Handle (typically a meaningfully named integer). The new members TimerLevel and TimerDisplayFraction of PROGRAM_HEADER_Singleton, which can be set from the command line, can be used to suppress output from timings deemed insignificant, based on timer level or a measured time interval falling below a specified fraction of the total execution time.
The most significant addition in functionality in this release is the addition of infrastructure to offload computational kernels to hardware accelerators such as GPUs using OpenMP device-related directives and runtime library routines in OpenMP 4.5 and later.2 This infrastructure, implemented in a new subdivision Devices (see Fig. 1), provides lower-level routines to perform memory management between the host (CPU) and device (GPU) including data allocation, data movement between host and device, and device-to-host memory address association. The routines are implemented as Fortran wrappers to the OpenMP runtime library and CUDA3 routines written in C. Additional methods and an option utilizing the lower-level Devices routines have been added to our StorageForm class. They are: UpdateHost() and UpdateDevice() to copy data from device to host and host to device, respectively; AllocateDevice() to allocate memory on the device mirroring the allocation on the host; and PinnedOption as an optional flag to the Initialize() method to allocate the host memory in a page-locked region to facilitate faster data transfer between host and device. A detailed description of the implementation of this functionality can be found in [5].
To deal with different levels of compiler support for device-related OpenMP directives, we use the preprocessor in some source files in Devices to guard against attempted compilation of unsupported features. Preprocessor macro substitution is also utilized in OpenMP directives to switch between multi-threading parallelism on CPUs and offload parallelism to GPUs. Setting the makefile variable ENABLE_OMP_OFFLOAD to 1 – which is the default in the machine-specific makefile Makefile_POWER_XL for the XL compiler on POWER-based supercomputers – sets the appropriate flags and preprocessing to enable compilation for OpenMP offload parallelism. Alternatively, the command
sets this variable when make is invoked from the command line.
Information regarding the number of devices available to the program, the kind of OpenMP parallelism enabled (i.e. multi-threading or offload), and the selected OpenMP loop scheduling are displayed at runtime by PROGRAM_HEADER_Singleton. When offload parallelism is enabled, the loop scheduling is automatically set to static with chunk-size of 1. With multi-threading parallelism, the schedule defaults to guided but can be overridden at runtime by setting the environment variable OMP_SCHEDULE appropriately.
The example problem RiemannProblem in the Examples directory under the Basics division has been modified to exploit the GPUs using this new functionality. The computational kernels for the problem have been annotated with new OpenMP directives (via the appropriate preprocessor macros) such that they are offloaded to the GPUs when offload parallelism is enabled during compilation. In [5] we demonstrate the weak scaling of this example problem up to 8000 GPUs on the Summit supercomputer at the Oak Ridge Leadership Computing Facility.4Fig. 2, Fig. 3 show a visualization of the three-dimensional version of RiemannProblem at resolution executed with GPUs.
Nature of problem: By way of illustrating GenASiSBasics functionality, solve example fluid dynamics and molecular dynamics problems.
Solution method: For fluid dynamics examples, finite-volume. For molecular dynamics examples, leapfrog and velocity-Verlet integration.
Additional comments including restrictions and unusual features: The example problems named above are not ends in themselves, but serve to illustrate our object-oriented approach and the functionality available though GenASiSBasics. In addition to these more substantial examples, we provide individual unit test programs for each of the classes comprised by GenASiSBasics.
M. Tanabashi et al. (Particle Data Group), Phys. Rev. D 98 (2018) 030001
Budiardja, R.D. and Cardall, C.Y., “Targeting GPUs with OpenMP Directives on Summit: A Simple and Effective Fortran Experience,” submitted for publication Parallel Computing: Systems and Applications, arXiv:1812.07977 [physics.comp-ph],
Section snippets
Acknowledgments
This material is based upon work supported by the U.S. Department of Energy, Office of Science, Office of Nuclear Physics under contract number DE-AC05-00OR22725 and the National Science Foundation under Grant No. 1535130. This research used resources of the Oak Ridge Leadership Computing Facility, which is a DOE Office of Science User Facility
supported under Contract DE-AC05-00OR22725.
Most solvers and storage needs of GenASiS are written using the StorageForm class, enabling a uniform, simplified code for functionality such as I/O and nearest-neighbor ghost cell exchange. The OpenMP implementation of GenASiS [17] extends the StorageForm class with methods to mirror, associate, and synchronize the host (CPU) memory data allocation with a corresponding device (GPU) memory allocation. Listing 2 illustrates the use of the StorageForm class with methods relevant to manage the device memory allocation and association.
As recent enhancements to the OpenMP specification become available in its implementations, there is a need to share the results of experimentation in order to better understand the OpenMP implementation’s behavior in practice, to identify pitfalls, and to learn how the implementations can be effectively deployed in scientific codes. We report on experiences gained and practices adopted when using OpenMP to port a variety of ECP applications, mini-apps and libraries based on different computational motifs to accelerator-based leadership-class high-performance supercomputer systems at the United States Department of Energy. Additionally, we identify important challenges and open problems related to the deployment of OpenMP. Through our report of experiences, we find that OpenMP implementations are successful on current supercomputing platforms and that OpenMP is a promising programming model to use for applications to be run on emerging and future platforms with accelerated nodes.
This manuscript has been authored by UT-Battelle, LLC under Contract No. DE-AC05-00OR22725 with the U.S. Department of Energy. The United States Government retains and the publisher, by accepting the article for publication, acknowledges that the United States Government retains a non-exclusive, paid-up, irrevocable, worldwide license to publish or reproduce the published form of this manuscript, or allow others to do so, for United States Government purposes. The Department of Energy will provide public access to these results of federally sponsored research in accordance with the DOE Public Access Plan (http://energy.gov/downloads/doe-public-access-plan).