Keywords

1 Introduction

Collecting performance data to examine the run-time behavior of a program is essential for identifying regions in the code that benefit most from optimization or parallelization [14]. Traditionally, this data is collected using either sampling or instrumentation techniques. For use cases that require a more in-depth analysis, such as the creation of performance models [4, 5] for specific functions, accurate measurements are essential. Hence, instrumentation is better suited, as it guarantees that every function invocation is recorded accurately.

However, instrumenting all functions in a program typically generates a large overhead, which can increase the execution time by orders of magnitude [18]. This is in large part caused by frequently-called, short-running functions. Additionally, the insertion of measurement hooks can prohibit optimization in some cases [25].

For this reason, a filtering approach is typically necessary to instrument only those functions which are most relevant w.r.t. a user-defined metric, e.g. execution time. Excluding all other functions reduces the total number of calls to the measurement tool and, thus, the execution overhead. We refer to the set of instrumented functions as the instrumentation configuration (IC).

The simplest way to create a suitable IC is to define filter lists manually. The typical workflow involves first profiling a fully-instrumented version of the code. Subsequently, the user examines the resulting profile and selects the functions that should be excluded from the measurement. The drawback with this approach is that the user has to manually select the functions to instrument, which may require multiple iterations of compiling the code, executing it to generate a profile, and refining the IC. Hence, different tools to automate the selection process have been proposed and mainly differ in whether they use runtime data or rely on source-code features to determine a suitable IC. Unfortunately, the application of current compiler-assisted static selection tools is tedious and error prone, despite their general advantages in expressiveness and overhead reduction.

In this paper, we focus on the composable instrumentation selection mechanism introduced by the InstRO [10] project. In the context of the exaFOAM project,Footnote 1 we investigated its applicability for the instrumentation of the computational fluid dynamics (CFD) framework OpenFOAM [26]. However, due to the scale and structure of OpenFOAM, we found that the current implementation of InstRO is not suited to this task.

We present the Compiler-assisted Performance Instrumentation (CaPI) tool, which adopts ideas from InstRO and makes them applicable for the selective instrumentation of large-scale codes. We make the following contributions: (1) Present a new instrumentation tool based on key principles of InstRO. (2) Demonstrate its application on large-scale scientific software and identify specific usability and validation impediments. (3) Identify key challenges for improving CaPI specifically, as well as compiler-assisted selection tools in general.

The paper is structured as follows: Sect. 2 gives an overview of related work. Section 3 explains particularities of OpenFOAM and how they stress limitations of InstRO. Section 4 presents the CaPI toolchain to address these limitations. Thereafter, CaPI is evaluated on OpenFOAM in Sect. 5. Usability and validation impediments are highlighted in Sect. 6. The results are subsequently discussed in Sect. 7. Finally, Sect. 8 summarizes the paper and gives a brief outline on how remaining challenges may be addressed.

2 Related Work

Several tools have been developed to help automate the process of constructing ICs for performance measurements, or reduce the overhead by filtering runtime events. Their function selection methods can be divided into three categories, for which we list some representative tools.

  • Profile-guided selection uses previously recorded profile data to determine which functions to exclude or include in a subsequent measurement. An example is the scorep-score utility of the Score-P measurement infrastructure [12]. It enables the user to define a set of threshold values for, e.g., execution time per invocation, which need to be exceeded by a function to be considered for instrumentation. PerfTaint [6] applies a taint analysis to determine which parts of the application depend on a given set of input parameters, and only instruments dependent functions, as all others are considered to have constant runtime w.r.t. the set of input parameters.

  • Compiler-assisted selection tools aim to semi-automatically determine a suitable IC with the help of static code analysis. Tau [25] enables the selective instrumentation of functions via the use of its intermediate representation called PDT [19]. Cobi [21] requires the user to specify which points in a program to instrument in an XML-based format. It relies on binary instrumentation using the DynInst API [3], and, since it operates at the binary level, ignores C++ virtual functions or function pointers for any path analysis. The InstRO project [10] gives the user the ability to define selection passes that filter out functions based on statically collected information. Notably, a static call graph (CG) is generated that gives information about the call context of the respective function. This information can be used to decide if the function is relevant for overall performance.

  • Hybrid selection tools combine profile- and static data for the creation of IC files. PIRA [16] employs a static statement aggregation scheme [11] to estimate the amount of work per function for an initial IC. Subsequently, the IC is iteratively refined using profile information or empirically constructed performance models [15]. X-Ray [1] instrumentation uses instruction-level heuristics to estimate if a function should be instrumented, and, if so, inserts no-op sleds into the binary. At runtime, the sleds can be patched to enable or disable the recording of events, which may also be filtered based on their occurrence or available memory.

3 Tailored Instrumentation for OpenFOAM

While the utility of compiler-assisted selection tools has been successfully demonstrated on smaller applications, large scientific codes pose particular challenges.

OpenFOAM, a modular CFD framework, is a prime example of such a code. It is comprised of a multitude of individual solvers, and applicable to a wide variety of problems. OpenFOAM v2106 [22] consists of over 5000 C++ source files and \(\approx \)1.2 million lines of code (counted with cloc [7]).

Its philosophy centers around an extendable toolbox for physics simulation. Hence, OpenFOAM provides many libraries that implement different solver algorithms, preconditioners, and other utilities required to develop simulation software. These libraries are employed in various solvers for specific use cases and physical phenomena, e.g., multi-phase flows or fluid-structure interaction, requiring a high degree of flexibility and configurability in the code base. One of OpenFOAM’s very particular properties is the use of the project-specific build system wmake. Build systems, particularly custom and niche ones, commonly pose challenges in their application [8], e.g., maintaining multiple configurations. For such systems, the application of static analysis and instrumentation tools can be challenging.

The following section outlines how these features of OpenFOAM make the application of the existing InstRO tool impractical.

3.1 Design and Limitations of InstRO

InstRO provides a configurable set of passes, which can be combined by the user to perform customized source-to-source code transformations on selected code regions. Passes can be divided into three categories: Selectors select code regions for instrumentation based on code features. Transformers perform necessary source code transformations, e.g., to canonicalize certain constructs for instrumentation. Finally, Adapters implement the actual instrumentation of the code. Figure 1 provides an example of how passes may be combined for selective instrumentation of functions related to MPI [20] usage.

Fig. 1.
figure 1

Example InstRO pass pipeline, adapted from [10]. and select the and all functions, respectively. identifies the paths between the functions selected. The removes functions which match either the (all -marked functions) or the (functions matching a certain regular expression). Finally, the inserts the instrumentation hooks.

This abstract pass design makes InstRO highly configurable, and, together with its whole-program analysis, a powerful instrumentation tool. Moreover, the layered design of InstRO makes many parts of the tool—theoretically at least—independent of the compiler technology used underneath. However, most of InstRO’s features have been implemented on top of the ROSE source-to-source translator. A Clang-based implementation exists, but provides, in comparison, only limited functionality.

For the application to OpenFOAM, both versions proved unsuitable. The main issue is the need for a global CG analysis in order to enable the selection of specific call-paths. In the ROSE implementation, this requires the parsing and merging of all 5000 source files at once, which is impractical due to time and memory constraints. The Clang implementation lacks global CG analysis capabilities altogether.

To overcome this obstacle, we developed the new CaPI tool based on the InstRO paradigms, but capable for application to large-scale codes. We demonstrate its capabilities on OpenFOAM and construct a low-overhead IC that focuses on analyzing functions that use MPI communication.

4 The CaPI Instrumentation Toolchain

In this section, the CaPI workflow and its implementation are introduced and explained in further detail.

We reworked the InstRO toolchain in order to make it applicable for the OpenFOAM use case. Most notably, we switched from a source-to-source transformation to a more flexible compiler instrumentation approach. This necessitated moving from the abstract pass formulation to a more concrete workflow comprised of analysis, selection and instrumentation steps. CaPI employs MetaCG [17] for global CG analysis, which was developed for a similar purpose in the automatic instrumentation refinement tool PIRA [16]. We use a custom domain-specific language (DSL) to implement the user-defined selection mechanism, designed with a focus on ease-of-use and conciseness.

4.1 Instrumentation Workflow

The toolchain consists of two main phases: In the analysis and selection phase the code is analyzed statically and relevant code regions are selected for instrumentation. We employ a stand-alone selection tool to process the collected data and generate the IC. The final instrumentation step is implemented using a custom LLVM [13] optimizer plugin. During compilation, hooks are inserted into the selected functions to interface with the measurement library. These steps are illustrated in Fig. 2.

Fig. 2.
figure 2

Our instrumentation toolchain consists of these steps: (1) Preparation of the target code’s build system, in case it is required. (2) Generation of a compilation data base for Clang-based tools. (3) Translation-unit local CG construction, given the MetaCG workflow. (4) Whole-program CG construction, manually combining relevant source files. (5) Definition of the selection specification. (6) Execution of the CaPI analysis to create the IC. (7) Compilation of target code with IC instrumentation.

4.2 Implementation

The implementation distinguishes between the selection phase, which is implemented in a stand-alone tool, and the compilation phase, in which an LLVM plugin is used to insert the instrumentation hooks. We provide a more detailed explanation on how the selection is implemented and how different selection passes are combined. Thereafter, we briefly explain the compilation phase.

Analysis and Selection. The selection is applied to the whole-program CG representation provided by MetaCG. Hence, selectors can match function names, or structural properties of functions within the CG. The whole-program view enables the selectors to maintain full context information for the functions selected, when desired.

One of the fundamental paradigms of InstRO is the composability of its selector modules. We realize this composability via a lightweight DSL. This DSL enables the user to easily instantiate a nested sequence of parameterized selectors. We found that, compared to an alternative XML or JSON based format, this approach results in a much more concise and comprehensible specification. A simplified grammar definition is shown in Fig. 3.

Fig. 3.
figure 3

BNF grammar of the CaPI DSL. Some nonterminals related to the parsing of literals have been omitted for brevity. The full, up-to-date grammar is available in the project repository (https://github.com/tudasc/CaPI).

A selection specification consists of a sequence of selector definitions, which may be named or anonymous. The last of these definitions serves as the entry point to the selection pipeline. Each definition starts with the name of the selector module, followed by a list of arguments enclosed in parentheses. Aside from basic data types, i.e. strings, booleans, integers and floating-point numbers, selector modules may accept other selector definitions as input. These can be defined in-place or passed as a reference to a previously defined (named) selector instance. Such references are marked with a leading , followed by the identifier. The reference is pre-defined and corresponds to the set of all functions.

Listing 1 shows an example for a call-path selection pipeline that instruments functions on paths to MPI calls.

The user can choose from a set of predefined selectors that can be customized for the specific use case. The following selectors are currently available:

  • Include/exclude lists: Select functions by name based on regular expressions.

  • Specifier selection: Select functions w.r.t. specifiers, e.g., the keyword.

  • Call-path selection: Select all functions that are in the call chain below or above a previously selected function.

  • Unresolved call selection: Select functions that contain calls via function pointers, which may not be statically resolvable.

  • Set operations: Merge selection sets using basic operations such as union, intersection and complement.

The selection pipeline is applied to all functions in the CG, resulting in the final IC file. This file consists of the list of functions to be instrumented and is compatible with the Score-P filter file format. Hence, Score-P can be used as an alternative to our compiler plugin for the instrumentation step.

Compilation. We use the Clang/LLVM compiler toolchain to build the target code and perform the instrumentation. A custom LLVM plugin reads the IC file and identifies all functions in the current translation unit that are contained in the IC. These functions are then marked with LLVM function instrumentation attributes. Subsequently, the instrumentation attributes are consumed by the existing post-inline LLVM pass and the measurement hooks are inserted accordingly. We apply the instrumentation after inlining, in order to pre-emptively reduce instrumentation overhead. The enter and exit hooks conform to the GNU profiling interface, which is used by GCC compatible compilers for function instrumentation via the flag [9].

4.3 Score-P Integration

In principle, CaPI is compatible with any measurement tool that supports the GNU interface. Our main target, however, is the Score-P measurement infrastructure, which is commonly available in HPC environments. While Score-P provides support for the GNU profiling interface as well as defining its own measurement API, the GNU version is limited to recording only statically linked functions. This is due to the fact that only symbols with statically known addresses are collected from the main executable. As a result, the corresponding function names of calls to shared libraries cannot be identified and are thus ignored in the measurement.

We have developed the Score-P symbol injector library to identify and register these missing symbols [24]. Linked into the instrumented executable, it queries the /proc/self/maps pseudo-file at start-up to obtain information about the memory mapping of the loaded shared libraries. Each of these libraries is then analyzed with nm. Using the previously-collected information, each symbol is mapped to its address in the running program. Functions that are included in the IC are then registered in Score-P’s internal address-resolution hash map.

figure o

5 Evaluation on OpenFOAM

In this section, we demonstrate the presented CaPI toolchain on OpenFOAM and examine the obtained measurement results.

We evaluated the ICs with two OpenFOAM test cases: 3-D Lid-driven cavity (cavity), a well-known benchmark problem for incompressible flow [2], and HPC_Motorbike (motorbike), a simulation of flow around a motorbike model [23]. The executables applied in the main solve phase are icoFoam and simpleFoam, respectively. We measured the execution time for the Score-P profiling mode on a single node of the Lichtenberg 2 cluster, running with 4 MPI processes.Footnote 2

The compatibility of CaPI with the Score-P filter file format enables the comparison of various combinations of the available selection and instrumentation methods. This is illustrated in Fig. 4.

Fig. 4.
figure 4

Interoperability of Score-P and CaPI selection and instrumentation methods. The IC generated by CaPI or scorep-score can be combined with CaPI’s Clang-based instrumenter or the GCC-based Score-P instrumenter. Note that using the GNU interface requires the inclusion of the symbol injector library to record calls to shared libraries.

The full specification of the evaluated variants is shown in Table 1. All instrumented variants rely on Score-P’s compile-time filtering method, using an IC generated by either scorep-score or CaPI. The scorep-full variant corresponds to Score-P’s default full instrumentation, which does not perform any explicit filtering but excludes all functions declared as . The hybrid variant combines both selection methods by performing additional runtime filtering. All variants were compiled with -O2 optimization.

Table 1. Build configurations used in the evaluation.

For the scorep-score IC, we filtered out all functions that are called at least a million times and take less than 10 \(\mu {}s\) to execute. This yielded a filter file that excludes 17 functions for cavity and 38 functions for motorbike that are responsible for a majority of the overhead.

For the CaPI variants, we used the selection specification shown in Listing 1, which selects all call paths performing MPI communication. Additionally, we filtered out functions defined in files from a directory that contains mostly code related to I/O operations, as well as functions specified as .

We manually validated these ICs by comparing the resulting profiles with the results from scorep-full. Both profiles represented the behavior of the program accurately and preserved the call paths comprising hot spots.

Figure 5 shows the execution time measured for each variant. For both benchmarks, vanilla-gcc performed slightly better than vanilla-clang. For cavity, however, this difference is miniscule.

Compared to vanilla-gcc, the unfiltered instrumentation scorep-full produced only 8% overhead for cavity, but 135% for motorbike. Using the profile-guided filter variant scorep-filt reduced the overhead significantly to 3% for cavity and 44% for motorbike. The capi-gnu variant, however, was slower than scorep-filt in both cases. This is in part due to the initial look-up and registration of the shared library symbols. This step is quite time consuming because the CaPI-generated IC consists of an include list of about 110k entries, which have to be cross-checked with the found symbols. In the capi-scorep variant, the performance penalty due to the initialization overhead is eliminated, thus showing better results in both cases. The discrepancy in the execution time of between capi-gnu and capi-scorep are likely due to the differences in compilers and the Score-P measurement API.

The hybrid variant showed the most promising results. For cavity, it reduces the instrumentation overhead to below 1%. Similarly, hybrid yielded the overall best results for motorbike with an overhead of 30% compared to vanilla-gcc.

Fig. 5.
figure 5

Mean execution time of the instrumentation variants for the cavity and motorbike benchmarks over 5 runs. The total time is split into contributions from initialization and the execution of the function. The error bars indicate the standard deviation. Note that the lower limits of the y-axes have been adjusted for better visibility.

6 Usability and Validation Impediments

In this section, we highlight some of the usability impediments that we had to overcome in the instrumentation of OpenFOAM.

As mentioned earlier, dealing with the particularities of uncommon build systems can be cumbersome and tedious. As such, OpenFOAM’s wmake made certain aspects of the tool application more difficult. We do not consider it as a separate issue in this list. Nonetheless, it should be noted that the chosen build system heavily influences the ease-of-use of any instrumentation workflow.

Whole-Program CG. The generation of the whole-program CG is the most time-consuming part of our toolchain, and took several hours for OpenFOAM. The main difficulty, however, lies in setting up the analysis correctly. It has to be executed as a preprocessing step and is therefore not easily applied via the build system. This makes it difficult to identify which source files should be included.

For the initial local CG analysis, it is sufficient to search the code base for C++ files. The subsequent merging into a whole-program CG, however, requires additional care. OpenFOAM builds a large number of individual solver executables. Merging them all together is not sensible, as their behavior varies significantly. Hence, to generate the CG for each solver, we first merge all local CGs of the OpenFOAM libraries into a large library CG. We then identify the source files specific to the solver and merge the corresponding CGs with the library CG.

In general, this requires the user to have detailed knowledge about the build process of the target application. In its current form, the setup of the CG analysis therefore constitutes a significant barrier.

Limitations of Static Analysis. Due to the inherent limitations of static analysis, some call paths cannot be correctly identified by MetaCG. The resulting CG is therefore not guaranteed to be complete. A common reason for missed call edges is the use of function pointers [17]. For OpenFOAM, this played a minor role. In general, however, we cannot guarantee that there are no other issues that lead to missed calls, e.g., due to bugs in the analysis or misconfigured selection specifications. Unfortunately, there is no direct way to reliably check that a recorded profile is complete. Hence, it is the responsibility of the user to manually verify that no major parts of the code are missing.

To mitigate the issue, MetaCG provides a tool that compares the statically constructed CG with one constructed from a full-instrumentation profile and adds missing edges. This approach, however, introduces additional steps into the instrumentation workflow and requires a fully-instrumented build of the target. Furthermore, the resulting CG is only valid for the specific program inputs used to generate the profile. In order to guarantee completeness, this validation step must be repeated every time the program calling behavior changes based on inputs. For large code bases, this is impractical.

Managing Multiple Configurations. In the use case of OpenFOAM, it is sensible to create separate ICs for different solvers, as they may use completely different parts of the main library. As the instrumentation of the selected functions happens at compile-time, every new IC requires a rebuild of the program. Moreover, for multiple, different ICs, a separate build folder per IC is required.

This is especially tedious in OpenFOAM because the build system is designed to have only one build for each compiler configuration. Maintaining multiple instrumented builds is doable, but requires tedious configuration work. In addition, the user needs to keep track of the purpose of each build and document the configuration steps. If this is done poorly, the wrong build may be used, potentially leading to incomplete profiling data.

Furthermore, having multiple builds of a large program can waste significant amounts of disk space, despite the binaries being virtually identical.

In order to avoid these issues altogether, Score-P provides an option for run-time filtering. Using this method, all functions are initially instrumented. At run-time, the entry/exit hooks are still called, but measurements are only recorded for functions that pass the filter. As a result, the overhead is generally bigger compared to compile-time filtering, which may lead to skewed measurements. This is especially apparent with our toolchain, which generates a filter list containing \(\approx \)29k entries for the cavity case. We observed a significant increase in overhead using run-time filtering with this CaPI-generated filter, compared to the compile-time filtering method.

7 Discussion

We have demonstrated that our tool is capable of generating instrumentation configurations for large-scale codes. The results show that a hybrid approach, which combines the tailored CaPI selection with run-time filtering to remove remaining high-overhead functions, proved to be especially effective in mitigating the overhead, while preserving relevant call paths. This demonstrates that the compiler-assisted instrumentation workflow is in principle feasible to apply and beneficial w.r.t. overhead reduction.

In practice, however, the application on OpenFOAM proved to be quite time-consuming and required a good understanding of the code base and build system. We can therefore conclude that for most cases, the use of existing profile-guided filtering techniques with manual adjustments is preferable, as they require far less configuration overhead. The issues we identified are in large part applicable to other compiler-assisted instrumentation tools that rely on prior static analysis. This relates to PIRA in particular, which uses the same CG analysis workflow. In order for compiler-assisted instrumentation tools to be a viable alternative, the following key challenges must be addressed:

Simplification of the Analysis Workflow: The global static CG analysis is a requirement for the presented selection techniques. Currently, this step is very time-consuming. In order to simplify the workflow, the manual set-up must be reduced, by providing better integration into the compilation process.

Management of Build Configurations: Different instrumented versions of a code currently require maintaining multiple program builds. Instrumentation tools should aid in organizing and identifying them. Ideally, the need for separate builds should be eliminated altogether by providing an alternative run-time adaptation method that introduces little overhead.

Detection of Missed Calls: Currently, the user is unable to tell if function calls are missing due to limitations in the static analysis. A manual comparison with a complete instrumentation of the same program is possible, but requires extra steps that have to be repeated for every input configuration. Ideally, the static analysis phase should detect situations where such problems might occur and insert run-time checks to detect missed calls.

8 Conclusion and Future Work

Fig. 6.
figure 6

Envisioned workflow with embedded CG: The CG analysis is performed as link-time-optimization (LTO) on all object files of a shared library or executable and the CG is embedded into it. At run time, the CaPI runtime library queries the objects for their respective CGs and merges them to construct the whole-program CG.

We presented the Compiler-assisted Performance Instrumentation tool for user-defined selective program instrumentation. CaPI was demonstrated by creating tailored instrumentation for the CFD framework OpenFOAM. Our evaluation showed that a hybrid selection approach, comprised of static selection and run-time filtering, is effective in eliminating overhead. However, the amount of required manual work for CaPI is undesirable. Hence, we identified key areas for improvement to make such techniques more accessible.

Currently, the biggest usability issue for CaPI and similar tools is the requirement for a separate analysis phase. This issue could be mitigated by shifting the whole-program CG construction to link-time and embedding the CG into the generated binary, as illustrated in Fig. 6. In this proposed toolchain, a suitable dynamic instrumentation method enables the selection and instrumentation steps at program start. This opens up opportunities for dynamic instrumentation refinement based on collected run-time information, as employed by PIRA, without the need to rebuild the program. In addition, the availability of the CG at run-time would enable the assessment of the IC’s completeness. Further work is required to assess the feasibility of this approach.

CaPI is available at https://github.com/tudasc/CaPI under the BSD 3-Clause license.