4. RAJA Performance Suite

RAJA Performance Suite source code is near-final at this point. It will be released soon along with benchmark baseline data and instructions for running the benchmark and generating evaluation metrics.

The RAJA Performance Suite contains a variety of numerical kernels that represent important computational patterns found in HPC applications. It is a companion project to RAJA, which is a library of software abstractions used by developers of C++ applications to write portable, single-source code. The RAJA Performance Suite enables performance experiments and comparisons for kernel variants that use RAJA and those that do not.

Source code and documentation for RAJA and the RAJA Performance Suite is available at:

Important

The RAJA Performance Suite benchmark is limited to a subset of kernels in the RAJA Performance Suite as described in Problems.

4.1. Purpose

The main purpose of the RAJA Performance Suite is to analyze performance of loop-based computational kernels representative of those found in HPC applications and which are implemented using RAJA. The kernels in the Suite originate from different sources ranging from open-source HPC benchmarks to restricted-access production applications. Kernels exercise various loop structures as well as parallel operations such as reductions, atomics, scans, and sorts.

Each kernel in the Suite appears in RAJA and non-RAJA variants that exercise common programming models, such as OpenMP, CUDA, HIP, and SYCL. Performance comparisons between RAJA and non-RAJA variants are helpful to improve RAJA implementation and to identify impacts C++ abstractions have on compilers’ ability to optimize. Often, kernels in the Suite serve as collaboration tools enabling the RAJA team to work with vendors to resolve performance issues observed in production applications that use RAJA.

To more closely align execution of kernels in the Suite with how they would run in the context of a full application, benchmark runs must be done using multiple MPI ranks to ensure that all resources on a compute node are being exercised and avoid misrepresentation of kernel and node performance. RAJA is a potential X in the often referred to MPI + X parallel application paradigm, where MPI is used for coarse-grained, distributed memory parallelism and X (e.g., RAJA) supports fine-grained parallelism within an MPI rank. The RAJA Performance Suite can be configured with MPI so that execution of kernels in the Suite represents how those kernels would be exercised in an MPI + X HPC application. When the RAJA Performance Suite is run using multiple MPI ranks, the same kernel code is executed on each rank. Synchronization and communication across ranks involves only sending execution timing information to rank zero for reporting purposes.

Important

For RAJA Performance Suite benchmark execution, MPI must be used to run to ensure that all resources on a compute node are being exercised and avoid misrepresentation of kernel and node performance. This is described in the instructions provided in Running.

4.2. Characteristics

The RAJA Performance Suite GitHub project contains the code for all the Suite kernels and all essential external software dependencies in Git submodules. Thus, dependency versions are pinned to each version of the Suite. Building the Suite requires an installation of CMake for configuring a build, a C++17 compliant compiler to build the code, and an MPI library installation when MPI is to be used.

The Suite can be run in a myriad of ways by specifying parameters and options as command-line arguments. The intent is that one can build the code and use scripts to execute multiple Suite runs to generate data for a desired performance experiment.

In particular, variants, problem sizes, etc. for the kernels can be set by a user from the command line. Specific instructions for running the RAJA Performance Suite benchmark are described in Running.

4.2.1. Problems

The RAJA Performance Suite benchmark is limited to a subset of kernels in the full Suite to focus on some of the more important computational patterns found in LLNL applications. The subset of kernels is described.

Note

Each kernel contains a complete reference description located in the header file for the kernel object <kernel-name>.hpp. The reference is a C-style sequential implementation of the kernel in a comment section near the top of the header file.

4.2.1.1. Priority 1 kernels

  • Apps group (directory src/apps)

    1. DIFFUSION3DPA element-wise action of a 3D finite element volume diffusion operator via partial assembly and sum factorization (nested loops, GPU shared memory, RAJA::launch API)

    2. EDGE3D stiffness matrix assembly for a 3D MHD calculation (single loop with included function call, RAJA::forall API)

    3. ENERGY internal energy calculation from an explicit hydrodynamics algorithm; (multiple single-loop operations in sequence, conditional logic for correctness checks and cutoffs, RAJA::forall API)

    4. FEMSWEEP finite element implementation of linear sweep algorithm used in radiation transport (nested loops, RAJA::launch API)

    5. INTSC_HEXRECT intersection between a 24-sided hexahedron and a rectangular solid, including volume and moment calculations (single loop, RAJA::forall API)

    6. MASS3DEA element assembly of a 3D finite element mass matrix (nested loops, GPU shared memory, RAJA::launch API)

    7. MASS3DPA_ATOMIC action of a 3D finite element mass matrix on elements with shared DOFs via partial assembly and sum factorization (nested loops, GPU shared memory, RAJA::launch API)

    8. MASSVEC3DPA element-wise action of a 3D finite element mass matrix via partial assembly and sum factorization on a block vector (nested loops, GPU shared memory, RAJA::launch API)

    9. MATVEC_3D_STENCIL matrix-vector product based on a 3D mesh stencil (single loop, data access via indirection array, RAJA::forall API)

    10. NODAL_ACCUMULATION_3D on a 3D structured hexahedral mesh, sum a contribution from each hex vertex (nodal value) to its centroid (zonal value) (single loop, data access via indirection array, 8-way atomic contention, RAJA::forall API)

    11. VOL3D on a 3D structured hexahedral mesh (faces are not necessarily planes), compute volume of each zone (hex) (single loop, data access via indirection array, RAJA::forall API)

4.2.1.2. Priority 2 kernels

  • Apps group (directory src/apps)

    1. CONVECTION3DPA element-wise action of a 3D finite element volume convection operator via partial assembly and sum factorization (nested loops, GPU shared memory, RAJA::launch API)

    2. DEL_DOT_VEC_2D divergence of a vector field at a set of points on a mesh (single loop, data access via indirection array, RAJA::forall API)

    3. INTSC_HEXHEX intersection between two 24-sided hexahedra, including volume and moment calculations (multiple single-loop operations in sequence, RAJA::forall API)

    4. LTIMES one step of the source-iteration technique for solving the steady-state linear Boltzmann equation, multi-dimensional matrix product (nested loops, RAJA::kernel API)

    5. MASS3DPA element-wise action of a 3D finite element mass matrix via partial assembly and sum factorization (nested loops, GPU shared memory, RAJA::launch API)

  • Basic group (directory src/basic)

    1. MULTI_REDUCE multiple reductions in a kernel, where number of reductions is set at run time (single loop, irregular atomic contention, RAJA::forall API)

    2. REDUCE_STRUCT multiple reductions in a kernel, where number of reductions (6) is known at compile time (single loop, multiple reductions, RAJA::forall API)

    3. INDEXLIST_3LOOP construction of set of indices used in other kernel executions (single loops, vendor scan implementations, RAJA::forall API)

  • Comm group (directory src/comm)

    1. HALO_PACKING_FUSED packing and unpacking MPI message buffers for point-to-point distributed memory halo data exchange for mesh-based codes (overhead of launching many small kernels, GPU variants use RAJA::Workgroup concepts to execute multiple kernels with one launch)

4.2.2. Figure of Merit

There are two figures of merit (FOM) for each benchmark kernel: execution time and memory bandwidth….. fill this in*

Describe how to set problem size based on architecture and how key output quantities are computed…..*

4.3. Source code modifications

Please see Run Rules Synopsis for general guidance on allowed modifications. For the RAJA Performance Suite, we define the following restrictions on source code modifications:

  • While source code changes to the RAJA Performance Suite kernels and to RAJA can be proposed, RAJA may not be removed from RAJA kernel variants in the Suite or replaced with any other library. The Base kernel variants in the Suite are provided to show how each kernel could be implemented directly in the corresponding programming model back-end without the RAJA abstraction layer. Apart from some special cases, the RAJA and Base variants for each kernel should perform the same computation.

4.4. Building

The RAJA Performance Suite uses a CMake-based system to configure the code for compilation. As noted earlier, all non-system related software dependencies are included in the RAJA Performance Suite repository as Git submodules.

The current RAJA Performance Suite benchmark uses the v2025.12.0 version of the code. When the git repository is cloned, you will be on the develop branch, which is the default RAJA Performance Suite branch. To get a local copy of this version of the code and the correct versions of submodules:

$ git clone --recursive https://github.com/LLNL/RAJAPerf.git
$ git checkout v2025.12.0
$ git submodule update --init --recursive

When building the RAJA Performance Suite, RAJA and the RAJA Performance Suite are built together using the same CMake configuration. The basic process for specifying a configuration and generating a build space is to create a build directory and run CMake in it with the proper options. For example:

$ pwd
path/to/RAJAPerf
$ mkdir my-build
$ cd my-build
$ cmake <cmake args> ..
$ make -j (or make -j <N> to build with a specified number of cores)

For convenience and informational purposes, configuration scripts are maintained in the RAJAPerf/scripts subdirectories for various build configurations. For example, the RAJAPerf/scripts/lc-builds directory contains scripts that can be used to generate build configurations for machines in the Livermore Computing (LC) Center at Lawrence Livermore National Laboratory. These scripts are to be run in the top-level RAJAPerf directory. Each script creates a descriptively-named build space directory and runs CMake with a configuration appropriate for the platform and specified compiler(s) indicated by the build script name. For example, to build the code to generate baseline data on the El Capitan system:

$ pwd
path/to/RAJAPerf
$ ./scripts/lc-builds/toss4_cray-mpich_amdclang.sh 9.0.1 6.4.3 gfx942
$ build_lc_toss4-cray-mpich-9.0.1-amdclang-6.4.3-gfx942
$ make -j

This will build the code for CPU-GPU execution using the system-installed version 9.0.1 of the Cray MPICH MPI library with the version 6.4.3 of the AMD clang compiler (ROCm version 6.4.3) targeting GPU compute architecture gfx942, which is appropriate for the AMD MI300A APU hardware on El Capitan. Please consult the build script files in the RAJAPerf/scripts/lc-builds directory for hints at building the code for other architectures and compilers. Additional information on build configurations is described in the RAJA Performance Suite User Guide for the version of the code in which you are interested.

4.5. Running

After the code is built, the executable will located in the bin directory of the build space. Continuing the El Capitan example above:

$ pwd
path/to/build_lc_toss4-cray-mpich-9.0.1-amdclang-6.4.3-gfx942
$ ls bin
rajaperf.exe

To get usage information:

$ path/to/rajaperf.exe --help (or -h)

This command will print all available command-line options along with potential arguments and defaults. Options are avail to print information about the Suite, to select output directory and file details, to select kernels and variants to run, and how they are run (problem sizes, # times each kernel is run, data spaces to use for array allocation, etc.). All arguments are optional. If no arguments are specified, the suite will run all kernels in their default configurations for the variants that are available for the way the code is configured to build.

The script to run the benchmark for generating baselines for EL Capitan is described in Example Benchmark Results. A similar recipe should be followed for benchmarking other systems.

4.6. Validation

Each kernel and variant run generates a checksum value based on kernel execution output, such as an output data array computed by the kernel. The checksum depends on the problem size run for the kernel; thus, each checksum is computed at run time. Validation criteria is defined in terms of the checksum difference between each kernel variant and problem size run and a corresponding reference variant. The Base_Seq variant is used to define the reference checksum and so that variant should be run for each kernel as part of a performance study. Each kernel is annotated in the source code as to whether the checksum for each variant is expected to match the reference checksum exactly, or to be within some tolerance due to order of operation differences when run in parallel.

Whether the checksum for each kernel is considered to be within its expected tolerance is reported as checksum PASSED or FAILED in the output files.

Show an example of this for the EL Capitan baseline runs!!

Reminder: add more accurate Base_Seq summation tunings (left fold is inaccurate for large problem sizes).

4.7. Example Benchmark Results

Include tables of results of El Capitan baseline results

4.8. Memory Usage

Do we need to say anything here, if we describe how benchmark problem size is set in the benchmark results section above???

4.9. Strong Scaling on El Capitan

The RAJA Performance Suite is primarily a single-node and compiler assessment tool. Thus, strong scaling is not part of the benchmark.

4.10. Weak Scaling on El Capitan

The RAJA Performance Suite is primarily a single-node and compiler assessment tool. Thus, weak scaling is not part of the benchmark.

4.11. References