Tutorial: Local Containers

This tutorial will walk you through getting Benchpark running in your own local container. There are some known issues and bugs, but one advantage to working in a local container is you can test basic benchmarks and process and analyze output without having to work through running on a real cluster.

Step 1: Getting Started with Container Runtimes

If you haven’t used containers on your local system before, allow us to recommend the Podman Desktop container runtime suite. It can be installed using the instructions from their official site or with brew. It is free and quite robust. The CLI is more or less the same, so if you are following along with Docker, the commands here should work equally well, but let us know if that is not the case.

Some Changes to the Default VM

For the purposes of this tutorial, I recommend setting at least 8 cores and about 16 GiB of memory as follows:

podman machine stop
podman machine set --cpus=8 --memory 15258
podman machine start

Some Common Issues

If you are behind a firewall, you may need to tell your VM about the firewall self-signed certificate to avoid errors involving pulling containers. Here is an example for macOS:

security find-certificate -a -p /Library/Keychains/System.keychain | \
podman machine ssh sudo tee /etc/pki/ca-trust/source/anchors/macos-system-certs.pem > /dev/null

and then you will need to restart the machine as above.

Step 2: Playing with the Benchpark Container

Now, let’s pull and run a Benchpark Container:

podman pull ghcr.io/llnl/benchpark/benchpark-flux-el10
podman run -it benchpark-flux-el10

And now, if everything worked properly, you should be dropped into a shell that is already running Flux and Benchpark. The Flux broker has been tricked into thinking the container has 4 nodes with 8 cores each. mpibind has also been installed as a Flux plugin, and is managing “affinity” on the various “nodes.” A simple MPICH installation is also installed and available.

Some simple Benchpark commands

Let’s run some simple Benchpark commands to poke around:

$ benchpark list systems

Systems - SYSTEM_DEFINITION CLUSTER/INSTANCE
    aws-pcluster instance_type=[c6g.xlarge|c4.xlarge|hpc7a.48xlarge|hpc6a.48xlarge]
    aws-tutorial instance_type=[c7i.48xlarge|c7i.metal-48xl|c7i.24xlarge|c7i.metal-24xl|c7i.12xlarge]
    csc-lumi
    cscs-daint
    cscs-eiger
    fluxtainer
    generic-x86
    jsc-juwels
    lanl-rocinante cluster=[rocinante|tycho|crossroads]
    lanl-venado cluster=[grace-hopper|grace-grace]
    lbnl-perlmutter
    llnl-cluster cluster=[magma|dane|rzgenie|ruby-deprecated|poodle-deprecated]
    llnl-elcapitan cluster=[tioga|elcapitan|tuolumne]
    llnl-matrix
    llnl-sierra-deprecated
    riken-fugaku
    snl-eldorado

This will show you all the systems available. This particular one is called fluxtainer. Let’s get some more info on it:

$ benchpark info system fluxtainer

Hardware:
    arm:
        cpu_arch: arm64
    x86:
        cpu_arch: x86_64_v3
Maintainers:
    nhanford
Variants:
    name: timeout
    default: 120
    description: Set job timeout limit (in minutes). Has to be under the limit for selected 'queue'.
    values: None
    validator: <function variant.<locals>._always_true at 0x7f0745df19e0>
    multi: False
    sticky: False

    name: instance_type
    default: x86
    description: Target Architecture
    values: ('arm', 'x86')
    validator: <function Variant.__init__.<locals>.<lambda> at 0x7f074409a980>
    multi: False
    sticky: False

In the output, we can see that there is a variant called instance_type. Let’s keep that in mind as we initialize the system. In my case, the architecture is arm, but yours might be x86.

benchpark system init --dest=my-fluxtainer fluxtainer instance_type=arm

Okay now let’s look at some benchmarks:

$ benchpark list experiments

Experiments - BENCHMARK+PROGRAMMING_MODEL+SCALING
    ad+[mpi]
    amg2023+[openmp|cuda|rocm|mpi]+[strong|weak|throughput]
    babelstream+[openmp|cuda|rocm]
    branson+[openmp|cuda|rocm|mpi]+[strong|weak|throughput]
    commbench+[cuda|rocm]
    genesis+[openmp|mpi]
    gpcnet+[mpi]
    gromacs+[openmp|cuda|rocm|mpi]
    hpcg+[openmp|mpi]+[strong|weak]
    hpl+[openmp|mpi]+[strong|weak]
    ior+[mpi]+[strong|weak]
    kripke+[openmp|cuda|rocm|mpi]+[strong|weak|throughput]
    laghos+[cuda|rocm|mpi]+[strong|weak|throughput]
    lammps+[openmp|cuda|rocm|mpi]+[strong]
    md-test+[mpi]+[strong]
    osu-micro-benchmarks+[cuda|rocm|mpi]
    phloem+[mpi]
    py-scaffold+[cuda|rocm]+[strong|weak]
    quicksilver+[openmp|mpi]+[strong|weak]
    qws+[openmp|mpi]+[strong|weak|throughput]
    raja-perf+[openmp|cuda|rocm|mpi]+[strong|weak|throughput]
    remhos+[cuda|rocm|mpi]+[strong|weak|throughput]
    salmon-tddft+[openmp|mpi]
    smb+[mpi]
    sparta-snl+[openmp|cuda|rocm|mpi]
    stream+[mpi]
    test

Let’s look at the osu-micro-benchmarks:

$ benchpark info experiment osu-micro-benchmarks

Maintainers:
    nhanford
Ramble Name:
    osu_micro_benchmarks
Spack Name:
    osu_micro_benchmarks
Variants:
    name: exec_mode
    default: test
    description: Execution mode
    values: ('test', 'perf')
    validator: <function Variant.__init__.<locals>.<lambda> at 0x7f12427e8400>
    multi: False
    sticky: False

    name: affinity
    default: none
    description: Build and run the affinity package
    values: ('none', 'on')
    validator: <function Variant.__init__.<locals>.<lambda> at 0x7f12427e84a0>
    multi: False
    sticky: False

    name: hwloc
    default: none
    description: Get underlying infrastructure topology
    values: ('none', 'on')
    validator: <function Variant.__init__.<locals>.<lambda> at 0x7f12427e8220>
    multi: False
    sticky: False

    name: package_manager
    default: spack
    description: package manager to use
    values: ('spack', 'environment-modules', 'user-managed', 'pip', 'spack-pip')
    validator: <function Variant.__init__.<locals>.<lambda> at 0x7f12427eab60>
    multi: False
    sticky: False

    name: append_path
    default:  
    description: Append to environment PATH during experiment execution
    values: None
    validator: <function variant.<locals>._always_true at 0x7f12427e85e0>
    multi: False
    sticky: False

    name: prepend_path
    default:  
    description: Prepend to environment PATH during experiment execution
    values: None
    validator: <function variant.<locals>._always_true at 0x7f12427e87c0>
    multi: False
    sticky: False

    name: n_repeats
    default: 0
    description: Number of experiment repetitions
    values: None
    validator: <function variant.<locals>._always_true at 0x7f12427e89a0>
    multi: False
    sticky: False

    name: allocation
    default: standard
    description: Allocation modifier mode
    values: ('standard', 'torchrun-hpc')
    validator: <function Variant.__init__.<locals>.<lambda> at 0x7f12427eac00>
    multi: False
    sticky: False

    name: mpi
    default: True
    description: Run with MPI
    values: (True, False)
    validator: <function Variant.__init__.<locals>.<lambda> at 0x7f12427ead40>
    multi: False
    sticky: False

    name: rocm
    default: False
    description: Build and run with ROCm
    values: (True, False)
    validator: <function Variant.__init__.<locals>.<lambda> at 0x7f12427eafc0>
    multi: False
    sticky: False

    name: cuda
    default: False
    description: Build and run with CUDA
    values: (True, False)
    validator: <function Variant.__init__.<locals>.<lambda> at 0x7f12427eaf20>
    multi: False
    sticky: False

    name: workload
    default: osu_latency
    description: workloads available
    values: ('osu_bibw', 'osu_bw', 'osu_latency', 'osu_latency_mp', 'osu_latency_mt', 'osu_mbw_mr', 'osu_multi_lat', 'osu_allgather', 'osu_allreduce_persistent', 'osu_alltoallw', 'osu_bcast_persistent', 'osu_iallgather', 'osu_ialltoallw', 'osu_ineighbor_allgather', 'osu_ireduce', 'osu_neighbor_allgatherv', 'osu_reduce_persistent', 'osu_scatterv', 'osu_allgather_persistent', 'osu_alltoall', 'osu_alltoallw_persistent', 'osu_gather', 'osu_iallgatherv', 'osu_ibarrier', 'osu_ineighbor_allgatherv', 'osu_ireduce_scatter', 'osu_neighbor_alltoall', 'osu_reduce_scatter', 'osu_scatterv_persistent', 'osu_allgatherv', 'osu_alltoall_persistent', 'osu_barrier', 'osu_gather_persistent', 'osu_iallreduce', 'osu_ibcast', 'osu_ineighbor_alltoall', 'osu_iscatter', 'osu_neighbor_alltoallv', 'osu_reduce_scatter_persistent', 'osu_allgatherv_persistent', 'osu_alltoallv', 'osu_barrier_persistent', 'osu_gatherv', 'osu_ialltoall', 'osu_igather', 'osu_ineighbor_alltoallv', 'osu_iscatterv', 'osu_neighbor_alltoallw', 'osu_scatter', 'osu_allreduce', 'osu_alltoallv_persistent', 'osu_bcast', 'osu_gatherv_persistent', 'osu_ialltoallv', 'osu_igatherv', 'osu_ineighbor_alltoallw', 'osu_neighbor_allgather', 'osu_reduce', 'osu_scatter_persistent', 'osu_acc_latency', 'osu_cas_latency', 'osu_fop_latency', 'osu_get_acc_latency', 'osu_get_bw', 'osu_get_latency', 'osu_put_bibw', 'osu_put_bw', 'osu_put_latency', 'osu_hello', 'osu_init')
    validator: <function Variant.__init__.<locals>.<lambda> at 0x7f12427eb100>
    multi: True
    sticky: False

    name: version
    default: 7.5
    description: app version
    values: ('latest', '7.5')
    validator: <function Variant.__init__.<locals>.<lambda> at 0x7f12427eb240>
    multi: False
    sticky: False

Okay that was a lot to take in. Most of it is variants yet again. We’ll discuss a couple in detail: affinity will run the affinity test to show where different communicating processes ended up within a node, be they MPI ranks or OpenMP threads, on CPU cores or GPUs. This can be very helpful for debugging common parallel performance issues.

The other variant to look at is workload. This variant will determine which actual micro- benchmarks get run. Okay, let’s get started with a simple example, but first some caveats:

Caveats and known issues

1. The OSU benchmarks are a work in progress, and we are still working out the details on scaling ranks at this stage, so we will stick with 2 ranks on 2 fake nodes.

2. Collectives can hang in this configuration as this multi-node trick we’re playing on the Flux broker is really more for testing broker throughput, scheduling algorithms, etc., not actual applications. We will demonstrate a more robust single-node configuration a bit later.

Okay now that we have that out of the way, let’s initialize, setup, and run an experiment:

benchpark experiment init my-fluxtainer osu-micro-benchmarks workload=osu_allreduce,osu_mbw_mr affinity=on

Now we get a message back telling us what to run next:

Run `benchpark setup my-fluxtainer/osu-micro-benchmarks <experiments_root>`_ to generate Ramble workspace

Let’s call the experiments_root wkp for now…

benchpark setup my-fluxtainer/osu-micro-benchmarks wkp

If you get the error

fatal: hardlink different from source at ...

Run:

rm -rf ~/.benchpark
benchpark bootstrap

and reattempt the above benchpark setup ... command. And yet again, Benchpark tells us what to run next:

Clearing existing workspace /home/fluxuser/benchpark/wkp/my-fluxtainer/osu-micro-benchmarks
Setting up configs for Ramble workspace /home/fluxuser/benchpark/wkp/my-fluxtainer/osu-micro-benchmarks/workspace/configs
Cloning packages to /home/fluxuser/benchpark/wkp/spack-packages
Cloning Spack to /home/fluxuser/benchpark/wkp/spack
Cloning Ramble to /home/fluxuser/benchpark/wkp/ramble
To complete the benchpark setup, do the following:

        . /home/fluxuser/benchpark/wkp/setup.sh

Further steps are needed to build the experiments (ramble --workspace-dir /home/fluxuser/benchpark/wkp/my-fluxtainer/osu-micro-benchmarks/workspace workspace setup) and run them (ramble --workspace-dir /home/fluxuser/benchpark/wkp/my-fluxtainer/osu-micro-benchmarks/workspace on)

So let’s do that:

. /home/fluxuser/benchpark/wkp/setup.sh
ramble --workspace-dir /home/fluxuser/benchpark/wkp/my-fluxtainer/osu-micro-benchmarks/workspace workspace setup
ramble --workspace-dir /home/fluxuser/benchpark/wkp/my-fluxtainer/osu-micro-benchmarks/workspace on

And Ramble tells us it built the software:

==> Streaming details to log:
==>   /home/fluxuser/benchpark/wkp/my-fluxtainer/osu-micro-benchmarks/workspace/logs/setup.2026-05-06_23.38.34.out
==>   Setting up 2 out of 2 experiments:
==> Experiment #1 (1/2):
==>     name: osu_micro_benchmarks.osu_allreduce.osu_micro_benchmarks_osu_allreduce_test_mpi_2_2
==>     root experiment_index: 1
==>     log file: /home/fluxuser/benchpark/wkp/my-fluxtainer/osu-micro-benchmarks/workspace/logs/setup.2026-05-06_23.38.34/osu_micro_benchmarks.osu_allreduce.osu_micro_benchmarks_osu_allreduce_test_mpi_2_2.out
==>   Returning to log file: /home/fluxuser/benchpark/wkp/my-fluxtainer/osu-micro-benchmarks/workspace/logs/setup.2026-05-06_23.38.34.out
==> Experiment #2 (2/2):
==>     name: osu_micro_benchmarks.osu_mbw_mr.osu_micro_benchmarks_osu_mbw_mr_test_mpi_2_2
==>     root experiment_index: 2
==>     log file: /home/fluxuser/benchpark/wkp/my-fluxtainer/osu-micro-benchmarks/workspace/logs/setup.2026-05-06_23.38.34/osu_micro_benchmarks.osu_mbw_mr.osu_micro_benchmarks_osu_mbw_mr_test_mpi_2_2.out
==>   Returning to log file: /home/fluxuser/benchpark/wkp/my-fluxtainer/osu-micro-benchmarks/workspace/logs/setup.2026-05-06_23.38.34.out

And Ramble and Flux tell us they ran the jobs:

==> Streaming details to log:
==>   /home/fluxuser/benchpark/wkp/my-fluxtainer/osu-micro-benchmarks/workspace/logs/execute.2026-05-06_23.43.29.out
==>   Executing 2 out of 2 experiments:
==>   Log files for experiments are stored in: /home/fluxuser/benchpark/wkp/my-fluxtainer/osu-micro-benchmarks/workspace/logs/execute.2026-05-06_23.43.29
==> Running executors...
ƒJMvePMyZ
ƒJMxjxM5q

So what just happened?

First, Benchpark copied Spack, its packages repository (with all the build recipes), and Ramble into a dedicated workspace for this experiment. This creates total isolation between benchmarks and reproducibility.

Second, Ramble built our benchmarks in accordance with the instructions given in the initialization, and also set up the batch scripts for the job scheduler (in this case Flux).

Finally, Ramble executed the benchmarks using the Flux scheduler.

Now let’s analyze the results:

ramble --workspace-dir /home/fluxuser/benchpark/wkp/my-fluxtainer/osu-micro-benchmarks/workspace workspace analyze -f json

And Ramble tells us it performed the analysis:

==> Streaming details to log:
==>   /home/fluxuser/benchpark/wkp/my-fluxtainer/osu-micro-benchmarks/workspace/logs/analyze.2026-05-06_23.46.25.out
==>   Analyzing 2 out of 2 experiments:
==> Experiment #1 (1/2):
==>     name: osu_micro_benchmarks.osu_allreduce.osu_micro_benchmarks_osu_allreduce_test_mpi_2_2
==>     root experiment_index: 1
==>     log file: /home/fluxuser/benchpark/wkp/my-fluxtainer/osu-micro-benchmarks/workspace/logs/analyze.2026-05-06_23.46.25/osu_micro_benchmarks.osu_allreduce.osu_micro_benchmarks_osu_allreduce_test_mpi_2_2.out
==> Invalidating experiment results cache: timestamp difference
==>   Returning to log file: /home/fluxuser/benchpark/wkp/my-fluxtainer/osu-micro-benchmarks/workspace/logs/analyze.2026-05-06_23.46.25.out
==> Experiment #2 (2/2):
==>     name: osu_micro_benchmarks.osu_mbw_mr.osu_micro_benchmarks_osu_mbw_mr_test_mpi_2_2
==>     root experiment_index: 2
==>     log file: /home/fluxuser/benchpark/wkp/my-fluxtainer/osu-micro-benchmarks/workspace/logs/analyze.2026-05-06_23.46.25/osu_micro_benchmarks.osu_mbw_mr.osu_micro_benchmarks_osu_mbw_mr_test_mpi_2_2.out
==> Invalidating experiment results cache: timestamp difference
==>   Returning to log file: /home/fluxuser/benchpark/wkp/my-fluxtainer/osu-micro-benchmarks/workspace/logs/analyze.2026-05-06_23.46.25.out
==> Results are written to:
==>   /home/fluxuser/benchpark/wkp/my-fluxtainer/osu-micro-benchmarks/workspace/results/results.2026-05-06_23.46.26.json
==> Symlinks updated:
==>   /home/fluxuser/benchpark/wkp/my-fluxtainer/osu-micro-benchmarks/workspace/results/results.latest.json

So let’s look at one:

cat /home/fluxuser/benchpark/wkp/my-fluxtainer/osu-micro-benchmarks/workspace/results/results.latest.json | jq

If you scroll around, you’ll see that Ramble captured quite a bit of information about our benchmarks, most importantly, the performance data.

Hey remember we set affinity=on? Where did that end up? Let’s poke around this workspace and see:

cat /home/fluxuser/benchpark/wkp/my-fluxtainer/osu-micro-benchmarks/workspace/experiments/osu_micro_benchmarks/osu_allreduce/osu_micro_benchmarks_osu_allreduce_test_mpi_2_2/affinity.mpi.out

Shows:

affinity test for 2 MPI ranks
rank   0 @ d26f4eea6a52: thread 0 -> core   0
rank   1 @ d26f4eea6a52: thread 0 -> core   0

Okay not super interesting because the broker was tricked into thinking it had 4 nodes, but this will certainly come in handy for more complex cases…

Epilogue: Gory Details, FAQ, and HPC Common Practices

What is Flux and why are you using it here?

The Flux Framework is the only workload manager on the El Capitan supercomputer. It is a hierarchical, highly-portable, security-aware workload manager and job scheduler that plays nice with cloud, containers, orchestrators, and more. We’re using it here because:

It works well in a container.
It can run under other workload managers such as Slurm, Spectrum LSF, etc., so you can run it easily on your own cluster.

.

This seems complicated.

You’re right, it is, but portable reproducible benchmarking across many different system types has a great deal of inherent complexity. Many have built test harnesses that either compromise on one of those features or slowly grows more complex with time in an unsustainable manner. Benchpark and Ramble take the complexity bull by the horns and use much of Spack’s design philosophy to pay the cost up-front. The learning curve is admittedly somewhat steep, but the payoff is (hopefully) portability and reproducibility with a relatively stable set of interfaces.

How did you build this container? Which of these lessons can I apply to my own cluster?

The containerfile for this particular container is here. We based it on the Flux containers and picked EL10 because that’s a common OS for HPC with a relatively stable ABI, which is important when building so much from source. The 2 key tricks it demonstrates for HPC/AI practitioners are:

Manage affinity portably with a tool like mpibind.
Describe and build on system externals deterministically with Spack.

More on how to do that second part for your own cluster is featured in Adding a System