Tutorial: Local Containers
This tutorial will walk you through getting Benchpark running in your own local container. There are some known issues and bugs, but one advantage to working in a local container is you can test basic benchmarks and process and analyze output without having to work through running on a real cluster.
Step 1: Getting Started with Container Runtimes
If you haven’t used containers on your local system before, allow us to recommend the
Podman Desktop container runtime suite. It can be
installed using the instructions from their official site or with brew. It is free
and quite robust. The CLI is more or less the same, so if you are following along with
Docker, the commands here should work equally well, but let us know if that is not the
case.
Some Changes to the Default VM
For the purposes of this tutorial, I recommend setting at least 8 cores and about 16 GiB of memory as follows:
podman machine stop
podman machine set --cpus=8 --memory 15258
podman machine start
Some Common Issues
If you are behind a firewall, you may need to tell your VM about the firewall self-signed certificate to avoid errors involving pulling containers. Here is an example for macOS:
security find-certificate -a -p /Library/Keychains/System.keychain | \
podman machine ssh sudo tee /etc/pki/ca-trust/source/anchors/macos-system-certs.pem > /dev/null
and then you will need to restart the machine as above.
Step 2: Playing with the Benchpark Container
Now, let’s pull and run a Benchpark Container:
podman pull ghcr.io/llnl/benchpark/benchpark-flux-el10
podman run -it benchpark-flux-el10
And now, if everything worked properly, you should be dropped into a shell that is
already running Flux and Benchpark. The Flux broker has been tricked into thinking the
container has 4 nodes with 8 cores each. mpibind has also been installed as a Flux
plugin, and is managing “affinity” on the various “nodes.” A simple MPICH installation
is also installed and available.
Some simple Benchpark commands
Let’s run some simple Benchpark commands to poke around:
$ benchpark list systems
Systems - SYSTEM_DEFINITION CLUSTER/INSTANCE
aws-pcluster instance_type=[c6g.xlarge|c4.xlarge|hpc7a.48xlarge|hpc6a.48xlarge]
aws-tutorial instance_type=[c7i.48xlarge|c7i.metal-48xl|c7i.24xlarge|c7i.metal-24xl|c7i.12xlarge]
csc-lumi
cscs-daint
cscs-eiger
fluxtainer
generic-x86
jsc-juwels
lanl-rocinante cluster=[rocinante|tycho|crossroads]
lanl-venado cluster=[grace-hopper|grace-grace]
lbnl-perlmutter
llnl-cluster cluster=[magma|dane|rzgenie|ruby-deprecated|poodle-deprecated]
llnl-elcapitan cluster=[tioga|elcapitan|tuolumne]
llnl-matrix
llnl-sierra-deprecated
riken-fugaku
snl-eldorado
This will show you all the systems available. This particular one is called
fluxtainer. Let’s get some more info on it:
$ benchpark info system fluxtainer
Hardware:
arm:
cpu_arch: arm64
x86:
cpu_arch: x86_64_v3
Maintainers:
nhanford
Variants:
name: timeout
default: 120
description: Set job timeout limit (in minutes). Has to be under the limit for selected 'queue'.
values: None
validator: <function variant.<locals>._always_true at 0x7faff31f1620>
multi: False
sticky: False
name: instance_type
default: x86
description: Target Architecture
values: ('arm', 'x86')
validator: <function Variant.__init__.<locals>.<lambda> at 0x7faff125ae80>
multi: False
sticky: False
In the output, we can see that there is a variant called instance_type. Let’s
keep that in mind as we initialize the system. In my case, the architecture is arm,
but yours might be x86.
benchpark system init --dest=my-fluxtainer fluxtainer instance_type=arm
Okay now let’s look at some benchmarks:
$ benchpark list experiments
Experiments - BENCHMARK+PROGRAMMING_MODEL+SCALING
ad+[mpi]
amg2023+[openmp|cuda|rocm|mpi]+[strong|weak|throughput]
babelstream+[openmp|cuda|rocm]
branson+[openmp|cuda|rocm|mpi]+[strong|weak|throughput]
commbench+[cuda|rocm]
genesis+[openmp|mpi]
gpcnet+[mpi]
gromacs+[openmp|cuda|rocm|mpi]
hpcg+[openmp|mpi]+[strong|weak]
hpl+[openmp|mpi]+[strong|weak]
ior+[mpi]+[strong|weak]
kripke+[openmp|cuda|rocm|mpi]+[strong|weak|throughput]
laghos+[cuda|rocm|mpi]+[strong|weak|throughput]
lammps+[openmp|cuda|rocm|mpi]+[strong]
md-test+[mpi]+[strong]
osu-micro-benchmarks+[cuda|rocm|mpi]
phloem+[mpi]
py-scaffold+[cuda|rocm]+[strong|weak]
quicksilver+[openmp|mpi]+[strong|weak]
qws+[openmp|mpi]+[strong|weak|throughput]
raja-perf+[openmp|cuda|rocm|mpi]+[strong|weak|throughput]
remhos+[cuda|rocm|mpi]+[strong|weak|throughput]
salmon-tddft+[openmp|mpi]
smb+[mpi]
sparta-snl+[openmp|cuda|rocm|mpi]
stream+[mpi]
test
Let’s look at the osu-micro-benchmarks:
$ benchpark info experiment osu-micro-benchmarks
Maintainers:
nhanford
Ramble Name:
osu_micro_benchmarks
Spack Name:
osu_micro_benchmarks
Variants:
name: exec_mode
default: test
description: Execution mode
values: ('test', 'perf')
validator: <function Variant.__init__.<locals>.<lambda> at 0x7f4083c90cc0>
multi: False
sticky: False
name: affinity
default: none
description: Build and run the affinity package
values: ('none', 'on')
validator: <function Variant.__init__.<locals>.<lambda> at 0x7f4083c909a0>
multi: False
sticky: False
name: hwloc
default: none
description: Get underlying infrastructure topology
values: ('none', 'on')
validator: <function Variant.__init__.<locals>.<lambda> at 0x7f4083c90a40>
multi: False
sticky: False
name: package_manager
default: spack
description: package manager to use
values: ('spack', 'environment-modules', 'user-managed', 'pip', 'spack-pip')
validator: <function Variant.__init__.<locals>.<lambda> at 0x7f4083c907c0>
multi: False
sticky: False
name: append_path
default:
description: Append to environment PATH during experiment execution
values: None
validator: <function variant.<locals>._always_true at 0x7f4083c90b80>
multi: False
sticky: False
name: prepend_path
default:
description: Prepend to environment PATH during experiment execution
values: None
validator: <function variant.<locals>._always_true at 0x7f4083c90d60>
multi: False
sticky: False
name: n_repeats
default: 0
description: Number of experiment repetitions
values: None
validator: <function variant.<locals>._always_true at 0x7f4083c90f40>
multi: False
sticky: False
name: allocation
default: standard
description: Allocation modifier mode
values: ('standard', 'torchrun-hpc')
validator: <function Variant.__init__.<locals>.<lambda> at 0x7f4083c90540>
multi: False
sticky: False
name: mpi
default: True
description: Run with MPI
values: (True, False)
validator: <function Variant.__init__.<locals>.<lambda> at 0x7f4083c90220>
multi: False
sticky: False
name: rocm
default: False
description: Build and run with ROCm
values: (True, False)
validator: <function Variant.__init__.<locals>.<lambda> at 0x7f4083c934c0>
multi: False
sticky: False
name: cuda
default: False
description: Build and run with CUDA
values: (True, False)
validator: <function Variant.__init__.<locals>.<lambda> at 0x7f4083c93420>
multi: False
sticky: False
name: workload
default: osu_latency
description: workloads available
values: ('osu_bibw', 'osu_bw', 'osu_latency', 'osu_latency_mp', 'osu_latency_mt', 'osu_mbw_mr', 'osu_multi_lat', 'osu_allgather', 'osu_allreduce_persistent', 'osu_alltoallw', 'osu_bcast_persistent', 'osu_iallgather', 'osu_ialltoallw', 'osu_ineighbor_allgather', 'osu_ireduce', 'osu_neighbor_allgatherv', 'osu_reduce_persistent', 'osu_scatterv', 'osu_allgather_persistent', 'osu_alltoall', 'osu_alltoallw_persistent', 'osu_gather', 'osu_iallgatherv', 'osu_ibarrier', 'osu_ineighbor_allgatherv', 'osu_ireduce_scatter', 'osu_neighbor_alltoall', 'osu_reduce_scatter', 'osu_scatterv_persistent', 'osu_allgatherv', 'osu_alltoall_persistent', 'osu_barrier', 'osu_gather_persistent', 'osu_iallreduce', 'osu_ibcast', 'osu_ineighbor_alltoall', 'osu_iscatter', 'osu_neighbor_alltoallv', 'osu_reduce_scatter_persistent', 'osu_allgatherv_persistent', 'osu_alltoallv', 'osu_barrier_persistent', 'osu_gatherv', 'osu_ialltoall', 'osu_igather', 'osu_ineighbor_alltoallv', 'osu_iscatterv', 'osu_neighbor_alltoallw', 'osu_scatter', 'osu_allreduce', 'osu_alltoallv_persistent', 'osu_bcast', 'osu_gatherv_persistent', 'osu_ialltoallv', 'osu_igatherv', 'osu_ineighbor_alltoallw', 'osu_neighbor_allgather', 'osu_reduce', 'osu_scatter_persistent', 'osu_acc_latency', 'osu_cas_latency', 'osu_fop_latency', 'osu_get_acc_latency', 'osu_get_bw', 'osu_get_latency', 'osu_put_bibw', 'osu_put_bw', 'osu_put_latency', 'osu_hello', 'osu_init')
validator: <function Variant.__init__.<locals>.<lambda> at 0x7f4083c93600>
multi: True
sticky: False
name: version
default: 7.5
description: app version
values: ('latest', '7.5')
validator: <function Variant.__init__.<locals>.<lambda> at 0x7f4083c93740>
multi: False
sticky: False
Okay that was a lot to take in. Most of it is variants yet again. We’ll discuss a
couple in detail: affinity will run the affinity test to show where different communicating
processes ended up within a node, be they MPI ranks or OpenMP threads, on CPU cores or
GPUs. This can be very helpful for debugging common parallel performance issues.
The other variant to look at is workload. This variant will determine which actual
micro- benchmarks get run. Okay, let’s get started with a simple example, but first some
caveats:
Caveats and known issues
1. The OSU benchmarks are a work in progress, and we are still working out the details on scaling ranks at this stage, so we will stick with 2 ranks on 2 fake nodes.
2. Collectives can hang in this configuration as this multi-node trick we’re playing on the Flux broker is really more for testing broker throughput, scheduling algorithms, etc., not actual applications. We will demonstrate a more robust single-node configuration a bit later.
Okay now that we have that out of the way, let’s initialize, setup, and run an experiment:
benchpark experiment init my-fluxtainer osu-micro-benchmarks workload=osu_allreduce,osu_mbw_mr affinity=on
Now we get a message back telling us what to run next:
Run `benchpark setup my-fluxtainer/osu-micro-benchmarks <experiments_root>`_ to generate Ramble workspace
Let’s call the experiments_root wkp for now…
benchpark setup my-fluxtainer/osu-micro-benchmarks wkp
If you get the error
fatal: hardlink different from source at ...
Run:
rm -rf ~/.benchpark
benchpark bootstrap
and reattempt the above benchpark setup ... command. And yet again, Benchpark tells
us what to run next:
Clearing existing workspace /home/fluxuser/benchpark/wkp/my-fluxtainer/osu-micro-benchmarks
Setting up configs for Ramble workspace /home/fluxuser/benchpark/wkp/my-fluxtainer/osu-micro-benchmarks/workspace/configs
Cloning packages to /home/fluxuser/benchpark/wkp/spack-packages
Cloning Spack to /home/fluxuser/benchpark/wkp/spack
Cloning Ramble to /home/fluxuser/benchpark/wkp/ramble
To complete the benchpark setup, do the following:
. /home/fluxuser/benchpark/wkp/setup.sh
Further steps are needed to build the experiments (ramble --workspace-dir /home/fluxuser/benchpark/wkp/my-fluxtainer/osu-micro-benchmarks/workspace workspace setup) and run them (ramble --workspace-dir /home/fluxuser/benchpark/wkp/my-fluxtainer/osu-micro-benchmarks/workspace on)
So let’s do that:
. /home/fluxuser/benchpark/wkp/setup.sh
ramble --workspace-dir /home/fluxuser/benchpark/wkp/my-fluxtainer/osu-micro-benchmarks/workspace workspace setup
ramble --workspace-dir /home/fluxuser/benchpark/wkp/my-fluxtainer/osu-micro-benchmarks/workspace on
And Ramble tells us it built the software:
==> Streaming details to log:
==> /home/fluxuser/benchpark/wkp/my-fluxtainer/osu-micro-benchmarks/workspace/logs/setup.2026-05-06_23.38.34.out
==> Setting up 2 out of 2 experiments:
==> Experiment #1 (1/2):
==> name: osu_micro_benchmarks.osu_allreduce.osu_micro_benchmarks_osu_allreduce_test_mpi_2_2
==> root experiment_index: 1
==> log file: /home/fluxuser/benchpark/wkp/my-fluxtainer/osu-micro-benchmarks/workspace/logs/setup.2026-05-06_23.38.34/osu_micro_benchmarks.osu_allreduce.osu_micro_benchmarks_osu_allreduce_test_mpi_2_2.out
==> Returning to log file: /home/fluxuser/benchpark/wkp/my-fluxtainer/osu-micro-benchmarks/workspace/logs/setup.2026-05-06_23.38.34.out
==> Experiment #2 (2/2):
==> name: osu_micro_benchmarks.osu_mbw_mr.osu_micro_benchmarks_osu_mbw_mr_test_mpi_2_2
==> root experiment_index: 2
==> log file: /home/fluxuser/benchpark/wkp/my-fluxtainer/osu-micro-benchmarks/workspace/logs/setup.2026-05-06_23.38.34/osu_micro_benchmarks.osu_mbw_mr.osu_micro_benchmarks_osu_mbw_mr_test_mpi_2_2.out
==> Returning to log file: /home/fluxuser/benchpark/wkp/my-fluxtainer/osu-micro-benchmarks/workspace/logs/setup.2026-05-06_23.38.34.out
And Ramble and Flux tell us they ran the jobs:
==> Streaming details to log:
==> /home/fluxuser/benchpark/wkp/my-fluxtainer/osu-micro-benchmarks/workspace/logs/execute.2026-05-06_23.43.29.out
==> Executing 2 out of 2 experiments:
==> Log files for experiments are stored in: /home/fluxuser/benchpark/wkp/my-fluxtainer/osu-micro-benchmarks/workspace/logs/execute.2026-05-06_23.43.29
==> Running executors...
ƒJMvePMyZ
ƒJMxjxM5q
So what just happened?
First, Benchpark copied Spack, its packages repository (with all the build recipes), and Ramble into a dedicated workspace for this experiment. This creates total isolation between benchmarks and reproducibility.
Second, Ramble built our benchmarks in accordance with the instructions given in the initialization, and also set up the batch scripts for the job scheduler (in this case Flux).
Finally, Ramble executed the benchmarks using the Flux scheduler.
Now let’s analyze the results:
ramble --workspace-dir /home/fluxuser/benchpark/wkp/my-fluxtainer/osu-micro-benchmarks/workspace workspace analyze -f json
And Ramble tells us it performed the analysis:
==> Streaming details to log:
==> /home/fluxuser/benchpark/wkp/my-fluxtainer/osu-micro-benchmarks/workspace/logs/analyze.2026-05-06_23.46.25.out
==> Analyzing 2 out of 2 experiments:
==> Experiment #1 (1/2):
==> name: osu_micro_benchmarks.osu_allreduce.osu_micro_benchmarks_osu_allreduce_test_mpi_2_2
==> root experiment_index: 1
==> log file: /home/fluxuser/benchpark/wkp/my-fluxtainer/osu-micro-benchmarks/workspace/logs/analyze.2026-05-06_23.46.25/osu_micro_benchmarks.osu_allreduce.osu_micro_benchmarks_osu_allreduce_test_mpi_2_2.out
==> Invalidating experiment results cache: timestamp difference
==> Returning to log file: /home/fluxuser/benchpark/wkp/my-fluxtainer/osu-micro-benchmarks/workspace/logs/analyze.2026-05-06_23.46.25.out
==> Experiment #2 (2/2):
==> name: osu_micro_benchmarks.osu_mbw_mr.osu_micro_benchmarks_osu_mbw_mr_test_mpi_2_2
==> root experiment_index: 2
==> log file: /home/fluxuser/benchpark/wkp/my-fluxtainer/osu-micro-benchmarks/workspace/logs/analyze.2026-05-06_23.46.25/osu_micro_benchmarks.osu_mbw_mr.osu_micro_benchmarks_osu_mbw_mr_test_mpi_2_2.out
==> Invalidating experiment results cache: timestamp difference
==> Returning to log file: /home/fluxuser/benchpark/wkp/my-fluxtainer/osu-micro-benchmarks/workspace/logs/analyze.2026-05-06_23.46.25.out
==> Results are written to:
==> /home/fluxuser/benchpark/wkp/my-fluxtainer/osu-micro-benchmarks/workspace/results/results.2026-05-06_23.46.26.json
==> Symlinks updated:
==> /home/fluxuser/benchpark/wkp/my-fluxtainer/osu-micro-benchmarks/workspace/results/results.latest.json
So let’s look at one:
cat /home/fluxuser/benchpark/wkp/my-fluxtainer/osu-micro-benchmarks/workspace/results/results.latest.json | jq
If you scroll around, you’ll see that Ramble captured quite a bit of information about our benchmarks, most importantly, the performance data.
Hey remember we set affinity=on? Where did that end up? Let’s poke around this
workspace and see:
cat /home/fluxuser/benchpark/wkp/my-fluxtainer/osu-micro-benchmarks/workspace/experiments/osu_micro_benchmarks/osu_allreduce/osu_micro_benchmarks_osu_allreduce_test_mpi_2_2/affinity.mpi.out
Shows:
affinity test for 2 MPI ranks
rank 0 @ d26f4eea6a52: thread 0 -> core 0
rank 1 @ d26f4eea6a52: thread 0 -> core 0
Okay not super interesting because the broker was tricked into thinking it had 4 nodes, but this will certainly come in handy for more complex cases…
Epilogue: Gory Details, FAQ, and HPC Common Practices
What is Flux and why are you using it here?
The Flux Framework is the only workload manager on the El Capitan supercomputer. It is a hierarchical, highly-portable, security-aware workload manager and job scheduler that plays nice with cloud, containers, orchestrators, and more. We’re using it here because:
It works well in a container.
It can run under other workload managers such as Slurm, Spectrum LSF, etc., so you can run it easily on your own cluster.
.
This seems complicated.
You’re right, it is, but portable reproducible benchmarking across many different system types has a great deal of inherent complexity. Many have built test harnesses that either compromise on one of those features or slowly grows more complex with time in an unsustainable manner. Benchpark and Ramble take the complexity bull by the horns and use much of Spack’s design philosophy to pay the cost up-front. The learning curve is admittedly somewhat steep, but the payoff is (hopefully) portability and reproducibility with a relatively stable set of interfaces.
How did you build this container? Which of these lessons can I apply to my own cluster?
The containerfile for this particular container is here. We based it on the Flux containers and picked EL10 because that’s a common OS for HPC with a relatively stable ABI, which is important when building so much from source. The 2 key tricks it demonstrates for HPC/AI practitioners are:
Manage affinity portably with a tool like mpibind.
Describe and build on system externals deterministically with Spack.
More on how to do that second part for your own cluster is featured in Adding a System