GPU profiling¶
Caliper can profile CUDA and ROCm API functions and device-side activities like kernel executions and memory copies. This requires Caliper to be built with CUpti support (-DWITH_CUPTI=On, for CUDA devices) or roctracer support (-DWITH_ROCTRACER=On, for AMD devices).
Profiling host-side API functions¶
Use the profile.cuda or profile.hip options to report the time spent in
host-side CUDA and HIP API functions, respectively.
They are available for many ConfigManager configs like
runtime-report, hatchet-region-profile, and spot. They intercept host-side
CUDA API functions like cudaMemcpy
and cudaDeviceSynchronize
.
The example below shows runtime-report output with the profile.cuda option
enabled. We can see the time spent in CUDA functions like cudaMemcpy
and
cudaLaunchKernel
:
$ CALI_CONFIG=runtime-report,profile.cuda lrun -n 4 ./tea_leaf
Path Min time/rank Max time/rank Avg time/rank Time %
timestep_loop 0.000175 0.000791 0.000345 0.002076
summary 0.000129 0.000153 0.000140 0.000846
cudaMemcpy 0.000414 0.000422 0.000418 0.002516
cudaLaunchKernel 0.000247 0.000273 0.000256 0.001542
total_solve 0.000105 0.000689 0.000252 0.001516
reset 0.000160 0.000179 0.000169 0.001021
internal_halo_update 0.000027 0.000029 0.000028 0.000167
halo_update 0.000028 0.000031 0.000029 0.000176
halo_exchange 0.000496 0.000812 0.000607 0.003654
cudaMemcpy 0.001766 0.001792 0.001779 0.010713
cudaLaunchKernel 0.000418 0.000450 0.000430 0.002588
cudaLaunchKernel 0.000097 0.000126 0.000106 0.000640
...
Profiling device-side activities¶
Use the cuda.gputime (for NVidia) or rocm.gputime (for AMD) options to measure and report times spent in GPU activities, such as kernels or memory copies:
CALI_CONFIG=runtime-report,cuda.gputime lrun -n 4 ./tea_leaf
Path Min time/rank Max time/rank Avg time/rank Time % Avg GPU Time/rank Min GPU Time/rank Max GPU Time/rank
timestep_loop 16.400791 16.402930 16.401392 99.340451 12.014107 11.990568 12.031210
summary 0.000743 0.000852 0.000772 0.004679 0.000420 0.000419 0.000421
total_solve 16.397178 16.398656 16.398176 99.320973 12.013687 11.990149 12.030790
reset 0.002924 0.004401 0.003682 0.022303 0.001463 0.001461 0.001465
internal_halo_update 0.000031 0.000038 0.000033 0.000201
halo_update 0.000033 0.000042 0.000036 0.000220
halo_exchange 0.002589 0.004060 0.003304 0.020013 0.000223 0.000222 0.000223
solve 16.377495 16.377513 16.377504 99.195768 12.003737 11.980207 12.020827
dot_product 0.001003 0.001339 0.001151 0.006971
internal_halo_update 0.087994 0.094034 0.089777 0.543764
halo_update 0.199130 0.199757 0.199502 1.208347
halo_exchange 14.407030 14.427922 14.419671 87.337501 0.624949 0.619820 0.629013
tea_init 0.016205 0.016616 0.016506 0.099973 0.008487 0.008475 0.008501
internal_halo_update 0.000096 0.000102 0.000100 0.000603
halo_update 0.000104 0.000113 0.000109 0.000660
halo_exchange 0.009489 0.010272 0.009914 0.060046 0.001013 0.001007 0.001022
timestep 0.000311 0.002516 0.001355 0.008205
The GPU Time/rank metrics denote the number of seconds spent in GPU activities, such as kernel execution or memory copies. The GPU time values are inclusive, i.e. they represent the total amount of GPU time launched from a Caliper region and any region below. Thus, the GPU time values for “timestep_loop” in the example above represent the GPU activity time for the entire program.
Note that the cuda.gputime and rocm.gputime options are more expensive than regular host-side region profiling because they use tracing APIs to record each GPU activity. They are primarily intended for short, dedicated performance profiling experiments.
There are also dedicated configs for examining GPU activities: the cuda-activity-report, cuda-activity-profile, rocm-activity-report, and rocm-activity-profile configs record the time spent in CUDA activities (e.g. kernel executions or memory copies) on the device. The GPU times are mapped to the Caliper regions that launched those GPU activities. The …-report options print a human-readable table, while the …-profile options write a machine-readable .json or .cali file. Here is example output for cuda-activity-report:
$ CALI_CONFIG=cuda-activity-report lrun -n 4 ./tea_leaf
Path Avg Host Time Max Host Time Avg GPU Time Max GPU Time GPU %
timestep_loop 17.052183 17.053008 12.020218 12.037409 70.490786
total_solve 17.048899 17.049464 12.019799 12.036989 70.501909
solve 17.026991 17.027000 12.009868 12.027047 70.534294
dot_product 0.004557 0.006057
cudaMalloc 0.000050 0.000055
internal_halo_update 0.051974 0.052401
halo_update 0.110023 0.111015
halo_exchange 15.154157 15.184635 0.621857 0.625657 4.103542
cudaMemcpy 12.398854 12.422384 0.234947 0.235587 1.894913
cudaLaunchKernel 1.426069 1.440035 0.386910 0.390767 27.131206
cudaMemcpy 0.497959 0.503879 0.003377 0.003433 0.678115
cudaLaunchKernel 0.772027 0.793401 11.384634 11.402917 1474.641141
For each Caliper region, we now see the time spent on the CPU (“Avg/Max Host
Time”) and the aggregate time on the GPU for activities launched from
this region (“Avg/Max GPU Time”). Note how the cudaLaunchKernel
call in
the last row of the output has ~0.78 seconds of CPU time associated with it,
but ~11.38 seconds of GPU time - these 11.38 seconds reflect the amount of GPU
activities launched asynchronously from this region. Most CPU time in the
example output is actually spent in cudaMemcpy
under halo_exchange
- we can deduce that much of this is actually synchronization time spent
waiting until the GPU kernels are complete.
The “GPU %” metric shows the fraction of (wall-clock) CPU time in which there
are GPU activities, and represents GPU utilization inside a Caliper region.
The percentage is based on the total inclusive CPU time and GPU activities for
the region and it sub-regions. The “GPU %” for the top-level region
(timestep_loop
) represents the GPU utilization of the entire program.
The definition of the reported metrics is as follows:
- Avg Host Time
Inclusive time (seconds) in the Caliper region on the Host (CPU). Typically the wall-clock time. Average value across MPI ranks.
- Max Host Time
Inclusive time (seconds) in the Caliper region on the Host (CPU). Typically the wall-clock time. Maximum value among all MPI ranks.
- Avg GPU Time
Inclusive total time (seconds) of activities executing on the GPU launched from the Caliper region. Average across MPI ranks.
- Max GPU Time
Inclusive total time (seconds) of activities executing on the GPU launched from the Caliper region. Maximum value among all MPI ranks.
- GPU %
Fraction of total inclusive GPU time vs. CPU time. Typically represents the GPU utilization in the Caliper region.
Show GPU kernels¶
Use the show_kernels option in cuda-activity-report to distinguish individual CUDA kernels:
$ CALI_CONFIG=cuda-activity-report,show_kernels lrun -n 4 ./tea_leaf
Path Kernel Avg Host Time Max Host Time Avg GPU Time Max GPU Time GPU %
timestep_loop
|- 17.068956 17.069917 0.239392 0.240725 1.402501
|- device_unpack_top_buffe~~le*, double*, int, int) 0.091051 0.092734
|- device_tea_leaf_ppcg_so~~ const*, double const*) 5.409844 5.419096
|- device_tea_leaf_ppcg_so~~t*, double const*, int) 5.316101 5.320777
|- device_pack_right_buffe~~le*, double*, int, int) 0.112455 0.113198
|- device_pack_top_buffer(~~le*, double*, int, int) 0.092634 0.092820
(...)
|- device_pack_bottom_buff~~le*, double*, int, int) 0.098929 0.099095
summary
|- 0.000881 0.000964 0.000010 0.000011 1.179024
|- device_field_summary_ke~~ble*, double*, double*) 0.000325 0.000326
|- void reduction<double, ~~N_TYPE)0>(int, double*) 0.000083 0.000084
cudaMemcpy 0.000437 0.000457 0.000010 0.000011 2.376874
cudaLaunchKernel
|- 0.000324 0.000392
|- device_field_summary_ke~~ble*, double*, double*) 0.000325 0.000326
|- void reduction<double, ~~N_TYPE)0>(int, double*) 0.000083 0.000084
We now see the GPU time in each kernel. The display is “inclusive”, that is,
for each Caliper region, we see kernels launched from this region as well as
all regions below it. Under the top-level region (timestep_loop
),
we see the total time spent in each CUDA kernel in the program.
Profiling memory copies¶
The cuda.memcpy option shows the amount of data copied between CPU and GPU memory:
$ CALI_CONFIG=cuda-activity-report,cuda.memcpy lrun -n 4 ./tea_leaf
Path Avg Host Time Max Host Time Avg GPU Time Max GPU Time GPU % Copy CPU->GPU (avg) Copy CPU->GPU (max) Copy GPU->CPU (avg) Copy GPU->CPU (max)
timestep_loop 17.098644 17.100283 12.020492 12.040316 70.300846
total_solve 17.094882 17.095510 12.020072 12.039896 70.313864
solve 17.072250 17.072253 12.010138 12.029962 70.348891
dot_product 0.001489 0.001684
cudaMalloc 0.000050 0.000054
internal_halo_update 0.052739 0.053484
halo_update 0.110009 0.111008
halo_exchange 15.194774 15.220273 0.622893 0.629519 4.099388
cudaMemcpy 12.405958 12.430534 0.234917 0.235315 1.893578 902.517616 902.517616 902.469568 902.469568
cudaLaunchKernel 1.431387 1.463419 0.387976 0.394445 27.104925
cudaMemcpy 0.493047 0.494699 0.003369 0.003393 0.683394 0.040320 0.040320 0.019232 0.019232
In the example, there were about 902 Megabytes copied from CPU to GPU memory and back. There are four new metrics:
- Copy CPU->GPU (avg)
Data copied (Megabytes) from CPU to GPU memory. Average across MPI ranks.
- Copy CPU->GPU (max)
Data copied (Megabytes) from CPU to GPU memory. Maximum among MPI ranks.
- Copy GPU->CPU (avg)
Data copied (Megabytes) from GPU to CPU memory. Average across MPI ranks.
- Copy GPU->CPU (max)
Data copied (Megabytes) from GPU to CPU memory. Maximum among MPI ranks.
Profiles in JSON or CALI format¶
The cuda-activity-profile and rocm-activity-profile configs record GPU activities data similar to the corresponding …-report options, but write it in machine-readable JSON or Caliper’s “.cali” format for processing with cali-query. By default, it writes the json-split format that can be read by Hatchet:
$ CALI_CONFIG=cuda-activity-profile lrun -n 4 ./tea_leaf
$ ls *.json
cuda_profile.json
Other formats can be selected with the output.format option. Possible values:
- cali
The Caliper profile/trace format for processing with cali-query.
- hatchet
JSON format readable by Hatchet. See Json-split.
- json
Easy-to-parse list-of-dicts style JSON. See Json.