CUDA profiling ======================================= Caliper can profile CUDA API functions and CUDA device-side activities like kernel executions and memory copies. This requires Caliper to be built with CUpti support (`-DWITH_CUPTI=On`). Profiling CUDA host-side API functions --------------------------------------- The `profile.cuda` option is available for many ConfigManager configs like `runtime-report`, `hatchet-region-profile`, and `spot`. It intercepts host-side CUDA API functions like ``cudaMemcpy`` and ``cudaDeviceSynchronize``. The example below shows `runtime-report` output with the `profile.cuda` option enabled. We can see the time spent in CUDA functions like ``cudaMemcpy`` and ``cudaLaunchKernel``:: $ CALI_CONFIG=runtime-report,profile.cuda lrun -n 4 ./tea_leaf Path Min time/rank Max time/rank Avg time/rank Time % timestep_loop 0.000175 0.000791 0.000345 0.002076 summary 0.000129 0.000153 0.000140 0.000846 cudaMemcpy 0.000414 0.000422 0.000418 0.002516 cudaLaunchKernel 0.000247 0.000273 0.000256 0.001542 total_solve 0.000105 0.000689 0.000252 0.001516 reset 0.000160 0.000179 0.000169 0.001021 internal_halo_update 0.000027 0.000029 0.000028 0.000167 halo_update 0.000028 0.000031 0.000029 0.000176 halo_exchange 0.000496 0.000812 0.000607 0.003654 cudaMemcpy 0.001766 0.001792 0.001779 0.010713 cudaLaunchKernel 0.000418 0.000450 0.000430 0.002588 cudaLaunchKernel 0.000097 0.000126 0.000106 0.000640 ... Profiling CUDA activities --------------------------------------- The `cuda.gputime` option for `runtime-report` and `hatchet-region-profile` measures and reports times spent in GPU activities:: CALI_CONFIG=runtime-report,cuda.gputime lrun -n 4 ./tea_leaf Path Min time/rank Max time/rank Avg time/rank Time % Avg GPU Time/rank Min GPU Time/rank Max GPU Time/rank timestep_loop 16.400791 16.402930 16.401392 99.340451 12.014107 11.990568 12.031210 summary 0.000743 0.000852 0.000772 0.004679 0.000420 0.000419 0.000421 total_solve 16.397178 16.398656 16.398176 99.320973 12.013687 11.990149 12.030790 reset 0.002924 0.004401 0.003682 0.022303 0.001463 0.001461 0.001465 internal_halo_update 0.000031 0.000038 0.000033 0.000201 halo_update 0.000033 0.000042 0.000036 0.000220 halo_exchange 0.002589 0.004060 0.003304 0.020013 0.000223 0.000222 0.000223 solve 16.377495 16.377513 16.377504 99.195768 12.003737 11.980207 12.020827 dot_product 0.001003 0.001339 0.001151 0.006971 internal_halo_update 0.087994 0.094034 0.089777 0.543764 halo_update 0.199130 0.199757 0.199502 1.208347 halo_exchange 14.407030 14.427922 14.419671 87.337501 0.624949 0.619820 0.629013 tea_init 0.016205 0.016616 0.016506 0.099973 0.008487 0.008475 0.008501 internal_halo_update 0.000096 0.000102 0.000100 0.000603 halo_update 0.000104 0.000113 0.000109 0.000660 halo_exchange 0.009489 0.010272 0.009914 0.060046 0.001013 0.001007 0.001022 timestep 0.000311 0.002516 0.001355 0.008205 The GPU Time/rank metrics denote the number of seconds spent in GPU activities, such as kernel execution or memory copies. The GPU time values are inclusive, i.e. they represent the total amount of GPU time launched from a Caliper region and any region below. Thus, the GPU time values for "timestep_loop" in the example above represent the GPU activity time for the entire program. Note that the `cuda.gputime` option is more expensive than regular host-side region profiling because it uses the NVidia CUPTI API to trace GPU activities. It is primarily intended for short, dedicated performance profiling experiments. There are also dedicated configs for examining GPU activities: the `cuda-activity-report` and `cuda-activity-profile` configs record the time spent in CUDA activities (e.g. kernel executions or memory copies) on the CUDA device. The GPU times are mapped to the Caliper regions that launched those GPU activities. Here is example output for `cuda-activity-report`:: $ CALI_CONFIG=cuda-activity-report lrun -n 4 ./tea_leaf Path Avg Host Time Max Host Time Avg GPU Time Max GPU Time GPU % timestep_loop 17.052183 17.053008 12.020218 12.037409 70.490786 total_solve 17.048899 17.049464 12.019799 12.036989 70.501909 solve 17.026991 17.027000 12.009868 12.027047 70.534294 dot_product 0.004557 0.006057 cudaMalloc 0.000050 0.000055 internal_halo_update 0.051974 0.052401 halo_update 0.110023 0.111015 halo_exchange 15.154157 15.184635 0.621857 0.625657 4.103542 cudaMemcpy 12.398854 12.422384 0.234947 0.235587 1.894913 cudaLaunchKernel 1.426069 1.440035 0.386910 0.390767 27.131206 cudaMemcpy 0.497959 0.503879 0.003377 0.003433 0.678115 cudaLaunchKernel 0.772027 0.793401 11.384634 11.402917 1474.641141 For each Caliper region, we now see the time spent on the CPU ("Avg/Max Host Time") and the aggregate time on the GPU for activities launched from this region ("Avg/Max GPU Time"). Note how the ``cudaLaunchKernel`` call in the last row of the output has ~0.78 seconds of CPU time associated with it, but ~11.38 seconds of GPU time - these 11.38 seconds reflect the amount of GPU activities launched asynchronously from this region. Most CPU time in the example output is actually spent in ``cudaMemcpy`` under ``halo_exchange`` - we can deduce that much of this is actually synchronization time spent waiting until the GPU kernels are complete. The "GPU %" metric shows the fraction of (wall-clock) CPU time in which there are GPU activities, and represents GPU utilization inside a Caliper region. The percentage is based on the total inclusive CPU time and GPU activities for the region and it sub-regions. The "GPU %" for the top-level region (``timestep_loop``) represents the GPU utilization of the entire program. The definition of the reported metrics is as follows: Avg Host Time Inclusive time (seconds) in the Caliper region on the Host (CPU). Typically the wall-clock time. Average value across MPI ranks. Max Host Time Inclusive time (seconds) in the Caliper region on the Host (CPU). Typically the wall-clock time. Maximum value among all MPI ranks. Avg GPU Time Inclusive total time (seconds) of activities executing on the GPU launched from the Caliper region. Average across MPI ranks. Max GPU Time Inclusive total time (seconds) of activities executing on the GPU launched from the Caliper region. Maximum value among all MPI ranks. GPU % Fraction of total inclusive GPU time vs. CPU time. Typically represents the GPU utilization in the Caliper region. Show GPU kernels --------------------------------------- Use the `show_kernels` option in `cuda-activity-report` to distinguish individual CUDA kernels:: $ CALI_CONFIG=cuda-activity-report,show_kernels lrun -n 4 ./tea_leaf Path Kernel Avg Host Time Max Host Time Avg GPU Time Max GPU Time GPU % timestep_loop |- 17.068956 17.069917 0.239392 0.240725 1.402501 |- device_unpack_top_buffe~~le*, double*, int, int) 0.091051 0.092734 |- device_tea_leaf_ppcg_so~~ const*, double const*) 5.409844 5.419096 |- device_tea_leaf_ppcg_so~~t*, double const*, int) 5.316101 5.320777 |- device_pack_right_buffe~~le*, double*, int, int) 0.112455 0.113198 |- device_pack_top_buffer(~~le*, double*, int, int) 0.092634 0.092820 (...) |- device_pack_bottom_buff~~le*, double*, int, int) 0.098929 0.099095 summary |- 0.000881 0.000964 0.000010 0.000011 1.179024 |- device_field_summary_ke~~ble*, double*, double*) 0.000325 0.000326 |- void reduction(int, double*) 0.000083 0.000084 cudaMemcpy 0.000437 0.000457 0.000010 0.000011 2.376874 cudaLaunchKernel |- 0.000324 0.000392 |- device_field_summary_ke~~ble*, double*, double*) 0.000325 0.000326 |- void reduction(int, double*) 0.000083 0.000084 We now see the GPU time in each kernel. The display is "inclusive", that is, for each Caliper region, we see kernels launched from this region as well as all regions below it. Under the top-level region (``timestep_loop``), we see the total time spent in each CUDA kernel in the program. Profiling memory copies --------------------------------------- The `cuda.memcpy` option shows the amount of data copied between CPU and GPU memory:: $ CALI_CONFIG=cuda-activity-report,cuda.memcpy lrun -n 4 ./tea_leaf Path Avg Host Time Max Host Time Avg GPU Time Max GPU Time GPU % Copy CPU->GPU (avg) Copy CPU->GPU (max) Copy GPU->CPU (avg) Copy GPU->CPU (max) timestep_loop 17.098644 17.100283 12.020492 12.040316 70.300846 total_solve 17.094882 17.095510 12.020072 12.039896 70.313864 solve 17.072250 17.072253 12.010138 12.029962 70.348891 dot_product 0.001489 0.001684 cudaMalloc 0.000050 0.000054 internal_halo_update 0.052739 0.053484 halo_update 0.110009 0.111008 halo_exchange 15.194774 15.220273 0.622893 0.629519 4.099388 cudaMemcpy 12.405958 12.430534 0.234917 0.235315 1.893578 902.517616 902.517616 902.469568 902.469568 cudaLaunchKernel 1.431387 1.463419 0.387976 0.394445 27.104925 cudaMemcpy 0.493047 0.494699 0.003369 0.003393 0.683394 0.040320 0.040320 0.019232 0.019232 In the example, there were about 902 Megabytes copied from CPU to GPU memory and back. There are four new metrics: Copy CPU->GPU (avg) Data copied (Megabytes) from CPU to GPU memory. Average across MPI ranks. Copy CPU->GPU (max) Data copied (Megabytes) from CPU to GPU memory. Maximum among MPI ranks. Copy GPU->CPU (avg) Data copied (Megabytes) from GPU to CPU memory. Average across MPI ranks. Copy GPU->CPU (max) Data copied (Megabytes) from GPU to CPU memory. Maximum among MPI ranks. Profiles in JSON or CALI format --------------------------------------- The `cuda-activity-profile` config records data similar to `cuda-activity-report` but writes it in machine-readable JSON or Caliper's ".cali" format for processing with `cali-query`. By default, it writes the `json-split` format that can be read by Hatchet:: $ CALI_CONFIG=cuda-activity-profile lrun -n 4 ./tea_leaf $ ls *.json cuda_profile.json Other formats can be selected with the `output.format` option. Possible values: cali The Caliper profile/trace format for processing with `cali-query`. hatchet JSON format readable by Hatchet. See :ref:`json-split-format`. json Easy-to-parse list-of-dicts style JSON. See :ref:`json-format`.