MPI Profiling¶
Caliper’s built-in profiling recipes support MPI natively and automatically aggregate performance data across all MPI ranks. In addition, Caliper provides MPI performance statistics.
MPI Function Profiling¶
The mpi-report config recipe lists the number of invocations and the time spent in each MPI function (min/max/avg across MPI ranks). It works similar to the mpiP profiling tool. The first row shows the time outside of MPI (i.e., the computation time of the program):
$ CALI_CONFIG=mpi-report srun -n 8 ./lulesh2.0
Function Count (min) Count (max) Time (min) Time (max) Time (avg) Time %
446 518 0.315387 0.415731 0.353483 83.299370
MPI_Allreduce 10 11 0.000281 0.068973 0.045038 10.613409
MPI_Wait 107 177 0.000795 0.032157 0.014788 3.484918
MPI_Barrier 2 2 0.000051 0.007671 0.005110 1.204122
MPI_Isend 107 177 0.002300 0.002799 0.002571 0.605904
MPI_Waitall 31 31 0.000482 0.001858 0.001149 0.270677
MPI_Comm_split 2 2 0.000176 0.001925 0.000999 0.235499
MPI_Irecv 107 177 0.000446 0.000767 0.000631 0.148605
MPI_Bcast 4 4 0.000054 0.000500 0.000436 0.102674
MPI_Reduce 1 1 0.000032 0.000296 0.000072 0.017057
MPI_Comm_dup 1 1 0.000038 0.000066 0.000052 0.012178
MPI_Comm_free 2 2 0.000012 0.000015 0.000013 0.003020
MPI_Get_library_version 1 1 0.000007 0.000010 0.000008 0.001972
MPI_Gather 1 1 0.000020 0.000020 0.000020 0.000594
The profile.mpi option is available for most built-in profiling recipes, such as runtime-report or hatchet-region-profile. It shows the time spent in MPI functions within each Caliper region:
$ CALI_CONFIG=runtime-report,profile.mpi srun -n 8 ./lulesh2.0
Path Min time/rank Max time/rank Avg time/rank Time %
main 0.007467 0.007918 0.007664 1.775109
CommRecv 0.000036 0.000068 0.000044 0.010198
MPI_Irecv 0.000036 0.000076 0.000044 0.010223
CommSend 0.000045 0.000053 0.000047 0.010781
MPI_Isend 0.000597 0.000621 0.000608 0.140909
MPI_Waitall 0.000015 0.000020 0.000016 0.003809
CommSBN 0.000035 0.000042 0.000037 0.008527
MPI_Wait 0.000027 0.000286 0.000147 0.034127
MPI_Barrier 0.000013 0.000104 0.000065 0.014967
lulesh.cycle 0.000212 0.000252 0.000228 0.052810
TimeIncrement 0.000085 0.000107 0.000091 0.021138
MPI_Allreduce 0.000210 0.071229 0.046590 10.791564
LagrangeLeapFrog 0.000263 0.000408 0.000320 0.074115
LagrangeNodal 0.004715 0.005330 0.005034 1.165980
CalcForceForNodes 0.000624 0.000747 0.000694 0.160774
CommRecv 0.000242 0.000287 0.000265 0.061496
MPI_Irecv 0.000240 0.000332 0.000280 0.064767
CalcVolumeForceForElems 0.001827 0.002038 0.001919 0.444573
IntegrateStressForElems 0.034616 0.038880 0.036624 8.483035
CalcHourglassControlForElems 0.095108 0.102434 0.098921 22.912601
CalcFBHourglassForceForElems 0.062848 0.071650 0.067722 15.686204
CommSend 0.000838 0.000949 0.000890 0.206152
MPI_Isend 0.000899 0.001126 0.000999 0.231382
MPI_Waitall 0.000140 0.000298 0.000216 0.049950
CommSBN 0.000558 0.000692 0.000615 0.142361
MPI_Wait 0.000334 0.018442 0.008193 1.897615
CommRecv 0.000044 0.000249 0.000152 0.035190
MPI_Irecv 0.000042 0.000281 0.000162 0.032884
CommSend 0.000086 0.001056 0.000560 0.129665
MPI_Isend 0.000067 0.000525 0.000327 0.066173
MPI_Waitall 0.000043 0.002110 0.000808 0.187086
(...)
Message statistics¶
The mpi.message.count and mpi.message.size options show the number of messages and transferred bytes for both point-to-point and collective communication operations.
MPI Function filtering¶
You can use the mpi.include and mpi.exclude to explicitly select or filter out MPI operations to capture. This is a more efficient option than filtering MPI functions with the name-based include_regions or exclude_regions option. As an example, we can use mpi.include to only measure MPI_Allreduce:
$ CALI_CONFIG=runtime-report,profile.mpi,mpi.include=MPI_Allreduce
Path Min time/rank Max time/rank Avg time/rank Time %
main 0.007588 0.008178 0.007737 1.834651
CommRecv 0.000024 0.000034 0.000029 0.006834
CommSend 0.000551 0.000631 0.000594 0.140933
CommSBN 0.000019 0.000046 0.000031 0.007338
lulesh.cycle 0.000218 0.000254 0.000233 0.055250
TimeIncrement 0.000088 0.000111 0.000098 0.023188
MPI_Allreduce 0.000180 0.066283 0.042576 10.095752
LagrangeLeapFrog 0.000278 0.000364 0.000324 0.076719
LagrangeNodal 0.004838 0.005261 0.005013 1.188590
CalcForceForNodes 0.000622 0.000839 0.000740 0.175528
CommRecv 0.000056 0.000076 0.000065 0.015465
(...)