MPI Profiling
================================

Caliper's built-in profiling recipes support MPI natively and automatically
aggregate performance data across all MPI ranks. In addition, Caliper provides
MPI performance statistics.

MPI Function Profiling
--------------------------------

The `mpi-report` config recipe lists the number of invocations and the time spent
in each MPI function (min/max/avg across MPI ranks). It works similar to the
mpiP profiling tool. The first row shows the time outside of MPI (i.e., the
computation time of the program)::

    $ CALI_CONFIG=mpi-report srun -n 8 ./lulesh2.0
    Function                Count (min) Count (max) Time (min) Time (max) Time (avg) Time %
                                    446         518   0.315387   0.415731   0.353483 83.299370
    MPI_Allreduce                    10          11   0.000281   0.068973   0.045038 10.613409
    MPI_Wait                        107         177   0.000795   0.032157   0.014788  3.484918
    MPI_Barrier                       2           2   0.000051   0.007671   0.005110  1.204122
    MPI_Isend                       107         177   0.002300   0.002799   0.002571  0.605904
    MPI_Waitall                      31          31   0.000482   0.001858   0.001149  0.270677
    MPI_Comm_split                    2           2   0.000176   0.001925   0.000999  0.235499
    MPI_Irecv                       107         177   0.000446   0.000767   0.000631  0.148605
    MPI_Bcast                         4           4   0.000054   0.000500   0.000436  0.102674
    MPI_Reduce                        1           1   0.000032   0.000296   0.000072  0.017057
    MPI_Comm_dup                      1           1   0.000038   0.000066   0.000052  0.012178
    MPI_Comm_free                     2           2   0.000012   0.000015   0.000013  0.003020
    MPI_Get_library_version           1           1   0.000007   0.000010   0.000008  0.001972
    MPI_Gather                        1           1   0.000020   0.000020   0.000020  0.000594

The `profile.mpi` option is available for most built-in profiling recipes, such as
`runtime-report` or `hatchet-region-profile`. It shows the time spent in MPI functions
within each Caliper region::

    $ CALI_CONFIG=runtime-report,profile.mpi srun -n 8 ./lulesh2.0
    Path                                       Min time/rank Max time/rank Avg time/rank Time %
    main                                            0.007467      0.007918      0.007664  1.775109
      CommRecv                                      0.000036      0.000068      0.000044  0.010198
        MPI_Irecv                                   0.000036      0.000076      0.000044  0.010223
      CommSend                                      0.000045      0.000053      0.000047  0.010781
        MPI_Isend                                   0.000597      0.000621      0.000608  0.140909
        MPI_Waitall                                 0.000015      0.000020      0.000016  0.003809
      CommSBN                                       0.000035      0.000042      0.000037  0.008527
        MPI_Wait                                    0.000027      0.000286      0.000147  0.034127
      MPI_Barrier                                   0.000013      0.000104      0.000065  0.014967
      lulesh.cycle                                  0.000212      0.000252      0.000228  0.052810
        TimeIncrement                               0.000085      0.000107      0.000091  0.021138
          MPI_Allreduce                             0.000210      0.071229      0.046590 10.791564
        LagrangeLeapFrog                            0.000263      0.000408      0.000320  0.074115
          LagrangeNodal                             0.004715      0.005330      0.005034  1.165980
            CalcForceForNodes                       0.000624      0.000747      0.000694  0.160774
              CommRecv                              0.000242      0.000287      0.000265  0.061496
                MPI_Irecv                           0.000240      0.000332      0.000280  0.064767
              CalcVolumeForceForElems               0.001827      0.002038      0.001919  0.444573
                IntegrateStressForElems             0.034616      0.038880      0.036624  8.483035
                CalcHourglassControlForElems        0.095108      0.102434      0.098921 22.912601
                  CalcFBHourglassForceForElems      0.062848      0.071650      0.067722 15.686204
              CommSend                              0.000838      0.000949      0.000890  0.206152
                MPI_Isend                           0.000899      0.001126      0.000999  0.231382
                MPI_Waitall                         0.000140      0.000298      0.000216  0.049950
              CommSBN                               0.000558      0.000692      0.000615  0.142361
                MPI_Wait                            0.000334      0.018442      0.008193  1.897615
            CommRecv                                0.000044      0.000249      0.000152  0.035190
              MPI_Irecv                             0.000042      0.000281      0.000162  0.032884
            CommSend                                0.000086      0.001056      0.000560  0.129665
              MPI_Isend                             0.000067      0.000525      0.000327  0.066173
              MPI_Waitall                           0.000043      0.002110      0.000808  0.187086
    (...)

Message statistics
................................

The `mpi.message.count` and `mpi.message.size` options show the number of
messages and transferred bytes for both point-to-point and collective communication
operations.

MPI Function filtering
................................

You can use the `mpi.include` and `mpi.exclude` to explicitly select or
filter out MPI operations to capture. This is a more efficient option than
filtering MPI functions with the name-based `include_regions` or
`exclude_regions` option. As an example, we can use `mpi.include` to only
measure `MPI_Allreduce`::

    $ CALI_CONFIG=runtime-report,profile.mpi,mpi.include=MPI_Allreduce
    Path                                       Min time/rank Max time/rank Avg time/rank Time %
    main                                            0.007588      0.008178      0.007737  1.834651
      CommRecv                                      0.000024      0.000034      0.000029  0.006834
      CommSend                                      0.000551      0.000631      0.000594  0.140933
      CommSBN                                       0.000019      0.000046      0.000031  0.007338
      lulesh.cycle                                  0.000218      0.000254      0.000233  0.055250
        TimeIncrement                               0.000088      0.000111      0.000098  0.023188
          MPI_Allreduce                             0.000180      0.066283      0.042576 10.095752
        LagrangeLeapFrog                            0.000278      0.000364      0.000324  0.076719
          LagrangeNodal                             0.004838      0.005261      0.005013  1.188590
            CalcForceForNodes                       0.000622      0.000839      0.000740  0.175528
              CommRecv                              0.000056      0.000076      0.000065  0.015465
    (...)