Memory usage profiling ================================ Caliper provides basic functionality to profile memory usage on Linux systems. Heap allocation statistics -------------------------------- The `alloc.stats` option reports heap memory allocation statistics. To do so Caliper intercepts all malloc and free operations, so enabling this option can be somewhat expensive for programs with many small memory allocations. The output contains the following metrics: Mem HWM The memory high-water mark in bytes. Maximum total amount of memory that was allocated in this region. Alloc tMax Allocation tally maximum. This is the maximum total amount of memory that was allocated directly by this region. In contrast, the memory high-water mark includes memory allocated in any region. Alloc count Number of individual memory allocations. Min, Avg, Max Bytes/alloc Minimum, average and maximum bytes allocated per allocation operation. An example output looks like this: :: $ CALI_CONFIG=runtime-report,alloc.stats ./lulesh2.0 Path Time (E) Time (I) Time % (E) Time % (I) Mem HWM Alloc tMax Alloc count Min Bytes/alloc Avg Bytes/alloc Max Bytes/alloc main 0.004867 0.100623 4.712664 97.423931 8750898 8750898 71 1 124779 864000 lulesh.cycle 0.000259 0.095755 0.250617 92.711267 8740194 96 10 96 96 96 TimeIncrement 0.000084 0.000084 0.081690 0.081690 8740194 LagrangeLeapFrog 0.000488 0.095412 0.472577 92.378960 8740194 96 20 96 96 96 LagrangeNodal 0.001119 0.060091 1.083415 58.180911 8740194 CalcForceForNodes 0.000490 0.058972 0.474139 57.097497 8740194 216 2 24 108 192 CalcVolumeForceForElems 0.000710 0.058483 0.687355 56.623357 9604194 864000 40 216000 216000 216000 IntegrateStressForElems 0.014498 0.014498 14.036874 14.036874 14788194 5184000 30 1728000 1728000 1728000 CalcHourglassControlForElems 0.021265 0.043275 20.588633 41.899129 19972194 10368000 60 1728000 1728000 1728000 CalcFBHourglassForceForElems 0.022010 0.022010 21.310495 21.310495 25156194 5184000 30 1728000 1728000 1728000 LagrangeElements 0.000215 0.034296 0.207968 33.205399 8740194 CalcLagrangeElements 0.000369 0.007101 0.357067 6.875254 9388194 648000 30 216000 216000 216000 CalcKinematicsForElems 0.006732 0.006732 6.518187 6.518187 9388194 CalcQForElems 0.003196 0.005037 3.094518 4.877323 10165794 1425600 60 216000 237600 259200 CalcMonotonicQForElems 0.001841 0.001841 1.782805 1.782805 10165794 ApplyMaterialPropertiesForElems 0.000541 0.021942 0.524234 21.244854 8956194 216000 10 216000 216000 216000 EvalEOSForElems 0.007972 0.021401 7.718625 20.720620 9902034 945840 1540 984 19636 67560 CalcEnergyForElems 0.013429 0.013429 13.001996 13.001996 9969594 67560 350 984 45988 67560 Note that `alloc.stats` *only* accounts for memory allocations through C heap allocation calls (malloc, calloc, realloc, free) within the active profiling session. It does *not* account for any memory allocated through other means, including: * Allocations that happened before initializing the Caliper profiling session * Objects allocated in static memory or on the stack * Text segments (i.e., program code) * GPU memory allocated through `cudaMalloc` or `hipMalloc` etc. * Any heap allocations through other means, such as calling `mmap` directly or CUDA/HIP host memory allocation calls unless they fall back to malloc/free Memory page use statistics -------------------------------- The `mem.pages` option provides high-water marks for total (VmSize), resident set (VmRSS), and Data memory usage in number of pages, similar to memory statistics reported by tools like `top`. Unlike the `alloc.stats` option this feature does not require intercepting malloc/free calls, so it should induce less runtime overhead in programs using many small memory allocations. It does however induce a medium per-region overhead for reading information from `/proc`. Therefore, it may be advisable to limit profiling to high-level regions (e.g. with `level=phase` if the target code distinguishes regular and phase regions). Also unlike the `alloc.stats` option, `mem.pages` captures all memory usage (except GPUs) including code segments, static memory, etc. However, according to the Linux documentation, the reported numbers may not be entirely accurate. The option adds `VmSize`, `VmRSS`, and `Data` metrics representing the high-water mark in number of pages observed in the respective categories. Note the number of pages in the example is highest in the `CalcFBHourglassForceForElems` function, just as in the `alloc.stats` output before: :: $ CALI_CONFIG=runtime-report,mem.highwatermark ./lulesh2.0 [...] Path Time (E) Time (I) Time % (E) Time % (I) VmSize VmRSS Data main 0.010168 9.979632 0.101883 99.997550 79862 6786 33651 lulesh.cycle 0.000085 9.969464 0.000855 99.895667 81740 7665 35529 TimeIncrement 0.000047 0.000047 0.000469 0.000469 81740 7665 35529 LagrangeLeapFrog 0.000839 9.969332 0.008402 99.894343 81740 7665 35529 LagrangeNodal 0.056944 0.239477 0.570587 2.399594 81740 7665 35529 CalcForceForNodes 0.017880 0.182533 0.179165 1.829007 81740 7665 35529 CalcVolumeForceForElems 0.039597 0.164652 0.396773 1.649842 81740 7742 35529 IntegrateStressForElems 0.051738 0.051738 0.518423 0.518423 81740 7742 35529 CalcHourglassControlForElems 0.030218 0.073317 0.302787 0.734645 83849 9804 37638 CalcFBHourglassForceForElems 0.043099 0.043099 0.431859 0.431859 84271 10196 38060 LagrangeElements 0.012675 9.057428 0.127006 90.756916 81740 7665 35529