AMD provides a native profiler named rocprof in their ROCm stack to allow users to trace HIP/HSA or custom roctx annotated profiling range. To obtain the demanding hardware metric, one needs to specify the collected metrics in a file and pass to rocprof via -i option. The supported hardware metrics (basic or derived) can be listed via the following commands1 :

rocprof --list-basic
rocprof --list-derived

To obtain specific metrics for a HIP executable, just run:

rocprof -i ./prof_metrics.txt -d ./data -t ./tmp -o output/tcc.csv ./gpu-stream-hip

However, there are a portion of AMD GPU conception and their abbreviations in the list, which confuse the profiler users2. This article records some primary terms appearing in the profiler metrics.

Terms

  • TCC: texture channel cache, i.e., L2 cache or LLC cache in AMD GPU
  • TCP: texture cache private, i.e., L0 cache in RDNA or L1 cache in GCN/CDNA
  • EA: the interconnect between L2 and HBM (NoC)3
  • SQ: sequencer, i.e., hardware dispatcher4, which is responsible for issuing instructions
  • TA: texture address block, used to determine the effective address of load/store instructions for later coalesce4

Basic and derived metrics

In rocprf, there are two kinds of metrics: basic and derived, where the former is extracted directly from hardware performance monitor counters while the latter is obtained from the arithmetic expressions of several basic metrics. There are some components replicated across the entire GPU like LLC slice and HBM channels. For metrics related to them, rocprof is able to offer separate (for one single component) or aggregated (over the whole components). For instance, one can obtain the read requests from LLC slice 0 to the off-chip NoC with TCC_EA_RDREQ0 (there are total 32 LLC slices in MI100, so rocprof depicts this counter as TCC_EA_RDREQ[0-31]), or the aggregated (sum of average) value with TCC_EA_RDREQ_sum.

Epilog

Additionally, AMD research also maintains a research project called Omnitrace to collaboratively collect CPU + GPU profiles for parallel applications.

Footnotes

  1. https://github.com/ROCm/rocprofiler/blob/amd-master/test/tool/metrics.xml

  2. https://www.coelacanth-dream.com/posts/2020/01/14/amdgpu-abbreviation

  3. https://docs.olcf.ornl.gov/systems/crusher_quick_start_guide.html

  4. https://github.com/ROCm/rocprofiler/issues/85#issuecomment-1089235070 2