AMD provides a native profiler named rocprof
in their ROCm stack to allow users to trace HIP/HSA or custom roctx annotated profiling range.
To obtain the demanding hardware metric, one needs to specify the collected metrics in a file and pass to rocprof
via -i
option.
The supported hardware metrics (basic or derived) can be listed via the following commands1 :
To obtain specific metrics for a HIP executable, just run:
However, there are a portion of AMD GPU conception and their abbreviations in the list, which confuse the profiler users2. This article records some primary terms appearing in the profiler metrics.
Terms
TCC
: texture channel cache, i.e., L2 cache or LLC cache in AMD GPUTCP
: texture cache private, i.e., L0 cache in RDNA or L1 cache in GCN/CDNAEA
: the interconnect between L2 and HBM (NoC)3SQ
: sequencer, i.e., hardware dispatcher4, which is responsible for issuing instructionsTA
: texture address block, used to determine the effective address of load/store instructions for later coalesce4
Basic and derived metrics
In rocprf
, there are two kinds of metrics: basic and derived, where the former is extracted directly from hardware performance monitor counters while the latter is obtained from the arithmetic expressions of several basic metrics.
There are some components replicated across the entire GPU like LLC slice and HBM channels.
For metrics related to them, rocprof
is able to offer separate (for one single component) or aggregated (over the whole components).
For instance, one can obtain the read requests from LLC slice 0 to the off-chip NoC with TCC_EA_RDREQ0
(there are total 32 LLC slices in MI100, so rocprof
depicts this counter as TCC_EA_RDREQ[0-31]
), or the aggregated (sum of average) value with TCC_EA_RDREQ_sum
.
Epilog
Additionally, AMD research also maintains a research project called Omnitrace to collaboratively collect CPU + GPU profiles for parallel applications.