HSA & Overview
AMD’s GPU software stack follows the HSA standard, where CPU, GPU and other devices are treated as HSA agents from the system’s view.
ROCt and ROCr
ROCt is a thin interface layer providing APIs from the amdgpu
kernel module to userspace.
And ROCr is an upper layer managing interactions between language-level userspace runtime and the kernel driver/module.
ROCr
The core component in ROCr is core::Runtime
, which maintains the connections to kernel driver.
In AMD’s implementation, there is a singleton instance of core::Runtime
, ensuring interactions to HSA runtime are handled by the only instance thus avoid inconsistency issues.
The single instance is constructed when roc::Runtime::Acquire
is called in the hsa_init
function (called from roc::Device::init
when HIP API firstly called), then roc::Runtime::Load
is invoked to initialize the HSA environment (runtime configurations, GPU agents, HSA extensions, API table, etc.).
HSA signal
hsa_signal_t
itself contains an uint64_t
as the handler at bottom.
To handle a signal, ROCr provides a wrapper class hsa_signal_handle
:
it overloads operator->
for converting the raw handler into rocr::core::Signal
type.
core::Signal
is a reference object, which keeps alive whenever there are some object referring it.
HSA agent
There are two primary agent types: CPU agent and GPU agent, both are derived from Agent
.
The HSA runtime is responsible for maintaining lists for CPU and GPU devices.
Each CPU socket is abstracted as a CPU agent while each GPU card is represented as a GPU agent likewise.
HSA signal callback registration
ROCr offers a registration API hsa_amd_signal_async_handler
allowing user to register a handler together with a condition to signal.
When the condition is met, the handler will be executed asynchronously to process signal-related transactions.
Based on the return value (bool
type, true
or false
), the HSA runtime will determine whether this handler procedure will be called again or only once when the same conditions are met.
The ROCr API internally invokes Runtime::SetAsyncSignalHandler
to register the handler to the specified signal.
There are vectors maintained by HSA runtime managing a series of signals, conditions and values.
In the Runtime::SetAsyncSignalHandler
function, the HSA runtime will fork a thread (pthread
-based software thread) to asynchronously and consistently check the signal value (AsyncEventsLoop
is the entry function).
How to declare a new ROCr (HSA) API?
Put your implementation in core/runtime/hsa_ext_amd.cpp
, and declare it in both core/inc/hsa_ext_amd_impl.h
and inc/hsa_ext_amd.h
.
HSA requires its API to be traced, so the newly-added API should be inserted into AMD extension API table suffixed with _fn
in inc/hsa_api_trace.h
and core/common/hsa_table_interface.cpp
and bind them in core/runtime/hsa_api_trace.cpp
.
And last, DO NOT forget to expose your API via hsacore.so.def
script.
Loading ROC code object
As discussed above, the virtual device data structure is created and bind to a high-level HIP stream.
ROCclr will construct the corresponding device::Program
and then roc::LightningProgram
(ROCr instead).
The HSA API calling sequence is:
hsa_executable_create_alt
hsa_code_object_reader_create_from_memory
hsa_executable_load_agent_code_object
hsa_executable_freeze
Block Allocator in ROCr
To avoid frequent memory allocation commands between host and device, the ROCr runtime itself maintains a userspace heap-like data structure to the minimize the allocation requests sent to driver and hardware.
ROCclr
ROC common language runtime (ROCclr) plays as the intermediate layer between specific programming language (OpenCL, HIP, etc.) and (AMD extended) HSA userspace APIs.
The BlockAllocator
is implemented in the rocr::AMD::MemoryRegion::BlockAllocator
:
The default block size is set as 2MB, and it provides two common interfaces: alloc
and free
.
When users call the hipMalloc
routine, it finally gets into the rocr::AMD::MemoryRegion::AllocateImpl
function, where the block allocator will be invoked to:
- align up the requested size based on the block size
- check whether there are available blocks allocated but not used, if so, return it to user without actually allocating memory in device
The block allocator is wrapped by a SimpleHeap
, where the memory allocation and free are managed by a heap-like structure with two level abstractions: block and fragment.
When a memory allocation request from user comes, ROCr will first check whether there is a “suitable” fragment in the heap whose size is just equal or slightly larger than the requested size (using lower_bound
of std::multimap
).
If so, that fragment will be selected for the allocation, and the leftover space in the same block (free_size - request_size
) will be re-inserted in the heap.
Otherwise, ROCr will try to claim a new free block (either from userspace block caching or driver).
When user calls hipFree
in their HIP program, ROCr employs a caching layer to keep the currently unused block in host library (rather than discarding and releasing the block in device immediately).
ROCr will firstly check whether there are free neighboring fragments of the releasing one, and tries to merge them into one larger unused fragment.
The caching layer is a double-ended queue (std::deque
).
The entire block will be reported as unused and appended into the caching queue if all its fragments are free.
Otherwise, only the merged free fragment will be manipulated into the free fragment heap for later usage.
With this recycle mechanism, this heap-based allocator will claim the block from the userspace queue instead of the driver.
To substantially release all cached blocks at userspace, one has to call the trim
interface which will request the underlying block allocator to reclaim all unused blocks back to driver (and hardware).
Physical device vs. virtual device
Derivation relationships:
- physical device:
amd::Device
➡️roc::NullDevice
➡️roc::Device
- virtual device:
device::VirtualDevice
➡️device::VirtualGPU
Each HSA agent corresponds to a pdevice, each HIP stream corresponds to a vdevice.
ROCm amd::HostQueue
vs. amd::DeviceQueue
, and amd::Context
amd::HostQueue
is a linked list data structure derived from amd::CommandQueue
.
It contains a thread together with a virtual device aims to managing host commands.
amd::DeviceQueue
is also derived from amd::CommandQueue
, and is embedded in Context
object.
Kernel launch procedure
hipLaunchKernel
receives kernel launch request from host, and then conducts a series of parameter checking and validation processes to ensure kernel parameters (work group size, LDS usage, etc.) meet the hardware requirements.
Then the procedure continues in ihipModuleLaunchKernel
, where the kernel launch command is generated and pushed to the host queue.
Stream creation & CU masking adjustment procedure
We mainly focus on one problem: how the CU mask configuration is applied to the stream/queue?
Calling sequence of hipStreamCreateWithCUMask
:
hipExtStreamCreateWithCUMask
➡️ ihipStreamCreate
➡️ hip::Stream::Create
➡️ amd::HostQueue
➡️ hip::Device::SaveQueue
HIP stream synchronize API
hipEvent_t
is a wrapper of hip::Event
. When AMD_DIRECT_DISPATCH
is set, ROCclr will use hip::EventDD
derived from hip::Event
as the marker to record activities.
When hipEventCreate
is called, ROCclr is going to create the corresponding HIP event and bind it to the passed-in hipEvent_t
and insert it to the global event set.
For hipEventRecord
, the runtime will call Event::addMarker
to insert a HSA marker (wrapped by hip::EventMarker
) packet into the HSA queue.
In this procedure, HIP will first call Event::recordCommand
to create a marker command then invoke Event::enqueueRecordCommand
to insert the newly-created command to the corresponding queue (determined by stream).
Declare and implement a new HIP runtime API
First, declare your API (suppose hipStreamXXX
in this example) in HIP/include/hip/hip_runtime_api.h
, and provide corresponding implementation in hipamd/src/hip_stream.cpp
.
Then insert a new HIP_API_ID_hipStreamXXX
at the end of enum hip_api_id_t
in hipamd/include/hip/amd_detail/hip_prof_str.h
, please also remember to update the value of HIP_API_ID_LAST
.
After hip_prof_str.h
updated, run the following command (or rerun cmake --build build
to generate a new version of hip_prof_str.h
containing both the newly-added HIP_API_ID
as well as the callback ID for this self-defined HIP runtime API:
If you finish previous steps, you can successfully build the hipamd
project and generate your own libhipamd64.so
.
However, HIP programs will still fail to link to it, the reported errors say:
undefined reference to `hipStreamXXX'
clang-16: error: linker command failed with exit code 1 (use -v to see invocation)
Now if you inspect the exposed symbols in libamdhip64.so
(via the nm -D
command), you will find NO hipStreamXXX
occurs.
It attributes to Linux ELF shared object symbol visibility issues, which in this project, is controlled by the hipamd/src/CMakeLists.txt
CMake script to hide all non-authorized symbols.
To make the newly-added API as a visible symbol, you need to append it in hipamd/src/hip_hcc.map.in
:
Then you could find hipStreamXXX@hip_5.3
in the exposed symbols in libamdhip64.so
.
Tracing and Logging
Environment variables
The full support list of env variables and their effects could be found on the recently-released ROCm new doc site: https://rocm.docs.amd.com/projects/HIP/en/latest/how_to_guides/debugging.html#summary-of-environment-variables-in-hip.
There is an environment variable AMD_LOG_LEVEL
(defined in ROCclr:utils/flags.cpp
by macro) to indicate which kind of information will be printed during runtime.
Currently there are 5 supported logging level:
LOG_NONE
LOG_ERROR
LOG_WARNING
LOG_INFO
LOG_DEBUG
These flags will passed together with logging format and messages toClPrint
function for printing.
There is another knob to control which kind of logging information to print. For example, there are various places where logging functions are called:
LOG_API
: in the HIP runtime APILOG_INIT
: when HIP/ROCm software stack is initializingLOG_AQL
: decode and display AQL packet content The default value forAMD_LOG_MASK
isLOG_ALWAYS
, which displays all logging information everywhere.
To serialize kernel launch and execution manner, there is also a environment variable AMD_SERIALIZE_KERNEL
which could be set with the following allowed values:
1
: wait for completion before enqueue2
: wait for completion after enqueue3
: both of above
Miscs.
There may be cases where the driver get stuck, one way to fix this is restart but a little troublesome. To reset the AMD driver, read the following file:
Terms
- SVM: OpenCL shared virtual memory
- Virtual Device: an abstraction of physical device in ROCclr, the backend of a HIP stream
- Command Queue: a linked list like data structure under
hipStream
, responsible for managing commands (host operations)