Intro

There are two primary types of resources in GPU to be partitioned: computing resources (SM) and (off-chip) memory bandwidth. Computing resource partitioning could be further subdivided into inter-SM and intra-SM types. The former mainly focuses on how to efficiently assign GPU SMs to different concurrent tasks thus achieving specific target (SLA, system throughput, etc.). While the latter aims to schedule CTAs/warps within a SM in a fine-grain manner such that memory stalls, occupancy, shared memory usage, Tensor Core/CUDA Core utilization or other metrics could be improved.

Since GPU is a throughput-oriented device, thousands of parallel threads can simultaneously issue memory requests to the attached off-chip HBM/GDDR memory. The off-chip bandwidth, therefore, becomes the most possible bottleneck of the memory-intensive applications, which takes up the majority in the heterogeneous computing workload market. To cope with this problem, researches from communities of compiler, architecture and parallel framework have been focused on split the global memory bandwidth for each concurrent tasks.

Papers

Works which require hardware modifications rely on GPU simulators to verity their effects, while others exploit vender-provided runtime/driver/framework to improve performance on off-the-shelf GPUs. And there are some works just focusing on dissecting and demystifying opaque hardware details.

Survey

  • [TPDS ‘22] A Survey of GPU Multitasking Methods Supported by Hardware Architecture

Target on commodity hardwares

  • [HPCA ‘23] KRISP: Enabling Kernel-wise RIght-sizing for Spatial Partitioned GPU Inference Servers

The CU scalability varies from kernel to kernel, thus setting the granularity of resource partition in DL inference as model-wise is too coarse. This work implements per-kernel CU affinity update just before kernel launch via ROCm modification to achieve fine-grained resource partitioning for concurrent inference tasks.

  • [OSDI ‘22] Microsecond-scale Preemption for Concurrent GPU-accelerated DNN Inferences

Survey

  • [TPDS ‘22] A Survey of GPU Multitasking Methods Supported by Hardware Architecture
  • [RTAS ‘19] Fractional GPUs: Software-Based Compute and Memory Bandwidth Reservation for GPUs

Memory hierarchy dissection via micro-benchmarks and page coloring for memory isolation through NVIDA GPU driver enhancement. Compute isolation is achieved with persistent kernel (a.k.a., SM-centric programming style).

  • [ISC ‘19] Laius: Towards Latency Awareness and Improved Utilization of Spatial Multitasking Accelerators in Datacenters

Laius exploits a decision tree trained offline to predict the kernel execution time based on its input and launch configurations. When a GPU serves to batch queries, it models the partitioning decision as a knapsack problem and tries to allocate resources to the user-facing (QoS) application with “just enough” resources and assign the leftover resources to other non-QoS batched applications.

Simulator-based

  • [ASPLOS ‘20] HSM: A Hybrid Slowdown Model for Multitasking GPUs

Combine both white-box metric (DRAM row buffer hits as input) and black-box model (linear regression) to classify the kernel type (C/M) and predict its normalized progress in shared mode. The combined model is used to guide the scheduling policies for both QoS and fairness objects. However, the hardware metric RBH is hard to obtain in commodity devices.

  • [DAC ‘20] Navigator: Dynamic Multi-kernel Scheduling to Improve GPU Performance

Integrate a cache-like lookup table containing performance metrics for kernels under single execution or multi-kernel execution. At runtime, the scheduler iteratively selects the kernel pair with the highest performance from the candidates. Re-profiling is required for kernel pair selection if there are new coming kernels before the next scheduling.

  • [ISCA ‘17] Quality of service support for fine-grained sharing on GPUs

  • [HPCA ‘16] Simultaneous Multikernel GPU: Multi-tasking throughput processors via fine-grained sharing

Intra-SM resource under-utilization becomes the primary cause of inefficiency when multiple kernels co-running on GPUs. This paper proposes partial kernel preemption mechanism to preempt part of the running CTAs belong to one kernel within the SM and comity the occupied resource to the CTAs belong to the incoming kernel. Also, a resource allocation scheme is also proposed to ensure the fairness among co-running kernels.

Benchmarking & dissection

  • [RTNS ‘21] Exploring AMD GPU Scheduling Details by Experimenting With “Worst Practices”
  • [RTSS ‘17] GPU Scheduling on the NVIDIA TX2: Hidden Details Revealed