What’s Tensor Core (TC)? It’s a ASIC integrated in the general purpose GPU (GPGPU) to designed for accelerating GEMM workload composing a large portion of machine learning applications. However, since there are obstacles to exploit TC effectively in CUDA, programmers are hardly to make use of TC to speedup their applications.
Dissections & Microbenchmarks
- [TPDS ‘23] Dissecting Tensor Cores via Microbenchmarks: Latency, Throughput and Numeric Behaviors
- [IPDPS ‘20] Demystifying Tensor Cores to Optimize Half-Precision Matrix Multiply
- Dissecting the NVIDIA Turing T4 GPU via Microbenchmarking
TC with Intra-SM Parallelism
- [HPCA ‘22] Tacker: Tensor-CUDA Core Kernel Fusion for Improving the GPU Utilization while Ensuring QoS
- [ISVLSI ‘22] Improving GPU Throughput through Parallel Execution Using Tensor Cores and CUDA Cores
- [ICCD ‘21] Exploiting Intra-SM Parallelism in GPUs via Persistent and Elastic Blocks
GEMM / Scientific / DL App. with TC
- [SC ‘22] Efficient quantized sparse matrix operations on tensor cores
- [ISCA ‘22] SIMD2: a generalized matrix instruction set for accelerating tensor computation beyond GEMM
- [ISC ‘22] Toward accelerated stencil computation by adapting tensor core unit on GPU
- [MICRO ‘20] Duplo: Lifting Redundant Memory Accesses of Deep Neural Networks for GPU Tensor Cores
- [ISC ‘19] Accelerating reduction and scan using tensor core units
GNN with TC
- [OSDI ‘23] TC-GNN: Accelerating Sparse Graph Neural Network Computation Via Dense Tensor Core on GPUs
- [PPoPP ‘22] QGTC: accelerating quantized graph neural networks via GPU tensor core