What’s Tensor Core (TC)? It’s a ASIC integrated in the general purpose GPU (GPGPU) to designed for accelerating GEMM workload composing a large portion of machine learning applications. However, since there are obstacles to exploit TC effectively in CUDA, programmers are hardly to make use of TC to speedup their applications.

Dissections & Microbenchmarks

  • [TPDS ‘23] Dissecting Tensor Cores via Microbenchmarks: Latency, Throughput and Numeric Behaviors
  • [IPDPS ‘20] Demystifying Tensor Cores to Optimize Half-Precision Matrix Multiply
  • Dissecting the NVIDIA Turing T4 GPU via Microbenchmarking

TC with Intra-SM Parallelism

  • [HPCA ‘22] Tacker: Tensor-CUDA Core Kernel Fusion for Improving the GPU Utilization while Ensuring QoS
  • [ISVLSI ‘22] Improving GPU Throughput through Parallel Execution Using Tensor Cores and CUDA Cores
  • [ICCD ‘21] Exploiting Intra-SM Parallelism in GPUs via Persistent and Elastic Blocks

GEMM / Scientific / DL App. with TC

  • [SC ‘22] Efficient quantized sparse matrix operations on tensor cores
  • [ISCA ‘22] SIMD2: a generalized matrix instruction set for accelerating tensor computation beyond GEMM
  • [ISC ‘22] Toward accelerated stencil computation by adapting tensor core unit on GPU
  • [MICRO ‘20] Duplo: Lifting Redundant Memory Accesses of Deep Neural Networks for GPU Tensor Cores
  • [ISC ‘19] Accelerating reduction and scan using tensor core units

GNN with TC

  • [OSDI ‘23] TC-GNN: Accelerating Sparse Graph Neural Network Computation Via Dense Tensor Core on GPUs
  • [PPoPP ‘22] QGTC: accelerating quantized graph neural networks via GPU tensor core