Papers: Tensor Core
What’s Tensor Core (TC)? It’s a ASIC integrated in the general purpose GPU (GPGPU) to designed for accelerating GEMM workload composing a large portion of machine learning applications. However, since there are obstacles to exploit TC effectively in CUDA, programmers are hardly to make use of TC to speedup their appliations.
# Dissections & Microbenchmarks
- [TPDS'23] Dissecting Tensor Cores via Microbenchmarks: Latency, Throughput and Numeric Behaviors
- [IPDPS'20] Demystifying Tensor Cores to Optimize Half-Precision Matrix Multiply
- Dissecting the NVidia Turing T4 GPU via Microbenchmarking
# TC with Intra-SM Parallelism
- [HPCA'22] Tacker: Tensor-CUDA Core Kernel Fusion for Improving the GPU Utilization while Ensuring QoS
- [ISVLSI'22] Improving GPU Throughput through Parallel Execution Using Tensor Cores and CUDA Cores
- [ICCD'21] Exploiting Intra-SM Parallelism in GPUs via Persistent and Elastic Blocks
# GEMM / Scientific / DL App. with TC
- [SC'22] Efficient quantized sparse matrix operations on tensor cores
- [ISCA'22] SIMD2: a generalized matrix instruction set for accelerating tensor computation beyond GEMM
- [ISC'22] Toward accelerated stencil computation by adapting tensor core unit on GPU
- [MICRO'20] Duplo: Lifting Redundant Memory Accesses of Deep Neural Networks for GPU Tensor Cores
- [ISC'19] Accelerating reduction and scan using tensor core units
# GNN with TC
- [OSDI'23] TC-GNN: Accelerating Sparse Graph Neural Network Computation Via Dense Tensor Core on GPUs
- [PPoPP'22] QGTC: accelerating quantized graph neural networks via GPU tensor core