Fused Tensor Core kernels for Ozaki Schemes I and II achieve up to 83% of INT8 peak throughput and outperform cuBLAS TF32 and ZGEMM on large matrices at comparable accuracy.
Mixed precision block fused multiply-add: Error analysis and application to gpu tensor cores,
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.DC 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
EmuGEMM: Fused Tensor Core Kernels for Precision Emulation in Matrix Multiplication
Fused Tensor Core kernels for Ozaki Schemes I and II achieve up to 83% of INT8 peak throughput and outperform cuBLAS TF32 and ZGEMM on large matrices at comparable accuracy.