Fused Tensor Core kernels for Ozaki Schemes I and II achieve up to 83% of INT8 peak throughput and outperform cuBLAS TF32 and ZGEMM on large matrices at comparable accuracy.
Guaranteed dgemm accuracy while using reduced precision tensor cores through extensions of the ozaki scheme
5 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 5roles
background 1polarities
background 1representative citing papers
BF16 tensor cores on GPUs emulate FP32 SGEMM with superior performance, power efficiency, and numerical accuracy compared to native FP32, including a library implementation that handles denormals.
An adaptation of the Ozaki-II scheme allows DGEMM emulation on FP8 MMA units with significantly reduced computational cost compared to FP8-based Ozaki-I.
The quatrex quantum transport solver achieves up to 51% higher throughput using low-precision formats while maintaining accuracy on realistic semiconductor systems.
No single post-Moore technology replaces current HPC for plasma simulations, but FPGA-class accelerators offer near-term kernel offload, non-von Neumann architectures medium-term operator acceleration, and quantum computing long-term potential for warm dense matter microphysics.
citing papers explorer
-
Exceeding the Numerical and Performance Characteristics of IEEE-754 SGEMM with BFloat16 Tensor Cores on GPUs for Scientific Computing
BF16 tensor cores on GPUs emulate FP32 SGEMM with superior performance, power efficiency, and numerical accuracy compared to native FP32, including a library implementation that handles denormals.