A scale-invariant revision to the fast-mode scaling formula in Ozaki scheme II ensures the CRT uniqueness condition holds for all input scalings while preserving the speed of the original fast mode.
Double-Precision Matrix Multiplication Emulation via Ozaki-II Scheme with FP8 Quantization
2 Pith papers cite this work. Polarity classification is still indexing.
abstract
In this paper, we propose a method for emulating double-precision general matrix--matrix multiplication (DGEMM), a fundamental and performance-critical kernel in many high-performance computing applications. Ozaki-I and Ozaki-II are established DGEMM emulation schemes via low-precision matrix multiply-accumulate (MMA) units. For the Ozaki-I scheme, INT8-, FP8-, and FP16-based implementations have been proposed, all of which can be realized based on the same underlying algorithmic structure. In contrast, although INT8-based implementations of the Ozaki-II scheme have been reported, the original algorithm cannot be directly adapted to exploit FP8 MMA units. In several recent architectures, such as NVIDIA Blackwell Ultra and NVIDIA Rubin, INT8 performance has been reduced, making reliance on INT8 alone insufficient. Therefore, we introduce a novel technique to demonstrate DGEMM emulation based on the Ozaki-II scheme that operates on FP8 MMA units. Compared to the FP8-based Ozaki-I scheme, our method significantly reduces the computational cost and enables efficient FP64 emulation.
fields
cs.MS 2years
2026 2verdicts
UNVERDICTED 2representative citing papers
Ozaki-Bailey 3D FFT achieves near memory-roof FP64 performance on B300 by emulating via FP8 tensor cores with Garner reconstruction split into phases and Kulisch escape on INT32 units.
citing papers explorer
-
Improved Scaling for Fast Mode of Ozaki Scheme II
A scale-invariant revision to the fast-mode scaling formula in Ozaki scheme II ensures the CRT uniqueness condition holds for all input scalings while preserving the speed of the original fast mode.
-
FP8 is All You Need (Part 2): Efficient Ozaki-Bailey Style FFT Through Tensor-core Garner Reformulation and Kulisch Escape Route
Ozaki-Bailey 3D FFT achieves near memory-roof FP64 performance on B300 by emulating via FP8 tensor cores with Garner reconstruction split into phases and Kulisch escape on INT32 units.