FP8 is All You Need (Part 1): Debunking Hardware FP64 as the HPC Holy Grail

Satoshi Matsuoka

arxiv: 2606.06510 · v1 · pith:EX2AKXJWnew · submitted 2026-05-28 · 💻 cs.AR · cs.AI· cs.DC· cs.PF

FP8 is All You Need (Part 1): Debunking Hardware FP64 as the HPC Holy Grail

Satoshi Matsuoka This is my paper

Pith reviewed 2026-06-29 00:41 UTC · model grok-4.3

classification 💻 cs.AR cs.AIcs.DCcs.PF

keywords FP8FP64 emulationOzaki Scheme IIHPC kernelsGPU architectureBlackwelltensor coresChinese Remainder Theorem

0 comments

The pith

FP8 tensor throughput with Ozaki Scheme II recovers full FP64 accuracy at memory-roof speeds on B300 GPUs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper challenges the assumption that native hardware FP64 is required for high-performance scientific computing. It demonstrates that on B300-generation AI GPUs, where native FP64 falls to 1.3 TFLOPS, combining abundant FP8 tensor cores with the Chinese Remainder Theorem-based Ozaki Scheme II enables emulation that reaches the memory roof while preserving full double-precision accuracy for kernels such as SpMV, GEMV, and stencils. A Tensor-Memory Equilibrium model incorporating compute multiplier alpha, bandwidth multiplier beta, and reconstruction latency gamma shows that register-level fusion drives emulation overhead to negligible levels. Projections indicate effective FP64 performance of approximately 500 TFLOPS on B300 and 400 TFLOPS on Rubin R200, exceeding both native B300 capability and prior H100 native FP64 results across the studied workloads.

Core claim

On AI-optimised GPUs of the B300 generation and beyond, abundant FP8 tensor throughput combined with the Chinese Remainder Theorem-based Ozaki Scheme II recovers memory-roof execution at full FP64 accuracy across the canonical HPC kernel spectrum, projecting effective FP64 rates of 500 TFLOPS on B300 while matching memory bandwidth limits in bandwidth-bound cases.

What carries the argument

Ozaki Scheme II, a Chinese Remainder Theorem-based reconstruction method that emulates FP64 using FP8 tensor operations, with register-level fusion to achieve beta approaching 1 in the TME model.

If this is right

B300 native FP64 at 1.3 TFLOPS is replaced by emulated rates up to 500 TFLOPS in compute-bound regimes.
Performance matches or exceeds H100 native FP64 on all workloads studied.
Combined with Kulisch fixed-point methods for FFTs, every kernel class reaches the memory roof at full FP64.
Native FP64 silicon is no longer required for production HPC workloads.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Future GPU designs may allocate more silicon to tensor units rather than dedicated FP64 pipelines.
The same emulation approach could be tested on other low-precision formats or non-GPU accelerators.
Real hardware runs would be required to validate whether the TME model parameters hold under production conditions.

Load-bearing premise

The TME model parameters guarantee that reconstruction overhead remains negligible and full FP64 accuracy holds for every surveyed kernel.

What would settle it

Measure actual throughput and accuracy of Ozaki II on B300 hardware for kernels such as GEMM and SpMV to check whether full FP64 results are obtained near the projected 500 TFLOPS or memory roof.

Figures

Figures reproduced from arXiv: 2606.06510 by Satoshi Matsuoka.

read the original abstract

Conventional HPC dogma holds that native hardware FP64 silicon is the irreducible foundation of scientific computing -- the "holy grail" of double-precision simulation. This paper argues the dogma is wrong: on AI-optimised GPUs of the B300 generation and beyond, abundant FP8 tensor throughput combined with the Chinese Remainder Theorem-based Ozaki Scheme II recovers memory-roof execution at full FP64 accuracy across the canonical HPC kernel spectrum. NVIDIA's Blackwell Ultra (B300) collapses native FP64 to ~1.3 TFLOPS -- a 31x regression from the B200 -- rendering even memory-bound kernels (SpMV, GEMV, stencils) compute-bound. We make four contributions. First, a unified analytic model, the Tensor-Memory Equilibrium (TME) model, augmenting the Roofline with a compute multiplier alpha, a bandwidth multiplier beta, and a reconstruction latency gamma. Second, we identify register-level fusion as the mechanism driving beta -> 1, making emulation essentially free behind the memory wall. Third, we project that Ozaki II vaults emulated FP64 from the ~1 TFLOPS native floor to ~500 TFLOPS (B300) and ~400 TFLOPS (Rubin R200), exceeding even B200's native FP64 ceiling by over an order of magnitude in the compute-bound regime while matching the memory roof in the bandwidth-bound regime. Fourth, against an H100 baseline, Ozaki II matches or exceeds H100 on every workload studied, versus the up-to-50x regression that B300 native FP64 imposes. Combined with a companion FFT analysis (Kulisch fixed-point reconstruction on the surviving INT32 pipe) and FP32+Kahan reductions reported in the companion Part(2) paper, every surveyed kernel class on B300 reaches the memory roof at full FP64. The evidence supports the title's claim: FP8, with Ozaki II and Kulisch escape routes, is all one needs for production HPC; native FP64 silicon is no longer the holy grail it has been taken to be.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper argues FP8 plus Ozaki II can deliver memory-roof FP64 on B300 without native hardware, but the TME projections rest on unverified multipliers.

read the letter

The main point is that this paper claims abundant FP8 tensor throughput on B300 GPUs, combined with Ozaki Scheme II, can recover full FP64 accuracy at memory-roof performance for kernels like SpMV, GEMV, and stencils, making dedicated FP64 silicon unnecessary.

What is new is the Tensor-Memory Equilibrium model that layers compute multiplier alpha, bandwidth multiplier beta, and reconstruction latency gamma onto a Roofline base, plus the concrete projections of roughly 500 TFLOPS effective FP64 on B300 and 400 on Rubin. The paper also ties the B300's 31x drop in native FP64 to the broader shift toward AI-optimized silicon.

It does a clear job showing why the hardware trend creates a real problem for traditional HPC and how prior Ozaki work might address it without new silicon.

The soft spots sit in the modeling assumptions. The claim that register-level fusion drives beta exactly to 1 lacks any register-pressure analysis, instruction schedule, or derivation showing the extra FP8 GEMMs and CRT steps fit without spilling or added traffic. The parameters alpha, beta, and gamma appear fitted rather than independently measured, so the 500 TFLOPS figure is generated by the model itself. No implementation details, error bars, or direct comparisons to native FP64 references appear in the abstract, leaving accuracy and overhead claims untested.

This is for hardware architects and HPC researchers weighing future GPU requirements. A reader tracking AI-HPC convergence would find the argument worth examining. It deserves peer review because the question is timely and the modeling framework is a usable starting point, even though any referee will want concrete validation of the beta=1 premise and end-to-end measurements.

Referee Report

3 major / 1 minor

Summary. The paper claims that on AI-optimized GPUs such as NVIDIA Blackwell Ultra (B300), abundant FP8 tensor throughput combined with the Chinese Remainder Theorem-based Ozaki Scheme II can recover memory-roof execution at full FP64 accuracy for canonical HPC kernels including SpMV, GEMV, and stencils. It introduces the Tensor-Memory Equilibrium (TME) analytic model augmenting Roofline with multipliers alpha (compute), beta (bandwidth), and gamma (reconstruction latency), asserts that register-level fusion drives beta to 1, and projects ~500 TFLOPS effective FP64 on B300 (exceeding B200 native FP64) while matching or exceeding H100 baselines, concluding that native FP64 silicon is no longer required.

Significance. If the TME projections and accuracy claims hold with supporting analysis, the result would challenge the necessity of dedicated FP64 hardware in future HPC systems and shift design priorities toward lower-precision tensor units with emulation techniques, with potential impact on both architecture and software stacks for scientific computing.

major comments (3)

[Abstract] Abstract: the TME model defines free parameters alpha, beta, and gamma (with beta asserted to reach exactly 1 via register-level fusion) but supplies no register-pressure analysis, instruction schedule, or first-principles derivation showing that multiple FP8 GEMMs plus CRT reconstruction fit without spilling or extra memory traffic for the surveyed kernels; the ~500 TFLOPS projection on B300 is obtained by direct multiplication of the native FP8 rate by these unverified factors.
[Abstract] Abstract: the claim of 'full FP64 accuracy' for Ozaki Scheme II across the kernel spectrum is stated without cross-validation against native FP64 references or explicit bounds on rounding/overflow in the reconstruction steps, yet this premise is required for the central assertion that emulation recovers exact memory-roof execution.
[Abstract] Abstract: the performance comparison to H100 and the claim that Ozaki II 'matches or exceeds H100 on every workload' while native B300 FP64 imposes up to 50x regression rest on the same unverified TME parameters; no sensitivity analysis or alternative parameter values are provided to test robustness.

minor comments (1)

[Abstract] The abstract introduces the TME model and Ozaki Scheme II without a forward reference to the section containing their formal definitions or the precise equations for alpha, beta, and gamma.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments on the presentation of the TME model and the supporting claims. We address each major comment below and will incorporate additional analysis to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: the TME model defines free parameters alpha, beta, and gamma (with beta asserted to reach exactly 1 via register-level fusion) but supplies no register-pressure analysis, instruction schedule, or first-principles derivation showing that multiple FP8 GEMMs plus CRT reconstruction fit without spilling or extra memory traffic for the surveyed kernels; the ~500 TFLOPS projection on B300 is obtained by direct multiplication of the native FP8 rate by these unverified factors.

Authors: We agree that the abstract does not contain the requested register-pressure analysis or instruction schedule. The TME model in the manuscript body parameterizes beta from the architectural register file size and the number of intermediate values generated by Ozaki II, but a detailed first-principles derivation confirming no register spilling for SpMV, GEMV, and stencils is absent. We will add this analysis (including register usage estimates and a high-level schedule) in a new subsection or appendix of the revised manuscript to verify that beta reaches 1 without extra memory traffic. revision: yes
Referee: [Abstract] Abstract: the claim of 'full FP64 accuracy' for Ozaki Scheme II across the kernel spectrum is stated without cross-validation against native FP64 references or explicit bounds on rounding/overflow in the reconstruction steps, yet this premise is required for the central assertion that emulation recovers exact memory-roof execution.

Authors: The manuscript asserts full FP64 accuracy on the basis of the exact arithmetic properties of CRT reconstruction within the representable range. However, the abstract and main text do not include explicit cross-validation against native FP64 or derived bounds on reconstruction overflow for the target kernels. We will expand the accuracy section with direct numerical comparisons and explicit overflow bounds in the revision. revision: yes
Referee: [Abstract] Abstract: the performance comparison to H100 and the claim that Ozaki II 'matches or exceeds H100 on every workload' while native B300 FP64 imposes up to 50x regression rest on the same unverified TME parameters; no sensitivity analysis or alternative parameter values are provided to test robustness.

Authors: The H100 comparisons and the 50x regression claim are obtained by substituting the B300 and H100 architectural parameters into the TME equations. We acknowledge the absence of sensitivity analysis on alpha, beta, and gamma. The revised manuscript will include a sensitivity study that varies these parameters over plausible ranges and reports the resulting performance envelopes, thereby testing robustness of the workload comparisons. revision: yes

Circularity Check

1 steps flagged

TME model projections of ~500 TFLOPS reduce to assumed multipliers alpha/beta/gamma by construction

specific steps

fitted input called prediction [Abstract]
"a unified analytic model, the Tensor-Memory Equilibrium (TME) model, augmenting the Roofline with a compute multiplier alpha, a bandwidth multiplier beta, and a reconstruction latency gamma. ... we project that Ozaki II vaults emulated FP64 from the ~1 TFLOPS native floor to ~500 TFLOPS (B300) and ~400 TFLOPS (Rubin R200)"

The ~500 TFLOPS figure is obtained by plugging the TME parameters (particularly beta set to 1 via the register-fusion claim) into the model; the projection is therefore the direct numerical output of those assumed multipliers rather than an independent result.

full rationale

The abstract presents the TME model (with alpha, beta, gamma) as the basis for projecting emulated FP64 performance to ~500 TFLOPS on B300, asserting beta->1 via register-level fusion. This matches the fitted_input_called_prediction pattern because the headline performance numbers are generated directly from the model's own parameters rather than from independent derivation, external benchmarks, or first-principles register analysis shown in the text. No other circular patterns (self-citation chains, self-definitional equations, or ansatz smuggling) are exhibited in the provided sections. The central claim therefore has partial circularity but retains some independent modeling content.

Axiom & Free-Parameter Ledger

3 free parameters · 2 axioms · 1 invented entities

The central claim depends on the TME model parameters and the assumption that Ozaki II reconstruction preserves full accuracy while incurring negligible latency; these are introduced without external validation in the abstract.

free parameters (3)

alpha
compute multiplier augmenting Roofline in TME model
beta
bandwidth multiplier in TME model
gamma
reconstruction latency term in TME model

axioms (2)

standard math Chinese Remainder Theorem enables exact FP64 reconstruction from FP8 operations
Invoked to claim full accuracy in Ozaki Scheme II
domain assumption Register-level fusion drives beta to 1 making emulation free behind the memory wall
Stated as the mechanism that eliminates overhead in the abstract

invented entities (1)

Tensor-Memory Equilibrium (TME) model no independent evidence
purpose: Augmented Roofline model for emulated FP64 performance
New analytic framework introduced to support the projections

pith-pipeline@v0.9.1-grok · 5927 in / 1704 out tokens · 35120 ms · 2026-06-29T00:41:43.713508+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

42 extracted references · 3 canonical work pages

[1]

Amestoy, Iain S

Patrick R. Amestoy, Iain S. Duff, Jean-Yves L’Excellent, and Jacko Koster. A fully asyn- chronousmultifrontalsolverusingdistributeddynamicscheduling.SIAM Journal on Matrix 24 Analysis and Applications, 23(1):15–41, 2001

2001
[2]

Higham, and Enrique S

Hartwig Anzt, Jack Dongarra, Goran Flegar, Nicholas J. Higham, and Enrique S. Quintana- Ortí. Adaptive precision in block-Jacobi preconditioning for iterative sparse linear system solvers.Concurrency and Computation: Practice and Experience, 31(6):e4460, 2019

2019
[3]

Booth, Alex J.˜W

George H. Booth, Alex J.˜W. Thom, and Ali Alavi. Fermion Monte Carlo without fixed nodes: A game of life, death, and annihilation in Slater determinant space.Journal of Chemical Physics, 131:054106, 2009

2009
[4]

Clark, Ronald Babich, Kipton Barros, Richard C

Michael A. Clark, Ronald Babich, Kipton Barros, Richard C. Brower, and Claudio Rebbi. Solving lattice QCD systems of equations using mixed precision solvers on GPUs.Computer Physics Communications, 181(9):1517–1528, 2010

2010
[5]

Fifteen years of FP64 segmentation, and why the Blackwell ultra breaks the pattern

Nicolas Dickenmann. Fifteen years of FP64 segmentation, and why the Blackwell ultra breaks the pattern. Blog post, 2026

2026
[6]

Harvey L. Garner. The residue number system.IRE Transactions on Electronic Computers, EC-8(2):140–147, 1959

1959
[7]

SPTCStencil: Unleashing sparse tensor cores for stencil computation via strided swap, 2025

Qiqi Gu, Chenpeng Wu, Heng Shi, and Jianguo Yao. SPTCStencil: Unleashing sparse tensor cores for stencil computation via strided swap, 2025. arXiv:2506.22035

work page arXiv 2025
[8]

Azzam Haidar, Stanimire Tomov, Jack Dongarra, and Nicholas J. Higham. Harnessing GPU tensor cores for fast FP16 arithmetic to speed up mixed-precision iterative refinement solvers. InSC18: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 603–613, 2018

2018
[9]

Microbenchmarking NVIDIA’s Blackwell architecture: An in-depth architectural analysis, 2025

Aaron Jarmusch and Sunita Chandrasekaran. Microbenchmarking NVIDIA’s Blackwell architecture: An in-depth architectural analysis, 2025

2025
[10]

Kalamkar, Karthikeyan Vaidyanathan, Mikhail Smelyanskiy, Karthikeyan Pamnany, Victor W

Bálint Joo, Dhiraj D. Kalamkar, Karthikeyan Vaidyanathan, Mikhail Smelyanskiy, Karthikeyan Pamnany, Victor W. Lee, Pradeep Dubey, and William Watson. Lattice QCD on Intel Xeon Phi coprocessors.ISC High Performance, 2013

2013
[11]

Pracniques: Further remarks on reducing truncation errors.Communica- tions of the ACM, 8(1):40, 1965

William Kahan. Pracniques: Further remarks on reducing truncation errors.Communica- tions of the ACM, 8(1):40, 1965

1965
[12]

Knuth.The Art of Computer Programming, Volume 2: Seminumerical Algo- rithms

Donald E. Knuth.The Art of Computer Programming, Volume 2: Seminumerical Algo- rithms. Addison-Wesley, 3rd edition, 1997

1997
[13]

U. W. Kulisch. Mathematical foundation of computer arithmetic.IEEE Transactions on Computers, C-26(7):610–621, 1977

1977
[14]

Kulischand W.L

U.W. Kulischand W.L. Miranker. Thearithmeticof thedigital computer: Anew approach. SIAM Review, 28(1):1–40, 1986

1986
[15]

Toward accelerated stencil computation by adapting tensor core unit on GPU

Xueying Liu et al. Toward accelerated stencil computation by adapting tensor core unit on GPU. InProceedings of the 36th ACM International Conference on Supercomputing, 2022

2022
[16]

Lockwood

Glenn K. Lockwood. NVIDIA Rubin: Architecture notes and performance specifications. Glenn’s Digital Garden, 2026.https://www.glennklockwood.com/garden/processors/ R200, accessed May 2026

2026
[17]

Nvidia unpacks Vera Rubin rack system at CES

Tobias Mann. Nvidia unpacks Vera Rubin rack system at CES. The Register, January 2026. Documents Rubin FP64 vector regression from45to33TFLOPS. 25

2026
[18]

Stefano Markidis, Steven Wei Der Chien, Erwin Laure, Ivy Bo Peng, and Jeffrey S. Vetter. NVIDIAtensorcoreprogrammability, performance&precision. In2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pages 522–531, 2018

2018
[19]

FP8 is all you need (part 2): Efficient Ozaki–Bailey style FFT through tensor-core Garner reformulation and Kulisch escape route

Satoshi Matsuoka. FP8 is all you need (part 2): Efficient Ozaki–Bailey style FFT through tensor-core Garner reformulation and Kulisch escape route. Companion manuscript, 2026. In preparation

2026
[20]

DGEMM without FP64 arithmetic: Using FP64 emulation and FP8 tensor cores with Ozaki scheme, 2025

Daichi Mukunoki. DGEMM without FP64 arithmetic: Using FP64 emulation and FP8 tensor cores with Ozaki scheme, 2025

2025
[21]

DGEMMusing tensor cores, and its accurate and reproducible versions

DaichiMukunoki, KatsuhisaOzaki, TakeshiOgita, andToshiyukiImamura. DGEMMusing tensor cores, and its accurate and reproducible versions. InHigh Performance Computing – ISC High Performance 2020, pages 230–248. Springer, 2020

2020
[22]

Doudna: NERSC’s next-generation supercomputer based on NVIDIA Vera Rubin

NERSC. Doudna: NERSC’s next-generation supercomputer based on NVIDIA Vera Rubin. National Energy Research Scientific Computing Center, 2026

2026
[23]

NVIDIA Corporation.NVIDIA Blackwell Architecture Technical Brief, 2024

2024
[24]

Individual Blackwell Ultra GPU specifications, October 2025

NVIDIA Corporation.NVIDIA Blackwell Ultra GPU Datasheet, 2025. Individual Blackwell Ultra GPU specifications, October 2025

2025
[25]

NVIDIA HPC-Benchmarks 25.04: Ozaki-I HPL on blackwell tensor cores

NVIDIA Corporation. NVIDIA HPC-Benchmarks 25.04: Ozaki-I HPL on blackwell tensor cores. NVIDIA Developer Documentation, 2025

2025
[26]

Unlocking tensor core performance with floating-point emulation in cuBLAS

NVIDIA Corporation. Unlocking tensor core performance with floating-point emulation in cuBLAS. NVIDIA Developer Blog, 2025

2025
[27]

Blue Lion supercomputer will run on NVIDIA Vera Rubin

NVIDIA Corporation. Blue Lion supercomputer will run on NVIDIA Vera Rubin. NVIDIA Blog, 2026

2026
[28]

Emulated DGEMM

NVIDIA Corporation. Inside the NVIDIA Vera Rubin platform: Six new chips, one AI supercomputer. NVIDIA Developer Blog, 2026. Lists “Emulated DGEMM” as an official column in Rubin specifications

2026
[29]

DGEMM on integer matrix multi- plication unit

Hiroyuki Ootomo, Katsuhisa Ozaki, and Rio Yokota. DGEMM on integer matrix multi- plication unit. The International Journal of High Performance Computing Applications, 2024

2024
[30]

Katsuhisa Ozaki, Takeshi Ogita, Shin’ichi Oishi, and Siegfried M. Rump. Error-free trans- formations of matrix multiplication by using fast routines of matrix multiplication and its applications.Numerical Algorithms, 59(1):95–118, 2012

2012
[31]

Ozaki Scheme II: A GEMM- oriented emulation of floating-point matrix multiplication using an integer modular tech- nique, 2025

Katsuhisa Ozaki, Yuki Uchino, and Toshiyuki Imamura. Ozaki Scheme II: A GEMM- oriented emulation of floating-point matrix multiplication using an integer modular tech- nique, 2025

2025
[32]

Error analysis of matrix multipli- cation emulation using Ozaki-II scheme, 2026

Katsuhisa Ozaki, Yuki Uchino, and Toshiyuki Imamura. Error analysis of matrix multipli- cation emulation using Ozaki-II scheme, 2026. Preprint

2026
[33]

Schwarz et al

A. Schwarz, A. Anders, C. Brower, H. Bayraktar, J. Gunnels, K. Clark, R. G. Xu, S. Ro- driguez, S. Cayrols, P. Tabaszewski, et al. Guaranteed DGEMM accuracy while using reduced precision tensor cores through extensions of the Ozaki scheme. InProceedings of the Supercomputing Asia and International Conference on High Performance Computing in Asia Pacific R...

work page arXiv 2025
[34]

The mixed-precision path to FP64 on AI-centric accelerators (MFP64 whitepaper)

Rick Stevens et al. The mixed-precision path to FP64 on AI-centric accelerators (MFP64 whitepaper). Technical report, U.S. Department of Energy, 2025

2025
[35]

GEMMul8: GEMM emulation using INT8/FP8 matrix engines based on the Ozaki Scheme II

Yuki Uchino. GEMMul8: GEMM emulation using INT8/FP8 matrix engines based on the Ozaki Scheme II. GitHub repository, RIKEN-RCCS, 2025

2025
[36]

Performance enhancement of the Ozaki scheme on integer matrix multiplication unit.The International Journal of High Performance Computing Applications, 2025

Yuki Uchino, Katsuhisa Ozaki, and Toshiyuki Imamura. Performance enhancement of the Ozaki scheme on integer matrix multiplication unit.The International Journal of High Performance Computing Applications, 2025

2025
[37]

Double-precision matrix multipli- cation emulation via Ozaki-II scheme with FP8 quantization, 2026

Yuki Uchino, Katsuhisa Ozaki, and Toshiyuki Imamura. Double-precision matrix multipli- cation emulation via Ozaki-II scheme with FP8 quantization, 2026

2026
[38]

SparStencil: Retargeting sparse tensor cores to scientific stencil compu- tations via structured sparsity transformation

Likun Wang et al. SparStencil: Retargeting sparse tensor cores to scientific stencil compu- tations via structured sparsity transformation. InProceedings of the International Confer- ence for High Performance Computing, Networking, Storage and Analysis (SC ’25), 2025. arXiv:2506.22969

work page arXiv 2025
[39]

Roofline: An insightful visual performance model for multicore architectures.Communications of the ACM, 52(4):65–76, 2009

Samuel Williams, Andrew Waterman, and David Patterson. Roofline: An insightful visual performance model for multicore architectures.Communications of the ACM, 52(4):65–76, 2009

2009
[40]

NVIDIA says it’s not abandoning 64-bit computing

Alex Woodie. NVIDIA says it’s not abandoning 64-bit computing. HPCwire, December 2025

2025
[41]

AMD hints at big FP64 increases in MI430X GPU as Ozaki underwhelms

Alex Woodie. AMD hints at big FP64 increases in MI430X GPU as Ozaki underwhelms. HPCwire, March 2026

2026
[42]

stored” to “actual

Alex Woodie. Genesis mission will lean heavily on Ozaki scheme for FP64 capability. HPCwire, February 2026. A Garner’s Algorithm: Detailed Derivation Equation (7) is the iterative formulation of Garner’s algorithm. We expand it here for the case r= 3to make the dependence pattern explicit. LetC∈Zwith0≤C < m 1m2m3, and letC (i) =Cmodm i. We seek mixed-radi...

2026

[1] [1]

Amestoy, Iain S

Patrick R. Amestoy, Iain S. Duff, Jean-Yves L’Excellent, and Jacko Koster. A fully asyn- chronousmultifrontalsolverusingdistributeddynamicscheduling.SIAM Journal on Matrix 24 Analysis and Applications, 23(1):15–41, 2001

2001

[2] [2]

Higham, and Enrique S

Hartwig Anzt, Jack Dongarra, Goran Flegar, Nicholas J. Higham, and Enrique S. Quintana- Ortí. Adaptive precision in block-Jacobi preconditioning for iterative sparse linear system solvers.Concurrency and Computation: Practice and Experience, 31(6):e4460, 2019

2019

[3] [3]

Booth, Alex J.˜W

George H. Booth, Alex J.˜W. Thom, and Ali Alavi. Fermion Monte Carlo without fixed nodes: A game of life, death, and annihilation in Slater determinant space.Journal of Chemical Physics, 131:054106, 2009

2009

[4] [4]

Clark, Ronald Babich, Kipton Barros, Richard C

Michael A. Clark, Ronald Babich, Kipton Barros, Richard C. Brower, and Claudio Rebbi. Solving lattice QCD systems of equations using mixed precision solvers on GPUs.Computer Physics Communications, 181(9):1517–1528, 2010

2010

[5] [5]

Fifteen years of FP64 segmentation, and why the Blackwell ultra breaks the pattern

Nicolas Dickenmann. Fifteen years of FP64 segmentation, and why the Blackwell ultra breaks the pattern. Blog post, 2026

2026

[6] [6]

Harvey L. Garner. The residue number system.IRE Transactions on Electronic Computers, EC-8(2):140–147, 1959

1959

[7] [7]

SPTCStencil: Unleashing sparse tensor cores for stencil computation via strided swap, 2025

Qiqi Gu, Chenpeng Wu, Heng Shi, and Jianguo Yao. SPTCStencil: Unleashing sparse tensor cores for stencil computation via strided swap, 2025. arXiv:2506.22035

work page arXiv 2025

[8] [8]

Azzam Haidar, Stanimire Tomov, Jack Dongarra, and Nicholas J. Higham. Harnessing GPU tensor cores for fast FP16 arithmetic to speed up mixed-precision iterative refinement solvers. InSC18: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 603–613, 2018

2018

[9] [9]

Microbenchmarking NVIDIA’s Blackwell architecture: An in-depth architectural analysis, 2025

Aaron Jarmusch and Sunita Chandrasekaran. Microbenchmarking NVIDIA’s Blackwell architecture: An in-depth architectural analysis, 2025

2025

[10] [10]

Kalamkar, Karthikeyan Vaidyanathan, Mikhail Smelyanskiy, Karthikeyan Pamnany, Victor W

Bálint Joo, Dhiraj D. Kalamkar, Karthikeyan Vaidyanathan, Mikhail Smelyanskiy, Karthikeyan Pamnany, Victor W. Lee, Pradeep Dubey, and William Watson. Lattice QCD on Intel Xeon Phi coprocessors.ISC High Performance, 2013

2013

[11] [11]

Pracniques: Further remarks on reducing truncation errors.Communica- tions of the ACM, 8(1):40, 1965

William Kahan. Pracniques: Further remarks on reducing truncation errors.Communica- tions of the ACM, 8(1):40, 1965

1965

[12] [12]

Knuth.The Art of Computer Programming, Volume 2: Seminumerical Algo- rithms

Donald E. Knuth.The Art of Computer Programming, Volume 2: Seminumerical Algo- rithms. Addison-Wesley, 3rd edition, 1997

1997

[13] [13]

U. W. Kulisch. Mathematical foundation of computer arithmetic.IEEE Transactions on Computers, C-26(7):610–621, 1977

1977

[14] [14]

Kulischand W.L

U.W. Kulischand W.L. Miranker. Thearithmeticof thedigital computer: Anew approach. SIAM Review, 28(1):1–40, 1986

1986

[15] [15]

Toward accelerated stencil computation by adapting tensor core unit on GPU

Xueying Liu et al. Toward accelerated stencil computation by adapting tensor core unit on GPU. InProceedings of the 36th ACM International Conference on Supercomputing, 2022

2022

[16] [16]

Lockwood

Glenn K. Lockwood. NVIDIA Rubin: Architecture notes and performance specifications. Glenn’s Digital Garden, 2026.https://www.glennklockwood.com/garden/processors/ R200, accessed May 2026

2026

[17] [17]

Nvidia unpacks Vera Rubin rack system at CES

Tobias Mann. Nvidia unpacks Vera Rubin rack system at CES. The Register, January 2026. Documents Rubin FP64 vector regression from45to33TFLOPS. 25

2026

[18] [18]

Stefano Markidis, Steven Wei Der Chien, Erwin Laure, Ivy Bo Peng, and Jeffrey S. Vetter. NVIDIAtensorcoreprogrammability, performance&precision. In2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pages 522–531, 2018

2018

[19] [19]

FP8 is all you need (part 2): Efficient Ozaki–Bailey style FFT through tensor-core Garner reformulation and Kulisch escape route

Satoshi Matsuoka. FP8 is all you need (part 2): Efficient Ozaki–Bailey style FFT through tensor-core Garner reformulation and Kulisch escape route. Companion manuscript, 2026. In preparation

2026

[20] [20]

DGEMM without FP64 arithmetic: Using FP64 emulation and FP8 tensor cores with Ozaki scheme, 2025

Daichi Mukunoki. DGEMM without FP64 arithmetic: Using FP64 emulation and FP8 tensor cores with Ozaki scheme, 2025

2025

[21] [21]

DGEMMusing tensor cores, and its accurate and reproducible versions

DaichiMukunoki, KatsuhisaOzaki, TakeshiOgita, andToshiyukiImamura. DGEMMusing tensor cores, and its accurate and reproducible versions. InHigh Performance Computing – ISC High Performance 2020, pages 230–248. Springer, 2020

2020

[22] [22]

Doudna: NERSC’s next-generation supercomputer based on NVIDIA Vera Rubin

NERSC. Doudna: NERSC’s next-generation supercomputer based on NVIDIA Vera Rubin. National Energy Research Scientific Computing Center, 2026

2026

[23] [23]

NVIDIA Corporation.NVIDIA Blackwell Architecture Technical Brief, 2024

2024

[24] [24]

Individual Blackwell Ultra GPU specifications, October 2025

NVIDIA Corporation.NVIDIA Blackwell Ultra GPU Datasheet, 2025. Individual Blackwell Ultra GPU specifications, October 2025

2025

[25] [25]

NVIDIA HPC-Benchmarks 25.04: Ozaki-I HPL on blackwell tensor cores

NVIDIA Corporation. NVIDIA HPC-Benchmarks 25.04: Ozaki-I HPL on blackwell tensor cores. NVIDIA Developer Documentation, 2025

2025

[26] [26]

Unlocking tensor core performance with floating-point emulation in cuBLAS

NVIDIA Corporation. Unlocking tensor core performance with floating-point emulation in cuBLAS. NVIDIA Developer Blog, 2025

2025

[27] [27]

Blue Lion supercomputer will run on NVIDIA Vera Rubin

NVIDIA Corporation. Blue Lion supercomputer will run on NVIDIA Vera Rubin. NVIDIA Blog, 2026

2026

[28] [28]

Emulated DGEMM

NVIDIA Corporation. Inside the NVIDIA Vera Rubin platform: Six new chips, one AI supercomputer. NVIDIA Developer Blog, 2026. Lists “Emulated DGEMM” as an official column in Rubin specifications

2026

[29] [29]

DGEMM on integer matrix multi- plication unit

Hiroyuki Ootomo, Katsuhisa Ozaki, and Rio Yokota. DGEMM on integer matrix multi- plication unit. The International Journal of High Performance Computing Applications, 2024

2024

[30] [30]

Katsuhisa Ozaki, Takeshi Ogita, Shin’ichi Oishi, and Siegfried M. Rump. Error-free trans- formations of matrix multiplication by using fast routines of matrix multiplication and its applications.Numerical Algorithms, 59(1):95–118, 2012

2012

[31] [31]

Ozaki Scheme II: A GEMM- oriented emulation of floating-point matrix multiplication using an integer modular tech- nique, 2025

Katsuhisa Ozaki, Yuki Uchino, and Toshiyuki Imamura. Ozaki Scheme II: A GEMM- oriented emulation of floating-point matrix multiplication using an integer modular tech- nique, 2025

2025

[32] [32]

Error analysis of matrix multipli- cation emulation using Ozaki-II scheme, 2026

Katsuhisa Ozaki, Yuki Uchino, and Toshiyuki Imamura. Error analysis of matrix multipli- cation emulation using Ozaki-II scheme, 2026. Preprint

2026

[33] [33]

Schwarz et al

A. Schwarz, A. Anders, C. Brower, H. Bayraktar, J. Gunnels, K. Clark, R. G. Xu, S. Ro- driguez, S. Cayrols, P. Tabaszewski, et al. Guaranteed DGEMM accuracy while using reduced precision tensor cores through extensions of the Ozaki scheme. InProceedings of the Supercomputing Asia and International Conference on High Performance Computing in Asia Pacific R...

work page arXiv 2025

[34] [34]

The mixed-precision path to FP64 on AI-centric accelerators (MFP64 whitepaper)

Rick Stevens et al. The mixed-precision path to FP64 on AI-centric accelerators (MFP64 whitepaper). Technical report, U.S. Department of Energy, 2025

2025

[35] [35]

GEMMul8: GEMM emulation using INT8/FP8 matrix engines based on the Ozaki Scheme II

Yuki Uchino. GEMMul8: GEMM emulation using INT8/FP8 matrix engines based on the Ozaki Scheme II. GitHub repository, RIKEN-RCCS, 2025

2025

[36] [36]

Performance enhancement of the Ozaki scheme on integer matrix multiplication unit.The International Journal of High Performance Computing Applications, 2025

Yuki Uchino, Katsuhisa Ozaki, and Toshiyuki Imamura. Performance enhancement of the Ozaki scheme on integer matrix multiplication unit.The International Journal of High Performance Computing Applications, 2025

2025

[37] [37]

Double-precision matrix multipli- cation emulation via Ozaki-II scheme with FP8 quantization, 2026

Yuki Uchino, Katsuhisa Ozaki, and Toshiyuki Imamura. Double-precision matrix multipli- cation emulation via Ozaki-II scheme with FP8 quantization, 2026

2026

[38] [38]

SparStencil: Retargeting sparse tensor cores to scientific stencil compu- tations via structured sparsity transformation

Likun Wang et al. SparStencil: Retargeting sparse tensor cores to scientific stencil compu- tations via structured sparsity transformation. InProceedings of the International Confer- ence for High Performance Computing, Networking, Storage and Analysis (SC ’25), 2025. arXiv:2506.22969

work page arXiv 2025

[39] [39]

Roofline: An insightful visual performance model for multicore architectures.Communications of the ACM, 52(4):65–76, 2009

Samuel Williams, Andrew Waterman, and David Patterson. Roofline: An insightful visual performance model for multicore architectures.Communications of the ACM, 52(4):65–76, 2009

2009

[40] [40]

NVIDIA says it’s not abandoning 64-bit computing

Alex Woodie. NVIDIA says it’s not abandoning 64-bit computing. HPCwire, December 2025

2025

[41] [41]

AMD hints at big FP64 increases in MI430X GPU as Ozaki underwhelms

Alex Woodie. AMD hints at big FP64 increases in MI430X GPU as Ozaki underwhelms. HPCwire, March 2026

2026

[42] [42]

stored” to “actual

Alex Woodie. Genesis mission will lean heavily on Ozaki scheme for FP64 capability. HPCwire, February 2026. A Garner’s Algorithm: Detailed Derivation Equation (7) is the iterative formulation of Garner’s algorithm. We expand it here for the case r= 3to make the dependence pattern explicit. LetC∈Zwith0≤C < m 1m2m3, and letC (i) =Cmodm i. We seek mixed-radi...

2026