pith. sign in

arxiv: 2606.06510 · v1 · pith:EX2AKXJWnew · submitted 2026-05-28 · 💻 cs.AR · cs.AI· cs.DC· cs.PF

FP8 is All You Need (Part 1): Debunking Hardware FP64 as the HPC Holy Grail

Pith reviewed 2026-06-29 00:41 UTC · model grok-4.3

classification 💻 cs.AR cs.AIcs.DCcs.PF
keywords FP8FP64 emulationOzaki Scheme IIHPC kernelsGPU architectureBlackwelltensor coresChinese Remainder Theorem
0
0 comments X

The pith

FP8 tensor throughput with Ozaki Scheme II recovers full FP64 accuracy at memory-roof speeds on B300 GPUs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper challenges the assumption that native hardware FP64 is required for high-performance scientific computing. It demonstrates that on B300-generation AI GPUs, where native FP64 falls to 1.3 TFLOPS, combining abundant FP8 tensor cores with the Chinese Remainder Theorem-based Ozaki Scheme II enables emulation that reaches the memory roof while preserving full double-precision accuracy for kernels such as SpMV, GEMV, and stencils. A Tensor-Memory Equilibrium model incorporating compute multiplier alpha, bandwidth multiplier beta, and reconstruction latency gamma shows that register-level fusion drives emulation overhead to negligible levels. Projections indicate effective FP64 performance of approximately 500 TFLOPS on B300 and 400 TFLOPS on Rubin R200, exceeding both native B300 capability and prior H100 native FP64 results across the studied workloads.

Core claim

On AI-optimised GPUs of the B300 generation and beyond, abundant FP8 tensor throughput combined with the Chinese Remainder Theorem-based Ozaki Scheme II recovers memory-roof execution at full FP64 accuracy across the canonical HPC kernel spectrum, projecting effective FP64 rates of 500 TFLOPS on B300 while matching memory bandwidth limits in bandwidth-bound cases.

What carries the argument

Ozaki Scheme II, a Chinese Remainder Theorem-based reconstruction method that emulates FP64 using FP8 tensor operations, with register-level fusion to achieve beta approaching 1 in the TME model.

If this is right

  • B300 native FP64 at 1.3 TFLOPS is replaced by emulated rates up to 500 TFLOPS in compute-bound regimes.
  • Performance matches or exceeds H100 native FP64 on all workloads studied.
  • Combined with Kulisch fixed-point methods for FFTs, every kernel class reaches the memory roof at full FP64.
  • Native FP64 silicon is no longer required for production HPC workloads.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Future GPU designs may allocate more silicon to tensor units rather than dedicated FP64 pipelines.
  • The same emulation approach could be tested on other low-precision formats or non-GPU accelerators.
  • Real hardware runs would be required to validate whether the TME model parameters hold under production conditions.

Load-bearing premise

The TME model parameters guarantee that reconstruction overhead remains negligible and full FP64 accuracy holds for every surveyed kernel.

What would settle it

Measure actual throughput and accuracy of Ozaki II on B300 hardware for kernels such as GEMM and SpMV to check whether full FP64 results are obtained near the projected 500 TFLOPS or memory roof.

Figures

Figures reproduced from arXiv: 2606.06510 by Satoshi Matsuoka.

Figure 1
Figure 1. Figure 1: TME roofline projection for B300 and Rubin. Each platform has its own memory roof [PITH_FULL_IMAGE:figures/full_fig_p012_1.png] view at source ↗
read the original abstract

Conventional HPC dogma holds that native hardware FP64 silicon is the irreducible foundation of scientific computing -- the "holy grail" of double-precision simulation. This paper argues the dogma is wrong: on AI-optimised GPUs of the B300 generation and beyond, abundant FP8 tensor throughput combined with the Chinese Remainder Theorem-based Ozaki Scheme II recovers memory-roof execution at full FP64 accuracy across the canonical HPC kernel spectrum. NVIDIA's Blackwell Ultra (B300) collapses native FP64 to ~1.3 TFLOPS -- a 31x regression from the B200 -- rendering even memory-bound kernels (SpMV, GEMV, stencils) compute-bound. We make four contributions. First, a unified analytic model, the Tensor-Memory Equilibrium (TME) model, augmenting the Roofline with a compute multiplier alpha, a bandwidth multiplier beta, and a reconstruction latency gamma. Second, we identify register-level fusion as the mechanism driving beta -> 1, making emulation essentially free behind the memory wall. Third, we project that Ozaki II vaults emulated FP64 from the ~1 TFLOPS native floor to ~500 TFLOPS (B300) and ~400 TFLOPS (Rubin R200), exceeding even B200's native FP64 ceiling by over an order of magnitude in the compute-bound regime while matching the memory roof in the bandwidth-bound regime. Fourth, against an H100 baseline, Ozaki II matches or exceeds H100 on every workload studied, versus the up-to-50x regression that B300 native FP64 imposes. Combined with a companion FFT analysis (Kulisch fixed-point reconstruction on the surviving INT32 pipe) and FP32+Kahan reductions reported in the companion Part(2) paper, every surveyed kernel class on B300 reaches the memory roof at full FP64. The evidence supports the title's claim: FP8, with Ozaki II and Kulisch escape routes, is all one needs for production HPC; native FP64 silicon is no longer the holy grail it has been taken to be.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper claims that on AI-optimized GPUs such as NVIDIA Blackwell Ultra (B300), abundant FP8 tensor throughput combined with the Chinese Remainder Theorem-based Ozaki Scheme II can recover memory-roof execution at full FP64 accuracy for canonical HPC kernels including SpMV, GEMV, and stencils. It introduces the Tensor-Memory Equilibrium (TME) analytic model augmenting Roofline with multipliers alpha (compute), beta (bandwidth), and gamma (reconstruction latency), asserts that register-level fusion drives beta to 1, and projects ~500 TFLOPS effective FP64 on B300 (exceeding B200 native FP64) while matching or exceeding H100 baselines, concluding that native FP64 silicon is no longer required.

Significance. If the TME projections and accuracy claims hold with supporting analysis, the result would challenge the necessity of dedicated FP64 hardware in future HPC systems and shift design priorities toward lower-precision tensor units with emulation techniques, with potential impact on both architecture and software stacks for scientific computing.

major comments (3)
  1. [Abstract] Abstract: the TME model defines free parameters alpha, beta, and gamma (with beta asserted to reach exactly 1 via register-level fusion) but supplies no register-pressure analysis, instruction schedule, or first-principles derivation showing that multiple FP8 GEMMs plus CRT reconstruction fit without spilling or extra memory traffic for the surveyed kernels; the ~500 TFLOPS projection on B300 is obtained by direct multiplication of the native FP8 rate by these unverified factors.
  2. [Abstract] Abstract: the claim of 'full FP64 accuracy' for Ozaki Scheme II across the kernel spectrum is stated without cross-validation against native FP64 references or explicit bounds on rounding/overflow in the reconstruction steps, yet this premise is required for the central assertion that emulation recovers exact memory-roof execution.
  3. [Abstract] Abstract: the performance comparison to H100 and the claim that Ozaki II 'matches or exceeds H100 on every workload' while native B300 FP64 imposes up to 50x regression rest on the same unverified TME parameters; no sensitivity analysis or alternative parameter values are provided to test robustness.
minor comments (1)
  1. [Abstract] The abstract introduces the TME model and Ozaki Scheme II without a forward reference to the section containing their formal definitions or the precise equations for alpha, beta, and gamma.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments on the presentation of the TME model and the supporting claims. We address each major comment below and will incorporate additional analysis to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the TME model defines free parameters alpha, beta, and gamma (with beta asserted to reach exactly 1 via register-level fusion) but supplies no register-pressure analysis, instruction schedule, or first-principles derivation showing that multiple FP8 GEMMs plus CRT reconstruction fit without spilling or extra memory traffic for the surveyed kernels; the ~500 TFLOPS projection on B300 is obtained by direct multiplication of the native FP8 rate by these unverified factors.

    Authors: We agree that the abstract does not contain the requested register-pressure analysis or instruction schedule. The TME model in the manuscript body parameterizes beta from the architectural register file size and the number of intermediate values generated by Ozaki II, but a detailed first-principles derivation confirming no register spilling for SpMV, GEMV, and stencils is absent. We will add this analysis (including register usage estimates and a high-level schedule) in a new subsection or appendix of the revised manuscript to verify that beta reaches 1 without extra memory traffic. revision: yes

  2. Referee: [Abstract] Abstract: the claim of 'full FP64 accuracy' for Ozaki Scheme II across the kernel spectrum is stated without cross-validation against native FP64 references or explicit bounds on rounding/overflow in the reconstruction steps, yet this premise is required for the central assertion that emulation recovers exact memory-roof execution.

    Authors: The manuscript asserts full FP64 accuracy on the basis of the exact arithmetic properties of CRT reconstruction within the representable range. However, the abstract and main text do not include explicit cross-validation against native FP64 or derived bounds on reconstruction overflow for the target kernels. We will expand the accuracy section with direct numerical comparisons and explicit overflow bounds in the revision. revision: yes

  3. Referee: [Abstract] Abstract: the performance comparison to H100 and the claim that Ozaki II 'matches or exceeds H100 on every workload' while native B300 FP64 imposes up to 50x regression rest on the same unverified TME parameters; no sensitivity analysis or alternative parameter values are provided to test robustness.

    Authors: The H100 comparisons and the 50x regression claim are obtained by substituting the B300 and H100 architectural parameters into the TME equations. We acknowledge the absence of sensitivity analysis on alpha, beta, and gamma. The revised manuscript will include a sensitivity study that varies these parameters over plausible ranges and reports the resulting performance envelopes, thereby testing robustness of the workload comparisons. revision: yes

Circularity Check

1 steps flagged

TME model projections of ~500 TFLOPS reduce to assumed multipliers alpha/beta/gamma by construction

specific steps
  1. fitted input called prediction [Abstract]
    "a unified analytic model, the Tensor-Memory Equilibrium (TME) model, augmenting the Roofline with a compute multiplier alpha, a bandwidth multiplier beta, and a reconstruction latency gamma. ... we project that Ozaki II vaults emulated FP64 from the ~1 TFLOPS native floor to ~500 TFLOPS (B300) and ~400 TFLOPS (Rubin R200)"

    The ~500 TFLOPS figure is obtained by plugging the TME parameters (particularly beta set to 1 via the register-fusion claim) into the model; the projection is therefore the direct numerical output of those assumed multipliers rather than an independent result.

full rationale

The abstract presents the TME model (with alpha, beta, gamma) as the basis for projecting emulated FP64 performance to ~500 TFLOPS on B300, asserting beta->1 via register-level fusion. This matches the fitted_input_called_prediction pattern because the headline performance numbers are generated directly from the model's own parameters rather than from independent derivation, external benchmarks, or first-principles register analysis shown in the text. No other circular patterns (self-citation chains, self-definitional equations, or ansatz smuggling) are exhibited in the provided sections. The central claim therefore has partial circularity but retains some independent modeling content.

Axiom & Free-Parameter Ledger

3 free parameters · 2 axioms · 1 invented entities

The central claim depends on the TME model parameters and the assumption that Ozaki II reconstruction preserves full accuracy while incurring negligible latency; these are introduced without external validation in the abstract.

free parameters (3)
  • alpha
    compute multiplier augmenting Roofline in TME model
  • beta
    bandwidth multiplier in TME model
  • gamma
    reconstruction latency term in TME model
axioms (2)
  • standard math Chinese Remainder Theorem enables exact FP64 reconstruction from FP8 operations
    Invoked to claim full accuracy in Ozaki Scheme II
  • domain assumption Register-level fusion drives beta to 1 making emulation free behind the memory wall
    Stated as the mechanism that eliminates overhead in the abstract
invented entities (1)
  • Tensor-Memory Equilibrium (TME) model no independent evidence
    purpose: Augmented Roofline model for emulated FP64 performance
    New analytic framework introduced to support the projections

pith-pipeline@v0.9.1-grok · 5927 in / 1704 out tokens · 35120 ms · 2026-06-29T00:41:43.713508+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

42 extracted references · 3 canonical work pages

  1. [1]

    Amestoy, Iain S

    Patrick R. Amestoy, Iain S. Duff, Jean-Yves L’Excellent, and Jacko Koster. A fully asyn- chronousmultifrontalsolverusingdistributeddynamicscheduling.SIAM Journal on Matrix 24 Analysis and Applications, 23(1):15–41, 2001

  2. [2]

    Higham, and Enrique S

    Hartwig Anzt, Jack Dongarra, Goran Flegar, Nicholas J. Higham, and Enrique S. Quintana- Ortí. Adaptive precision in block-Jacobi preconditioning for iterative sparse linear system solvers.Concurrency and Computation: Practice and Experience, 31(6):e4460, 2019

  3. [3]

    Booth, Alex J.˜W

    George H. Booth, Alex J.˜W. Thom, and Ali Alavi. Fermion Monte Carlo without fixed nodes: A game of life, death, and annihilation in Slater determinant space.Journal of Chemical Physics, 131:054106, 2009

  4. [4]

    Clark, Ronald Babich, Kipton Barros, Richard C

    Michael A. Clark, Ronald Babich, Kipton Barros, Richard C. Brower, and Claudio Rebbi. Solving lattice QCD systems of equations using mixed precision solvers on GPUs.Computer Physics Communications, 181(9):1517–1528, 2010

  5. [5]

    Fifteen years of FP64 segmentation, and why the Blackwell ultra breaks the pattern

    Nicolas Dickenmann. Fifteen years of FP64 segmentation, and why the Blackwell ultra breaks the pattern. Blog post, 2026

  6. [6]

    Harvey L. Garner. The residue number system.IRE Transactions on Electronic Computers, EC-8(2):140–147, 1959

  7. [7]

    SPTCStencil: Unleashing sparse tensor cores for stencil computation via strided swap, 2025

    Qiqi Gu, Chenpeng Wu, Heng Shi, and Jianguo Yao. SPTCStencil: Unleashing sparse tensor cores for stencil computation via strided swap, 2025. arXiv:2506.22035

  8. [8]

    Azzam Haidar, Stanimire Tomov, Jack Dongarra, and Nicholas J. Higham. Harnessing GPU tensor cores for fast FP16 arithmetic to speed up mixed-precision iterative refinement solvers. InSC18: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 603–613, 2018

  9. [9]

    Microbenchmarking NVIDIA’s Blackwell architecture: An in-depth architectural analysis, 2025

    Aaron Jarmusch and Sunita Chandrasekaran. Microbenchmarking NVIDIA’s Blackwell architecture: An in-depth architectural analysis, 2025

  10. [10]

    Kalamkar, Karthikeyan Vaidyanathan, Mikhail Smelyanskiy, Karthikeyan Pamnany, Victor W

    Bálint Joo, Dhiraj D. Kalamkar, Karthikeyan Vaidyanathan, Mikhail Smelyanskiy, Karthikeyan Pamnany, Victor W. Lee, Pradeep Dubey, and William Watson. Lattice QCD on Intel Xeon Phi coprocessors.ISC High Performance, 2013

  11. [11]

    Pracniques: Further remarks on reducing truncation errors.Communica- tions of the ACM, 8(1):40, 1965

    William Kahan. Pracniques: Further remarks on reducing truncation errors.Communica- tions of the ACM, 8(1):40, 1965

  12. [12]

    Knuth.The Art of Computer Programming, Volume 2: Seminumerical Algo- rithms

    Donald E. Knuth.The Art of Computer Programming, Volume 2: Seminumerical Algo- rithms. Addison-Wesley, 3rd edition, 1997

  13. [13]

    U. W. Kulisch. Mathematical foundation of computer arithmetic.IEEE Transactions on Computers, C-26(7):610–621, 1977

  14. [14]

    Kulischand W.L

    U.W. Kulischand W.L. Miranker. Thearithmeticof thedigital computer: Anew approach. SIAM Review, 28(1):1–40, 1986

  15. [15]

    Toward accelerated stencil computation by adapting tensor core unit on GPU

    Xueying Liu et al. Toward accelerated stencil computation by adapting tensor core unit on GPU. InProceedings of the 36th ACM International Conference on Supercomputing, 2022

  16. [16]

    Lockwood

    Glenn K. Lockwood. NVIDIA Rubin: Architecture notes and performance specifications. Glenn’s Digital Garden, 2026.https://www.glennklockwood.com/garden/processors/ R200, accessed May 2026

  17. [17]

    Nvidia unpacks Vera Rubin rack system at CES

    Tobias Mann. Nvidia unpacks Vera Rubin rack system at CES. The Register, January 2026. Documents Rubin FP64 vector regression from45to33TFLOPS. 25

  18. [18]

    Stefano Markidis, Steven Wei Der Chien, Erwin Laure, Ivy Bo Peng, and Jeffrey S. Vetter. NVIDIAtensorcoreprogrammability, performance&precision. In2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pages 522–531, 2018

  19. [19]

    FP8 is all you need (part 2): Efficient Ozaki–Bailey style FFT through tensor-core Garner reformulation and Kulisch escape route

    Satoshi Matsuoka. FP8 is all you need (part 2): Efficient Ozaki–Bailey style FFT through tensor-core Garner reformulation and Kulisch escape route. Companion manuscript, 2026. In preparation

  20. [20]

    DGEMM without FP64 arithmetic: Using FP64 emulation and FP8 tensor cores with Ozaki scheme, 2025

    Daichi Mukunoki. DGEMM without FP64 arithmetic: Using FP64 emulation and FP8 tensor cores with Ozaki scheme, 2025

  21. [21]

    DGEMMusing tensor cores, and its accurate and reproducible versions

    DaichiMukunoki, KatsuhisaOzaki, TakeshiOgita, andToshiyukiImamura. DGEMMusing tensor cores, and its accurate and reproducible versions. InHigh Performance Computing – ISC High Performance 2020, pages 230–248. Springer, 2020

  22. [22]

    Doudna: NERSC’s next-generation supercomputer based on NVIDIA Vera Rubin

    NERSC. Doudna: NERSC’s next-generation supercomputer based on NVIDIA Vera Rubin. National Energy Research Scientific Computing Center, 2026

  23. [23]

    NVIDIA Corporation.NVIDIA Blackwell Architecture Technical Brief, 2024

  24. [24]

    Individual Blackwell Ultra GPU specifications, October 2025

    NVIDIA Corporation.NVIDIA Blackwell Ultra GPU Datasheet, 2025. Individual Blackwell Ultra GPU specifications, October 2025

  25. [25]

    NVIDIA HPC-Benchmarks 25.04: Ozaki-I HPL on blackwell tensor cores

    NVIDIA Corporation. NVIDIA HPC-Benchmarks 25.04: Ozaki-I HPL on blackwell tensor cores. NVIDIA Developer Documentation, 2025

  26. [26]

    Unlocking tensor core performance with floating-point emulation in cuBLAS

    NVIDIA Corporation. Unlocking tensor core performance with floating-point emulation in cuBLAS. NVIDIA Developer Blog, 2025

  27. [27]

    Blue Lion supercomputer will run on NVIDIA Vera Rubin

    NVIDIA Corporation. Blue Lion supercomputer will run on NVIDIA Vera Rubin. NVIDIA Blog, 2026

  28. [28]

    Emulated DGEMM

    NVIDIA Corporation. Inside the NVIDIA Vera Rubin platform: Six new chips, one AI supercomputer. NVIDIA Developer Blog, 2026. Lists “Emulated DGEMM” as an official column in Rubin specifications

  29. [29]

    DGEMM on integer matrix multi- plication unit

    Hiroyuki Ootomo, Katsuhisa Ozaki, and Rio Yokota. DGEMM on integer matrix multi- plication unit. The International Journal of High Performance Computing Applications, 2024

  30. [30]

    Katsuhisa Ozaki, Takeshi Ogita, Shin’ichi Oishi, and Siegfried M. Rump. Error-free trans- formations of matrix multiplication by using fast routines of matrix multiplication and its applications.Numerical Algorithms, 59(1):95–118, 2012

  31. [31]

    Ozaki Scheme II: A GEMM- oriented emulation of floating-point matrix multiplication using an integer modular tech- nique, 2025

    Katsuhisa Ozaki, Yuki Uchino, and Toshiyuki Imamura. Ozaki Scheme II: A GEMM- oriented emulation of floating-point matrix multiplication using an integer modular tech- nique, 2025

  32. [32]

    Error analysis of matrix multipli- cation emulation using Ozaki-II scheme, 2026

    Katsuhisa Ozaki, Yuki Uchino, and Toshiyuki Imamura. Error analysis of matrix multipli- cation emulation using Ozaki-II scheme, 2026. Preprint

  33. [33]

    Schwarz et al

    A. Schwarz, A. Anders, C. Brower, H. Bayraktar, J. Gunnels, K. Clark, R. G. Xu, S. Ro- driguez, S. Cayrols, P. Tabaszewski, et al. Guaranteed DGEMM accuracy while using reduced precision tensor cores through extensions of the Ozaki scheme. InProceedings of the Supercomputing Asia and International Conference on High Performance Computing in Asia Pacific R...

  34. [34]

    The mixed-precision path to FP64 on AI-centric accelerators (MFP64 whitepaper)

    Rick Stevens et al. The mixed-precision path to FP64 on AI-centric accelerators (MFP64 whitepaper). Technical report, U.S. Department of Energy, 2025

  35. [35]

    GEMMul8: GEMM emulation using INT8/FP8 matrix engines based on the Ozaki Scheme II

    Yuki Uchino. GEMMul8: GEMM emulation using INT8/FP8 matrix engines based on the Ozaki Scheme II. GitHub repository, RIKEN-RCCS, 2025

  36. [36]

    Performance enhancement of the Ozaki scheme on integer matrix multiplication unit.The International Journal of High Performance Computing Applications, 2025

    Yuki Uchino, Katsuhisa Ozaki, and Toshiyuki Imamura. Performance enhancement of the Ozaki scheme on integer matrix multiplication unit.The International Journal of High Performance Computing Applications, 2025

  37. [37]

    Double-precision matrix multipli- cation emulation via Ozaki-II scheme with FP8 quantization, 2026

    Yuki Uchino, Katsuhisa Ozaki, and Toshiyuki Imamura. Double-precision matrix multipli- cation emulation via Ozaki-II scheme with FP8 quantization, 2026

  38. [38]

    SparStencil: Retargeting sparse tensor cores to scientific stencil compu- tations via structured sparsity transformation

    Likun Wang et al. SparStencil: Retargeting sparse tensor cores to scientific stencil compu- tations via structured sparsity transformation. InProceedings of the International Confer- ence for High Performance Computing, Networking, Storage and Analysis (SC ’25), 2025. arXiv:2506.22969

  39. [39]

    Roofline: An insightful visual performance model for multicore architectures.Communications of the ACM, 52(4):65–76, 2009

    Samuel Williams, Andrew Waterman, and David Patterson. Roofline: An insightful visual performance model for multicore architectures.Communications of the ACM, 52(4):65–76, 2009

  40. [40]

    NVIDIA says it’s not abandoning 64-bit computing

    Alex Woodie. NVIDIA says it’s not abandoning 64-bit computing. HPCwire, December 2025

  41. [41]

    AMD hints at big FP64 increases in MI430X GPU as Ozaki underwhelms

    Alex Woodie. AMD hints at big FP64 increases in MI430X GPU as Ozaki underwhelms. HPCwire, March 2026

  42. [42]

    stored” to “actual

    Alex Woodie. Genesis mission will lean heavily on Ozaki scheme for FP64 capability. HPCwire, February 2026. A Garner’s Algorithm: Detailed Derivation Equation (7) is the iterative formulation of Garner’s algorithm. We expand it here for the case r= 3to make the dependence pattern explicit. LetC∈Zwith0≤C < m 1m2m3, and letC (i) =Cmodm i. We seek mixed-radi...