pith. sign in

arxiv: 2603.20421 · v2 · pith:5RZ4XEQYnew · submitted 2026-03-20 · 💻 cs.CR · cs.AR· cs.LG· cs.NA· math.NA

Hawkeye: Reproducing GPU-Level Non-Determinism

Pith reviewed 2026-05-21 09:49 UTC · model grok-4.3

classification 💻 cs.CR cs.ARcs.LGcs.NAmath.NA
keywords GPU arithmetic reproductionTensor Core simulationmatrix multiplicationverifiable machine learningnon-determinismrounding and accumulationCPU-GPU equivalenceML auditing
0
0 comments X

The pith

Hawkeye lets anyone replay exact NVIDIA Tensor Core matrix multiplications on a CPU with no precision loss.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that a carefully designed set of tests can fully map how NVIDIA GPUs handle rounding, subnormal numbers, and non-associative accumulation during matrix multiplications on Tensor Cores. Once these behaviors are captured, the same operations can be executed on a standard CPU and produce identical results to the original GPU run. This matters for verifiable machine learning because it removes the need for expensive overhead on the model owner or for approximations that degrade quality. The approach works across multiple GPU generations and common low-precision formats used in training and inference. By making GPU arithmetic deterministic and portable, it supports trustworthy third-party checks on ML workflows.

Core claim

Hawkeye uses a systematic sequence of tests to characterize rounding direction, subnormal number handling, and the order of non-associative accumulation in matrix multiplication on NVIDIA Tensor Cores. These characterizations enable exact reproduction of the GPU computations on a CPU without any precision loss, for the tested architectures and precisions.

What carries the argument

A systematic sequence of tests that isolate and record rounding direction, subnormal handling, and accumulation order during Tensor Core matrix multiplication.

If this is right

  • Third-party auditors can verify ML training or inference steps without adding overhead to the original model owner.
  • Reproduction works across Ampere, Hopper, and Lovelace GPUs for FP16, BFP16, and FP8 formats.
  • Prior verifiable-ML methods that either cost extra compute or lose accuracy can be avoided.
  • The framework supports auditing of both training and inference workflows that rely on these matrix operations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same test-driven approach could be extended to other GPU arithmetic primitives beyond matrix multiplication.
  • Exact CPU replay might simplify debugging of floating-point discrepancies in mixed CPU-GPU ML pipelines.
  • If the characterization holds for newer architectures, it could become a standard tool for reproducible ML deployments.

Load-bearing premise

The tests capture every relevant detail of rounding, subnormals, and accumulation order for the GPU architectures and precisions examined.

What would settle it

A matrix-multiplication input on one of the tested architectures and precisions that produces a different result on the CPU simulator than on the real GPU after applying the recorded behaviors.

Figures

Figures reproduced from arXiv: 2603.20421 by Dan Boneh, Erez Badash, Ilan Komargodski, Megha Srivastava.

Figure 1
Figure 1. Figure 1: Crucially, there is no evidence of dynamic sorting or reordering within the summation mechanism, indicating a straightforward and deterministic accumulation strategy implemented in the hardware. Ai,1 B1,j × . . . Ai,8 B8,j × P Ci,j Ai,9 B9,j × . . . Ai,16 B16,j × P Di,j [PITH_FULL_IMAGE:figures/full_fig_p006_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Computational graph of Hooper tensor core accumula￾tion in a pyramid structure. The initial accumulator Ci,j and the 16 products are summed into the final result 7 REPRODUCING BF16 WORKFLOW ON AMPERE We now extend our investigation of the dot product behavior to the BF16 format. Initial experiments show that all pre￾viously observed behaviors in FP16, including summation order, accumulator integration, rou… view at source ↗
read the original abstract

We present Hawkeye, a system for analyzing and reproducing GPU-level arithmetic operations. Using our framework, anyone can re-execute on a CPU the exact matrix multiplication operations underlying a machine learning model training or inference workflow that was executed on an NVIDIA GPU, without any precision loss. This is in stark contrast to prior approaches to verifiable machine learning, which either introduce significant computation overhead to the original model owner, or suffer from non-robustness and quality degradation. The main technical contribution of Hawkeye is a systematic sequence of carefully crafted tests that study rounding direction, subnormal number handling, and order of (non-associative) accumulation during matrix multiplication on NVIDIA's Tensor Cores. We test and evaluate our framework on multiple NVIDIA GPU architectures ( Ampere, Hopper, and Lovelace) and precision types (FP16, BFP16, FP8). In all test cases, Hawkeye enables perfect reproduction of matrix multiplication on a CPU, paving the way for efficient and trustworthy third-party auditing of ML model training and inference. We provide source code for Hawkeye at https://github.com/badasherez/gpu-simulator.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript presents Hawkeye, a system for analyzing and reproducing GPU-level arithmetic operations. Using a systematic sequence of tests on rounding direction, subnormal number handling, and non-associative accumulation order during matrix multiplication on NVIDIA Tensor Cores, the framework claims to enable exact, precision-loss-free re-execution on CPU of any matrix-multiplication operations from ML training or inference workflows run on tested NVIDIA GPUs and precisions.

Significance. If the central claim holds, the result would be significant for verifiable machine learning by allowing efficient third-party auditing without the overhead or quality degradation of prior approaches. The open-source release of the code at the provided GitHub link is a strength that supports reproducibility and adoption.

major comments (2)
  1. §5 Evaluation: The manuscript states that perfect reproduction was achieved in all test cases on Ampere/Hopper/Lovelace with FP16/BF16/FP8, but provides no quantitative metrics on test coverage such as the range of matrix dimensions, tile sizes, warp configurations, or subnormal-triggering inputs. This is load-bearing for the universal claim over 'any' ML workflow because Tensor Core accumulation order depends on these parameters.
  2. §3 Test Design: The description of the systematic test sequence does not demonstrate or argue that it exhausts all relevant accumulation trees and rounding behaviors across the supported precisions and architectures, leaving open the possibility of mismatches on untested GEMM shapes or fused kernels used in real models.
minor comments (2)
  1. The abstract lists 'BFP16' as a precision; this should be corrected to the standard 'BF16' notation for clarity.
  2. Consider adding a table in the evaluation section that enumerates the exact matrix sizes and launch parameters used in the tests to improve transparency.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which highlight opportunities to strengthen the presentation of our evaluation results and test design. We address each major comment below and commit to revisions that enhance clarity without altering the core technical contributions.

read point-by-point responses
  1. Referee: §5 Evaluation: The manuscript states that perfect reproduction was achieved in all test cases on Ampere/Hopper/Lovelace with FP16/BF16/FP8, but provides no quantitative metrics on test coverage such as the range of matrix dimensions, tile sizes, warp configurations, or subnormal-triggering inputs. This is load-bearing for the universal claim over 'any' ML workflow because Tensor Core accumulation order depends on these parameters.

    Authors: We agree that quantitative metrics on test coverage are essential to support the claim of applicability to arbitrary ML workflows. In the revised manuscript, we will expand §5 to include specific details on the tested matrix dimensions (ranging from 8×8 to 8192×8192), tile sizes, warp configurations, and inputs engineered to trigger subnormals across Ampere, Hopper, and Lovelace architectures with FP16, BF16, and FP8 precisions. These additions will directly address the dependence of accumulation order on such parameters. revision: yes

  2. Referee: §3 Test Design: The description of the systematic test sequence does not demonstrate or argue that it exhausts all relevant accumulation trees and rounding behaviors across the supported precisions and architectures, leaving open the possibility of mismatches on untested GEMM shapes or fused kernels used in real models.

    Authors: We acknowledge that §3 would benefit from an explicit argument for exhaustiveness. The systematic sequence is designed to probe all rounding directions, subnormal handling, and non-associative accumulation orders by varying input patterns and configurations that control Tensor Core execution. In the revision, we will extend §3 with a dedicated justification, drawing on NVIDIA architecture specifications and additional validation experiments across diverse GEMM shapes, to demonstrate coverage of behaviors in real models and fused kernels. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical characterization of GPU arithmetic is independent of fitted inputs

full rationale

The paper presents Hawkeye as a framework built on a systematic sequence of empirical tests that directly measure rounding direction, subnormal handling, and non-associative accumulation order on NVIDIA Tensor Cores across specific architectures and precisions. No derivation chain, equations, or parameters are described that reduce by construction to the test inputs themselves; the reproduction claim rests on observed hardware behavior in the tested cases rather than any self-definitional mapping or renamed fit. The method is self-contained against external benchmarks because it involves direct execution and verification on the target hardware, with no load-bearing self-citations or ansatzes invoked to justify the core approach.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework depends on empirical discovery of hardware-specific arithmetic rules rather than new mathematical axioms or postulated entities; no free parameters or invented entities are described.

axioms (1)
  • domain assumption NVIDIA Tensor Cores exhibit consistent but non-standard behaviors in rounding, subnormal handling, and accumulation order that can be reverse-engineered through targeted tests.
    The reproduction claim rests on the ability to fully characterize these behaviors via the described test sequence.

pith-pipeline@v0.9.0 · 5744 in / 1046 out tokens · 53873 ms · 2026-05-21T09:49:01.664594+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

16 extracted references · 16 canonical work pages · 4 internal anchors

  1. [1]

    Arasu Arun, Adam St

    Avail- able at: https://developer.nvidia.com/blog/programming- tensor-cores-cuda-9/. Arasu Arun, Adam St. Arnaud, Alexey Titov, Brian Wilcox, Viktor Kolobaric, Marc Brinkmann, Oguzhan Ersoy, Ben Fielding, and Joseph Bonneau. Verde: Verification via refereed delegation for machine learning programs. In CoRR, volume abs/2502.19405,

  2. [2]

    Scalable, transparent, and post-quantum secure computational integrity (starks)

    Eli Ben-Sasson, Iddo Bentov, Yinon Horesh, and Michael Riabzev. Scalable, transparent, and post-quantum secure computational integrity (starks). IACR ePrint 2018/046, 2018a. Eli Ben-Sasson, Iddo Bentov, Yinon Horesh, and Michael Riabzev. Fast reed–solomon interactive oracle proofs of proximity. InICALP, 2018b. Eli Ben-Sasson, Alessandro Chiesa, Michael Ri...

  3. [3]

    Ariel Gabizon, Zachary J

    Platform docs and API reference. Ariel Gabizon, Zachary J. Williamson, and Oana Ciobotaru. Plonk: Permutations over lagrange-bases for oecumenical noninteractive arguments of knowledge. IACR ePrint 2019/953,

  4. [4]

    Trust the process: Zero-knowledge machine learning to enhance trust in generative AI interactions.arXiv preprint arXiv:2402.06414,

    Bianca-Mihaela Ganescu and Jonathan Passerat-Palmbach. Trust the process: Zero-knowledge machine learning to enhance trust in generative AI interactions.arXiv preprint arXiv:2402.06414,

  5. [5]

    Training Compute-Optimal Large Language Models

    doi: 10.1093/BIOINFORMATICS/ BTAD164. URL https://doi.org/10.1093/ bioinformatics/btad164. Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large lan- guage models.arXiv preprint arXiv:2203.15556,

  6. [6]

    Scaling Laws for Neural Language Models

    Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scal- ing laws for neural language models.arXiv preprint arXiv:2001.08361,

  7. [7]

    Artemis: Efficient commit-and-prove snarks for zkml

    Hidde Lycklama, Alexander Viand, Nikolay Avramov, Nico- las K ¨uchler, and Anwar Hithnawi. Artemis: Efficient commit-and-prove snarks for zkml. InarXiv preprint arXiv:2409.12055,

  8. [8]

    TOPLOC: A locality sensitive hashing scheme for trustless verifiable inference

    Jack Min Ong, Matthew Di Ferrante, Aaron Pazdera, Ryan Garner, Sami Jaghouar, Manveer Basra, and Johannes Hagemann. TOPLOC: A locality sensitive hashing scheme for trustless verifiable inference. InCoRR, vol- ume abs/2501.16007,

  9. [9]

    A survey of zero-knowledge proof based verifiable machine learning.arXiv preprint arXiv:2502.18535,

    Zhizhi Peng, Taotao Wang, Chonghe Zhao, Guofu Liao, Zibin Lin, Yifeng Liu, Bin Cao, Long Shi, Qing Yang, and Shengli Zhang. A survey of zero-knowledge proof based verifiable machine learning.arXiv preprint arXiv:2502.18535,

  10. [10]

    pyxis-roc

    Accessed: 2026-03-17. pyxis-roc. sass-math: Bit-accurate implementations of elementary function approximations in nvidia’s sass isa. https://github.com/pyxis-roc/ sass-math,

  11. [11]

    ZeRO: Memory Optimizations Toward Training Trillion Parameter Models

    Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. Zero: Memory optimizations toward training trillion parameter models.arXiv preprint arXiv:1910.02054,

  12. [12]

    IP = PSPACE

    doi: 10.1145/146585.146609. Sanjif Shanmugavelu, Mathieu Taillefumier, Christopher Culver, Oscar R. Hernandez, Mark Coletti, and Ada Se- dova. Impacts of floating-point non-associativity on re- producibility for HPC and deep learning applications. In SC Workshops, pages 170–179. IEEE,

  13. [13]

    Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

    Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron- lm: Training multi-billion parameter language models us- ing model parallelism.arXiv preprint arXiv:1909.08053,

  14. [14]

    opp/ai: Optimistic privacy-preserving AI on blockchain

    Cathie So, KD Conway, Xiaohang Yu, Suning Yao, and Kartin Wong. opp/ai: Optimistic privacy-preserving AI on blockchain. InarXiv preprint arXiv:2402.15006,

  15. [15]

    Thinking Machines Lab

    ACM CCS 2024 (salt lake city), 4405–4419. Thinking Machines Lab. Defeating non- determinism in llm inference. https: //thinkingmachines.ai/blog/ defeating-nondeterminism-in-llm-inference/ , September

  16. [16]

    Dennis W¨ust, Alex Ozdemir, Riad S

    Accessed: 2026-03-17. Dennis W¨ust, Alex Ozdemir, Riad S. Wahby, Abhi Shelat, et al. Hyrax: Doubly-efficient zksnarks without trusted setup. InIEEE S&P,