Hawkeye: Reproducing GPU-Level Non-Determinism
Pith reviewed 2026-05-21 09:49 UTC · model grok-4.3
The pith
Hawkeye lets anyone replay exact NVIDIA Tensor Core matrix multiplications on a CPU with no precision loss.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Hawkeye uses a systematic sequence of tests to characterize rounding direction, subnormal number handling, and the order of non-associative accumulation in matrix multiplication on NVIDIA Tensor Cores. These characterizations enable exact reproduction of the GPU computations on a CPU without any precision loss, for the tested architectures and precisions.
What carries the argument
A systematic sequence of tests that isolate and record rounding direction, subnormal handling, and accumulation order during Tensor Core matrix multiplication.
If this is right
- Third-party auditors can verify ML training or inference steps without adding overhead to the original model owner.
- Reproduction works across Ampere, Hopper, and Lovelace GPUs for FP16, BFP16, and FP8 formats.
- Prior verifiable-ML methods that either cost extra compute or lose accuracy can be avoided.
- The framework supports auditing of both training and inference workflows that rely on these matrix operations.
Where Pith is reading between the lines
- The same test-driven approach could be extended to other GPU arithmetic primitives beyond matrix multiplication.
- Exact CPU replay might simplify debugging of floating-point discrepancies in mixed CPU-GPU ML pipelines.
- If the characterization holds for newer architectures, it could become a standard tool for reproducible ML deployments.
Load-bearing premise
The tests capture every relevant detail of rounding, subnormals, and accumulation order for the GPU architectures and precisions examined.
What would settle it
A matrix-multiplication input on one of the tested architectures and precisions that produces a different result on the CPU simulator than on the real GPU after applying the recorded behaviors.
Figures
read the original abstract
We present Hawkeye, a system for analyzing and reproducing GPU-level arithmetic operations. Using our framework, anyone can re-execute on a CPU the exact matrix multiplication operations underlying a machine learning model training or inference workflow that was executed on an NVIDIA GPU, without any precision loss. This is in stark contrast to prior approaches to verifiable machine learning, which either introduce significant computation overhead to the original model owner, or suffer from non-robustness and quality degradation. The main technical contribution of Hawkeye is a systematic sequence of carefully crafted tests that study rounding direction, subnormal number handling, and order of (non-associative) accumulation during matrix multiplication on NVIDIA's Tensor Cores. We test and evaluate our framework on multiple NVIDIA GPU architectures ( Ampere, Hopper, and Lovelace) and precision types (FP16, BFP16, FP8). In all test cases, Hawkeye enables perfect reproduction of matrix multiplication on a CPU, paving the way for efficient and trustworthy third-party auditing of ML model training and inference. We provide source code for Hawkeye at https://github.com/badasherez/gpu-simulator.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents Hawkeye, a system for analyzing and reproducing GPU-level arithmetic operations. Using a systematic sequence of tests on rounding direction, subnormal number handling, and non-associative accumulation order during matrix multiplication on NVIDIA Tensor Cores, the framework claims to enable exact, precision-loss-free re-execution on CPU of any matrix-multiplication operations from ML training or inference workflows run on tested NVIDIA GPUs and precisions.
Significance. If the central claim holds, the result would be significant for verifiable machine learning by allowing efficient third-party auditing without the overhead or quality degradation of prior approaches. The open-source release of the code at the provided GitHub link is a strength that supports reproducibility and adoption.
major comments (2)
- §5 Evaluation: The manuscript states that perfect reproduction was achieved in all test cases on Ampere/Hopper/Lovelace with FP16/BF16/FP8, but provides no quantitative metrics on test coverage such as the range of matrix dimensions, tile sizes, warp configurations, or subnormal-triggering inputs. This is load-bearing for the universal claim over 'any' ML workflow because Tensor Core accumulation order depends on these parameters.
- §3 Test Design: The description of the systematic test sequence does not demonstrate or argue that it exhausts all relevant accumulation trees and rounding behaviors across the supported precisions and architectures, leaving open the possibility of mismatches on untested GEMM shapes or fused kernels used in real models.
minor comments (2)
- The abstract lists 'BFP16' as a precision; this should be corrected to the standard 'BF16' notation for clarity.
- Consider adding a table in the evaluation section that enumerates the exact matrix sizes and launch parameters used in the tests to improve transparency.
Simulated Author's Rebuttal
We thank the referee for their constructive comments, which highlight opportunities to strengthen the presentation of our evaluation results and test design. We address each major comment below and commit to revisions that enhance clarity without altering the core technical contributions.
read point-by-point responses
-
Referee: §5 Evaluation: The manuscript states that perfect reproduction was achieved in all test cases on Ampere/Hopper/Lovelace with FP16/BF16/FP8, but provides no quantitative metrics on test coverage such as the range of matrix dimensions, tile sizes, warp configurations, or subnormal-triggering inputs. This is load-bearing for the universal claim over 'any' ML workflow because Tensor Core accumulation order depends on these parameters.
Authors: We agree that quantitative metrics on test coverage are essential to support the claim of applicability to arbitrary ML workflows. In the revised manuscript, we will expand §5 to include specific details on the tested matrix dimensions (ranging from 8×8 to 8192×8192), tile sizes, warp configurations, and inputs engineered to trigger subnormals across Ampere, Hopper, and Lovelace architectures with FP16, BF16, and FP8 precisions. These additions will directly address the dependence of accumulation order on such parameters. revision: yes
-
Referee: §3 Test Design: The description of the systematic test sequence does not demonstrate or argue that it exhausts all relevant accumulation trees and rounding behaviors across the supported precisions and architectures, leaving open the possibility of mismatches on untested GEMM shapes or fused kernels used in real models.
Authors: We acknowledge that §3 would benefit from an explicit argument for exhaustiveness. The systematic sequence is designed to probe all rounding directions, subnormal handling, and non-associative accumulation orders by varying input patterns and configurations that control Tensor Core execution. In the revision, we will extend §3 with a dedicated justification, drawing on NVIDIA architecture specifications and additional validation experiments across diverse GEMM shapes, to demonstrate coverage of behaviors in real models and fused kernels. revision: yes
Circularity Check
No circularity: empirical characterization of GPU arithmetic is independent of fitted inputs
full rationale
The paper presents Hawkeye as a framework built on a systematic sequence of empirical tests that directly measure rounding direction, subnormal handling, and non-associative accumulation order on NVIDIA Tensor Cores across specific architectures and precisions. No derivation chain, equations, or parameters are described that reduce by construction to the test inputs themselves; the reproduction claim rests on observed hardware behavior in the tested cases rather than any self-definitional mapping or renamed fit. The method is self-contained against external benchmarks because it involves direct execution and verification on the target hardware, with no load-bearing self-citations or ansatzes invoked to justify the core approach.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption NVIDIA Tensor Cores exhibit consistent but non-standard behaviors in rounding, subnormal handling, and accumulation order that can be reverse-engineered through targeted tests.
Reference graph
Works this paper leans on
-
[1]
Avail- able at: https://developer.nvidia.com/blog/programming- tensor-cores-cuda-9/. Arasu Arun, Adam St. Arnaud, Alexey Titov, Brian Wilcox, Viktor Kolobaric, Marc Brinkmann, Oguzhan Ersoy, Ben Fielding, and Joseph Bonneau. Verde: Verification via refereed delegation for machine learning programs. In CoRR, volume abs/2502.19405,
-
[2]
Scalable, transparent, and post-quantum secure computational integrity (starks)
Eli Ben-Sasson, Iddo Bentov, Yinon Horesh, and Michael Riabzev. Scalable, transparent, and post-quantum secure computational integrity (starks). IACR ePrint 2018/046, 2018a. Eli Ben-Sasson, Iddo Bentov, Yinon Horesh, and Michael Riabzev. Fast reed–solomon interactive oracle proofs of proximity. InICALP, 2018b. Eli Ben-Sasson, Alessandro Chiesa, Michael Ri...
work page 2018
-
[3]
Platform docs and API reference. Ariel Gabizon, Zachary J. Williamson, and Oana Ciobotaru. Plonk: Permutations over lagrange-bases for oecumenical noninteractive arguments of knowledge. IACR ePrint 2019/953,
work page 2019
-
[4]
Bianca-Mihaela Ganescu and Jonathan Passerat-Palmbach. Trust the process: Zero-knowledge machine learning to enhance trust in generative AI interactions.arXiv preprint arXiv:2402.06414,
-
[5]
Training Compute-Optimal Large Language Models
doi: 10.1093/BIOINFORMATICS/ BTAD164. URL https://doi.org/10.1093/ bioinformatics/btad164. Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large lan- guage models.arXiv preprint arXiv:2203.15556,
work page internal anchor Pith review Pith/arXiv arXiv doi:10.1093/bioinformatics/
-
[6]
Scaling Laws for Neural Language Models
Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scal- ing laws for neural language models.arXiv preprint arXiv:2001.08361,
work page internal anchor Pith review Pith/arXiv arXiv 2001
-
[7]
Artemis: Efficient commit-and-prove snarks for zkml
Hidde Lycklama, Alexander Viand, Nikolay Avramov, Nico- las K ¨uchler, and Anwar Hithnawi. Artemis: Efficient commit-and-prove snarks for zkml. InarXiv preprint arXiv:2409.12055,
-
[8]
TOPLOC: A locality sensitive hashing scheme for trustless verifiable inference
Jack Min Ong, Matthew Di Ferrante, Aaron Pazdera, Ryan Garner, Sami Jaghouar, Manveer Basra, and Johannes Hagemann. TOPLOC: A locality sensitive hashing scheme for trustless verifiable inference. InCoRR, vol- ume abs/2501.16007,
-
[9]
A survey of zero-knowledge proof based verifiable machine learning.arXiv preprint arXiv:2502.18535,
Zhizhi Peng, Taotao Wang, Chonghe Zhao, Guofu Liao, Zibin Lin, Yifeng Liu, Bin Cao, Long Shi, Qing Yang, and Shengli Zhang. A survey of zero-knowledge proof based verifiable machine learning.arXiv preprint arXiv:2502.18535,
- [10]
-
[11]
ZeRO: Memory Optimizations Toward Training Trillion Parameter Models
Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. Zero: Memory optimizations toward training trillion parameter models.arXiv preprint arXiv:1910.02054,
work page internal anchor Pith review Pith/arXiv arXiv 1910
-
[12]
doi: 10.1145/146585.146609. Sanjif Shanmugavelu, Mathieu Taillefumier, Christopher Culver, Oscar R. Hernandez, Mark Coletti, and Ada Se- dova. Impacts of floating-point non-associativity on re- producibility for HPC and deep learning applications. In SC Workshops, pages 170–179. IEEE,
-
[13]
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron- lm: Training multi-billion parameter language models us- ing model parallelism.arXiv preprint arXiv:1909.08053,
work page internal anchor Pith review Pith/arXiv arXiv 1909
-
[14]
opp/ai: Optimistic privacy-preserving AI on blockchain
Cathie So, KD Conway, Xiaohang Yu, Suning Yao, and Kartin Wong. opp/ai: Optimistic privacy-preserving AI on blockchain. InarXiv preprint arXiv:2402.15006,
-
[15]
ACM CCS 2024 (salt lake city), 4405–4419. Thinking Machines Lab. Defeating non- determinism in llm inference. https: //thinkingmachines.ai/blog/ defeating-nondeterminism-in-llm-inference/ , September
work page 2024
-
[16]
Dennis W¨ust, Alex Ozdemir, Riad S
Accessed: 2026-03-17. Dennis W¨ust, Alex Ozdemir, Riad S. Wahby, Abhi Shelat, et al. Hyrax: Doubly-efficient zksnarks without trusted setup. InIEEE S&P,
work page 2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.