Bit-Accurate Modeling of GPU Matrix Multiply-Accumulate Units: Demystifying Numerical Discrepancy and Accuracy

Fan Yang; Mao Yang; Peichen Xie; Shuotao Xu; Yang Wang

arxiv: 2511.10909 · v2 · submitted 2025-11-14 · 💻 cs.AR · cs.LG· cs.NA· math.NA

Bit-Accurate Modeling of GPU Matrix Multiply-Accumulate Units: Demystifying Numerical Discrepancy and Accuracy

Peichen Xie , Shuotao Xu , Yang Wang , Fan Yang , Mao Yang This is my paper

Pith reviewed 2026-05-17 22:53 UTC · model grok-4.3

classification 💻 cs.AR cs.LGcs.NAmath.NA

keywords GPUMatrix Multiply-AccumulateTensor CoresNumerical AccuracyBit-Accurate ModelingFloating-Point ArithmeticNumerical DiscrepancyAI Accelerators

0 comments

The pith

Closed-loop probing yields the first bit-accurate arithmetic models for matrix multiply-accumulate units on ten GPU architectures.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a closed-loop feature probing method to reverse-engineer the exact floating-point rules inside matrix multiply-accumulate hardware that vendors leave undocumented. These units drive modern AI accelerators yet produce inconsistent results across platforms and sometimes lower accuracy that destabilizes training. Applying the method to every relevant instruction on NVIDIA GPUs from Volta through Blackwell and AMD GPUs from CDNA1 through CDNA3 produces complete models that match real hardware outputs bit by bit. The models account for previously unexplained numerical differences, identify four precision bottlenecks plus one asymmetry in the designs, and supply both software fixes and suggestions for future hardware.

Core claim

A systematic closed-loop feature probing technique can fully characterize the internal arithmetic of undocumented matrix multiply-accumulate hardware, yielding precise models that match observed behavior on ten distinct GPU generations from NVIDIA and AMD.

What carries the argument

Closed-loop feature probing (CLFP), a framework that selects input matrices to expose every internal rounding mode, precision path, and accumulation behavior of MMA instructions.

If this is right

The models directly explain numerical discrepancies observed when running the same matrix operation on different GPU generations or vendors.
White-box error analysis becomes possible for neural network training and inference without relying on black-box measurements.
Four specific precision bottleneck designs and one numerical asymmetry are shown to limit accuracy in current MMAUs.
Concrete software workarounds exist for the identified accuracy issues, and the models supply design guidance for next-generation units.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar probing techniques could be applied to other undocumented accelerator blocks such as tensor reductions or sparse matrix operations.
Simulator developers can embed these models to forecast training instability on future hardware without physical access.
The revealed bottlenecks suggest concrete changes vendors could make in future MMAU microarchitecture to raise effective precision.

Load-bearing premise

The selected probing inputs are enough to reveal every internal arithmetic behavior and rounding mode without missing undocumented edge cases or vendor optimizations.

What would settle it

An input matrix pair that produces a different bit-exact output on one of the ten tested architectures than the derived model predicts would show the model is incomplete.

Figures

Figures reproduced from arXiv: 2511.10909 by Fan Yang, Mao Yang, Peichen Xie, Shuotao Xu, Yang Wang.

**Figure 2.** Figure 2: Distributions of δRD, numerical deviation of CDNA3 FP16 MFMA instruction that uses the rounding-down (RD) mode, and δRZ, numerical deviation of a hypothetical FP16 MFMA instruction that uses the rounding-to-zero (RZ) mode. truncating the least significant bits in these encodings. Because floating-point numbers are based on the sign-magnitude encoding, we suggest using sign-magnitude arithmetic as well and… view at source ↗

read the original abstract

Modern AI accelerators rely on matrix multiply-accumulate units (MMAUs), such as NVIDIA Tensor Cores and AMD Matrix Cores, to accelerate deep neural network workloads. MMAUs expose only instruction-level or API-level interfaces of matrix multiply-accumulate (MMA) operations, while leaving internal floating-point arithmetic behaviors undocumented. Consequently, MMAUs across vendors and architectural generations often produce numerical discrepancies for identical inputs, and sometimes exhibit reduced numerical accuracy that can cause training instability. Diagnosing and understanding the root causes of these effects is challenging without white-box models of their arithmetic behaviors. This paper proposes closed-loop feature probing (CLFP), a generic and systematic framework for constructing complete arithmetic behavior models of MMA operations. Based on this framework, we analyze all MMA instructions on ten GPU architectures spanning from NVIDIA Volta to RTX Blackwell and from AMD CDNA1 to CDNA3, and derive the first bit-accurate arithmetic models for these MMAUs. Our models explain previously observed cross-platform numerical discrepancies and accuracy issues, enable white-box numerical error analysis, reveal four precision bottleneck designs and one numerical asymmetry design that significantly affect numerical accuracy, and provide software workarounds as well as design guidance for future MMAUs. This work is open-source on https://github.com/microsoft/MMA-Sim .

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper provides the first bit-accurate models for MMA units on ten GPU architectures via systematic probing, which is practically useful despite the black-box limits.

read the letter

This paper delivers bit-accurate models for the MMA units on ten GPU architectures using their closed-loop feature probing method. They systematically test the instructions on hardware spanning Volta to Blackwell and CDNA1 to CDNA3. The resulting models match the observed numerical behavior and account for the discrepancies people have seen in practice. They also identify concrete precision bottlenecks and an asymmetry that affect accuracy in AI workloads. The open-source code lets others verify and apply these models directly. This kind of white-box view is rare because vendors do not publish the details. The main soft spot is that the models rest on empirical probing of a black box. While they validate on held-out inputs and the scope is wide, it's possible some undocumented edge cases remain uncovered. This is a real but limited risk because the method is systematic and the claims are testable. The stress test concern about completeness is fair but the paper's validation steps address it reasonably well. Practitioners who need to analyze or mitigate numerical errors in GPU matrix operations will find this useful. Hardware designers can use the identified issues for guidance. The paper shows clear engagement with the practical problem and provides reproducible artifacts. Readers interested in floating point on accelerators or in building more accurate simulators will get the most out of it. It deserves a serious referee. I would send it out for review.

Referee Report

1 major / 2 minor

Summary. The paper introduces Closed-Loop Feature Probing (CLFP), a systematic empirical framework for reverse-engineering undocumented internal arithmetic behaviors (precision, rounding, and edge cases) of GPU matrix multiply-accumulate units. The authors apply CLFP to every MMA instruction on ten architectures (NVIDIA Volta through Blackwell; AMD CDNA1 through CDNA3), derive the first claimed bit-accurate models, use them to explain observed cross-platform numerical discrepancies and accuracy loss, identify four precision-bottleneck designs plus one asymmetry, and supply software workarounds plus design guidance. The implementation is released as open source.

Significance. If the models hold, the work supplies the first white-box, bit-accurate characterizations of proprietary MMA hardware across two vendors and multiple generations. This enables precise numerical-error analysis for DNN training, explains previously mysterious discrepancies, and offers concrete guidance for future accelerator design. The open-source release and systematic coverage of all MMA instructions on ten chips are explicit strengths that support reproducibility and community use.

major comments (1)

[Section 3 (CLFP framework) and Section 5 (model derivation and validation)] The central claim that CLFP produces complete bit-accurate models for all MMA instructions rests on the assumption that the chosen probing inputs expose every internal precision, rounding mode, and vendor-specific behavior. The manuscript reports systematic probing plus validation on held-out inputs, yet provides no formal argument or exhaustive enumeration that every relevant corner case (subnormals, specific FMA saturations, undocumented modes) has been triggered. Because the hardware remains a black box, any missed case would render the derived model incomplete for the full input space; this directly affects the 'first bit-accurate' assertion.

minor comments (2)

[Table 2] Table 2 would benefit from an additional column showing the exact number of probing inputs used per architecture to allow readers to assess coverage density.
[Section 4] The notation for the extracted mantissa and exponent widths is introduced without a compact summary table; a single reference table would improve readability when comparing the ten architectures.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their detailed and constructive feedback, which has helped us improve the clarity and transparency of our manuscript. We address the major comment point by point below and have made revisions to strengthen the discussion of our methodology's limitations.

read point-by-point responses

Referee: [Section 3 (CLFP framework) and Section 5 (model derivation and validation)] The central claim that CLFP produces complete bit-accurate models for all MMA instructions rests on the assumption that the chosen probing inputs expose every internal precision, rounding mode, and vendor-specific behavior. The manuscript reports systematic probing plus validation on held-out inputs, yet provides no formal argument or exhaustive enumeration that every relevant corner case (subnormals, specific FMA saturations, undocumented modes) has been triggered. Because the hardware remains a black box, any missed case would render the derived model incomplete for the full input space; this directly affects the 'first bit-accurate' assertion.

Authors: We appreciate the referee raising this critical aspect of our empirical approach. We agree that a formal mathematical argument for exhaustive coverage is not feasible given the black-box nature of the hardware. Our CLFP framework instead relies on systematic, iterative probing designed to trigger and isolate behaviors for precision, rounding, subnormals, FMA edge cases, and vendor-specific modes, followed by validation against held-out inputs and reproduction of documented numerical discrepancies. To address the concern directly, we have revised Section 5 to explicitly state that the models are the first to achieve bit-accurate fidelity across our comprehensive test suite (covering all MMA instructions on the ten architectures) while acknowledging the inherent limits of empirical reverse-engineering. We have also added a dedicated limitations paragraph discussing the possibility of untriggered corner cases and recommending ongoing community validation via the open-source release. This revision preserves the core contribution without overstating completeness. revision: yes

Circularity Check

0 steps flagged

No circularity: models empirically constructed from external hardware measurements via CLFP probing

full rationale

The derivation chain relies on applying the proposed CLFP framework to direct measurements of MMA instructions across ten GPU architectures. No equations, parameters, or claims reduce by construction to prior fitted values, self-definitions, or self-citation chains. The bit-accurate models are outputs of systematic probing and validation on held-out inputs, not inputs renamed as predictions. This is self-contained against external benchmarks (actual GPU behavior) with no load-bearing internal loops.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that exhaustive probing can recover undocumented internal floating-point rules; no new physical entities or ad-hoc constants are introduced beyond the discovered hardware behaviors.

axioms (1)

domain assumption Floating-point arithmetic follows IEEE 754 rules except where vendor-specific deviations exist inside the MMA unit.
Invoked when interpreting probe results as bit-accurate models.

pith-pipeline@v0.9.0 · 5548 in / 1076 out tokens · 21806 ms · 2026-05-17T22:53:23.056687+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

we employ specially crafted inputs along with randomized inputs to detect the characteristics of the MMA... construct an arithmetic algorithm that models the arithmetic behavior

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Performance Isolation and Semantic Determinism in Efficient GPU Spatial Sharing
cs.DC 2026-03 unverdicted novelty 6.0

CoGPU resolves the tradeoff in GPU sharing by introducing GPU coroutines for semantic-preserving resource migration, delivering up to 79.2% higher training throughput and zero token mismatch in inference.

Reference graph

Works this paper leans on

22 extracted references · 22 canonical work pages · cited by 1 Pith paper · 2 internal anchors

[1]

Markidis, S

S. Markidis, S. W. D. Chien, E. Laure, I. B. Peng, and J. S. Vetter, “NVIDIA Tensor Core Programmability, Performance & Precision,” inIEEE International Parallel and Distributed Processing Symposium (IPDPS) Workshops. IEEE Computer Society, 2018, pp. 522–531. [Online]. Available: https://doi.org/10.1109/IPDPSW.2018.00091

work page doi:10.1109/ipdpsw.2018.00091 2018
[2]

Dissecting Tensor Cores via Microbenchmarks: Latency, Throughput and Numeric Behaviors,

W. Sun, A. Li, T. Geng, S. Stuijk, and H. Corporaal, “Dissecting Tensor Cores via Microbenchmarks: Latency, Throughput and Numeric Behaviors,”IEEE Transactions on Parallel and Distributed Systems, vol. 34, no. 1, pp. 246–261, 2023. [Online]. Available: https: //doi.org/10.1109/TPDS.2022.3217824

work page doi:10.1109/tpds.2022.3217824 2023
[3]

LIBRA: Enabling Workload-Aware Multi-Dimensional Network Topology Optimization for Distributed Training of Large AI Models

G. Schieffer, D. A. d. Medeiros, J. Faj, A. Marathe, and I. Peng, “On the Rise of AMD Matrix Cores: Performance, Power Efficiency, and Programmability,” inIEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), 2024, pp. 132–143. [Online]. Available: https://doi.org/10.1109/ISPASS61541.2024.00022

work page doi:10.1109/ispass61541.2024.00022 2024
[4]

DeepSeek-V3 Technical Report

DeepSeek-AI, A. Liu, B. Feng, B. Xue, B. Wang, B. Wu, C. Lu, C. Zhao, C. Deng, C. Zhang, C. Ruan, D. Dai, D. Guo, D. Yang, D. Chen, D. Ji, E. Li, F. Lin, F. Dai, F. Luo, G. Hao, G. Chen, G. Li, H. Zhang, H. Bao, H. Xu, H. Wang, H. Zhang, H. Ding, H. Xin, H. Gao, H. Li, H. Qu, J. L. Cai, J. Liang, J. Guo, J. Ni, J. Li, J. Wang, J. Chen, J. Chen, J. Yuan, J...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2412.19437 2024
[5]

PyTorch Developer Notes - Numerical accuracy,

PyTorch Developers, “PyTorch Developer Notes - Numerical accuracy,”

work page
[6]

Available: https://docs.pytorch.org/docs/stable/notes/ numerical accuracy.html

[Online]. Available: https://docs.pytorch.org/docs/stable/notes/ numerical accuracy.html

work page
[7]

FTTN: Feature-Targeted Testing for Numerical Properties of NVIDIA & AMD Matrix Accelerators,

X. Li, A. Li, B. Fang, K. Swirydowicz, I. Laguna, and G. Gopalakrishnan, “Discovery of Floating-Point Differences Between NVIDIA and AMD GPUs,” inInternational Symposium on Cluster, Cloud and Internet Computing (CCGRID), 2024, pp. 663–666. [Online]. Available: https://doi.org/10.1109/CCGrid59990.2024.00083

work page doi:10.1109/ccgrid59990.2024.00083 2024
[8]

Numerical behavior of NVIDIA tensor cores,

M. Fasi, N. J. Higham, M. Mikaitis, and S. Pranesh, “Numerical behavior of NVIDIA tensor cores,”PeerJ Computer Science, vol. 7, p. e330, 2021. [Online]. Available: https://doi.org/10.7717/peerj-cs.330

work page doi:10.7717/peerj-cs.330 2021
[9]

FTTN: Feature-Targeted Testing for Numerical Properties of NVIDIA & AMD Matrix Accelerators,

X. Li, A. Li, B. Fang, K. Swirydowicz, I. Laguna, and G. Gopalakrishnan, “FTTN: Feature-Targeted Testing for Numerical Properties of NVIDIA & AMD Matrix Accelerators,” inInternational Symposium on Cluster, Cloud and Internet Computing (CCGRID). IEEE, 2024, pp. 39–46. [Online]. Available: https://doi.org/10.1109/CCGrid59990.2024.00014

work page doi:10.1109/ccgrid59990.2024.00014 2024
[10]

Revealing Floating-Point Accumulation Orders in Software/Hardware Implementations,

P. Xie, Y . Gao, Y . Wang, and J. Xue, “Revealing Floating-Point Accumulation Orders in Software/Hardware Implementations,” in USENIX Annual Technical Conference (USENIX ATC), 2025, pp. 1425–1440. [Online]. Available: https://www.usenix.org/conference/ atc25/presentation/xie

work page 2025
[11]

Estimation of numerical reproducibility on CPU and GPU,

F. J ´ez´equel, J. L. Lamotte, and I. Said, “Estimation of numerical reproducibility on CPU and GPU,” inFederated Conference on Computer Science and Information Systems (FedCSIS), 2015, pp. 675–680. [Online]. Available: https://doi.org/10.15439/2015F29

work page doi:10.15439/2015f29 2015
[12]

Expediting Higher Fidelity Plasma State Reconstructions for the DIII-D Na- tional Fusion Facility Using Leadership Class Computing Resources

A. H. Zahid, I. Laguna, and W. Le, “Testing GPU Numerics: Finding Numerical Differences Between NVIDIA and AMD GPUs,” inSC24- W: Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2024, pp. 547–557. [Online]. Available: https://doi.org/10.1109/SCW63240.2024.00077

work page doi:10.1109/scw63240.2024.00077 2024
[13]

Problems and Opportunities in Training Deep Learning Software Systems: An Analysis of Variance,

H. V . Pham, S. Qian, J. Wang, T. Lutellier, J. Rosenthal, L. Tan, Y . Yu, and N. Nagappan, “Problems and Opportunities in Training Deep Learning Software Systems: An Analysis of Variance,” inInternational Conference on Automated Software Engineering (ASE), 2020, pp. 771–783. [Online]. Available: https://doi.org/10.1145/3324884.3416545

work page doi:10.1145/3324884.3416545 2020
[14]

In Proceedings of the 44th International Conference on Software Engineering (Pittsburgh, Pennsylvania) (ICSE ’22)

B. Chen, M. Wen, Y . Shi, D. Lin, G. K. Rajbahadur, and Z. M. Jiang, “Towards Training Reproducible Deep Learning Models,” in International Conference on Software Engineering (ICSE), 2022, pp. 2202–2214. [Online]. Available: https://doi.org/10.1145/3510003. 3510163

work page doi:10.1145/3510003 2022
[15]

Defeating Nondeterminism in LLM Inference,

H. He and Thinking Machines Lab, “Defeating Nondeterminism in LLM Inference,” 2025. [Online]. Available: https://thinkingmachines.ai/ blog/defeating-nondeterminism-in-llm-inference/

work page 2025
[16]

Zheng, C

M. A. Raihan, N. Goli, and T. M. Aamodt, “Modeling Deep Learning Accelerator Enabled GPUs,” inIEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), 2019, pp. 79–92. [Online]. Available: https://doi.org/10.1109/ISPASS.2019.00016

work page doi:10.1109/ispass.2019.00016 2019
[17]

Chechik and T

B. J. Hickmann and D. Bradford, “Experimental Analysis of Matrix Multiplication Functional Units,” inIEEE Symposium on Computer Arithmetic (ARITH), 2019, pp. 116–119. [Online]. Available: https://doi.org/10.1109/ARITH.2019.00031

work page doi:10.1109/arith.2019.00031 2019
[18]

An SMT Formalization of Mixed-Precision Matrix Multiplication: Modeling Three Generations of Tensor Cores,

B. Valpey, X. Li, S. Pai, and G. Gopalakrishnan, “An SMT Formalization of Mixed-Precision Matrix Multiplication: Modeling Three Generations of Tensor Cores,” 2025, arXiv: 2502.15999. [Online]. Available: https://doi.org/10.48550/arXiv.2502.15999

work page doi:10.48550/arxiv.2502.15999 2025
[19]

IEEE Standard for Floating-Point Arithmetic,

IEEE, “IEEE Standard for Floating-Point Arithmetic,” 2019

work page 2019
[20]

FP8 Formats for Deep Learning

P. Micikevicius, D. Stosic, N. Burgess, M. Cornea, P. Dubey, R. Grisenthwaite, S. Ha, A. Heinecke, P. Judd, J. Kamalu, N. Mellempudi, S. F. Oberman, M. Shoeybi, M. Y . Siu, and H. Wu, “FP8 Formats for Deep Learning,” 2022, arXiv: 2209.05433. [Online]. Available: https://doi.org/10.48550/arXiv.2209.05433

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2209.05433 2022
[21]

OCP Microscaling Formats (MX) Specifica- tion,

Open Compute Project, “OCP Microscaling Formats (MX) Specifica- tion,” 2023

work page 2023
[22]

Limit results for distribu ted estimation of invariant subspaces in multiple networks inference and pca

B. Noune, P. Jones, D. Justus, D. Masters, and C. Luschi, “8- bit Numerical Formats for Deep Neural Networks,” 2022, arXiv: 2206.02915. [Online]. Available: https://doi.org/10.48550/arXiv.2206. 02915 12

work page doi:10.48550/arxiv.2206 2022

[1] [1]

Markidis, S

S. Markidis, S. W. D. Chien, E. Laure, I. B. Peng, and J. S. Vetter, “NVIDIA Tensor Core Programmability, Performance & Precision,” inIEEE International Parallel and Distributed Processing Symposium (IPDPS) Workshops. IEEE Computer Society, 2018, pp. 522–531. [Online]. Available: https://doi.org/10.1109/IPDPSW.2018.00091

work page doi:10.1109/ipdpsw.2018.00091 2018

[2] [2]

Dissecting Tensor Cores via Microbenchmarks: Latency, Throughput and Numeric Behaviors,

W. Sun, A. Li, T. Geng, S. Stuijk, and H. Corporaal, “Dissecting Tensor Cores via Microbenchmarks: Latency, Throughput and Numeric Behaviors,”IEEE Transactions on Parallel and Distributed Systems, vol. 34, no. 1, pp. 246–261, 2023. [Online]. Available: https: //doi.org/10.1109/TPDS.2022.3217824

work page doi:10.1109/tpds.2022.3217824 2023

[3] [3]

LIBRA: Enabling Workload-Aware Multi-Dimensional Network Topology Optimization for Distributed Training of Large AI Models

G. Schieffer, D. A. d. Medeiros, J. Faj, A. Marathe, and I. Peng, “On the Rise of AMD Matrix Cores: Performance, Power Efficiency, and Programmability,” inIEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), 2024, pp. 132–143. [Online]. Available: https://doi.org/10.1109/ISPASS61541.2024.00022

work page doi:10.1109/ispass61541.2024.00022 2024

[4] [4]

DeepSeek-V3 Technical Report

DeepSeek-AI, A. Liu, B. Feng, B. Xue, B. Wang, B. Wu, C. Lu, C. Zhao, C. Deng, C. Zhang, C. Ruan, D. Dai, D. Guo, D. Yang, D. Chen, D. Ji, E. Li, F. Lin, F. Dai, F. Luo, G. Hao, G. Chen, G. Li, H. Zhang, H. Bao, H. Xu, H. Wang, H. Zhang, H. Ding, H. Xin, H. Gao, H. Li, H. Qu, J. L. Cai, J. Liang, J. Guo, J. Ni, J. Li, J. Wang, J. Chen, J. Chen, J. Yuan, J...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2412.19437 2024

[5] [5]

PyTorch Developer Notes - Numerical accuracy,

PyTorch Developers, “PyTorch Developer Notes - Numerical accuracy,”

work page

[6] [6]

Available: https://docs.pytorch.org/docs/stable/notes/ numerical accuracy.html

[Online]. Available: https://docs.pytorch.org/docs/stable/notes/ numerical accuracy.html

work page

[7] [7]

FTTN: Feature-Targeted Testing for Numerical Properties of NVIDIA & AMD Matrix Accelerators,

X. Li, A. Li, B. Fang, K. Swirydowicz, I. Laguna, and G. Gopalakrishnan, “Discovery of Floating-Point Differences Between NVIDIA and AMD GPUs,” inInternational Symposium on Cluster, Cloud and Internet Computing (CCGRID), 2024, pp. 663–666. [Online]. Available: https://doi.org/10.1109/CCGrid59990.2024.00083

work page doi:10.1109/ccgrid59990.2024.00083 2024

[8] [8]

Numerical behavior of NVIDIA tensor cores,

M. Fasi, N. J. Higham, M. Mikaitis, and S. Pranesh, “Numerical behavior of NVIDIA tensor cores,”PeerJ Computer Science, vol. 7, p. e330, 2021. [Online]. Available: https://doi.org/10.7717/peerj-cs.330

work page doi:10.7717/peerj-cs.330 2021

[9] [9]

FTTN: Feature-Targeted Testing for Numerical Properties of NVIDIA & AMD Matrix Accelerators,

X. Li, A. Li, B. Fang, K. Swirydowicz, I. Laguna, and G. Gopalakrishnan, “FTTN: Feature-Targeted Testing for Numerical Properties of NVIDIA & AMD Matrix Accelerators,” inInternational Symposium on Cluster, Cloud and Internet Computing (CCGRID). IEEE, 2024, pp. 39–46. [Online]. Available: https://doi.org/10.1109/CCGrid59990.2024.00014

work page doi:10.1109/ccgrid59990.2024.00014 2024

[10] [10]

Revealing Floating-Point Accumulation Orders in Software/Hardware Implementations,

P. Xie, Y . Gao, Y . Wang, and J. Xue, “Revealing Floating-Point Accumulation Orders in Software/Hardware Implementations,” in USENIX Annual Technical Conference (USENIX ATC), 2025, pp. 1425–1440. [Online]. Available: https://www.usenix.org/conference/ atc25/presentation/xie

work page 2025

[11] [11]

Estimation of numerical reproducibility on CPU and GPU,

F. J ´ez´equel, J. L. Lamotte, and I. Said, “Estimation of numerical reproducibility on CPU and GPU,” inFederated Conference on Computer Science and Information Systems (FedCSIS), 2015, pp. 675–680. [Online]. Available: https://doi.org/10.15439/2015F29

work page doi:10.15439/2015f29 2015

[12] [12]

Expediting Higher Fidelity Plasma State Reconstructions for the DIII-D Na- tional Fusion Facility Using Leadership Class Computing Resources

A. H. Zahid, I. Laguna, and W. Le, “Testing GPU Numerics: Finding Numerical Differences Between NVIDIA and AMD GPUs,” inSC24- W: Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2024, pp. 547–557. [Online]. Available: https://doi.org/10.1109/SCW63240.2024.00077

work page doi:10.1109/scw63240.2024.00077 2024

[13] [13]

Problems and Opportunities in Training Deep Learning Software Systems: An Analysis of Variance,

H. V . Pham, S. Qian, J. Wang, T. Lutellier, J. Rosenthal, L. Tan, Y . Yu, and N. Nagappan, “Problems and Opportunities in Training Deep Learning Software Systems: An Analysis of Variance,” inInternational Conference on Automated Software Engineering (ASE), 2020, pp. 771–783. [Online]. Available: https://doi.org/10.1145/3324884.3416545

work page doi:10.1145/3324884.3416545 2020

[14] [14]

In Proceedings of the 44th International Conference on Software Engineering (Pittsburgh, Pennsylvania) (ICSE ’22)

B. Chen, M. Wen, Y . Shi, D. Lin, G. K. Rajbahadur, and Z. M. Jiang, “Towards Training Reproducible Deep Learning Models,” in International Conference on Software Engineering (ICSE), 2022, pp. 2202–2214. [Online]. Available: https://doi.org/10.1145/3510003. 3510163

work page doi:10.1145/3510003 2022

[15] [15]

Defeating Nondeterminism in LLM Inference,

H. He and Thinking Machines Lab, “Defeating Nondeterminism in LLM Inference,” 2025. [Online]. Available: https://thinkingmachines.ai/ blog/defeating-nondeterminism-in-llm-inference/

work page 2025

[16] [16]

Zheng, C

M. A. Raihan, N. Goli, and T. M. Aamodt, “Modeling Deep Learning Accelerator Enabled GPUs,” inIEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), 2019, pp. 79–92. [Online]. Available: https://doi.org/10.1109/ISPASS.2019.00016

work page doi:10.1109/ispass.2019.00016 2019

[17] [17]

Chechik and T

B. J. Hickmann and D. Bradford, “Experimental Analysis of Matrix Multiplication Functional Units,” inIEEE Symposium on Computer Arithmetic (ARITH), 2019, pp. 116–119. [Online]. Available: https://doi.org/10.1109/ARITH.2019.00031

work page doi:10.1109/arith.2019.00031 2019

[18] [18]

An SMT Formalization of Mixed-Precision Matrix Multiplication: Modeling Three Generations of Tensor Cores,

B. Valpey, X. Li, S. Pai, and G. Gopalakrishnan, “An SMT Formalization of Mixed-Precision Matrix Multiplication: Modeling Three Generations of Tensor Cores,” 2025, arXiv: 2502.15999. [Online]. Available: https://doi.org/10.48550/arXiv.2502.15999

work page doi:10.48550/arxiv.2502.15999 2025

[19] [19]

IEEE Standard for Floating-Point Arithmetic,

IEEE, “IEEE Standard for Floating-Point Arithmetic,” 2019

work page 2019

[20] [20]

FP8 Formats for Deep Learning

P. Micikevicius, D. Stosic, N. Burgess, M. Cornea, P. Dubey, R. Grisenthwaite, S. Ha, A. Heinecke, P. Judd, J. Kamalu, N. Mellempudi, S. F. Oberman, M. Shoeybi, M. Y . Siu, and H. Wu, “FP8 Formats for Deep Learning,” 2022, arXiv: 2209.05433. [Online]. Available: https://doi.org/10.48550/arXiv.2209.05433

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2209.05433 2022

[21] [21]

OCP Microscaling Formats (MX) Specifica- tion,

Open Compute Project, “OCP Microscaling Formats (MX) Specifica- tion,” 2023

work page 2023

[22] [22]

Limit results for distribu ted estimation of invariant subspaces in multiple networks inference and pca

B. Noune, P. Jones, D. Justus, D. Masters, and C. Luschi, “8- bit Numerical Formats for Deep Neural Networks,” 2022, arXiv: 2206.02915. [Online]. Available: https://doi.org/10.48550/arXiv.2206. 02915 12

work page doi:10.48550/arxiv.2206 2022