Bit-Accurate Modeling of GPU Matrix Multiply-Accumulate Units: Demystifying Numerical Discrepancy and Accuracy
Pith reviewed 2026-05-17 22:53 UTC · model grok-4.3
The pith
Closed-loop probing yields the first bit-accurate arithmetic models for matrix multiply-accumulate units on ten GPU architectures.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
A systematic closed-loop feature probing technique can fully characterize the internal arithmetic of undocumented matrix multiply-accumulate hardware, yielding precise models that match observed behavior on ten distinct GPU generations from NVIDIA and AMD.
What carries the argument
Closed-loop feature probing (CLFP), a framework that selects input matrices to expose every internal rounding mode, precision path, and accumulation behavior of MMA instructions.
If this is right
- The models directly explain numerical discrepancies observed when running the same matrix operation on different GPU generations or vendors.
- White-box error analysis becomes possible for neural network training and inference without relying on black-box measurements.
- Four specific precision bottleneck designs and one numerical asymmetry are shown to limit accuracy in current MMAUs.
- Concrete software workarounds exist for the identified accuracy issues, and the models supply design guidance for next-generation units.
Where Pith is reading between the lines
- Similar probing techniques could be applied to other undocumented accelerator blocks such as tensor reductions or sparse matrix operations.
- Simulator developers can embed these models to forecast training instability on future hardware without physical access.
- The revealed bottlenecks suggest concrete changes vendors could make in future MMAU microarchitecture to raise effective precision.
Load-bearing premise
The selected probing inputs are enough to reveal every internal arithmetic behavior and rounding mode without missing undocumented edge cases or vendor optimizations.
What would settle it
An input matrix pair that produces a different bit-exact output on one of the ten tested architectures than the derived model predicts would show the model is incomplete.
Figures
read the original abstract
Modern AI accelerators rely on matrix multiply-accumulate units (MMAUs), such as NVIDIA Tensor Cores and AMD Matrix Cores, to accelerate deep neural network workloads. MMAUs expose only instruction-level or API-level interfaces of matrix multiply-accumulate (MMA) operations, while leaving internal floating-point arithmetic behaviors undocumented. Consequently, MMAUs across vendors and architectural generations often produce numerical discrepancies for identical inputs, and sometimes exhibit reduced numerical accuracy that can cause training instability. Diagnosing and understanding the root causes of these effects is challenging without white-box models of their arithmetic behaviors. This paper proposes closed-loop feature probing (CLFP), a generic and systematic framework for constructing complete arithmetic behavior models of MMA operations. Based on this framework, we analyze all MMA instructions on ten GPU architectures spanning from NVIDIA Volta to RTX Blackwell and from AMD CDNA1 to CDNA3, and derive the first bit-accurate arithmetic models for these MMAUs. Our models explain previously observed cross-platform numerical discrepancies and accuracy issues, enable white-box numerical error analysis, reveal four precision bottleneck designs and one numerical asymmetry design that significantly affect numerical accuracy, and provide software workarounds as well as design guidance for future MMAUs. This work is open-source on https://github.com/microsoft/MMA-Sim .
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Closed-Loop Feature Probing (CLFP), a systematic empirical framework for reverse-engineering undocumented internal arithmetic behaviors (precision, rounding, and edge cases) of GPU matrix multiply-accumulate units. The authors apply CLFP to every MMA instruction on ten architectures (NVIDIA Volta through Blackwell; AMD CDNA1 through CDNA3), derive the first claimed bit-accurate models, use them to explain observed cross-platform numerical discrepancies and accuracy loss, identify four precision-bottleneck designs plus one asymmetry, and supply software workarounds plus design guidance. The implementation is released as open source.
Significance. If the models hold, the work supplies the first white-box, bit-accurate characterizations of proprietary MMA hardware across two vendors and multiple generations. This enables precise numerical-error analysis for DNN training, explains previously mysterious discrepancies, and offers concrete guidance for future accelerator design. The open-source release and systematic coverage of all MMA instructions on ten chips are explicit strengths that support reproducibility and community use.
major comments (1)
- [Section 3 (CLFP framework) and Section 5 (model derivation and validation)] The central claim that CLFP produces complete bit-accurate models for all MMA instructions rests on the assumption that the chosen probing inputs expose every internal precision, rounding mode, and vendor-specific behavior. The manuscript reports systematic probing plus validation on held-out inputs, yet provides no formal argument or exhaustive enumeration that every relevant corner case (subnormals, specific FMA saturations, undocumented modes) has been triggered. Because the hardware remains a black box, any missed case would render the derived model incomplete for the full input space; this directly affects the 'first bit-accurate' assertion.
minor comments (2)
- [Table 2] Table 2 would benefit from an additional column showing the exact number of probing inputs used per architecture to allow readers to assess coverage density.
- [Section 4] The notation for the extracted mantissa and exponent widths is introduced without a compact summary table; a single reference table would improve readability when comparing the ten architectures.
Simulated Author's Rebuttal
We thank the referee for their detailed and constructive feedback, which has helped us improve the clarity and transparency of our manuscript. We address the major comment point by point below and have made revisions to strengthen the discussion of our methodology's limitations.
read point-by-point responses
-
Referee: [Section 3 (CLFP framework) and Section 5 (model derivation and validation)] The central claim that CLFP produces complete bit-accurate models for all MMA instructions rests on the assumption that the chosen probing inputs expose every internal precision, rounding mode, and vendor-specific behavior. The manuscript reports systematic probing plus validation on held-out inputs, yet provides no formal argument or exhaustive enumeration that every relevant corner case (subnormals, specific FMA saturations, undocumented modes) has been triggered. Because the hardware remains a black box, any missed case would render the derived model incomplete for the full input space; this directly affects the 'first bit-accurate' assertion.
Authors: We appreciate the referee raising this critical aspect of our empirical approach. We agree that a formal mathematical argument for exhaustive coverage is not feasible given the black-box nature of the hardware. Our CLFP framework instead relies on systematic, iterative probing designed to trigger and isolate behaviors for precision, rounding, subnormals, FMA edge cases, and vendor-specific modes, followed by validation against held-out inputs and reproduction of documented numerical discrepancies. To address the concern directly, we have revised Section 5 to explicitly state that the models are the first to achieve bit-accurate fidelity across our comprehensive test suite (covering all MMA instructions on the ten architectures) while acknowledging the inherent limits of empirical reverse-engineering. We have also added a dedicated limitations paragraph discussing the possibility of untriggered corner cases and recommending ongoing community validation via the open-source release. This revision preserves the core contribution without overstating completeness. revision: yes
Circularity Check
No circularity: models empirically constructed from external hardware measurements via CLFP probing
full rationale
The derivation chain relies on applying the proposed CLFP framework to direct measurements of MMA instructions across ten GPU architectures. No equations, parameters, or claims reduce by construction to prior fitted values, self-definitions, or self-citation chains. The bit-accurate models are outputs of systematic probing and validation on held-out inputs, not inputs renamed as predictions. This is self-contained against external benchmarks (actual GPU behavior) with no load-bearing internal loops.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Floating-point arithmetic follows IEEE 754 rules except where vendor-specific deviations exist inside the MMA unit.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
we employ specially crafted inputs along with randomized inputs to detect the characteristics of the MMA... construct an arithmetic algorithm that models the arithmetic behavior
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 1 Pith paper
-
Performance Isolation and Semantic Determinism in Efficient GPU Spatial Sharing
CoGPU resolves the tradeoff in GPU sharing by introducing GPU coroutines for semantic-preserving resource migration, delivering up to 79.2% higher training throughput and zero token mismatch in inference.
Reference graph
Works this paper leans on
-
[1]
S. Markidis, S. W. D. Chien, E. Laure, I. B. Peng, and J. S. Vetter, “NVIDIA Tensor Core Programmability, Performance & Precision,” inIEEE International Parallel and Distributed Processing Symposium (IPDPS) Workshops. IEEE Computer Society, 2018, pp. 522–531. [Online]. Available: https://doi.org/10.1109/IPDPSW.2018.00091
-
[2]
Dissecting Tensor Cores via Microbenchmarks: Latency, Throughput and Numeric Behaviors,
W. Sun, A. Li, T. Geng, S. Stuijk, and H. Corporaal, “Dissecting Tensor Cores via Microbenchmarks: Latency, Throughput and Numeric Behaviors,”IEEE Transactions on Parallel and Distributed Systems, vol. 34, no. 1, pp. 246–261, 2023. [Online]. Available: https: //doi.org/10.1109/TPDS.2022.3217824
-
[3]
G. Schieffer, D. A. d. Medeiros, J. Faj, A. Marathe, and I. Peng, “On the Rise of AMD Matrix Cores: Performance, Power Efficiency, and Programmability,” inIEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), 2024, pp. 132–143. [Online]. Available: https://doi.org/10.1109/ISPASS61541.2024.00022
-
[4]
DeepSeek-AI, A. Liu, B. Feng, B. Xue, B. Wang, B. Wu, C. Lu, C. Zhao, C. Deng, C. Zhang, C. Ruan, D. Dai, D. Guo, D. Yang, D. Chen, D. Ji, E. Li, F. Lin, F. Dai, F. Luo, G. Hao, G. Chen, G. Li, H. Zhang, H. Bao, H. Xu, H. Wang, H. Zhang, H. Ding, H. Xin, H. Gao, H. Li, H. Qu, J. L. Cai, J. Liang, J. Guo, J. Ni, J. Li, J. Wang, J. Chen, J. Chen, J. Yuan, J...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2412.19437 2024
-
[5]
PyTorch Developer Notes - Numerical accuracy,
PyTorch Developers, “PyTorch Developer Notes - Numerical accuracy,”
-
[6]
Available: https://docs.pytorch.org/docs/stable/notes/ numerical accuracy.html
[Online]. Available: https://docs.pytorch.org/docs/stable/notes/ numerical accuracy.html
-
[7]
FTTN: Feature-Targeted Testing for Numerical Properties of NVIDIA & AMD Matrix Accelerators,
X. Li, A. Li, B. Fang, K. Swirydowicz, I. Laguna, and G. Gopalakrishnan, “Discovery of Floating-Point Differences Between NVIDIA and AMD GPUs,” inInternational Symposium on Cluster, Cloud and Internet Computing (CCGRID), 2024, pp. 663–666. [Online]. Available: https://doi.org/10.1109/CCGrid59990.2024.00083
-
[8]
Numerical behavior of NVIDIA tensor cores,
M. Fasi, N. J. Higham, M. Mikaitis, and S. Pranesh, “Numerical behavior of NVIDIA tensor cores,”PeerJ Computer Science, vol. 7, p. e330, 2021. [Online]. Available: https://doi.org/10.7717/peerj-cs.330
-
[9]
FTTN: Feature-Targeted Testing for Numerical Properties of NVIDIA & AMD Matrix Accelerators,
X. Li, A. Li, B. Fang, K. Swirydowicz, I. Laguna, and G. Gopalakrishnan, “FTTN: Feature-Targeted Testing for Numerical Properties of NVIDIA & AMD Matrix Accelerators,” inInternational Symposium on Cluster, Cloud and Internet Computing (CCGRID). IEEE, 2024, pp. 39–46. [Online]. Available: https://doi.org/10.1109/CCGrid59990.2024.00014
-
[10]
Revealing Floating-Point Accumulation Orders in Software/Hardware Implementations,
P. Xie, Y . Gao, Y . Wang, and J. Xue, “Revealing Floating-Point Accumulation Orders in Software/Hardware Implementations,” in USENIX Annual Technical Conference (USENIX ATC), 2025, pp. 1425–1440. [Online]. Available: https://www.usenix.org/conference/ atc25/presentation/xie
work page 2025
-
[11]
Estimation of numerical reproducibility on CPU and GPU,
F. J ´ez´equel, J. L. Lamotte, and I. Said, “Estimation of numerical reproducibility on CPU and GPU,” inFederated Conference on Computer Science and Information Systems (FedCSIS), 2015, pp. 675–680. [Online]. Available: https://doi.org/10.15439/2015F29
-
[12]
A. H. Zahid, I. Laguna, and W. Le, “Testing GPU Numerics: Finding Numerical Differences Between NVIDIA and AMD GPUs,” inSC24- W: Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2024, pp. 547–557. [Online]. Available: https://doi.org/10.1109/SCW63240.2024.00077
-
[13]
Problems and Opportunities in Training Deep Learning Software Systems: An Analysis of Variance,
H. V . Pham, S. Qian, J. Wang, T. Lutellier, J. Rosenthal, L. Tan, Y . Yu, and N. Nagappan, “Problems and Opportunities in Training Deep Learning Software Systems: An Analysis of Variance,” inInternational Conference on Automated Software Engineering (ASE), 2020, pp. 771–783. [Online]. Available: https://doi.org/10.1145/3324884.3416545
-
[14]
B. Chen, M. Wen, Y . Shi, D. Lin, G. K. Rajbahadur, and Z. M. Jiang, “Towards Training Reproducible Deep Learning Models,” in International Conference on Software Engineering (ICSE), 2022, pp. 2202–2214. [Online]. Available: https://doi.org/10.1145/3510003. 3510163
-
[15]
Defeating Nondeterminism in LLM Inference,
H. He and Thinking Machines Lab, “Defeating Nondeterminism in LLM Inference,” 2025. [Online]. Available: https://thinkingmachines.ai/ blog/defeating-nondeterminism-in-llm-inference/
work page 2025
-
[16]
M. A. Raihan, N. Goli, and T. M. Aamodt, “Modeling Deep Learning Accelerator Enabled GPUs,” inIEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), 2019, pp. 79–92. [Online]. Available: https://doi.org/10.1109/ISPASS.2019.00016
-
[17]
B. J. Hickmann and D. Bradford, “Experimental Analysis of Matrix Multiplication Functional Units,” inIEEE Symposium on Computer Arithmetic (ARITH), 2019, pp. 116–119. [Online]. Available: https://doi.org/10.1109/ARITH.2019.00031
-
[18]
B. Valpey, X. Li, S. Pai, and G. Gopalakrishnan, “An SMT Formalization of Mixed-Precision Matrix Multiplication: Modeling Three Generations of Tensor Cores,” 2025, arXiv: 2502.15999. [Online]. Available: https://doi.org/10.48550/arXiv.2502.15999
-
[19]
IEEE Standard for Floating-Point Arithmetic,
IEEE, “IEEE Standard for Floating-Point Arithmetic,” 2019
work page 2019
-
[20]
P. Micikevicius, D. Stosic, N. Burgess, M. Cornea, P. Dubey, R. Grisenthwaite, S. Ha, A. Heinecke, P. Judd, J. Kamalu, N. Mellempudi, S. F. Oberman, M. Shoeybi, M. Y . Siu, and H. Wu, “FP8 Formats for Deep Learning,” 2022, arXiv: 2209.05433. [Online]. Available: https://doi.org/10.48550/arXiv.2209.05433
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2209.05433 2022
-
[21]
OCP Microscaling Formats (MX) Specifica- tion,
Open Compute Project, “OCP Microscaling Formats (MX) Specifica- tion,” 2023
work page 2023
-
[22]
B. Noune, P. Jones, D. Justus, D. Masters, and C. Luschi, “8- bit Numerical Formats for Deep Neural Networks,” 2022, arXiv: 2206.02915. [Online]. Available: https://doi.org/10.48550/arXiv.2206. 02915 12
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.