pith. sign in

arxiv: 2511.10909 · v2 · submitted 2025-11-14 · 💻 cs.AR · cs.LG· cs.NA· math.NA

Bit-Accurate Modeling of GPU Matrix Multiply-Accumulate Units: Demystifying Numerical Discrepancy and Accuracy

Pith reviewed 2026-05-17 22:53 UTC · model grok-4.3

classification 💻 cs.AR cs.LGcs.NAmath.NA
keywords GPUMatrix Multiply-AccumulateTensor CoresNumerical AccuracyBit-Accurate ModelingFloating-Point ArithmeticNumerical DiscrepancyAI Accelerators
0
0 comments X

The pith

Closed-loop probing yields the first bit-accurate arithmetic models for matrix multiply-accumulate units on ten GPU architectures.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a closed-loop feature probing method to reverse-engineer the exact floating-point rules inside matrix multiply-accumulate hardware that vendors leave undocumented. These units drive modern AI accelerators yet produce inconsistent results across platforms and sometimes lower accuracy that destabilizes training. Applying the method to every relevant instruction on NVIDIA GPUs from Volta through Blackwell and AMD GPUs from CDNA1 through CDNA3 produces complete models that match real hardware outputs bit by bit. The models account for previously unexplained numerical differences, identify four precision bottlenecks plus one asymmetry in the designs, and supply both software fixes and suggestions for future hardware.

Core claim

A systematic closed-loop feature probing technique can fully characterize the internal arithmetic of undocumented matrix multiply-accumulate hardware, yielding precise models that match observed behavior on ten distinct GPU generations from NVIDIA and AMD.

What carries the argument

Closed-loop feature probing (CLFP), a framework that selects input matrices to expose every internal rounding mode, precision path, and accumulation behavior of MMA instructions.

If this is right

  • The models directly explain numerical discrepancies observed when running the same matrix operation on different GPU generations or vendors.
  • White-box error analysis becomes possible for neural network training and inference without relying on black-box measurements.
  • Four specific precision bottleneck designs and one numerical asymmetry are shown to limit accuracy in current MMAUs.
  • Concrete software workarounds exist for the identified accuracy issues, and the models supply design guidance for next-generation units.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar probing techniques could be applied to other undocumented accelerator blocks such as tensor reductions or sparse matrix operations.
  • Simulator developers can embed these models to forecast training instability on future hardware without physical access.
  • The revealed bottlenecks suggest concrete changes vendors could make in future MMAU microarchitecture to raise effective precision.

Load-bearing premise

The selected probing inputs are enough to reveal every internal arithmetic behavior and rounding mode without missing undocumented edge cases or vendor optimizations.

What would settle it

An input matrix pair that produces a different bit-exact output on one of the ten tested architectures than the derived model predicts would show the model is incomplete.

Figures

Figures reproduced from arXiv: 2511.10909 by Fan Yang, Mao Yang, Peichen Xie, Shuotao Xu, Yang Wang.

Figure 1
Figure 1. Figure 1: Four typical summation orders on Tensor Cores and [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Distributions of δRD, numerical deviation of CDNA3 FP16 MFMA instruction that uses the rounding-down (RD) mode, and δRZ, numerical deviation of a hypothetical FP16 MFMA instruction that uses the rounding-to-zero (RZ) mode. truncating the least significant bits in these encodings. Because floating-point numbers are based on the sign-magnitude encod￾ing, we suggest using sign-magnitude arithmetic as well and… view at source ↗
read the original abstract

Modern AI accelerators rely on matrix multiply-accumulate units (MMAUs), such as NVIDIA Tensor Cores and AMD Matrix Cores, to accelerate deep neural network workloads. MMAUs expose only instruction-level or API-level interfaces of matrix multiply-accumulate (MMA) operations, while leaving internal floating-point arithmetic behaviors undocumented. Consequently, MMAUs across vendors and architectural generations often produce numerical discrepancies for identical inputs, and sometimes exhibit reduced numerical accuracy that can cause training instability. Diagnosing and understanding the root causes of these effects is challenging without white-box models of their arithmetic behaviors. This paper proposes closed-loop feature probing (CLFP), a generic and systematic framework for constructing complete arithmetic behavior models of MMA operations. Based on this framework, we analyze all MMA instructions on ten GPU architectures spanning from NVIDIA Volta to RTX Blackwell and from AMD CDNA1 to CDNA3, and derive the first bit-accurate arithmetic models for these MMAUs. Our models explain previously observed cross-platform numerical discrepancies and accuracy issues, enable white-box numerical error analysis, reveal four precision bottleneck designs and one numerical asymmetry design that significantly affect numerical accuracy, and provide software workarounds as well as design guidance for future MMAUs. This work is open-source on https://github.com/microsoft/MMA-Sim .

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper introduces Closed-Loop Feature Probing (CLFP), a systematic empirical framework for reverse-engineering undocumented internal arithmetic behaviors (precision, rounding, and edge cases) of GPU matrix multiply-accumulate units. The authors apply CLFP to every MMA instruction on ten architectures (NVIDIA Volta through Blackwell; AMD CDNA1 through CDNA3), derive the first claimed bit-accurate models, use them to explain observed cross-platform numerical discrepancies and accuracy loss, identify four precision-bottleneck designs plus one asymmetry, and supply software workarounds plus design guidance. The implementation is released as open source.

Significance. If the models hold, the work supplies the first white-box, bit-accurate characterizations of proprietary MMA hardware across two vendors and multiple generations. This enables precise numerical-error analysis for DNN training, explains previously mysterious discrepancies, and offers concrete guidance for future accelerator design. The open-source release and systematic coverage of all MMA instructions on ten chips are explicit strengths that support reproducibility and community use.

major comments (1)
  1. [Section 3 (CLFP framework) and Section 5 (model derivation and validation)] The central claim that CLFP produces complete bit-accurate models for all MMA instructions rests on the assumption that the chosen probing inputs expose every internal precision, rounding mode, and vendor-specific behavior. The manuscript reports systematic probing plus validation on held-out inputs, yet provides no formal argument or exhaustive enumeration that every relevant corner case (subnormals, specific FMA saturations, undocumented modes) has been triggered. Because the hardware remains a black box, any missed case would render the derived model incomplete for the full input space; this directly affects the 'first bit-accurate' assertion.
minor comments (2)
  1. [Table 2] Table 2 would benefit from an additional column showing the exact number of probing inputs used per architecture to allow readers to assess coverage density.
  2. [Section 4] The notation for the extracted mantissa and exponent widths is introduced without a compact summary table; a single reference table would improve readability when comparing the ten architectures.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their detailed and constructive feedback, which has helped us improve the clarity and transparency of our manuscript. We address the major comment point by point below and have made revisions to strengthen the discussion of our methodology's limitations.

read point-by-point responses
  1. Referee: [Section 3 (CLFP framework) and Section 5 (model derivation and validation)] The central claim that CLFP produces complete bit-accurate models for all MMA instructions rests on the assumption that the chosen probing inputs expose every internal precision, rounding mode, and vendor-specific behavior. The manuscript reports systematic probing plus validation on held-out inputs, yet provides no formal argument or exhaustive enumeration that every relevant corner case (subnormals, specific FMA saturations, undocumented modes) has been triggered. Because the hardware remains a black box, any missed case would render the derived model incomplete for the full input space; this directly affects the 'first bit-accurate' assertion.

    Authors: We appreciate the referee raising this critical aspect of our empirical approach. We agree that a formal mathematical argument for exhaustive coverage is not feasible given the black-box nature of the hardware. Our CLFP framework instead relies on systematic, iterative probing designed to trigger and isolate behaviors for precision, rounding, subnormals, FMA edge cases, and vendor-specific modes, followed by validation against held-out inputs and reproduction of documented numerical discrepancies. To address the concern directly, we have revised Section 5 to explicitly state that the models are the first to achieve bit-accurate fidelity across our comprehensive test suite (covering all MMA instructions on the ten architectures) while acknowledging the inherent limits of empirical reverse-engineering. We have also added a dedicated limitations paragraph discussing the possibility of untriggered corner cases and recommending ongoing community validation via the open-source release. This revision preserves the core contribution without overstating completeness. revision: yes

Circularity Check

0 steps flagged

No circularity: models empirically constructed from external hardware measurements via CLFP probing

full rationale

The derivation chain relies on applying the proposed CLFP framework to direct measurements of MMA instructions across ten GPU architectures. No equations, parameters, or claims reduce by construction to prior fitted values, self-definitions, or self-citation chains. The bit-accurate models are outputs of systematic probing and validation on held-out inputs, not inputs renamed as predictions. This is self-contained against external benchmarks (actual GPU behavior) with no load-bearing internal loops.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that exhaustive probing can recover undocumented internal floating-point rules; no new physical entities or ad-hoc constants are introduced beyond the discovered hardware behaviors.

axioms (1)
  • domain assumption Floating-point arithmetic follows IEEE 754 rules except where vendor-specific deviations exist inside the MMA unit.
    Invoked when interpreting probe results as bit-accurate models.

pith-pipeline@v0.9.0 · 5548 in / 1076 out tokens · 21806 ms · 2026-05-17T22:53:23.056687+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Performance Isolation and Semantic Determinism in Efficient GPU Spatial Sharing

    cs.DC 2026-03 unverdicted novelty 6.0

    CoGPU resolves the tradeoff in GPU sharing by introducing GPU coroutines for semantic-preserving resource migration, delivering up to 79.2% higher training throughput and zero token mismatch in inference.

Reference graph

Works this paper leans on

22 extracted references · 22 canonical work pages · cited by 1 Pith paper · 2 internal anchors

  1. [1]

    Markidis, S

    S. Markidis, S. W. D. Chien, E. Laure, I. B. Peng, and J. S. Vetter, “NVIDIA Tensor Core Programmability, Performance & Precision,” inIEEE International Parallel and Distributed Processing Symposium (IPDPS) Workshops. IEEE Computer Society, 2018, pp. 522–531. [Online]. Available: https://doi.org/10.1109/IPDPSW.2018.00091

  2. [2]

    Dissecting Tensor Cores via Microbenchmarks: Latency, Throughput and Numeric Behaviors,

    W. Sun, A. Li, T. Geng, S. Stuijk, and H. Corporaal, “Dissecting Tensor Cores via Microbenchmarks: Latency, Throughput and Numeric Behaviors,”IEEE Transactions on Parallel and Distributed Systems, vol. 34, no. 1, pp. 246–261, 2023. [Online]. Available: https: //doi.org/10.1109/TPDS.2022.3217824

  3. [3]

    LIBRA: Enabling Workload-Aware Multi-Dimensional Network Topology Optimization for Distributed Training of Large AI Models

    G. Schieffer, D. A. d. Medeiros, J. Faj, A. Marathe, and I. Peng, “On the Rise of AMD Matrix Cores: Performance, Power Efficiency, and Programmability,” inIEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), 2024, pp. 132–143. [Online]. Available: https://doi.org/10.1109/ISPASS61541.2024.00022

  4. [4]

    DeepSeek-V3 Technical Report

    DeepSeek-AI, A. Liu, B. Feng, B. Xue, B. Wang, B. Wu, C. Lu, C. Zhao, C. Deng, C. Zhang, C. Ruan, D. Dai, D. Guo, D. Yang, D. Chen, D. Ji, E. Li, F. Lin, F. Dai, F. Luo, G. Hao, G. Chen, G. Li, H. Zhang, H. Bao, H. Xu, H. Wang, H. Zhang, H. Ding, H. Xin, H. Gao, H. Li, H. Qu, J. L. Cai, J. Liang, J. Guo, J. Ni, J. Li, J. Wang, J. Chen, J. Chen, J. Yuan, J...

  5. [5]

    PyTorch Developer Notes - Numerical accuracy,

    PyTorch Developers, “PyTorch Developer Notes - Numerical accuracy,”

  6. [6]

    Available: https://docs.pytorch.org/docs/stable/notes/ numerical accuracy.html

    [Online]. Available: https://docs.pytorch.org/docs/stable/notes/ numerical accuracy.html

  7. [7]

    FTTN: Feature-Targeted Testing for Numerical Properties of NVIDIA & AMD Matrix Accelerators,

    X. Li, A. Li, B. Fang, K. Swirydowicz, I. Laguna, and G. Gopalakrishnan, “Discovery of Floating-Point Differences Between NVIDIA and AMD GPUs,” inInternational Symposium on Cluster, Cloud and Internet Computing (CCGRID), 2024, pp. 663–666. [Online]. Available: https://doi.org/10.1109/CCGrid59990.2024.00083

  8. [8]

    Numerical behavior of NVIDIA tensor cores,

    M. Fasi, N. J. Higham, M. Mikaitis, and S. Pranesh, “Numerical behavior of NVIDIA tensor cores,”PeerJ Computer Science, vol. 7, p. e330, 2021. [Online]. Available: https://doi.org/10.7717/peerj-cs.330

  9. [9]

    FTTN: Feature-Targeted Testing for Numerical Properties of NVIDIA & AMD Matrix Accelerators,

    X. Li, A. Li, B. Fang, K. Swirydowicz, I. Laguna, and G. Gopalakrishnan, “FTTN: Feature-Targeted Testing for Numerical Properties of NVIDIA & AMD Matrix Accelerators,” inInternational Symposium on Cluster, Cloud and Internet Computing (CCGRID). IEEE, 2024, pp. 39–46. [Online]. Available: https://doi.org/10.1109/CCGrid59990.2024.00014

  10. [10]

    Revealing Floating-Point Accumulation Orders in Software/Hardware Implementations,

    P. Xie, Y . Gao, Y . Wang, and J. Xue, “Revealing Floating-Point Accumulation Orders in Software/Hardware Implementations,” in USENIX Annual Technical Conference (USENIX ATC), 2025, pp. 1425–1440. [Online]. Available: https://www.usenix.org/conference/ atc25/presentation/xie

  11. [11]

    Estimation of numerical reproducibility on CPU and GPU,

    F. J ´ez´equel, J. L. Lamotte, and I. Said, “Estimation of numerical reproducibility on CPU and GPU,” inFederated Conference on Computer Science and Information Systems (FedCSIS), 2015, pp. 675–680. [Online]. Available: https://doi.org/10.15439/2015F29

  12. [12]

    Expediting Higher Fidelity Plasma State Reconstructions for the DIII-D Na- tional Fusion Facility Using Leadership Class Computing Resources

    A. H. Zahid, I. Laguna, and W. Le, “Testing GPU Numerics: Finding Numerical Differences Between NVIDIA and AMD GPUs,” inSC24- W: Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2024, pp. 547–557. [Online]. Available: https://doi.org/10.1109/SCW63240.2024.00077

  13. [13]

    Problems and Opportunities in Training Deep Learning Software Systems: An Analysis of Variance,

    H. V . Pham, S. Qian, J. Wang, T. Lutellier, J. Rosenthal, L. Tan, Y . Yu, and N. Nagappan, “Problems and Opportunities in Training Deep Learning Software Systems: An Analysis of Variance,” inInternational Conference on Automated Software Engineering (ASE), 2020, pp. 771–783. [Online]. Available: https://doi.org/10.1145/3324884.3416545

  14. [14]

    In Proceedings of the 44th International Conference on Software Engineering (Pittsburgh, Pennsylvania) (ICSE ’22)

    B. Chen, M. Wen, Y . Shi, D. Lin, G. K. Rajbahadur, and Z. M. Jiang, “Towards Training Reproducible Deep Learning Models,” in International Conference on Software Engineering (ICSE), 2022, pp. 2202–2214. [Online]. Available: https://doi.org/10.1145/3510003. 3510163

  15. [15]

    Defeating Nondeterminism in LLM Inference,

    H. He and Thinking Machines Lab, “Defeating Nondeterminism in LLM Inference,” 2025. [Online]. Available: https://thinkingmachines.ai/ blog/defeating-nondeterminism-in-llm-inference/

  16. [16]

    Zheng, C

    M. A. Raihan, N. Goli, and T. M. Aamodt, “Modeling Deep Learning Accelerator Enabled GPUs,” inIEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), 2019, pp. 79–92. [Online]. Available: https://doi.org/10.1109/ISPASS.2019.00016

  17. [17]

    Chechik and T

    B. J. Hickmann and D. Bradford, “Experimental Analysis of Matrix Multiplication Functional Units,” inIEEE Symposium on Computer Arithmetic (ARITH), 2019, pp. 116–119. [Online]. Available: https://doi.org/10.1109/ARITH.2019.00031

  18. [18]

    An SMT Formalization of Mixed-Precision Matrix Multiplication: Modeling Three Generations of Tensor Cores,

    B. Valpey, X. Li, S. Pai, and G. Gopalakrishnan, “An SMT Formalization of Mixed-Precision Matrix Multiplication: Modeling Three Generations of Tensor Cores,” 2025, arXiv: 2502.15999. [Online]. Available: https://doi.org/10.48550/arXiv.2502.15999

  19. [19]

    IEEE Standard for Floating-Point Arithmetic,

    IEEE, “IEEE Standard for Floating-Point Arithmetic,” 2019

  20. [20]

    FP8 Formats for Deep Learning

    P. Micikevicius, D. Stosic, N. Burgess, M. Cornea, P. Dubey, R. Grisenthwaite, S. Ha, A. Heinecke, P. Judd, J. Kamalu, N. Mellempudi, S. F. Oberman, M. Shoeybi, M. Y . Siu, and H. Wu, “FP8 Formats for Deep Learning,” 2022, arXiv: 2209.05433. [Online]. Available: https://doi.org/10.48550/arXiv.2209.05433

  21. [21]

    OCP Microscaling Formats (MX) Specifica- tion,

    Open Compute Project, “OCP Microscaling Formats (MX) Specifica- tion,” 2023

  22. [22]

    Limit results for distribu ted estimation of invariant subspaces in multiple networks inference and pca

    B. Noune, P. Jones, D. Justus, D. Masters, and C. Luschi, “8- bit Numerical Formats for Deep Neural Networks,” 2022, arXiv: 2206.02915. [Online]. Available: https://doi.org/10.48550/arXiv.2206. 02915 12