pith. sign in

arxiv: 2512.07004 · v3 · submitted 2025-12-07 · 💻 cs.MS · cs.AR· cs.NA· math.NA

Accurate Models of NVIDIA Tensor Cores

Pith reviewed 2026-05-17 00:49 UTC · model grok-4.3

classification 💻 cs.MS cs.ARcs.NAmath.NA
keywords tensor coresmatrix multiplicationlow-precision arithmeticmixed-precision computingGPU emulationnumerical reproducibilityfloating-point modelshardware modeling
0
0 comments X

The pith

Software models emulate the inner-product behavior of NVIDIA Tensor Cores for 8-, 16-, and 19-bit formats across V100 to B200 GPUs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper constructs software models that replicate the matrix-multiplication inner products performed by NVIDIA Tensor Cores on low- and mixed-precision data. These models target the specific numerical traits of the V100, A100, H100, and B200 GPUs in the input formats most relevant to mixed-precision developers. A reader would care because the hardware units follow rules that differ from IEEE 754, so the same code can produce different results on different generations and reproducibility becomes difficult without physical access to each platform. The models therefore let developers test and debug algorithms on simulated hardware that matches real outputs.

Core claim

The central claim is that software models can emulate the inner product behaviour of low- and mixed-precision matrix multipliers in the V100, A100, H100 and B200 data center GPUs for most supported input formats of interest to mixed-precision algorithm developers: 8-, 16-, and 19-bit floating point. The models capture hardware-specific numerical features including rounding behaviour, accumulator width, normalization points, and extra carry bits that distinguish each GPU generation.

What carries the argument

Software models constructed from test vectors that distinguish rounding, accumulator width, normalization, and carry-bit behavior of each hardware generation.

If this is right

  • Developers can obtain predicted matrix-multiplication results for GPU generations they do not physically own.
  • Numerical reproducibility checks for mixed-precision algorithms become possible through simulation rather than repeated hardware runs.
  • Direct comparison of rounding and accumulator effects across V100, A100, H100, and B200 becomes straightforward.
  • Validated test vectors can be reused on future NVIDIA platforms with the expectation that the same distinctions will hold.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same test-vector approach could be applied to matrix accelerators from other vendors to reduce cross-platform numerical surprises.
  • Embedding these emulators inside larger numerical libraries would let algorithm designers verify stability at scale without repeated hardware access.
  • Automated generation of distinguishing test vectors for new precision formats could speed up modeling of future GPU releases.

Load-bearing premise

The chosen test vectors are sufficient to distinguish the numerical features of each hardware generation and remain reliable when applied to new platforms.

What would settle it

Running the models on a new GPU generation such as B200 or later and comparing their outputs against actual hardware results for a broad set of input vectors; systematic mismatches would show the models do not fully capture the behavior.

Figures

Figures reproduced from arXiv: 2512.07004 by Faizan A. Khattak, Mantas Mikaitis.

Figure 1
Figure 1. Figure 1: Number of machines on the November TOP500 lists that s [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: A model of the inner product within the V100 GPU tensor [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: A model of the inner product within the A100 GPU tensor [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: A model of the inner product within the tensor cores of [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: A model of the inner product within the tensor cores of [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: An example MATLAB listing showing how to call the GEMM [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Multi-word arithmetic experiment presented by Mary [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗
read the original abstract

Matrix multiplication is a fundamental operation in both training of neural networks and inference. To accelerate matrix multiplication, Graphical Processing Units (GPUs) provide it implemented in hardware. Due to the increased throughput over the software-based matrix multiplication, the multipliers are increasingly used outside of AI, to accelerate various applications in scientific computing. However, matrix multipliers targeted at AI are at present not compliant with IEEE 754 floating-point arithmetic behaviour, with different vendors offering different numerical features. This leads to non-reproducible results across different generations of GPU architectures, at the matrix multiply-accumulate instruction level. To study numerical characteristics of matrix multipliers -- such as rounding behaviour, accumulator width, normalization points, extra carry bits, and others -- test vectors are typically constructed. Yet, these vectors may or may not distinguish between different hardware models, and due to limited hardware availability, their reliability across many different platforms remains largely untested. We present software models for emulating the inner product behaviour of low- and mixed-precision matrix multipliers in the V100, A100, H100 and B200 data center GPUs in most supported input formats of interest to mixed-precision algorithm developers: 8-, 16-, and 19-bit floating point.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript presents software models for emulating the inner-product behavior of low- and mixed-precision matrix multipliers on NVIDIA Tensor Cores in the V100, A100, H100, and B200 GPUs. Models target the most relevant input formats for mixed-precision developers (8-, 16-, and 19-bit floating point) and are constructed by designing test vectors that probe hardware-specific numerical features such as rounding behavior, accumulator width, normalization points, and extra carry bits, then fitting the observed outputs to parameterized emulators.

Significance. If the models prove accurate across the full input space, the work would provide a practical, hardware-independent tool for studying and reproducing non-IEEE-compliant tensor-core arithmetic. This directly supports mixed-precision algorithm development in scientific computing, where cross-generation reproducibility is currently limited by hardware availability and undocumented micro-architectural details.

major comments (1)
  1. [§3] §3 (Test vector construction): The central accuracy claim requires that the chosen vectors exhaustively distinguish all relevant numerical features and their interactions (e.g., mixed-precision normalization with carry propagation). The manuscript describes vector design but supplies no quantitative coverage metric, error rate on held-out inputs, or explicit argument that untested combinations cannot produce divergent behavior; this directly affects whether the fitted models can be trusted for arbitrary inputs.
minor comments (2)
  1. [Abstract] The abstract states that models are supplied for 'most supported input formats' but does not list the exact formats and precisions covered in each GPU generation; a concise table would improve clarity.
  2. [§2] Notation for accumulator width and normalization point should be defined once in a dedicated subsection rather than introduced inline when first used.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive evaluation of the work's significance and for the constructive comment on test vector construction. We address the concern directly below and will revise the manuscript to strengthen the presentation of coverage and validation.

read point-by-point responses
  1. Referee: [§3] §3 (Test vector construction): The central accuracy claim requires that the chosen vectors exhaustively distinguish all relevant numerical features and their interactions (e.g., mixed-precision normalization with carry propagation). The manuscript describes vector design but supplies no quantitative coverage metric, error rate on held-out inputs, or explicit argument that untested combinations cannot produce divergent behavior; this directly affects whether the fitted models can be trusted for arbitrary inputs.

    Authors: We agree that a quantitative coverage argument would improve the manuscript. The test vectors were constructed to isolate and combine the key hardware-specific behaviors (rounding modes, accumulator width, normalization points, and extra carry bits) across the supported precisions, with explicit probes for mixed-precision interactions such as normalization during accumulation. In the revision we will add a dedicated subsection to §3 that (i) reports the total number of vectors and their breakdown by feature combination, (ii) presents error rates of the fitted emulators on a held-out set of 10^6 randomly generated inputs drawn from the same distributions but never used during model fitting, and (iii) supplies a concise argument that any untested combination would still be captured by the parameterized emulator because the probes were chosen to exercise every term in the model equations. These additions will make the coverage claim explicit and verifiable. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical models fitted to external hardware measurements

full rationale

The paper constructs test vectors to probe hardware features such as rounding, accumulator width, normalization, and carry bits on V100/A100/H100/B200 GPUs, measures actual outputs, and fits software emulators to reproduce those observations for 8-/16-/19-bit formats. This is a direct empirical reverse-engineering process with no self-definitional loops, no fitted parameters renamed as independent predictions on the same data, and no load-bearing self-citations or imported uniqueness theorems. The central claim rests on external hardware benchmarks rather than internal equations that reduce to the fitting inputs by construction, rendering the derivation self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that a finite set of test vectors can fully characterize the undocumented numerical micro-architecture of each Tensor Core generation. No free parameters are explicitly named in the abstract, but any model that matches observed rounding or carry behavior necessarily contains fitted constants for accumulator width or extra bits.

axioms (1)
  • domain assumption Hardware matrix-multiply behavior can be reverse-engineered from a modest number of carefully chosen test vectors.
    Invoked when the authors state that test vectors are used to study rounding, accumulator width, and carry bits.

pith-pipeline@v0.9.0 · 5517 in / 1222 out tokens · 76547 ms · 2026-05-17T00:49:23.385742+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

31 extracted references · 31 canonical work pages

  1. [1]

    Mixed-precision iterative refinement using tensor cores o n GPUs to accelerate solution of linear systems,

    A. Haidar, H. Bayraktar, S. Tomov, J. Dongarra, and N. J. H igham, “Mixed-precision iterative refinement using tensor cores o n GPUs to accelerate solution of linear systems,” Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences , vol. 476, no. 2243, p. 20200110, 2020

  2. [2]

    Mixed precision algorithms in n umerical linear algebra,

    N. J. Higham and T. Mary, “Mixed precision algorithms in n umerical linear algebra,” Acta Numerica , vol. 31, pp. 347–414, May 2022

  3. [3]

    Accelerating supercomputing: AI-hardware-driven innov ation for speed and efficiency,

    J. Dongarra, J. Gunnels, H. Bayraktar, A. Haidar, and D. E rnst, “Accelerating supercomputing: AI-hardware-driven innov ation for speed and efficiency,” in 2025 IEEE High Performance Extreme Computing Conference (HPEC) , 2025, pp. 1–7

  4. [4]

    Datasheet: AMD instrinct MI355X GPU,

    AMD, “Datasheet: AMD instrinct MI355X GPU,” 2025. [Onli ne]. Available: https://www.amd.com/content/dam/amd/en/documents/ instinct-tech-docs/product-briefs/amd-instinct-mi35 5x-gpu-brochure. pdf

  5. [5]

    NVIDIA Blackwell architecture technical brie f,

    NVIDIA, “NVIDIA Blackwell architecture technical brie f,”

  6. [6]

    Available: h t t p s : //r e s o u r c e s.nv i d i a .c o m/ en-us-blackwell-architecture

    [Online]. Available: h t t p s : //r e s o u r c e s.nv i d i a .c o m/ en-us-blackwell-architecture

  7. [7]

    OCP 8-bit floating point spec itication (OFP8),

    P . Micikevicius, S. Oberman, P . Dubey, M. Cornea, A. Rodr iguez, I. Bratt, R. Grisenthwaite, N. Jouppi, C. Chou, A. Huffman, M . Schulte, R. Wittig, D. Jani, and S. Deng, “OCP 8-bit floating point spec itication (OFP8),” Open Compute Project, Tech. Rep., Jun. 2023, revis ion 1.0. [Online]. Available: https://www.opencompute.org/documents/ ocp-8-bit-float...

  8. [8]

    Interim report on binary floating-point formats for mac hine learning,

    “Interim report on binary floating-point formats for mac hine learning,” Tech. Rep., Nov. 2025, version 3.2. [Online]. Available: https:// github.com/P3109/Public/blob/main/Shared%20Reports/ IEEE%20WG %20P3109%20Interim%20Report%20v3.1.pdf

  9. [9]

    Piscataway, NJ, USA: Institute of Electrical and Electronics Engineers, Jul

    IEEE Standard for Floating-Point Arithmetic, IEEE Std 754- 2019 (re- vision of IEEE Std 754-2008) . Piscataway, NJ, USA: Institute of Electrical and Electronics Engineers, Jul. 2019

  10. [10]

    Experimental analysis of m atrix multi- plication functional units,

    B. Hickmann and D. Bradford, “Experimental analysis of m atrix multi- plication functional units,” in 2019 IEEE 26th Symposium on Computer Arithmetic (ARITH) , Oct. 2019, pp. 116–119

  11. [11]

    Nume rical behavior of NVIDIA tensor cores,

    M. Fasi, N. J. Higham, M. Mikaitis, and S. Pranesh, “Nume rical behavior of NVIDIA tensor cores,” PeerJ Computer Science, vol. 7, p. e330, 2021

  12. [12]

    FTTN: Feature-targeted testing for numerical pro perties of NVIDIA & AMD matrix accelerators,

    X. Li, A. Li, B. Fang, K. Swirydowicz, I. Laguna, and G. Go palakr- ishnan, “FTTN: Feature-targeted testing for numerical pro perties of NVIDIA & AMD matrix accelerators,” in 2024 IEEE 24th International Symposium on Cluster , Cloud and Internet Computing (CCGrid ), 2024, pp. 39–46

  13. [13]

    An SMT f ormalization of mixed-precision matrix multiplication,

    B. V alpey, X. Li, S. Pai, and G. Gopalakrishnan, “An SMT f ormalization of mixed-precision matrix multiplication,” in NASA F ormal Methods . Cham: Springer Nature Switzerland, 2025, pp. 360–379

  14. [14]

    Generalized methodolog y for deter- mining numerical features of hardware floating-point matri x multipliers: Part I,

    F. A. Khattak and M. Mikaitis, “Generalized methodolog y for deter- mining numerical features of hardware floating-point matri x multipliers: Part I,” in 2025 IEEE High Performance Extreme Computing Conference (HPEC), Wakefield, MA, USA, Oct. 2025

  15. [15]

    GPU floating-point Para noia,

    K. E. Hillesland and A. Lastra, “GPU floating-point Para noia,” in ACM W orkshop on General-Purpose Computing on Graphics Processors (GP2). Los Angeles, CA, USA: ACM, Aug. 2004

  16. [16]

    FPGA Paran oia: Testing numerical properties of FPGA floating point ip-cores,

    X. Y . Tan, D. Boland, and G. Constantinides, “FPGA Paran oia: Testing numerical properties of FPGA floating point ip-cores,” in Reconfigurable Computing: Architectures, Tools and Applications , O. C. S. Choy, R. C. C. Cheung, P . Athanas, and K. Sano, Eds. Berlin, Heidelb erg: Springer Berlin Heidelberg, 2012, pp. 290–301

  17. [17]

    CPFloat: A C library for simula ting low- precision arithmetic,

    M. Fasi and M. Mikaitis, “CPFloat: A C library for simula ting low- precision arithmetic,” ACM Trans. Math. Softw., vol. 49, no. 2, pp. 18:1– 18:32, Jun. 2023

  18. [18]

    NVIDIA tensor core programmability, performance & precis ion,

    S. Markidis, S. W. D. Chien, E. Laure, I. B. Peng, and J. S. V etter, “NVIDIA tensor core programmability, performance & precis ion,” in Proceedings of the 32nd IEEE International Parallel and Dis tributed Processing Symposium W orkshops, V ancouver, BC, Canada, Aug. 2018, pp. 522–531

  19. [19]

    Accelerating non-power-of- 2 size Fourier transforms with GPU tensor cores,

    L. Pisha and Ł. Ligowski, “Accelerating non-power-of- 2 size Fourier transforms with GPU tensor cores,” in Proceedings of the 2021 IEEE International Parallel and Distributed Processing Sympos ium, Portland, OR, USA, May 2021, pp. 507–516

  20. [20]

    Recovering single precision a ccuracy from tensor cores while surpassing the FP32 theoretical peak per formance,

    H. Ootomo and R. Y okota, “Recovering single precision a ccuracy from tensor cores while surpassing the FP32 theoretical peak per formance,” The International Journal of High Performance Computing Ap plications, vol. 36, no. 4, pp. 475–491, Jun. 2022

  21. [21]

    Error analysis of matrix multi plication with narrow range floating-point arithmetic,

    T. Mary and M. Mikaitis, “Error analysis of matrix multi plication with narrow range floating-point arithmetic,” SIAM J. Sci. Comput. , vol. 47, no. 4, pp. B785–B800, 2025

  22. [22]

    Monotonicity of multi-term floating-poi nt adders,

    M. Mikaitis, “Monotonicity of multi-term floating-poi nt adders,” IEEE Trans. Comput., vol. 73, no. 6, pp. 1531–1543, Feb. 2024

  23. [23]

    Optimized fused floating-point many-term dot-product har dware for machine learning accelerators,

    H. Kaul, M. Anders, S. Mathew, S. Kim, and R. Krishnamurt hy, “Optimized fused floating-point many-term dot-product har dware for machine learning accelerators,” in 2019 IEEE 26th Symposium on Computer Arithmetic (ARITH) , 2019, pp. 84–87

  24. [24]

    Intel Nervana Neural Network Processor-T (NNP-T) fus ed floating point many-term dot product,

    B. Hickmann, J. Chen, M. Rotzin, A. Y ang, M. Urbanski, an d S. Avan- cha, “Intel Nervana Neural Network Processor-T (NNP-T) fus ed floating point many-term dot product,” in 2020 IEEE 27th Symposium on Computer Arithmetic (ARITH) , 2020, pp. 133–136

  25. [25]

    Mixed precision block fused multiply-add: Error analysis and app lication to GPU tensor cores,

    P . Blanchard, N. J. Higham, F. Lopez, T. Mary, and S. Pran esh, “Mixed precision block fused multiply-add: Error analysis and app lication to GPU tensor cores,” SIAM Journal on Scientific Computing , vol. 42, no. 3, pp. C124–C141, 2020

  26. [26]

    Multi-operand floating-point addition,

    A. F. Tenca, “Multi-operand floating-point addition,” in 2009 19th IEEE Symposium on Computer Arithmetic , 2009, pp. 161–168

  27. [27]

    A test of a computer’s floating-point ari thmetic unit,

    N. L. Schryer, “A test of a computer’s floating-point ari thmetic unit,” A T&T Bell Laboratories, Murray Hill, NJ, Murray Hill, NJ 079 74, Technical Report Computer Science Technical Report 89, Feb . 1981

  28. [28]

    NVIDIA Tesla V100 GPU architecture,

    NVIDIA, “NVIDIA Tesla V100 GPU architecture,” 2017. [O nline]. Available: https://images.nvidia.com/content/volta- architectur e/pdf/ volta-architecture-whitepaper.pdf

  29. [29]

    BFLOA T16—hardware numerics defin ition,

    Intel Corporation, “BFLOA T16—hardware numerics defin ition,” Avail- able at h t t p s : //s o f t w a r e .i n t e l.c o m/e n - u s/d o w n l o a d / bfloat16-hardware-numerics-definition (accessed 15 July 2020), Nov. 2018, white paper. Document number 338302-001US

  30. [30]

    CUDA Binary Utilities, release 13.1,

    N. Corporation, “CUDA Binary Utilities, release 13.1, ” 2025. [Online]. Available: https://docs.nvidia.com/cuda/pdf/CUDA Binary Utilities.pdf

  31. [31]

    Disse cting the NVIDIA Blackwell architecture with microbenchmarks,

    A. Jarmusch, N. Graddon, and S. Chandrasekaran, “Disse cting the NVIDIA Blackwell architecture with microbenchmarks,” 2025. [Online]. Available: https://arxiv.org/abs/2507.10789