Accurate Models of NVIDIA Tensor Cores
Pith reviewed 2026-05-17 00:49 UTC · model grok-4.3
The pith
Software models emulate the inner-product behavior of NVIDIA Tensor Cores for 8-, 16-, and 19-bit formats across V100 to B200 GPUs.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that software models can emulate the inner product behaviour of low- and mixed-precision matrix multipliers in the V100, A100, H100 and B200 data center GPUs for most supported input formats of interest to mixed-precision algorithm developers: 8-, 16-, and 19-bit floating point. The models capture hardware-specific numerical features including rounding behaviour, accumulator width, normalization points, and extra carry bits that distinguish each GPU generation.
What carries the argument
Software models constructed from test vectors that distinguish rounding, accumulator width, normalization, and carry-bit behavior of each hardware generation.
If this is right
- Developers can obtain predicted matrix-multiplication results for GPU generations they do not physically own.
- Numerical reproducibility checks for mixed-precision algorithms become possible through simulation rather than repeated hardware runs.
- Direct comparison of rounding and accumulator effects across V100, A100, H100, and B200 becomes straightforward.
- Validated test vectors can be reused on future NVIDIA platforms with the expectation that the same distinctions will hold.
Where Pith is reading between the lines
- The same test-vector approach could be applied to matrix accelerators from other vendors to reduce cross-platform numerical surprises.
- Embedding these emulators inside larger numerical libraries would let algorithm designers verify stability at scale without repeated hardware access.
- Automated generation of distinguishing test vectors for new precision formats could speed up modeling of future GPU releases.
Load-bearing premise
The chosen test vectors are sufficient to distinguish the numerical features of each hardware generation and remain reliable when applied to new platforms.
What would settle it
Running the models on a new GPU generation such as B200 or later and comparing their outputs against actual hardware results for a broad set of input vectors; systematic mismatches would show the models do not fully capture the behavior.
Figures
read the original abstract
Matrix multiplication is a fundamental operation in both training of neural networks and inference. To accelerate matrix multiplication, Graphical Processing Units (GPUs) provide it implemented in hardware. Due to the increased throughput over the software-based matrix multiplication, the multipliers are increasingly used outside of AI, to accelerate various applications in scientific computing. However, matrix multipliers targeted at AI are at present not compliant with IEEE 754 floating-point arithmetic behaviour, with different vendors offering different numerical features. This leads to non-reproducible results across different generations of GPU architectures, at the matrix multiply-accumulate instruction level. To study numerical characteristics of matrix multipliers -- such as rounding behaviour, accumulator width, normalization points, extra carry bits, and others -- test vectors are typically constructed. Yet, these vectors may or may not distinguish between different hardware models, and due to limited hardware availability, their reliability across many different platforms remains largely untested. We present software models for emulating the inner product behaviour of low- and mixed-precision matrix multipliers in the V100, A100, H100 and B200 data center GPUs in most supported input formats of interest to mixed-precision algorithm developers: 8-, 16-, and 19-bit floating point.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents software models for emulating the inner-product behavior of low- and mixed-precision matrix multipliers on NVIDIA Tensor Cores in the V100, A100, H100, and B200 GPUs. Models target the most relevant input formats for mixed-precision developers (8-, 16-, and 19-bit floating point) and are constructed by designing test vectors that probe hardware-specific numerical features such as rounding behavior, accumulator width, normalization points, and extra carry bits, then fitting the observed outputs to parameterized emulators.
Significance. If the models prove accurate across the full input space, the work would provide a practical, hardware-independent tool for studying and reproducing non-IEEE-compliant tensor-core arithmetic. This directly supports mixed-precision algorithm development in scientific computing, where cross-generation reproducibility is currently limited by hardware availability and undocumented micro-architectural details.
major comments (1)
- [§3] §3 (Test vector construction): The central accuracy claim requires that the chosen vectors exhaustively distinguish all relevant numerical features and their interactions (e.g., mixed-precision normalization with carry propagation). The manuscript describes vector design but supplies no quantitative coverage metric, error rate on held-out inputs, or explicit argument that untested combinations cannot produce divergent behavior; this directly affects whether the fitted models can be trusted for arbitrary inputs.
minor comments (2)
- [Abstract] The abstract states that models are supplied for 'most supported input formats' but does not list the exact formats and precisions covered in each GPU generation; a concise table would improve clarity.
- [§2] Notation for accumulator width and normalization point should be defined once in a dedicated subsection rather than introduced inline when first used.
Simulated Author's Rebuttal
We thank the referee for the positive evaluation of the work's significance and for the constructive comment on test vector construction. We address the concern directly below and will revise the manuscript to strengthen the presentation of coverage and validation.
read point-by-point responses
-
Referee: [§3] §3 (Test vector construction): The central accuracy claim requires that the chosen vectors exhaustively distinguish all relevant numerical features and their interactions (e.g., mixed-precision normalization with carry propagation). The manuscript describes vector design but supplies no quantitative coverage metric, error rate on held-out inputs, or explicit argument that untested combinations cannot produce divergent behavior; this directly affects whether the fitted models can be trusted for arbitrary inputs.
Authors: We agree that a quantitative coverage argument would improve the manuscript. The test vectors were constructed to isolate and combine the key hardware-specific behaviors (rounding modes, accumulator width, normalization points, and extra carry bits) across the supported precisions, with explicit probes for mixed-precision interactions such as normalization during accumulation. In the revision we will add a dedicated subsection to §3 that (i) reports the total number of vectors and their breakdown by feature combination, (ii) presents error rates of the fitted emulators on a held-out set of 10^6 randomly generated inputs drawn from the same distributions but never used during model fitting, and (iii) supplies a concise argument that any untested combination would still be captured by the parameterized emulator because the probes were chosen to exercise every term in the model equations. These additions will make the coverage claim explicit and verifiable. revision: yes
Circularity Check
No circularity: empirical models fitted to external hardware measurements
full rationale
The paper constructs test vectors to probe hardware features such as rounding, accumulator width, normalization, and carry bits on V100/A100/H100/B200 GPUs, measures actual outputs, and fits software emulators to reproduce those observations for 8-/16-/19-bit formats. This is a direct empirical reverse-engineering process with no self-definitional loops, no fitted parameters renamed as independent predictions on the same data, and no load-bearing self-citations or imported uniqueness theorems. The central claim rests on external hardware benchmarks rather than internal equations that reduce to the fitting inputs by construction, rendering the derivation self-contained.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Hardware matrix-multiply behavior can be reverse-engineered from a modest number of carefully chosen test vectors.
Reference graph
Works this paper leans on
-
[1]
A. Haidar, H. Bayraktar, S. Tomov, J. Dongarra, and N. J. H igham, “Mixed-precision iterative refinement using tensor cores o n GPUs to accelerate solution of linear systems,” Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences , vol. 476, no. 2243, p. 20200110, 2020
work page 2020
-
[2]
Mixed precision algorithms in n umerical linear algebra,
N. J. Higham and T. Mary, “Mixed precision algorithms in n umerical linear algebra,” Acta Numerica , vol. 31, pp. 347–414, May 2022
work page 2022
-
[3]
Accelerating supercomputing: AI-hardware-driven innov ation for speed and efficiency,
J. Dongarra, J. Gunnels, H. Bayraktar, A. Haidar, and D. E rnst, “Accelerating supercomputing: AI-hardware-driven innov ation for speed and efficiency,” in 2025 IEEE High Performance Extreme Computing Conference (HPEC) , 2025, pp. 1–7
work page 2025
-
[4]
Datasheet: AMD instrinct MI355X GPU,
AMD, “Datasheet: AMD instrinct MI355X GPU,” 2025. [Onli ne]. Available: https://www.amd.com/content/dam/amd/en/documents/ instinct-tech-docs/product-briefs/amd-instinct-mi35 5x-gpu-brochure. pdf
work page 2025
-
[5]
NVIDIA Blackwell architecture technical brie f,
NVIDIA, “NVIDIA Blackwell architecture technical brie f,”
-
[6]
Available: h t t p s : //r e s o u r c e s.nv i d i a .c o m/ en-us-blackwell-architecture
[Online]. Available: h t t p s : //r e s o u r c e s.nv i d i a .c o m/ en-us-blackwell-architecture
-
[7]
OCP 8-bit floating point spec itication (OFP8),
P . Micikevicius, S. Oberman, P . Dubey, M. Cornea, A. Rodr iguez, I. Bratt, R. Grisenthwaite, N. Jouppi, C. Chou, A. Huffman, M . Schulte, R. Wittig, D. Jani, and S. Deng, “OCP 8-bit floating point spec itication (OFP8),” Open Compute Project, Tech. Rep., Jun. 2023, revis ion 1.0. [Online]. Available: https://www.opencompute.org/documents/ ocp-8-bit-float...
work page 2023
-
[8]
Interim report on binary floating-point formats for mac hine learning,
“Interim report on binary floating-point formats for mac hine learning,” Tech. Rep., Nov. 2025, version 3.2. [Online]. Available: https:// github.com/P3109/Public/blob/main/Shared%20Reports/ IEEE%20WG %20P3109%20Interim%20Report%20v3.1.pdf
work page 2025
-
[9]
Piscataway, NJ, USA: Institute of Electrical and Electronics Engineers, Jul
IEEE Standard for Floating-Point Arithmetic, IEEE Std 754- 2019 (re- vision of IEEE Std 754-2008) . Piscataway, NJ, USA: Institute of Electrical and Electronics Engineers, Jul. 2019
work page 2019
-
[10]
Experimental analysis of m atrix multi- plication functional units,
B. Hickmann and D. Bradford, “Experimental analysis of m atrix multi- plication functional units,” in 2019 IEEE 26th Symposium on Computer Arithmetic (ARITH) , Oct. 2019, pp. 116–119
work page 2019
-
[11]
Nume rical behavior of NVIDIA tensor cores,
M. Fasi, N. J. Higham, M. Mikaitis, and S. Pranesh, “Nume rical behavior of NVIDIA tensor cores,” PeerJ Computer Science, vol. 7, p. e330, 2021
work page 2021
-
[12]
FTTN: Feature-targeted testing for numerical pro perties of NVIDIA & AMD matrix accelerators,
X. Li, A. Li, B. Fang, K. Swirydowicz, I. Laguna, and G. Go palakr- ishnan, “FTTN: Feature-targeted testing for numerical pro perties of NVIDIA & AMD matrix accelerators,” in 2024 IEEE 24th International Symposium on Cluster , Cloud and Internet Computing (CCGrid ), 2024, pp. 39–46
work page 2024
-
[13]
An SMT f ormalization of mixed-precision matrix multiplication,
B. V alpey, X. Li, S. Pai, and G. Gopalakrishnan, “An SMT f ormalization of mixed-precision matrix multiplication,” in NASA F ormal Methods . Cham: Springer Nature Switzerland, 2025, pp. 360–379
work page 2025
-
[14]
F. A. Khattak and M. Mikaitis, “Generalized methodolog y for deter- mining numerical features of hardware floating-point matri x multipliers: Part I,” in 2025 IEEE High Performance Extreme Computing Conference (HPEC), Wakefield, MA, USA, Oct. 2025
work page 2025
-
[15]
K. E. Hillesland and A. Lastra, “GPU floating-point Para noia,” in ACM W orkshop on General-Purpose Computing on Graphics Processors (GP2). Los Angeles, CA, USA: ACM, Aug. 2004
work page 2004
-
[16]
FPGA Paran oia: Testing numerical properties of FPGA floating point ip-cores,
X. Y . Tan, D. Boland, and G. Constantinides, “FPGA Paran oia: Testing numerical properties of FPGA floating point ip-cores,” in Reconfigurable Computing: Architectures, Tools and Applications , O. C. S. Choy, R. C. C. Cheung, P . Athanas, and K. Sano, Eds. Berlin, Heidelb erg: Springer Berlin Heidelberg, 2012, pp. 290–301
work page 2012
-
[17]
CPFloat: A C library for simula ting low- precision arithmetic,
M. Fasi and M. Mikaitis, “CPFloat: A C library for simula ting low- precision arithmetic,” ACM Trans. Math. Softw., vol. 49, no. 2, pp. 18:1– 18:32, Jun. 2023
work page 2023
-
[18]
NVIDIA tensor core programmability, performance & precis ion,
S. Markidis, S. W. D. Chien, E. Laure, I. B. Peng, and J. S. V etter, “NVIDIA tensor core programmability, performance & precis ion,” in Proceedings of the 32nd IEEE International Parallel and Dis tributed Processing Symposium W orkshops, V ancouver, BC, Canada, Aug. 2018, pp. 522–531
work page 2018
-
[19]
Accelerating non-power-of- 2 size Fourier transforms with GPU tensor cores,
L. Pisha and Ł. Ligowski, “Accelerating non-power-of- 2 size Fourier transforms with GPU tensor cores,” in Proceedings of the 2021 IEEE International Parallel and Distributed Processing Sympos ium, Portland, OR, USA, May 2021, pp. 507–516
work page 2021
-
[20]
H. Ootomo and R. Y okota, “Recovering single precision a ccuracy from tensor cores while surpassing the FP32 theoretical peak per formance,” The International Journal of High Performance Computing Ap plications, vol. 36, no. 4, pp. 475–491, Jun. 2022
work page 2022
-
[21]
Error analysis of matrix multi plication with narrow range floating-point arithmetic,
T. Mary and M. Mikaitis, “Error analysis of matrix multi plication with narrow range floating-point arithmetic,” SIAM J. Sci. Comput. , vol. 47, no. 4, pp. B785–B800, 2025
work page 2025
-
[22]
Monotonicity of multi-term floating-poi nt adders,
M. Mikaitis, “Monotonicity of multi-term floating-poi nt adders,” IEEE Trans. Comput., vol. 73, no. 6, pp. 1531–1543, Feb. 2024
work page 2024
-
[23]
Optimized fused floating-point many-term dot-product har dware for machine learning accelerators,
H. Kaul, M. Anders, S. Mathew, S. Kim, and R. Krishnamurt hy, “Optimized fused floating-point many-term dot-product har dware for machine learning accelerators,” in 2019 IEEE 26th Symposium on Computer Arithmetic (ARITH) , 2019, pp. 84–87
work page 2019
-
[24]
Intel Nervana Neural Network Processor-T (NNP-T) fus ed floating point many-term dot product,
B. Hickmann, J. Chen, M. Rotzin, A. Y ang, M. Urbanski, an d S. Avan- cha, “Intel Nervana Neural Network Processor-T (NNP-T) fus ed floating point many-term dot product,” in 2020 IEEE 27th Symposium on Computer Arithmetic (ARITH) , 2020, pp. 133–136
work page 2020
-
[25]
Mixed precision block fused multiply-add: Error analysis and app lication to GPU tensor cores,
P . Blanchard, N. J. Higham, F. Lopez, T. Mary, and S. Pran esh, “Mixed precision block fused multiply-add: Error analysis and app lication to GPU tensor cores,” SIAM Journal on Scientific Computing , vol. 42, no. 3, pp. C124–C141, 2020
work page 2020
-
[26]
Multi-operand floating-point addition,
A. F. Tenca, “Multi-operand floating-point addition,” in 2009 19th IEEE Symposium on Computer Arithmetic , 2009, pp. 161–168
work page 2009
-
[27]
A test of a computer’s floating-point ari thmetic unit,
N. L. Schryer, “A test of a computer’s floating-point ari thmetic unit,” A T&T Bell Laboratories, Murray Hill, NJ, Murray Hill, NJ 079 74, Technical Report Computer Science Technical Report 89, Feb . 1981
work page 1981
-
[28]
NVIDIA Tesla V100 GPU architecture,
NVIDIA, “NVIDIA Tesla V100 GPU architecture,” 2017. [O nline]. Available: https://images.nvidia.com/content/volta- architectur e/pdf/ volta-architecture-whitepaper.pdf
work page 2017
-
[29]
BFLOA T16—hardware numerics defin ition,
Intel Corporation, “BFLOA T16—hardware numerics defin ition,” Avail- able at h t t p s : //s o f t w a r e .i n t e l.c o m/e n - u s/d o w n l o a d / bfloat16-hardware-numerics-definition (accessed 15 July 2020), Nov. 2018, white paper. Document number 338302-001US
work page 2020
-
[30]
CUDA Binary Utilities, release 13.1,
N. Corporation, “CUDA Binary Utilities, release 13.1, ” 2025. [Online]. Available: https://docs.nvidia.com/cuda/pdf/CUDA Binary Utilities.pdf
work page 2025
-
[31]
Disse cting the NVIDIA Blackwell architecture with microbenchmarks,
A. Jarmusch, N. Graddon, and S. Chandrasekaran, “Disse cting the NVIDIA Blackwell architecture with microbenchmarks,” 2025. [Online]. Available: https://arxiv.org/abs/2507.10789
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.