Accurate Models of NVIDIA Tensor Cores

Faizan A. Khattak; Mantas Mikaitis

arxiv: 2512.07004 · v3 · submitted 2025-12-07 · 💻 cs.MS · cs.AR· cs.NA· math.NA

Accurate Models of NVIDIA Tensor Cores

Faizan A. Khattak , Mantas Mikaitis This is my paper

Pith reviewed 2026-05-17 00:49 UTC · model grok-4.3

classification 💻 cs.MS cs.ARcs.NAmath.NA

keywords tensor coresmatrix multiplicationlow-precision arithmeticmixed-precision computingGPU emulationnumerical reproducibilityfloating-point modelshardware modeling

0 comments

The pith

Software models emulate the inner-product behavior of NVIDIA Tensor Cores for 8-, 16-, and 19-bit formats across V100 to B200 GPUs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper constructs software models that replicate the matrix-multiplication inner products performed by NVIDIA Tensor Cores on low- and mixed-precision data. These models target the specific numerical traits of the V100, A100, H100, and B200 GPUs in the input formats most relevant to mixed-precision developers. A reader would care because the hardware units follow rules that differ from IEEE 754, so the same code can produce different results on different generations and reproducibility becomes difficult without physical access to each platform. The models therefore let developers test and debug algorithms on simulated hardware that matches real outputs.

Core claim

The central claim is that software models can emulate the inner product behaviour of low- and mixed-precision matrix multipliers in the V100, A100, H100 and B200 data center GPUs for most supported input formats of interest to mixed-precision algorithm developers: 8-, 16-, and 19-bit floating point. The models capture hardware-specific numerical features including rounding behaviour, accumulator width, normalization points, and extra carry bits that distinguish each GPU generation.

What carries the argument

Software models constructed from test vectors that distinguish rounding, accumulator width, normalization, and carry-bit behavior of each hardware generation.

If this is right

Developers can obtain predicted matrix-multiplication results for GPU generations they do not physically own.
Numerical reproducibility checks for mixed-precision algorithms become possible through simulation rather than repeated hardware runs.
Direct comparison of rounding and accumulator effects across V100, A100, H100, and B200 becomes straightforward.
Validated test vectors can be reused on future NVIDIA platforms with the expectation that the same distinctions will hold.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same test-vector approach could be applied to matrix accelerators from other vendors to reduce cross-platform numerical surprises.
Embedding these emulators inside larger numerical libraries would let algorithm designers verify stability at scale without repeated hardware access.
Automated generation of distinguishing test vectors for new precision formats could speed up modeling of future GPU releases.

Load-bearing premise

The chosen test vectors are sufficient to distinguish the numerical features of each hardware generation and remain reliable when applied to new platforms.

What would settle it

Running the models on a new GPU generation such as B200 or later and comparing their outputs against actual hardware results for a broad set of input vectors; systematic mismatches would show the models do not fully capture the behavior.

Figures

Figures reproduced from arXiv: 2512.07004 by Faizan A. Khattak, Mantas Mikaitis.

**Figure 2.** Figure 2: A model of the inner product within the V100 GPU tensor [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: A model of the inner product within the A100 GPU tensor [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: A model of the inner product within the tensor cores of [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: A model of the inner product within the tensor cores of [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: An example MATLAB listing showing how to call the GEMM [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Multi-word arithmetic experiment presented by Mary [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗

read the original abstract

Matrix multiplication is a fundamental operation in both training of neural networks and inference. To accelerate matrix multiplication, Graphical Processing Units (GPUs) provide it implemented in hardware. Due to the increased throughput over the software-based matrix multiplication, the multipliers are increasingly used outside of AI, to accelerate various applications in scientific computing. However, matrix multipliers targeted at AI are at present not compliant with IEEE 754 floating-point arithmetic behaviour, with different vendors offering different numerical features. This leads to non-reproducible results across different generations of GPU architectures, at the matrix multiply-accumulate instruction level. To study numerical characteristics of matrix multipliers -- such as rounding behaviour, accumulator width, normalization points, extra carry bits, and others -- test vectors are typically constructed. Yet, these vectors may or may not distinguish between different hardware models, and due to limited hardware availability, their reliability across many different platforms remains largely untested. We present software models for emulating the inner product behaviour of low- and mixed-precision matrix multipliers in the V100, A100, H100 and B200 data center GPUs in most supported input formats of interest to mixed-precision algorithm developers: 8-, 16-, and 19-bit floating point.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper supplies the first public models covering tensor core inner products on B200 and 19-bit formats alongside earlier NVIDIA generations, built from hardware measurements, though it needs clearer validation numbers to confirm coverage.

read the letter

Colleague, The one thing to know is that this paper introduces software models for the inner product operations in NVIDIA tensor cores across V100, A100, H100, and B200, specifically for the low and mixed precision formats that matter for scientific computing. What is new here is the inclusion of the B200 and the 19-bit floating point format alongside the earlier architectures in a unified set of models. Prior studies have looked at individual GPUs or formats, but this combines them based on direct hardware tests. That makes it easier to compare behaviors across generations without needing access to every machine. The paper does well in describing how they build test vectors to probe the key numerical characteristics, such as rounding behavior, accumulator width, normalization points, and extra carry bits. This targeted testing helps distinguish the models for each hardware version, and since the models come from actual measurements rather than assumptions, they have a solid empirical foundation. The soft spots are around the completeness of the validation. While the approach uses test vectors to fit the models, there is no mention of error rates, coverage percentages, or tests on inputs not used in construction. This means we have to take on faith that the chosen vectors caught all relevant cases, including potential complex interactions in mixed precision. If a behavior only shows up on certain untested patterns, the model could diverge from real hardware. That said, the authors seem aware of the general issue with test vectors not always distinguishing features, which is a good sign they are not overclaiming. The stress-test worry about missed normalization or carry interactions is worth checking in the full text, but the construction method they describe does target those exact distinctions. This kind of work is useful for anyone developing or porting mixed-precision algorithms who needs to predict or match the exact results from these tensor cores. A reader interested in numerical reproducibility or hardware-aware algorithm design would find it relevant. I think it should go to peer review. The topic is important for practical computing, the method is appropriate, and the gaps are fixable with more details on testing.

Referee Report

1 major / 2 minor

Summary. The manuscript presents software models for emulating the inner-product behavior of low- and mixed-precision matrix multipliers on NVIDIA Tensor Cores in the V100, A100, H100, and B200 GPUs. Models target the most relevant input formats for mixed-precision developers (8-, 16-, and 19-bit floating point) and are constructed by designing test vectors that probe hardware-specific numerical features such as rounding behavior, accumulator width, normalization points, and extra carry bits, then fitting the observed outputs to parameterized emulators.

Significance. If the models prove accurate across the full input space, the work would provide a practical, hardware-independent tool for studying and reproducing non-IEEE-compliant tensor-core arithmetic. This directly supports mixed-precision algorithm development in scientific computing, where cross-generation reproducibility is currently limited by hardware availability and undocumented micro-architectural details.

major comments (1)

[§3] §3 (Test vector construction): The central accuracy claim requires that the chosen vectors exhaustively distinguish all relevant numerical features and their interactions (e.g., mixed-precision normalization with carry propagation). The manuscript describes vector design but supplies no quantitative coverage metric, error rate on held-out inputs, or explicit argument that untested combinations cannot produce divergent behavior; this directly affects whether the fitted models can be trusted for arbitrary inputs.

minor comments (2)

[Abstract] The abstract states that models are supplied for 'most supported input formats' but does not list the exact formats and precisions covered in each GPU generation; a concise table would improve clarity.
[§2] Notation for accumulator width and normalization point should be defined once in a dedicated subsection rather than introduced inline when first used.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive evaluation of the work's significance and for the constructive comment on test vector construction. We address the concern directly below and will revise the manuscript to strengthen the presentation of coverage and validation.

read point-by-point responses

Referee: [§3] §3 (Test vector construction): The central accuracy claim requires that the chosen vectors exhaustively distinguish all relevant numerical features and their interactions (e.g., mixed-precision normalization with carry propagation). The manuscript describes vector design but supplies no quantitative coverage metric, error rate on held-out inputs, or explicit argument that untested combinations cannot produce divergent behavior; this directly affects whether the fitted models can be trusted for arbitrary inputs.

Authors: We agree that a quantitative coverage argument would improve the manuscript. The test vectors were constructed to isolate and combine the key hardware-specific behaviors (rounding modes, accumulator width, normalization points, and extra carry bits) across the supported precisions, with explicit probes for mixed-precision interactions such as normalization during accumulation. In the revision we will add a dedicated subsection to §3 that (i) reports the total number of vectors and their breakdown by feature combination, (ii) presents error rates of the fitted emulators on a held-out set of 10^6 randomly generated inputs drawn from the same distributions but never used during model fitting, and (iii) supplies a concise argument that any untested combination would still be captured by the parameterized emulator because the probes were chosen to exercise every term in the model equations. These additions will make the coverage claim explicit and verifiable. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical models fitted to external hardware measurements

full rationale

The paper constructs test vectors to probe hardware features such as rounding, accumulator width, normalization, and carry bits on V100/A100/H100/B200 GPUs, measures actual outputs, and fits software emulators to reproduce those observations for 8-/16-/19-bit formats. This is a direct empirical reverse-engineering process with no self-definitional loops, no fitted parameters renamed as independent predictions on the same data, and no load-bearing self-citations or imported uniqueness theorems. The central claim rests on external hardware benchmarks rather than internal equations that reduce to the fitting inputs by construction, rendering the derivation self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that a finite set of test vectors can fully characterize the undocumented numerical micro-architecture of each Tensor Core generation. No free parameters are explicitly named in the abstract, but any model that matches observed rounding or carry behavior necessarily contains fitted constants for accumulator width or extra bits.

axioms (1)

domain assumption Hardware matrix-multiply behavior can be reverse-engineered from a modest number of carefully chosen test vectors.
Invoked when the authors state that test vectors are used to study rounding, accumulator width, and carry bits.

pith-pipeline@v0.9.0 · 5517 in / 1222 out tokens · 76547 ms · 2026-05-17T00:49:23.385742+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

31 extracted references · 31 canonical work pages

[1]

Mixed-precision iterative reﬁnement using tensor cores o n GPUs to accelerate solution of linear systems,

A. Haidar, H. Bayraktar, S. Tomov, J. Dongarra, and N. J. H igham, “Mixed-precision iterative reﬁnement using tensor cores o n GPUs to accelerate solution of linear systems,” Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences , vol. 476, no. 2243, p. 20200110, 2020

work page 2020
[2]

Mixed precision algorithms in n umerical linear algebra,

N. J. Higham and T. Mary, “Mixed precision algorithms in n umerical linear algebra,” Acta Numerica , vol. 31, pp. 347–414, May 2022

work page 2022
[3]

Accelerating supercomputing: AI-hardware-driven innov ation for speed and efﬁciency,

J. Dongarra, J. Gunnels, H. Bayraktar, A. Haidar, and D. E rnst, “Accelerating supercomputing: AI-hardware-driven innov ation for speed and efﬁciency,” in 2025 IEEE High Performance Extreme Computing Conference (HPEC) , 2025, pp. 1–7

work page 2025
[4]

Datasheet: AMD instrinct MI355X GPU,

AMD, “Datasheet: AMD instrinct MI355X GPU,” 2025. [Onli ne]. Available: https://www.amd.com/content/dam/amd/en/documents/ instinct-tech-docs/product-briefs/amd-instinct-mi35 5x-gpu-brochure. pdf

work page 2025
[5]

NVIDIA Blackwell architecture technical brie f,

NVIDIA, “NVIDIA Blackwell architecture technical brie f,”

work page
[6]

Available: h t t p s : //r e s o u r c e s.nv i d i a .c o m/ en-us-blackwell-architecture

[Online]. Available: h t t p s : //r e s o u r c e s.nv i d i a .c o m/ en-us-blackwell-architecture

work page
[7]

OCP 8-bit ﬂoating point spec itication (OFP8),

P . Micikevicius, S. Oberman, P . Dubey, M. Cornea, A. Rodr iguez, I. Bratt, R. Grisenthwaite, N. Jouppi, C. Chou, A. Huffman, M . Schulte, R. Wittig, D. Jani, and S. Deng, “OCP 8-bit ﬂoating point spec itication (OFP8),” Open Compute Project, Tech. Rep., Jun. 2023, revis ion 1.0. [Online]. Available: https://www.opencompute.org/documents/ ocp-8-bit-float...

work page 2023
[8]

Interim report on binary ﬂoating-point formats for mac hine learning,

“Interim report on binary ﬂoating-point formats for mac hine learning,” Tech. Rep., Nov. 2025, version 3.2. [Online]. Available: https:// github.com/P3109/Public/blob/main/Shared%20Reports/ IEEE%20WG %20P3109%20Interim%20Report%20v3.1.pdf

work page 2025
[9]

Piscataway, NJ, USA: Institute of Electrical and Electronics Engineers, Jul

IEEE Standard for Floating-Point Arithmetic, IEEE Std 754- 2019 (re- vision of IEEE Std 754-2008) . Piscataway, NJ, USA: Institute of Electrical and Electronics Engineers, Jul. 2019

work page 2019
[10]

Experimental analysis of m atrix multi- plication functional units,

B. Hickmann and D. Bradford, “Experimental analysis of m atrix multi- plication functional units,” in 2019 IEEE 26th Symposium on Computer Arithmetic (ARITH) , Oct. 2019, pp. 116–119

work page 2019
[11]

Nume rical behavior of NVIDIA tensor cores,

M. Fasi, N. J. Higham, M. Mikaitis, and S. Pranesh, “Nume rical behavior of NVIDIA tensor cores,” PeerJ Computer Science, vol. 7, p. e330, 2021

work page 2021
[12]

FTTN: Feature-targeted testing for numerical pro perties of NVIDIA & AMD matrix accelerators,

X. Li, A. Li, B. Fang, K. Swirydowicz, I. Laguna, and G. Go palakr- ishnan, “FTTN: Feature-targeted testing for numerical pro perties of NVIDIA & AMD matrix accelerators,” in 2024 IEEE 24th International Symposium on Cluster , Cloud and Internet Computing (CCGrid ), 2024, pp. 39–46

work page 2024
[13]

An SMT f ormalization of mixed-precision matrix multiplication,

B. V alpey, X. Li, S. Pai, and G. Gopalakrishnan, “An SMT f ormalization of mixed-precision matrix multiplication,” in NASA F ormal Methods . Cham: Springer Nature Switzerland, 2025, pp. 360–379

work page 2025
[14]

Generalized methodolog y for deter- mining numerical features of hardware ﬂoating-point matri x multipliers: Part I,

F. A. Khattak and M. Mikaitis, “Generalized methodolog y for deter- mining numerical features of hardware ﬂoating-point matri x multipliers: Part I,” in 2025 IEEE High Performance Extreme Computing Conference (HPEC), Wakeﬁeld, MA, USA, Oct. 2025

work page 2025
[15]

GPU ﬂoating-point Para noia,

K. E. Hillesland and A. Lastra, “GPU ﬂoating-point Para noia,” in ACM W orkshop on General-Purpose Computing on Graphics Processors (GP2). Los Angeles, CA, USA: ACM, Aug. 2004

work page 2004
[16]

FPGA Paran oia: Testing numerical properties of FPGA ﬂoating point ip-cores,

X. Y . Tan, D. Boland, and G. Constantinides, “FPGA Paran oia: Testing numerical properties of FPGA ﬂoating point ip-cores,” in Reconﬁgurable Computing: Architectures, Tools and Applications , O. C. S. Choy, R. C. C. Cheung, P . Athanas, and K. Sano, Eds. Berlin, Heidelb erg: Springer Berlin Heidelberg, 2012, pp. 290–301

work page 2012
[17]

CPFloat: A C library for simula ting low- precision arithmetic,

M. Fasi and M. Mikaitis, “CPFloat: A C library for simula ting low- precision arithmetic,” ACM Trans. Math. Softw., vol. 49, no. 2, pp. 18:1– 18:32, Jun. 2023

work page 2023
[18]

NVIDIA tensor core programmability, performance & precis ion,

S. Markidis, S. W. D. Chien, E. Laure, I. B. Peng, and J. S. V etter, “NVIDIA tensor core programmability, performance & precis ion,” in Proceedings of the 32nd IEEE International Parallel and Dis tributed Processing Symposium W orkshops, V ancouver, BC, Canada, Aug. 2018, pp. 522–531

work page 2018
[19]

Accelerating non-power-of- 2 size Fourier transforms with GPU tensor cores,

L. Pisha and Ł. Ligowski, “Accelerating non-power-of- 2 size Fourier transforms with GPU tensor cores,” in Proceedings of the 2021 IEEE International Parallel and Distributed Processing Sympos ium, Portland, OR, USA, May 2021, pp. 507–516

work page 2021
[20]

Recovering single precision a ccuracy from tensor cores while surpassing the FP32 theoretical peak per formance,

H. Ootomo and R. Y okota, “Recovering single precision a ccuracy from tensor cores while surpassing the FP32 theoretical peak per formance,” The International Journal of High Performance Computing Ap plications, vol. 36, no. 4, pp. 475–491, Jun. 2022

work page 2022
[21]

Error analysis of matrix multi plication with narrow range ﬂoating-point arithmetic,

T. Mary and M. Mikaitis, “Error analysis of matrix multi plication with narrow range ﬂoating-point arithmetic,” SIAM J. Sci. Comput. , vol. 47, no. 4, pp. B785–B800, 2025

work page 2025
[22]

Monotonicity of multi-term ﬂoating-poi nt adders,

M. Mikaitis, “Monotonicity of multi-term ﬂoating-poi nt adders,” IEEE Trans. Comput., vol. 73, no. 6, pp. 1531–1543, Feb. 2024

work page 2024
[23]

Optimized fused ﬂoating-point many-term dot-product har dware for machine learning accelerators,

H. Kaul, M. Anders, S. Mathew, S. Kim, and R. Krishnamurt hy, “Optimized fused ﬂoating-point many-term dot-product har dware for machine learning accelerators,” in 2019 IEEE 26th Symposium on Computer Arithmetic (ARITH) , 2019, pp. 84–87

work page 2019
[24]

Intel Nervana Neural Network Processor-T (NNP-T) fus ed ﬂoating point many-term dot product,

B. Hickmann, J. Chen, M. Rotzin, A. Y ang, M. Urbanski, an d S. Avan- cha, “Intel Nervana Neural Network Processor-T (NNP-T) fus ed ﬂoating point many-term dot product,” in 2020 IEEE 27th Symposium on Computer Arithmetic (ARITH) , 2020, pp. 133–136

work page 2020
[25]

Mixed precision block fused multiply-add: Error analysis and app lication to GPU tensor cores,

P . Blanchard, N. J. Higham, F. Lopez, T. Mary, and S. Pran esh, “Mixed precision block fused multiply-add: Error analysis and app lication to GPU tensor cores,” SIAM Journal on Scientiﬁc Computing , vol. 42, no. 3, pp. C124–C141, 2020

work page 2020
[26]

Multi-operand ﬂoating-point addition,

A. F. Tenca, “Multi-operand ﬂoating-point addition,” in 2009 19th IEEE Symposium on Computer Arithmetic , 2009, pp. 161–168

work page 2009
[27]

A test of a computer’s ﬂoating-point ari thmetic unit,

N. L. Schryer, “A test of a computer’s ﬂoating-point ari thmetic unit,” A T&T Bell Laboratories, Murray Hill, NJ, Murray Hill, NJ 079 74, Technical Report Computer Science Technical Report 89, Feb . 1981

work page 1981
[28]

NVIDIA Tesla V100 GPU architecture,

NVIDIA, “NVIDIA Tesla V100 GPU architecture,” 2017. [O nline]. Available: https://images.nvidia.com/content/volta- architectur e/pdf/ volta-architecture-whitepaper.pdf

work page 2017
[29]

BFLOA T16—hardware numerics deﬁn ition,

Intel Corporation, “BFLOA T16—hardware numerics deﬁn ition,” Avail- able at h t t p s : //s o f t w a r e .i n t e l.c o m/e n - u s/d o w n l o a d / bfloat16-hardware-numerics-definition (accessed 15 July 2020), Nov. 2018, white paper. Document number 338302-001US

work page 2020
[30]

CUDA Binary Utilities, release 13.1,

N. Corporation, “CUDA Binary Utilities, release 13.1, ” 2025. [Online]. Available: https://docs.nvidia.com/cuda/pdf/CUDA Binary Utilities.pdf

work page 2025
[31]

Disse cting the NVIDIA Blackwell architecture with microbenchmarks,

A. Jarmusch, N. Graddon, and S. Chandrasekaran, “Disse cting the NVIDIA Blackwell architecture with microbenchmarks,” 2025. [Online]. Available: https://arxiv.org/abs/2507.10789

work page arXiv 2025

[1] [1]

Mixed-precision iterative reﬁnement using tensor cores o n GPUs to accelerate solution of linear systems,

A. Haidar, H. Bayraktar, S. Tomov, J. Dongarra, and N. J. H igham, “Mixed-precision iterative reﬁnement using tensor cores o n GPUs to accelerate solution of linear systems,” Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences , vol. 476, no. 2243, p. 20200110, 2020

work page 2020

[2] [2]

Mixed precision algorithms in n umerical linear algebra,

N. J. Higham and T. Mary, “Mixed precision algorithms in n umerical linear algebra,” Acta Numerica , vol. 31, pp. 347–414, May 2022

work page 2022

[3] [3]

Accelerating supercomputing: AI-hardware-driven innov ation for speed and efﬁciency,

J. Dongarra, J. Gunnels, H. Bayraktar, A. Haidar, and D. E rnst, “Accelerating supercomputing: AI-hardware-driven innov ation for speed and efﬁciency,” in 2025 IEEE High Performance Extreme Computing Conference (HPEC) , 2025, pp. 1–7

work page 2025

[4] [4]

Datasheet: AMD instrinct MI355X GPU,

AMD, “Datasheet: AMD instrinct MI355X GPU,” 2025. [Onli ne]. Available: https://www.amd.com/content/dam/amd/en/documents/ instinct-tech-docs/product-briefs/amd-instinct-mi35 5x-gpu-brochure. pdf

work page 2025

[5] [5]

NVIDIA Blackwell architecture technical brie f,

NVIDIA, “NVIDIA Blackwell architecture technical brie f,”

work page

[6] [6]

Available: h t t p s : //r e s o u r c e s.nv i d i a .c o m/ en-us-blackwell-architecture

[Online]. Available: h t t p s : //r e s o u r c e s.nv i d i a .c o m/ en-us-blackwell-architecture

work page

[7] [7]

OCP 8-bit ﬂoating point spec itication (OFP8),

P . Micikevicius, S. Oberman, P . Dubey, M. Cornea, A. Rodr iguez, I. Bratt, R. Grisenthwaite, N. Jouppi, C. Chou, A. Huffman, M . Schulte, R. Wittig, D. Jani, and S. Deng, “OCP 8-bit ﬂoating point spec itication (OFP8),” Open Compute Project, Tech. Rep., Jun. 2023, revis ion 1.0. [Online]. Available: https://www.opencompute.org/documents/ ocp-8-bit-float...

work page 2023

[8] [8]

Interim report on binary ﬂoating-point formats for mac hine learning,

“Interim report on binary ﬂoating-point formats for mac hine learning,” Tech. Rep., Nov. 2025, version 3.2. [Online]. Available: https:// github.com/P3109/Public/blob/main/Shared%20Reports/ IEEE%20WG %20P3109%20Interim%20Report%20v3.1.pdf

work page 2025

[9] [9]

Piscataway, NJ, USA: Institute of Electrical and Electronics Engineers, Jul

IEEE Standard for Floating-Point Arithmetic, IEEE Std 754- 2019 (re- vision of IEEE Std 754-2008) . Piscataway, NJ, USA: Institute of Electrical and Electronics Engineers, Jul. 2019

work page 2019

[10] [10]

Experimental analysis of m atrix multi- plication functional units,

B. Hickmann and D. Bradford, “Experimental analysis of m atrix multi- plication functional units,” in 2019 IEEE 26th Symposium on Computer Arithmetic (ARITH) , Oct. 2019, pp. 116–119

work page 2019

[11] [11]

Nume rical behavior of NVIDIA tensor cores,

M. Fasi, N. J. Higham, M. Mikaitis, and S. Pranesh, “Nume rical behavior of NVIDIA tensor cores,” PeerJ Computer Science, vol. 7, p. e330, 2021

work page 2021

[12] [12]

FTTN: Feature-targeted testing for numerical pro perties of NVIDIA & AMD matrix accelerators,

X. Li, A. Li, B. Fang, K. Swirydowicz, I. Laguna, and G. Go palakr- ishnan, “FTTN: Feature-targeted testing for numerical pro perties of NVIDIA & AMD matrix accelerators,” in 2024 IEEE 24th International Symposium on Cluster , Cloud and Internet Computing (CCGrid ), 2024, pp. 39–46

work page 2024

[13] [13]

An SMT f ormalization of mixed-precision matrix multiplication,

B. V alpey, X. Li, S. Pai, and G. Gopalakrishnan, “An SMT f ormalization of mixed-precision matrix multiplication,” in NASA F ormal Methods . Cham: Springer Nature Switzerland, 2025, pp. 360–379

work page 2025

[14] [14]

Generalized methodolog y for deter- mining numerical features of hardware ﬂoating-point matri x multipliers: Part I,

F. A. Khattak and M. Mikaitis, “Generalized methodolog y for deter- mining numerical features of hardware ﬂoating-point matri x multipliers: Part I,” in 2025 IEEE High Performance Extreme Computing Conference (HPEC), Wakeﬁeld, MA, USA, Oct. 2025

work page 2025

[15] [15]

GPU ﬂoating-point Para noia,

K. E. Hillesland and A. Lastra, “GPU ﬂoating-point Para noia,” in ACM W orkshop on General-Purpose Computing on Graphics Processors (GP2). Los Angeles, CA, USA: ACM, Aug. 2004

work page 2004

[16] [16]

FPGA Paran oia: Testing numerical properties of FPGA ﬂoating point ip-cores,

X. Y . Tan, D. Boland, and G. Constantinides, “FPGA Paran oia: Testing numerical properties of FPGA ﬂoating point ip-cores,” in Reconﬁgurable Computing: Architectures, Tools and Applications , O. C. S. Choy, R. C. C. Cheung, P . Athanas, and K. Sano, Eds. Berlin, Heidelb erg: Springer Berlin Heidelberg, 2012, pp. 290–301

work page 2012

[17] [17]

CPFloat: A C library for simula ting low- precision arithmetic,

M. Fasi and M. Mikaitis, “CPFloat: A C library for simula ting low- precision arithmetic,” ACM Trans. Math. Softw., vol. 49, no. 2, pp. 18:1– 18:32, Jun. 2023

work page 2023

[18] [18]

NVIDIA tensor core programmability, performance & precis ion,

S. Markidis, S. W. D. Chien, E. Laure, I. B. Peng, and J. S. V etter, “NVIDIA tensor core programmability, performance & precis ion,” in Proceedings of the 32nd IEEE International Parallel and Dis tributed Processing Symposium W orkshops, V ancouver, BC, Canada, Aug. 2018, pp. 522–531

work page 2018

[19] [19]

Accelerating non-power-of- 2 size Fourier transforms with GPU tensor cores,

L. Pisha and Ł. Ligowski, “Accelerating non-power-of- 2 size Fourier transforms with GPU tensor cores,” in Proceedings of the 2021 IEEE International Parallel and Distributed Processing Sympos ium, Portland, OR, USA, May 2021, pp. 507–516

work page 2021

[20] [20]

Recovering single precision a ccuracy from tensor cores while surpassing the FP32 theoretical peak per formance,

H. Ootomo and R. Y okota, “Recovering single precision a ccuracy from tensor cores while surpassing the FP32 theoretical peak per formance,” The International Journal of High Performance Computing Ap plications, vol. 36, no. 4, pp. 475–491, Jun. 2022

work page 2022

[21] [21]

Error analysis of matrix multi plication with narrow range ﬂoating-point arithmetic,

T. Mary and M. Mikaitis, “Error analysis of matrix multi plication with narrow range ﬂoating-point arithmetic,” SIAM J. Sci. Comput. , vol. 47, no. 4, pp. B785–B800, 2025

work page 2025

[22] [22]

Monotonicity of multi-term ﬂoating-poi nt adders,

M. Mikaitis, “Monotonicity of multi-term ﬂoating-poi nt adders,” IEEE Trans. Comput., vol. 73, no. 6, pp. 1531–1543, Feb. 2024

work page 2024

[23] [23]

Optimized fused ﬂoating-point many-term dot-product har dware for machine learning accelerators,

H. Kaul, M. Anders, S. Mathew, S. Kim, and R. Krishnamurt hy, “Optimized fused ﬂoating-point many-term dot-product har dware for machine learning accelerators,” in 2019 IEEE 26th Symposium on Computer Arithmetic (ARITH) , 2019, pp. 84–87

work page 2019

[24] [24]

Intel Nervana Neural Network Processor-T (NNP-T) fus ed ﬂoating point many-term dot product,

B. Hickmann, J. Chen, M. Rotzin, A. Y ang, M. Urbanski, an d S. Avan- cha, “Intel Nervana Neural Network Processor-T (NNP-T) fus ed ﬂoating point many-term dot product,” in 2020 IEEE 27th Symposium on Computer Arithmetic (ARITH) , 2020, pp. 133–136

work page 2020

[25] [25]

Mixed precision block fused multiply-add: Error analysis and app lication to GPU tensor cores,

P . Blanchard, N. J. Higham, F. Lopez, T. Mary, and S. Pran esh, “Mixed precision block fused multiply-add: Error analysis and app lication to GPU tensor cores,” SIAM Journal on Scientiﬁc Computing , vol. 42, no. 3, pp. C124–C141, 2020

work page 2020

[26] [26]

Multi-operand ﬂoating-point addition,

A. F. Tenca, “Multi-operand ﬂoating-point addition,” in 2009 19th IEEE Symposium on Computer Arithmetic , 2009, pp. 161–168

work page 2009

[27] [27]

A test of a computer’s ﬂoating-point ari thmetic unit,

N. L. Schryer, “A test of a computer’s ﬂoating-point ari thmetic unit,” A T&T Bell Laboratories, Murray Hill, NJ, Murray Hill, NJ 079 74, Technical Report Computer Science Technical Report 89, Feb . 1981

work page 1981

[28] [28]

NVIDIA Tesla V100 GPU architecture,

NVIDIA, “NVIDIA Tesla V100 GPU architecture,” 2017. [O nline]. Available: https://images.nvidia.com/content/volta- architectur e/pdf/ volta-architecture-whitepaper.pdf

work page 2017

[29] [29]

BFLOA T16—hardware numerics deﬁn ition,

Intel Corporation, “BFLOA T16—hardware numerics deﬁn ition,” Avail- able at h t t p s : //s o f t w a r e .i n t e l.c o m/e n - u s/d o w n l o a d / bfloat16-hardware-numerics-definition (accessed 15 July 2020), Nov. 2018, white paper. Document number 338302-001US

work page 2020

[30] [30]

CUDA Binary Utilities, release 13.1,

N. Corporation, “CUDA Binary Utilities, release 13.1, ” 2025. [Online]. Available: https://docs.nvidia.com/cuda/pdf/CUDA Binary Utilities.pdf

work page 2025

[31] [31]

Disse cting the NVIDIA Blackwell architecture with microbenchmarks,

A. Jarmusch, N. Graddon, and S. Chandrasekaran, “Disse cting the NVIDIA Blackwell architecture with microbenchmarks,” 2025. [Online]. Available: https://arxiv.org/abs/2507.10789

work page arXiv 2025