SGEMM-cube: Precision-Recovery FP32 GEMM Approximation on Ascend NPUs with FP16 Matrix Engines

Baisong Xu; Dengdeng Fan; Kai Yang; Pengxiang Xu; Weicheng Xue; Yonghong Tian; Yongxiang Liu

arxiv: 2507.23387 · v4 · submitted 2025-07-31 · 💻 cs.DC

SGEMM-cube: Precision-Recovery FP32 GEMM Approximation on Ascend NPUs with FP16 Matrix Engines

Weicheng Xue , Baisong Xu , Kai Yang , Yongxiang Liu , Dengdeng Fan , Pengxiang Xu , Yonghong Tian This is my paper

Pith reviewed 2026-05-19 02:54 UTC · model grok-4.3

classification 💻 cs.DC

keywords GEMMFP32 approximationFP16 matrix enginesAscend NPUsprecision recoveryhigh-performance GEMMmatrix multiplication

0 comments

The pith

SGEMM-cube approximates FP32 GEMM on Ascend NPUs' FP16 engines by splitting inputs into high and residual FP16 parts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents SGEMM-cube as a precision-recovery method for FP32 matrix multiplication on Ascend NPUs that only offer fast FP16 matrix engines. Each FP32 operand is split into an FP16 high component and a scaled FP16 residual component. The matrix product is then reconstructed from the high-high and high-low terms while omitting the low-low term. This scheme is realized with architecture-specific adaptations including L1-aware blocking and double-buffered pipelining on the NPU's software-managed memory. For inputs whose magnitudes fit inside the FP16 dynamic range the approach delivers substantially higher accuracy than direct FP16 GEMM and approaches full FP32 accuracy while reaching 77 percent of the FP32-equivalent peak performance.

Core claim

SGEMM-cube demonstrates that a two-component FP32-to-FP16 splitting strategy, reconstructing the product from high-high and high-low terms while omitting the low-low term, allows FP32-accuracy GEMM approximation on FP16-only Ascend NPU matrix engines. Analysis of round-to-nearest conversion, underflow, residual scaling, and accumulation order under the Ascend execution model clarifies the range and accuracy limitations, and standard high-performance GEMM techniques are adapted to the software-managed memory hierarchy, enabling up to 65.3 TFLOP/s or 77 percent of the FP32-equivalent peak defined by the three-GEMM decomposition cost.

What carries the argument

The two-component FP32-to-FP16 splitting strategy, in which each FP32 operand is represented by an FP16 high component and a scaled FP16 residual component, with the matrix product reconstructed from the dominant high-high and high-low terms.

Load-bearing premise

All input magnitudes lie within the representable FP16 dynamic range and the truncation error from omitting the low-low term plus round-to-nearest conversion, underflow, and accumulation-order effects remain acceptable.

What would settle it

Measuring relative error on a set of inputs with magnitudes exceeding FP16 range or verifying whether accuracy on moderate-range inputs fails to approach FP32 SGEMM levels when run on Ascend 910A hardware.

Figures

Figures reproduced from arXiv: 2507.23387 by Baisong Xu, Dengdeng Fan, Kai Yang, Pengxiang Xu, Weicheng Xue, Yonghong Tian, Yongxiang Liu.

**Figure 2.** Figure 2: Analysis of FP32 underflow/gradual underflow and precis [PITH_FULL_IMAGE:figures/full_fig_p012_2.png] view at source ↗

**Figure 3.** Figure 3: DaVinci architecture of Huawei Ascend NPU AI core [PITH_FULL_IMAGE:figures/full_fig_p014_3.png] view at source ↗

**Figure 4.** Figure 4: Matrix blocking based on L1 cache reuse held in L1 simultaneously: Nfused = int L1 − 2bkbn bmbk = int L1 bmbk − 2 bn bm = f L1 bmbk (8) where bm and bk denote the block height and width of A, and f (0.92 ≤ f ≤ 1 in our experiments) accounts for the correction from bn bm ∼ O(1) and the floor operation. A larger Nfused implies greater reuse of A and fewer reloads of B from main memory. The total memo… view at source ↗

**Figure 5.** Figure 5: Impact of blocking size on Nfused and f cores, and Crw accounts for reading and writing C through the unified buffer Nfused times. Block sizes are subject to hardware constraints arising from cube computation alignment and buffer capacities:    bm, bk, bn ≡ 0 (mod 16) (cube alignment) bm × bk ≤ 64 × 256 (L0A capacity) bk × bn ≤ 64 × 256 (L0B capacity) bm × bn × 6 ≤ 248 × 1024 (L0C and UB capacity… view at source ↗

**Figure 6.** Figure 6: Comparison of single- and double-buffered pipelines based o [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗

**Figure 7.** Figure 7: Computation sequences of the SGEMM-cube precision rec [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗

**Figure 8.** Figure 8: Relative error vs. offset exponent under different input r [PITH_FULL_IMAGE:figures/full_fig_p020_8.png] view at source ↗

**Figure 9.** Figure 9: Relative error vs. input matrix sizes Varying m and n with fixed k = 64×44 ( [PITH_FULL_IMAGE:figures/full_fig_p021_9.png] view at source ↗

**Figure 10.** Figure 10: Performance impact of matrix blocking with L1 reuse [PITH_FULL_IMAGE:figures/full_fig_p022_10.png] view at source ↗

**Figure 11.** Figure 11: Performance vs. input matrix sizes 5. Conclusions SGEMM-cube is presented as a high-performance algorithm for emulating FP32 GEMM on low-precision AI accelerators equipped only with FP16 compute units. By combining a decomposition-based FP32-to-FP16 mapping, tunable residual scaling, and term-wise accumulation within a cache-aware double-buffered pipeline, the method achieves both high numerical accuracy… view at source ↗

read the original abstract

Modern AI accelerators provide high-throughput low-precision matrix engines, but their support for FP32 GEMM is often limited or inefficient. This work presents SGEMM-cube, a precision-recovery FP32 GEMM approximation on Ascend NPUs using FP16 Cube units. Rather than claiming bit-exact FP32 approximation, SGEMM-cube targets near-FP32 accuracy for inputs whose magnitudes are representable within the FP16 dynamic range. The method follows a two-component FP32-to-FP16 splitting strategy related to Ozaki-style and Ootomo-style schemes: each FP32 operand is represented by an FP16 high component and a scaled FP16 residual component, and the matrix product is reconstructed from the dominant high-high and high-low terms while omitting the low-low term. The main contribution of this paper is not a new splitting paradigm, but an architecture-specific realization and analysis of this precision-recovery scheme on Ascend NPUs. We analyze the effects of round-to-nearest conversion, underflow, residual scaling, and accumulation order under the Ascend execution model, and clarify the range and accuracy limitations of the approach. We further adapt standard high-performance GEMM techniques, including L1-aware blocking and double-buffered pipelining, to the software-managed memory hierarchy of Ascend NPUs. Experiments on Ascend 910A show that SGEMM-cube recovers substantially higher accuracy than native FP16 GEMM and approaches FP32 SGEMM accuracy for moderate-range inputs, while achieving up to 65.3 TFLOP/s, corresponding to 77\% of the FP32-equivalent peak defined by the three-GEMM decomposition cost. These results demonstrate that FP32-accuracy GEMM approximation can be made practical on FP16-only NPU matrix engines, provided that its range, error, and implementation constraints are explicitly managed.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SGEMM-cube ports a two-part FP32 splitting scheme to Ascend NPUs, delivers 65 TFLOP/s at 77% of the three-GEMM peak, and gets closer to FP32 accuracy than native FP16 for inputs that fit in FP16 range.

read the letter

The core of this paper is a practical port of an existing splitting idea to Ascend hardware. They break each FP32 operand into a high FP16 part and a scaled residual, run the three main products on the fast FP16 cube units, drop the low-low term, and reassemble the result. The new pieces are the Ascend-specific error analysis around round-to-nearest, underflow, residual scaling, and accumulation order, plus the adapted L1 blocking and double-buffered pipelining for the software-managed memory hierarchy there.

Referee Report

0 major / 3 minor

Summary. The paper introduces SGEMM-cube, a precision-recovery approximation for FP32 GEMM on Ascend NPUs using FP16 Cube matrix engines. It uses a two-component splitting strategy where each FP32 operand is split into an FP16 high part and a scaled FP16 residual, reconstructs the product from high-high and high-low terms (omitting low-low), analyzes effects of round-to-nearest, underflow, residual scaling, and accumulation order, adapts L1-aware blocking and double-buffered pipelining for the NPU's software-managed memory, and reports experimental results on Ascend 910A achieving up to 65.3 TFLOP/s (77% of the FP32-equivalent three-GEMM peak) with substantially higher accuracy than native FP16 GEMM for moderate-range inputs.

Significance. If the reported results hold, this work is significant for showing a practical hardware-specific way to obtain near-FP32 accuracy GEMM on FP16-only matrix engines, addressing a limitation in many AI accelerators. It earns credit for the explicit analysis of round-to-nearest conversion, underflow, and accumulation order under the Ascend model, the adaptation of standard GEMM optimizations to the software-managed hierarchy, and the concrete, falsifiable throughput (65.3 TFLOP/s) and accuracy measurements against an externally defined three-GEMM peak. These elements provide useful benchmarks and implementation guidance for mixed-precision computing on NPUs.

minor comments (3)

[Abstract] Abstract: the phrase 'approaches FP32 SGEMM accuracy' would be strengthened by a brief quantitative statement of the observed error reduction (e.g., orders of magnitude or relative error ranges) for the moderate-range regime.
[Analysis section] The residual scaling factor is listed as a free parameter; a short table or paragraph showing its chosen value and sensitivity to underflow on Ascend would improve reproducibility.
[Experiments] Experiments: input matrix generation for the moderate-range tests should be described more precisely (distribution, magnitude bounds) to allow independent verification of the accuracy claims.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the constructive summary and positive evaluation of the significance of SGEMM-cube. The recommendation for minor revision is noted, and we appreciate the recognition of the architecture-specific analysis, error modeling, and concrete throughput/accuracy results. No specific major comments appear in the provided report, so we have no point-by-point rebuttals to offer at this stage. We remain ready to incorporate any editorial or minor clarifications in the revised version.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper describes an architecture-specific implementation and empirical evaluation of a known two-component FP32-to-FP16 splitting approach (explicitly related to prior Ozaki-style and Ootomo-style schemes) on Ascend NPUs. It analyzes hardware effects such as round-to-nearest conversion, underflow, residual scaling, and accumulation order under the Ascend model, then reports measured accuracy and throughput (65.3 TFLOP/s reaching 77% of an externally defined three-GEMM peak). No equations, fitted parameters, or self-citations reduce the reported results to quantities defined inside the paper; the central claims rest on direct measurement against external benchmarks and acknowledged range limitations rather than internal derivation.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The approach rests on standard floating-point conversion and rounding semantics plus a chosen residual scaling factor; no new physical entities are postulated.

free parameters (1)

residual scaling factor
A scale is applied to the low-order residual so that it fits into FP16; the abstract does not specify whether this scale is fixed or tuned per matrix.

axioms (1)

standard math Standard round-to-nearest-even behavior and dynamic-range limits of IEEE FP16 and FP32
Invoked when discussing conversion effects, underflow, and the representable range of inputs.

pith-pipeline@v0.9.0 · 5901 in / 1414 out tokens · 40286 ms · 2026-05-19T02:54:06.807395+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The method decomposes each FP32 operand into two FP16 values... tunable scaling strategy... termwise accumulation scheme... cache-aware blocking and double-buffered pipeline... 77% of the FP32-equivalent theoretical peak
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean alpha_pin_under_high_calibration unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Rules 1 and 2 collectively define the allowable bounds for the scaling exponent sb... sb = 12 is a reasonable and robust choice

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages · 1 internal anchor

[1]

H. Yu, H. Li, H. Shi, T. S. Huang, G. Hua, Any-precision deep neur al networks, in: Proceedings of the AAAI Conference on Artiﬁcial In telli- gence, Vol. 35, 2021, pp. 10763–10771

work page 2021
[2]

Zhuang, L

B. Zhuang, L. Liu, M. Tan, C. Shen, I. Reid, Training quantized ne u- ral networks with a full-precision auxiliary module, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recogn ition, 2020, pp. 1488–1497

work page 2020
[3]

Vansteenkiste, J

A. Vansteenkiste, J. Leliaert, M. Dvornik, M. Helsen, F. Garcia- Sanchez, B. Van Waeyenberge, The design and veriﬁcation of mumax3, AIP ad - vances 4 (10) (2014)

work page 2014
[4]

G. P. M¨ uller, M. Hoﬀmann, C. Dißelkamp, D. Sch¨ urhoﬀ, S. Mavro s, M. Sallermann, N. S. Kiselev, H. J´ onsson, S. Bl¨ ugel, Spirit: Multi- functional framework for atomistic spin simulations, Physical revie w b 99 (22) (2019) 224414

work page 2019
[5]

B. D. Wozniak, F. D. Witherden, F. P. Russell, P. E. Vincent, P. H. Kelly, Gimmik—generating bespoke matrix multiplication kernels for accelerators: Application to high-order computational ﬂuid dynam ics, Computer Physics Communications 202 (2016) 12–22

work page 2016
[6]

Cawkwell, E

M. Cawkwell, E. Sanville, S. Mniszewski, A. M. Niklasson, Computing the density matrix in electronic structure theory on graphics proc essing units, Journal of chemical theory and computation 8 (11) (2012) 4094– 4101

work page 2012
[7]

Mixed Precision Training

P. Micikevicius, S. Narang, J. Alben, G. Diamos, E. Elsen, D. Garcia , B. Ginsburg, M. Houston, O. Kuchaiev, G. Venkatesh, et al., Mixed precision training, arXiv preprint arXiv:1710.03740 (2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017
[8]

Rakka, M

M. Rakka, M. E. Fouda, P. Khargonekar, F. Kurdahi, Mixed-pre cision neural networks: A survey, arXiv preprint arXiv:2208.06064 (202 2). 22

work page arXiv
[9]

Haidar, S

A. Haidar, S. Tomov, J. Dongarra, N. J. Higham, Harnessing gpu ten- sor cores for fast fp16 arithmetic to speed up mixed-precision iter ative reﬁnement solvers, in: SC18: International Conference for High Perfor- mance Computing, Networking, Storage and Analysis, IEEE, 2018, pp. 603–613

work page 2018
[10]

N. J. Higham, T. Mary, Mixed precision algorithms in numerical linea r algebra, Acta Numerica 31 (2022) 347–414

work page 2022
[11]

H. Liao, J. Tu, J. Xia, H. Liu, X. Zhou, H. Yuan, Y. Hu, Ascend: a scalable and uniﬁed architecture for ubiquitous deep neural netw ork computing: Industry track paper, in: 2021 IEEE International S ym- posium on High-Performance Computer Architecture (HPCA), IEE E, 2021, pp. 789–801

work page 2021
[12]

Ozaki, T

K. Ozaki, T. Ogita, S. Oishi, S. M. Rump, Error-free transform ations of matrix multiplication by using fast routines of matrix multiplication and its applications, Numerical Algorithms 59 (2012) 95–118

work page 2012
[13]

Markidis, S

S. Markidis, S. W. Der Chien, E. Laure, I. B. Peng, J. S. Vetter , Nvidia tensor core programmability, performance & precision, in: 2018 IE EE international parallel and distributed processing symposium works hops (IPDPSW), IEEE, 2018, pp. 522–531

work page 2018
[14]

B. Feng, Y. Wang, G. Chen, W. Zhang, Y. Xie, Y. Ding, Egemm-tc : ac- celerating scientiﬁc computing on tensor cores with extended prec ision, in: Proceedings of the 26th ACM SIGPLAN symposium on principles and practice of parallel programming, 2021, pp. 278–291

work page 2021
[15]

Ootomo, R

H. Ootomo, R. Yokota, Recovering single precision accuracy fr om tensor cores while surpassing the fp32 theoretical peak performance, T he Inter- national Journal of High Performance Computing Applications 36 (4 ) (2022) 475–491

work page 2022
[16]

M. Fasi, N. J. Higham, M. Mikaitis, S. Pranesh, Numerical behavio r of nvidia tensor cores, PeerJ Computer Science 7 (2021) e330

work page 2021
[17]

Z. Ma, H. Wang, G. Feng, C. Zhang, L. Xie, J. He, S. Chen, J. Zh ai, Ef- ﬁciently emulating high-bitwidth computation with low-bitwidth hard- ware, in: Proceedings of the 36th ACM International Conference on Supercomputing, 2022, pp. 1–12. 23

work page 2022
[18]

G. Li, J. Xue, L. Liu, X. Wang, X. Ma, X. Dong, J. Li, X. Feng, Un- leashing the low-precision computation potential of tensor cores o n gpus, in: 2021 IEEE/ACM International Symposium on Code Generation an d Optimization (CGO), IEEE, 2021, pp. 90–102

work page 2021
[19]

Kahan, Ieee standard 754 for binary ﬂoating-point arithme tic, Lec- ture Notes on the Status of IEEE 754 (94720-1776) (1996) 11

W. Kahan, Ieee standard 754 for binary ﬂoating-point arithme tic, Lec- ture Notes on the Status of IEEE 754 (94720-1776) (1996) 11. 24

work page 1996

[1] [1]

H. Yu, H. Li, H. Shi, T. S. Huang, G. Hua, Any-precision deep neur al networks, in: Proceedings of the AAAI Conference on Artiﬁcial In telli- gence, Vol. 35, 2021, pp. 10763–10771

work page 2021

[2] [2]

Zhuang, L

B. Zhuang, L. Liu, M. Tan, C. Shen, I. Reid, Training quantized ne u- ral networks with a full-precision auxiliary module, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recogn ition, 2020, pp. 1488–1497

work page 2020

[3] [3]

Vansteenkiste, J

A. Vansteenkiste, J. Leliaert, M. Dvornik, M. Helsen, F. Garcia- Sanchez, B. Van Waeyenberge, The design and veriﬁcation of mumax3, AIP ad - vances 4 (10) (2014)

work page 2014

[4] [4]

G. P. M¨ uller, M. Hoﬀmann, C. Dißelkamp, D. Sch¨ urhoﬀ, S. Mavro s, M. Sallermann, N. S. Kiselev, H. J´ onsson, S. Bl¨ ugel, Spirit: Multi- functional framework for atomistic spin simulations, Physical revie w b 99 (22) (2019) 224414

work page 2019

[5] [5]

B. D. Wozniak, F. D. Witherden, F. P. Russell, P. E. Vincent, P. H. Kelly, Gimmik—generating bespoke matrix multiplication kernels for accelerators: Application to high-order computational ﬂuid dynam ics, Computer Physics Communications 202 (2016) 12–22

work page 2016

[6] [6]

Cawkwell, E

M. Cawkwell, E. Sanville, S. Mniszewski, A. M. Niklasson, Computing the density matrix in electronic structure theory on graphics proc essing units, Journal of chemical theory and computation 8 (11) (2012) 4094– 4101

work page 2012

[7] [7]

Mixed Precision Training

P. Micikevicius, S. Narang, J. Alben, G. Diamos, E. Elsen, D. Garcia , B. Ginsburg, M. Houston, O. Kuchaiev, G. Venkatesh, et al., Mixed precision training, arXiv preprint arXiv:1710.03740 (2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017

[8] [8]

Rakka, M

M. Rakka, M. E. Fouda, P. Khargonekar, F. Kurdahi, Mixed-pre cision neural networks: A survey, arXiv preprint arXiv:2208.06064 (202 2). 22

work page arXiv

[9] [9]

Haidar, S

A. Haidar, S. Tomov, J. Dongarra, N. J. Higham, Harnessing gpu ten- sor cores for fast fp16 arithmetic to speed up mixed-precision iter ative reﬁnement solvers, in: SC18: International Conference for High Perfor- mance Computing, Networking, Storage and Analysis, IEEE, 2018, pp. 603–613

work page 2018

[10] [10]

N. J. Higham, T. Mary, Mixed precision algorithms in numerical linea r algebra, Acta Numerica 31 (2022) 347–414

work page 2022

[11] [11]

H. Liao, J. Tu, J. Xia, H. Liu, X. Zhou, H. Yuan, Y. Hu, Ascend: a scalable and uniﬁed architecture for ubiquitous deep neural netw ork computing: Industry track paper, in: 2021 IEEE International S ym- posium on High-Performance Computer Architecture (HPCA), IEE E, 2021, pp. 789–801

work page 2021

[12] [12]

Ozaki, T

K. Ozaki, T. Ogita, S. Oishi, S. M. Rump, Error-free transform ations of matrix multiplication by using fast routines of matrix multiplication and its applications, Numerical Algorithms 59 (2012) 95–118

work page 2012

[13] [13]

Markidis, S

S. Markidis, S. W. Der Chien, E. Laure, I. B. Peng, J. S. Vetter , Nvidia tensor core programmability, performance & precision, in: 2018 IE EE international parallel and distributed processing symposium works hops (IPDPSW), IEEE, 2018, pp. 522–531

work page 2018

[14] [14]

B. Feng, Y. Wang, G. Chen, W. Zhang, Y. Xie, Y. Ding, Egemm-tc : ac- celerating scientiﬁc computing on tensor cores with extended prec ision, in: Proceedings of the 26th ACM SIGPLAN symposium on principles and practice of parallel programming, 2021, pp. 278–291

work page 2021

[15] [15]

Ootomo, R

H. Ootomo, R. Yokota, Recovering single precision accuracy fr om tensor cores while surpassing the fp32 theoretical peak performance, T he Inter- national Journal of High Performance Computing Applications 36 (4 ) (2022) 475–491

work page 2022

[16] [16]

M. Fasi, N. J. Higham, M. Mikaitis, S. Pranesh, Numerical behavio r of nvidia tensor cores, PeerJ Computer Science 7 (2021) e330

work page 2021

[17] [17]

Z. Ma, H. Wang, G. Feng, C. Zhang, L. Xie, J. He, S. Chen, J. Zh ai, Ef- ﬁciently emulating high-bitwidth computation with low-bitwidth hard- ware, in: Proceedings of the 36th ACM International Conference on Supercomputing, 2022, pp. 1–12. 23

work page 2022

[18] [18]

G. Li, J. Xue, L. Liu, X. Wang, X. Ma, X. Dong, J. Li, X. Feng, Un- leashing the low-precision computation potential of tensor cores o n gpus, in: 2021 IEEE/ACM International Symposium on Code Generation an d Optimization (CGO), IEEE, 2021, pp. 90–102

work page 2021

[19] [19]

Kahan, Ieee standard 754 for binary ﬂoating-point arithme tic, Lec- ture Notes on the Status of IEEE 754 (94720-1776) (1996) 11

W. Kahan, Ieee standard 754 for binary ﬂoating-point arithme tic, Lec- ture Notes on the Status of IEEE 754 (94720-1776) (1996) 11. 24

work page 1996