pith. sign in

arxiv: 2507.23387 · v4 · submitted 2025-07-31 · 💻 cs.DC

SGEMM-cube: Precision-Recovery FP32 GEMM Approximation on Ascend NPUs with FP16 Matrix Engines

Pith reviewed 2026-05-19 02:54 UTC · model grok-4.3

classification 💻 cs.DC
keywords GEMMFP32 approximationFP16 matrix enginesAscend NPUsprecision recoveryhigh-performance GEMMmatrix multiplication
0
0 comments X

The pith

SGEMM-cube approximates FP32 GEMM on Ascend NPUs' FP16 engines by splitting inputs into high and residual FP16 parts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents SGEMM-cube as a precision-recovery method for FP32 matrix multiplication on Ascend NPUs that only offer fast FP16 matrix engines. Each FP32 operand is split into an FP16 high component and a scaled FP16 residual component. The matrix product is then reconstructed from the high-high and high-low terms while omitting the low-low term. This scheme is realized with architecture-specific adaptations including L1-aware blocking and double-buffered pipelining on the NPU's software-managed memory. For inputs whose magnitudes fit inside the FP16 dynamic range the approach delivers substantially higher accuracy than direct FP16 GEMM and approaches full FP32 accuracy while reaching 77 percent of the FP32-equivalent peak performance.

Core claim

SGEMM-cube demonstrates that a two-component FP32-to-FP16 splitting strategy, reconstructing the product from high-high and high-low terms while omitting the low-low term, allows FP32-accuracy GEMM approximation on FP16-only Ascend NPU matrix engines. Analysis of round-to-nearest conversion, underflow, residual scaling, and accumulation order under the Ascend execution model clarifies the range and accuracy limitations, and standard high-performance GEMM techniques are adapted to the software-managed memory hierarchy, enabling up to 65.3 TFLOP/s or 77 percent of the FP32-equivalent peak defined by the three-GEMM decomposition cost.

What carries the argument

The two-component FP32-to-FP16 splitting strategy, in which each FP32 operand is represented by an FP16 high component and a scaled FP16 residual component, with the matrix product reconstructed from the dominant high-high and high-low terms.

Load-bearing premise

All input magnitudes lie within the representable FP16 dynamic range and the truncation error from omitting the low-low term plus round-to-nearest conversion, underflow, and accumulation-order effects remain acceptable.

What would settle it

Measuring relative error on a set of inputs with magnitudes exceeding FP16 range or verifying whether accuracy on moderate-range inputs fails to approach FP32 SGEMM levels when run on Ascend 910A hardware.

Figures

Figures reproduced from arXiv: 2507.23387 by Baisong Xu, Dengdeng Fan, Kai Yang, Pengxiang Xu, Weicheng Xue, Yonghong Tian, Yongxiang Liu.

Figure 1
Figure 1. Figure 1: Splitting a single FP32 floating number into two FP16 floating n [PITH_FULL_IMAGE:figures/full_fig_p009_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Analysis of FP32 underflow/gradual underflow and precis [PITH_FULL_IMAGE:figures/full_fig_p012_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: DaVinci architecture of Huawei Ascend NPU AI core [PITH_FULL_IMAGE:figures/full_fig_p014_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Matrix blocking based on L1 cache reuse held in L1 simultaneously: Nfused = int  L1 − 2bkbn bmbk  = int  L1 bmbk − 2 bn bm  = f L1 bmbk (8) where bm and bk denote the block height and width of A, and f (0.92 ≤ f ≤ 1 in our experiments) accounts for the correction from bn bm ∼ O(1) and the floor operation. A larger Nfused implies greater reuse of A and fewer reloads of B from main memory. The total memo… view at source ↗
Figure 5
Figure 5. Figure 5: Impact of blocking size on Nfused and f cores, and Crw accounts for reading and writing C through the unified buffer Nfused times. Block sizes are subject to hardware constraints arising from cube compu￾tation alignment and buffer capacities:    bm, bk, bn ≡ 0 (mod 16) (cube alignment) bm × bk ≤ 64 × 256 (L0A capacity) bk × bn ≤ 64 × 256 (L0B capacity) bm × bn × 6 ≤ 248 × 1024 (L0C and UB capacity… view at source ↗
Figure 6
Figure 6. Figure 6: Comparison of single- and double-buffered pipelines based o [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Computation sequences of the SGEMM-cube precision rec [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Relative error vs. offset exponent under different input r [PITH_FULL_IMAGE:figures/full_fig_p020_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Relative error vs. input matrix sizes Varying m and n with fixed k = 64×44 ( [PITH_FULL_IMAGE:figures/full_fig_p021_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Performance impact of matrix blocking with L1 reuse [PITH_FULL_IMAGE:figures/full_fig_p022_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Performance vs. input matrix sizes 5. Conclusions SGEMM-cube is presented as a high-performance algorithm for emulating FP32 GEMM on low-precision AI accelerators equipped only with FP16 com￾pute units. By combining a decomposition-based FP32-to-FP16 mapping, tunable residual scaling, and term-wise accumulation within a cache-aware double-buffered pipeline, the method achieves both high numerical accuracy… view at source ↗
read the original abstract

Modern AI accelerators provide high-throughput low-precision matrix engines, but their support for FP32 GEMM is often limited or inefficient. This work presents SGEMM-cube, a precision-recovery FP32 GEMM approximation on Ascend NPUs using FP16 Cube units. Rather than claiming bit-exact FP32 approximation, SGEMM-cube targets near-FP32 accuracy for inputs whose magnitudes are representable within the FP16 dynamic range. The method follows a two-component FP32-to-FP16 splitting strategy related to Ozaki-style and Ootomo-style schemes: each FP32 operand is represented by an FP16 high component and a scaled FP16 residual component, and the matrix product is reconstructed from the dominant high-high and high-low terms while omitting the low-low term. The main contribution of this paper is not a new splitting paradigm, but an architecture-specific realization and analysis of this precision-recovery scheme on Ascend NPUs. We analyze the effects of round-to-nearest conversion, underflow, residual scaling, and accumulation order under the Ascend execution model, and clarify the range and accuracy limitations of the approach. We further adapt standard high-performance GEMM techniques, including L1-aware blocking and double-buffered pipelining, to the software-managed memory hierarchy of Ascend NPUs. Experiments on Ascend 910A show that SGEMM-cube recovers substantially higher accuracy than native FP16 GEMM and approaches FP32 SGEMM accuracy for moderate-range inputs, while achieving up to 65.3 TFLOP/s, corresponding to 77\% of the FP32-equivalent peak defined by the three-GEMM decomposition cost. These results demonstrate that FP32-accuracy GEMM approximation can be made practical on FP16-only NPU matrix engines, provided that its range, error, and implementation constraints are explicitly managed.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The paper introduces SGEMM-cube, a precision-recovery approximation for FP32 GEMM on Ascend NPUs using FP16 Cube matrix engines. It uses a two-component splitting strategy where each FP32 operand is split into an FP16 high part and a scaled FP16 residual, reconstructs the product from high-high and high-low terms (omitting low-low), analyzes effects of round-to-nearest, underflow, residual scaling, and accumulation order, adapts L1-aware blocking and double-buffered pipelining for the NPU's software-managed memory, and reports experimental results on Ascend 910A achieving up to 65.3 TFLOP/s (77% of the FP32-equivalent three-GEMM peak) with substantially higher accuracy than native FP16 GEMM for moderate-range inputs.

Significance. If the reported results hold, this work is significant for showing a practical hardware-specific way to obtain near-FP32 accuracy GEMM on FP16-only matrix engines, addressing a limitation in many AI accelerators. It earns credit for the explicit analysis of round-to-nearest conversion, underflow, and accumulation order under the Ascend model, the adaptation of standard GEMM optimizations to the software-managed hierarchy, and the concrete, falsifiable throughput (65.3 TFLOP/s) and accuracy measurements against an externally defined three-GEMM peak. These elements provide useful benchmarks and implementation guidance for mixed-precision computing on NPUs.

minor comments (3)
  1. [Abstract] Abstract: the phrase 'approaches FP32 SGEMM accuracy' would be strengthened by a brief quantitative statement of the observed error reduction (e.g., orders of magnitude or relative error ranges) for the moderate-range regime.
  2. [Analysis section] The residual scaling factor is listed as a free parameter; a short table or paragraph showing its chosen value and sensitivity to underflow on Ascend would improve reproducibility.
  3. [Experiments] Experiments: input matrix generation for the moderate-range tests should be described more precisely (distribution, magnitude bounds) to allow independent verification of the accuracy claims.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the constructive summary and positive evaluation of the significance of SGEMM-cube. The recommendation for minor revision is noted, and we appreciate the recognition of the architecture-specific analysis, error modeling, and concrete throughput/accuracy results. No specific major comments appear in the provided report, so we have no point-by-point rebuttals to offer at this stage. We remain ready to incorporate any editorial or minor clarifications in the revised version.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper describes an architecture-specific implementation and empirical evaluation of a known two-component FP32-to-FP16 splitting approach (explicitly related to prior Ozaki-style and Ootomo-style schemes) on Ascend NPUs. It analyzes hardware effects such as round-to-nearest conversion, underflow, residual scaling, and accumulation order under the Ascend model, then reports measured accuracy and throughput (65.3 TFLOP/s reaching 77% of an externally defined three-GEMM peak). No equations, fitted parameters, or self-citations reduce the reported results to quantities defined inside the paper; the central claims rest on direct measurement against external benchmarks and acknowledged range limitations rather than internal derivation.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The approach rests on standard floating-point conversion and rounding semantics plus a chosen residual scaling factor; no new physical entities are postulated.

free parameters (1)
  • residual scaling factor
    A scale is applied to the low-order residual so that it fits into FP16; the abstract does not specify whether this scale is fixed or tuned per matrix.
axioms (1)
  • standard math Standard round-to-nearest-even behavior and dynamic-range limits of IEEE FP16 and FP32
    Invoked when discussing conversion effects, underflow, and the representable range of inputs.

pith-pipeline@v0.9.0 · 5901 in / 1414 out tokens · 40286 ms · 2026-05-19T02:54:06.807395+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages · 1 internal anchor

  1. [1]

    H. Yu, H. Li, H. Shi, T. S. Huang, G. Hua, Any-precision deep neur al networks, in: Proceedings of the AAAI Conference on Artificial In telli- gence, Vol. 35, 2021, pp. 10763–10771

  2. [2]

    Zhuang, L

    B. Zhuang, L. Liu, M. Tan, C. Shen, I. Reid, Training quantized ne u- ral networks with a full-precision auxiliary module, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recogn ition, 2020, pp. 1488–1497

  3. [3]

    Vansteenkiste, J

    A. Vansteenkiste, J. Leliaert, M. Dvornik, M. Helsen, F. Garcia- Sanchez, B. Van Waeyenberge, The design and verification of mumax3, AIP ad - vances 4 (10) (2014)

  4. [4]

    G. P. M¨ uller, M. Hoffmann, C. Dißelkamp, D. Sch¨ urhoff, S. Mavro s, M. Sallermann, N. S. Kiselev, H. J´ onsson, S. Bl¨ ugel, Spirit: Multi- functional framework for atomistic spin simulations, Physical revie w b 99 (22) (2019) 224414

  5. [5]

    B. D. Wozniak, F. D. Witherden, F. P. Russell, P. E. Vincent, P. H. Kelly, Gimmik—generating bespoke matrix multiplication kernels for accelerators: Application to high-order computational fluid dynam ics, Computer Physics Communications 202 (2016) 12–22

  6. [6]

    Cawkwell, E

    M. Cawkwell, E. Sanville, S. Mniszewski, A. M. Niklasson, Computing the density matrix in electronic structure theory on graphics proc essing units, Journal of chemical theory and computation 8 (11) (2012) 4094– 4101

  7. [7]

    Mixed Precision Training

    P. Micikevicius, S. Narang, J. Alben, G. Diamos, E. Elsen, D. Garcia , B. Ginsburg, M. Houston, O. Kuchaiev, G. Venkatesh, et al., Mixed precision training, arXiv preprint arXiv:1710.03740 (2017)

  8. [8]

    Rakka, M

    M. Rakka, M. E. Fouda, P. Khargonekar, F. Kurdahi, Mixed-pre cision neural networks: A survey, arXiv preprint arXiv:2208.06064 (202 2). 22

  9. [9]

    Haidar, S

    A. Haidar, S. Tomov, J. Dongarra, N. J. Higham, Harnessing gpu ten- sor cores for fast fp16 arithmetic to speed up mixed-precision iter ative refinement solvers, in: SC18: International Conference for High Perfor- mance Computing, Networking, Storage and Analysis, IEEE, 2018, pp. 603–613

  10. [10]

    N. J. Higham, T. Mary, Mixed precision algorithms in numerical linea r algebra, Acta Numerica 31 (2022) 347–414

  11. [11]

    H. Liao, J. Tu, J. Xia, H. Liu, X. Zhou, H. Yuan, Y. Hu, Ascend: a scalable and unified architecture for ubiquitous deep neural netw ork computing: Industry track paper, in: 2021 IEEE International S ym- posium on High-Performance Computer Architecture (HPCA), IEE E, 2021, pp. 789–801

  12. [12]

    Ozaki, T

    K. Ozaki, T. Ogita, S. Oishi, S. M. Rump, Error-free transform ations of matrix multiplication by using fast routines of matrix multiplication and its applications, Numerical Algorithms 59 (2012) 95–118

  13. [13]

    Markidis, S

    S. Markidis, S. W. Der Chien, E. Laure, I. B. Peng, J. S. Vetter , Nvidia tensor core programmability, performance & precision, in: 2018 IE EE international parallel and distributed processing symposium works hops (IPDPSW), IEEE, 2018, pp. 522–531

  14. [14]

    B. Feng, Y. Wang, G. Chen, W. Zhang, Y. Xie, Y. Ding, Egemm-tc : ac- celerating scientific computing on tensor cores with extended prec ision, in: Proceedings of the 26th ACM SIGPLAN symposium on principles and practice of parallel programming, 2021, pp. 278–291

  15. [15]

    Ootomo, R

    H. Ootomo, R. Yokota, Recovering single precision accuracy fr om tensor cores while surpassing the fp32 theoretical peak performance, T he Inter- national Journal of High Performance Computing Applications 36 (4 ) (2022) 475–491

  16. [16]

    M. Fasi, N. J. Higham, M. Mikaitis, S. Pranesh, Numerical behavio r of nvidia tensor cores, PeerJ Computer Science 7 (2021) e330

  17. [17]

    Z. Ma, H. Wang, G. Feng, C. Zhang, L. Xie, J. He, S. Chen, J. Zh ai, Ef- ficiently emulating high-bitwidth computation with low-bitwidth hard- ware, in: Proceedings of the 36th ACM International Conference on Supercomputing, 2022, pp. 1–12. 23

  18. [18]

    G. Li, J. Xue, L. Liu, X. Wang, X. Ma, X. Dong, J. Li, X. Feng, Un- leashing the low-precision computation potential of tensor cores o n gpus, in: 2021 IEEE/ACM International Symposium on Code Generation an d Optimization (CGO), IEEE, 2021, pp. 90–102

  19. [19]

    Kahan, Ieee standard 754 for binary floating-point arithme tic, Lec- ture Notes on the Status of IEEE 754 (94720-1776) (1996) 11

    W. Kahan, Ieee standard 754 for binary floating-point arithme tic, Lec- ture Notes on the Status of IEEE 754 (94720-1776) (1996) 11. 24