Fast and Stable Triangular Inversion for Delta-Rule Linear Transformers
Pith reviewed 2026-05-21 05:34 UTC · model grok-4.3
The pith
Triangular inversion for Delta-Rule linear transformers reaches up to 4.3 times faster execution on NPUs with full model accuracy preserved.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
A systematic comparison of matrix-product-rich triangular inversion routines demonstrates that selected direct and iterative variants deliver up to 4.3 times faster execution than current SGLang implementations on NPUs, while remaining numerically stable in low-precision formats and leaving end-to-end accuracy unchanged across practical Delta-Rule deployments.
What carries the argument
Matrix-product-rich triangular inversion algorithms that trade off direct versus iterative steps to maximize hardware utilization while controlling numerical error accumulation.
If this is right
- Layer-level latency drops substantially for any Delta-Rule linear attention implementation.
- Full end-to-end model accuracy is retained in low-precision inference settings.
- Hardware accelerators such as NPUs can be utilized more efficiently for the inversion sub-routine.
- Designers gain concrete guidance on selecting inversion methods according to precision and hardware constraints.
Where Pith is reading between the lines
- Similar inversion patterns may appear in other linear recurrent architectures and could benefit from the same optimizations.
- Further scaling to larger context lengths or batch sizes would test whether the observed stability margins persist.
- Porting the fastest variants into additional inference engines could compound the reported layer-level gains.
Load-bearing premise
The numerical stability of the proposed inversion methods holds across the low-precision floating-point formats and model scales used in practical Delta-Rule deployments.
What would settle it
An end-to-end inference run on a production-scale Delta-Rule model that shows measurable accuracy degradation after swapping in the new inversion routines.
Figures
read the original abstract
Linear attention has emerged as a cornerstone for efficient long-context architectures, as evidenced by its integration into state-of-the-art open-source models including Qwen3.5/3.6, Kimi Linear, and RWKV-7. Models that incorporate linear attention layers with the so-called Delta-Rule involve the inversion of triangular matrices as a core sub-routine. This operation often forms a performance bottleneck, and, due to its high-sensitivity to numerical errors, it can significantly deteriorate end-to-end model accuracy if it is not carefully implemented. This work provides a systematic analysis of both direct and iterative triangular inversion algorithms, targeting methods that are rich in matrix products, and, therefore, have the potential to efficiently utilize modern hardware. To that end, our analysis covers a broad spectrum of mathematical and practical aspects, with a heavy focus on numerical stability, computational complexity, and, ultimately, hardware efficiency and practical considerations. We provide a rigorous experimental evaluation to verify these properties in practical scenarios, and in low-precision floating-point representations, highlighting the strengths and limitations of each method. Performance benchmarks on NPUs reveal up to $4.3\times$ speed-up against the state-of-the-art implementations of SGLang for triangular matrix inversion, leading to significant performance improvements on the entire layer level, while maintaining full end-to-end model accuracy.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper analyzes direct and iterative algorithms for triangular matrix inversion, a core subroutine in Delta-Rule linear attention for efficient long-context models (e.g., Qwen, RWKV). It emphasizes numerical stability in low-precision formats (FP16/BF16/FP8), computational complexity, and hardware efficiency, reporting up to 4.3× NPU speedup versus SGLang implementations while preserving full end-to-end model accuracy through rigorous experiments.
Significance. If the stability and performance claims hold under practical conditions, the work offers a practical optimization for a performance bottleneck in linear attention layers, potentially improving inference speed on NPUs for state-of-the-art models without accuracy loss. The focus on matrix-product-rich methods aligns well with modern hardware utilization.
major comments (2)
- [Experimental evaluation] Experimental evaluation section: the stability assertions for low-precision triangular inversion lack quantitative forward-error bounds, residual norm comparisons to higher-precision references, or systematic sweeps over matrix condition numbers. The abstract notes maintained accuracy in low precision, but without these metrics the claim does not address potential error growth in ill-conditioned cases typical of trained Delta-Rule layers.
- [Performance benchmarks] Performance benchmarks: the 4.3× speedup and layer-level improvements are reported against SGLang, yet the manuscript provides no error-bar details, exact data-exclusion rules, or hardware-specific implementation parameters that would allow independent verification of the NPU results.
minor comments (2)
- [Abstract] Abstract: the phrase 'rigorous experimental evaluation' could be clarified by explicitly listing the tested matrix sizes, condition-number ranges, and precision formats in the abstract itself.
- [Methods] Notation: ensure consistent use of symbols for triangular matrices and inversion residuals across sections describing direct versus iterative methods.
Simulated Author's Rebuttal
We thank the referee for their constructive comments. We address each major point below and indicate the revisions we will make to strengthen the manuscript.
read point-by-point responses
-
Referee: [Experimental evaluation] Experimental evaluation section: the stability assertions for low-precision triangular inversion lack quantitative forward-error bounds, residual norm comparisons to higher-precision references, or systematic sweeps over matrix condition numbers. The abstract notes maintained accuracy in low precision, but without these metrics the claim does not address potential error growth in ill-conditioned cases typical of trained Delta-Rule layers.
Authors: We appreciate the referee's suggestion for more granular stability metrics. While our primary validation relies on end-to-end model accuracy preservation across full inference runs in low precision, we acknowledge that isolated forward-error analysis would provide additional insight. In the revised manuscript we will add residual norm comparisons against double-precision references and forward-error bounds for representative matrices drawn from trained Delta-Rule layers. We will also include a sweep over matrices with varying condition numbers to quantify error growth behavior. revision: yes
-
Referee: [Performance benchmarks] Performance benchmarks: the 4.3× speedup and layer-level improvements are reported against SGLang, yet the manuscript provides no error-bar details, exact data-exclusion rules, or hardware-specific implementation parameters that would allow independent verification of the NPU results.
Authors: We agree that reproducibility details are important. The revised version will report error bars computed over multiple independent runs, specify the exact warm-up and data-exclusion protocol used for timing measurements, and provide additional hardware configuration parameters (clock rates, memory hierarchy details, and kernel launch configurations) to the extent permitted by the NPU vendor. The reported 4.3× figure is the peak observed speedup; average and median speedups across the benchmark suite will also be included. revision: partial
Circularity Check
No circularity; algorithmic analysis and hardware benchmarks are self-contained
full rationale
The paper presents a systematic analysis of direct and iterative triangular inversion algorithms for Delta-Rule linear attention, supported by mathematical properties, complexity analysis, numerical stability considerations, and empirical benchmarks on NPUs. The reported speed-ups (up to 4.3×) and maintained end-to-end accuracy are grounded in direct hardware measurements and experimental evaluation in low-precision formats, rather than any self-referential equations, fitted parameters renamed as predictions, or load-bearing self-citations. No derivation step reduces to its own inputs by construction, and the work relies on external benchmarks and implementation details independent of the target claims.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Triangular matrix inversion forms a core performance and numerical bottleneck inside Delta-Rule linear attention layers.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Definition 2.1 ... f-stable if ||A^{-1} - eA^{-1}||_2 ≤ c1 n^{c2} κ_2(A) u ||A^{-1}||_2
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Corollary 2.1. The condition number of an n×n DeltaNet matrix A is at most n²
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
James R. Bunch and John E. Hopcroft. Triangular factorization and inversion by fast matrix multiplication.Mathematics of Computation, 28(125):231–236, 1974
work page 1974
-
[2]
A computational model for tensor core units
Rezaul Chowdhury, Francesco Silvestri, and Flavio Vella. A computational model for tensor core units. InProceedings of Symposium on Parallelism in Algorithms and Architectures (SPAA), pages 519–521, 2020
work page 2020
-
[3]
Algorithm design for tensor units
Rezaul Chowdhury, Francesco Silvestri, and Flavio Vella. Algorithm design for tensor units. In International Conference on Parallel and Distributed Computing (Euro-Par), pages 353–367. Springer, 2021
work page 2021
-
[4]
The role of arithmetic in fast parallel matrix inversion.Algorithmica, 30(4):685–707, 2001
Bruno Codenotti, Mauro Leoncini, and Franco P Preparata. The role of arithmetic in fast parallel matrix inversion.Algorithmica, 30(4):685–707, 2001
work page 2001
-
[5]
Tri Dao and Albert Gu. Transformers are SSMs: generalized models and efficient algorithms through structured state space duality. InInternational Conference on Machine Learning (ICML), ICML’24. JMLR, 2024
work page 2024
-
[6]
Fast linear algebra is stable.Numerische Mathematik, 108(1):59–91, October 2007
James Demmel, Ioana Dumitriu, and Olga Holtz. Fast linear algebra is stable.Numerische Mathematik, 108(1):59–91, October 2007
work page 2007
-
[7]
Jeremy J. Du Croz and Nicholas J. Higham. Stability of methods for matrix inversion.IMA Journal of Numerical Analysis, 12(1):1–19, 1992
work page 1992
-
[8]
Sameh.Parallelism in Matrix Compu- tations
Efstratios Gallopoulos, Bernard Philippe, and Ahmed H. Sameh.Parallelism in Matrix Compu- tations. Springer Netherlands, 2016
work page 2016
-
[9]
Gene H. Golub and Charles F. Van Loan.Matrix Computations - 4th Edition. Johns Hopkins University Press, Philadelphia, PA, 2013. 10
work page 2013
-
[10]
Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring Massive Multitask Language Understanding.International Conference on Learning Representations (ICLR), 2021
work page 2021
-
[11]
Nicholas J. Higham. Iterative refinement for linear systems and LAPACK.IMA Journal of Numerical Analysis, 17(4):495–509, 1997
work page 1997
-
[12]
Higham.Accuracy and Stability of Numerical Algorithms
Nicholas J. Higham.Accuracy and Stability of Numerical Algorithms. Society for Industrial and Applied Mathematics, January 2002
work page 2002
-
[13]
Improving bilinear RNN with closed-loop control
Jiaxi Hu, Yongqi Pan, Jusen Du, Disen Lan, Xiaqiang Tang, Qingsong Wen, Yuxuan Liang, and Weigao Sun. Improving bilinear RNN with closed-loop control. InAdvances in Neural Information Processing Systems (NeurIPS), 2026
work page 2026
-
[14]
Transformer quality in linear time
Weizhe Hua, Zihang Dai, Hanxiao Liu, and Quoc Le. Transformer quality in linear time. In International Conference on Machine Learning (ICML), pages 9099–9117. PMLR, 2022
work page 2022
-
[15]
The elements of DeltaNet’s core inverse matrix are always within [-1, 1]
Su Jianlin. The elements of DeltaNet’s core inverse matrix are always within [-1, 1]. Available online:https://spaces.ac.cn/archives/11563, Jan 2026
work page 2026
-
[16]
Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention
Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret. Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention. InInternational Conference on Machine Learning (ICML), pages 5156–5165. PMLR, 2020
work page 2020
-
[17]
Kimi Team, Yu Zhang, Zongyu Lin, Xingcheng Yao, Jiaxi Hu, Fanqing Meng, Chengyin Liu, Xin Men, Songlin Yang, Zhiyuan Li, Wentao Li, Enzhe Lu, Weizhou Liu, Yanru Chen, Weixin Xu, Longhui Yu, Yejie Wang, Yu Fan, Longguang Zhong, Enming Yuan, Dehao Zhang, Yizhi Zhang, T. Y . Liu, Haiming Wang, Shengjun Fang, Weiran He, Shaowei Liu, Yiwei Li, Jianlin Su, Jiez...
work page 2025
-
[18]
Iterative refinement in floating point.Journal of the ACM, 14(2):316–321, 1967
Cleve B Moler. Iterative refinement in floating point.Journal of the ACM, 14(2):316–321, 1967
work page 1967
-
[19]
Efficient parallel solution of linear systems
Victor Pan and John Reif. Efficient parallel solution of linear systems. InProceedings of the Symposium on Theory of Computing (STOC), pages 143–152, 1985
work page 1985
-
[20]
Victor Pan and Robert Schreiber. An improved newton iteration for the generalized inverse of a matrix, with applications.SIAM Journal on Scientific and Statistical Computing, 12(5):1109– 1130, 1991
work page 1991
-
[21]
Wind, Tianyi Wu, Daniel Wuttke, and Christian Zhou-Zheng
Bo Peng, Ruichong Zhang, Daniel Goldstein, Eric Alcaide, Xingjian Du, Haowen Hou, Jiaju Lin, Jiaxing Liu, Janna Lu, William Merrill, Guangyu Song, Kaifeng Tan, Saiteja Utpala, Nathan Wilce, Johan S. Wind, Tianyi Wu, Daniel Wuttke, and Christian Zhou-Zheng. RWKV-7 ”Goose” with Expressive Dynamic State Evolution. InSecond Conference on Language Modeling, 2025
work page 2025
-
[22]
Ahmed H. Sameh and Richard P. Brent. Solving triangular systems on a parallel computer. SIAM Journal on Numerical Analysis, 14(6):1101–1113, December 1977
work page 1977
-
[23]
Linear Transformers Are Secretly Fast Weight Programmers
Imanol Schlag, Kazuki Irie, and Jürgen Schmidhuber. Linear Transformers Are Secretly Fast Weight Programmers. In Marina Meila and Tong Zhang, editors,International Conference on Machine Learning (ICML), volume 139 ofProceedings of Machine Learning Research, pages 9355–9366. PMLR, 18–24 Jul 2021
work page 2021
-
[24]
Günther Schulz. Iterative berechung der reziproken matrix.ZAMM - Journal of Applied Mathe- matics and Mechanics / Zeitschrift für Angewandte Mathematik und Mechanik, 13(1):57–59, January 1933
work page 1933
-
[25]
SGLang-Kernel-NPU: SGLang Kernel Library for Ascend NPU
SGLang Project Contributors. SGLang-Kernel-NPU: SGLang Kernel Library for Ascend NPU. https://github.com/sgl-project/sgl-kernel-npu, 2025. Accessed: 2026-05-06. 11
work page 2025
-
[26]
Hermitian diagonalization in linear precision
Rikhav Shah. Hermitian diagonalization in linear precision. InProceedings of the ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 5599–5615. SIAM, 2025
work page 2025
-
[27]
Segmented Operations using Matrix Multiplications.arXiv preprint arXiv:2506.23906, 2025
Aleksandros Sobczyk, Giuseppe Sorrentino, and Anastasios Zouzias. Segmented Operations using Matrix Multiplications.arXiv preprint arXiv:2506.23906, 2025
-
[28]
Torsten Söderström and Gilbert W. Stewart. On the Numerical Properties of an Iterative Method for Computing the Moore–Penrose Generalized Inverse.SIAM Journal on Numerical Analysis, 11(1):61–74, March 1974
work page 1974
-
[29]
TileLang-Ascend: Ascend TileLang Adapter
TileLang Contributors. TileLang-Ascend: Ascend TileLang Adapter. https://github.com/ tile-ai/tilelang-ascend, 2025. Accessed: 2026-05-06
work page 2025
-
[30]
Gomez, Łukasz Kaiser, and Illia Polosukhin
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in Neural Information Processing Systems (NeurIPS), 30, 2017
work page 2017
-
[31]
Divakar Viswanath and Lloyd Nicholas Trefethen. Condition numbers of random triangular matrices.SIAM Jounrnal on Matrix Analysis and Applications, 19(2):564–581, 1998
work page 1998
-
[32]
TileLang: Bridge Programmabil- ity and Performance in Modern Neural Kernels
Lei Wang, Yu Cheng, Yining Shi, Zhiwen Mo, Zhengju Tang, Wenhao Xie, Tong Wu, Lingxiao Ma, Yuqing Xia, Jilong Xue, Fan Yang, and Zhi Yang. TileLang: Bridge Programmabil- ity and Performance in Modern Neural Kernels. InInternational Conference on Learning Representations (ICLR), 2026
work page 2026
-
[33]
Wilkinson.Progress report on the automatic computing engine
James H. Wilkinson.Progress report on the automatic computing engine. Mathematics Division, Department of Scientific & Industrial Research, 1948
work page 1948
- [34]
-
[35]
DeltaFormer: Unlock the state space of Transformer
Mingyu Xu, Tenglong Ao, Jiaao He, Jianqiao Lu, Guang Shi, and Shu Zhong. DeltaFormer: Unlock the state space of Transformer. InAdvances in Neural Information Processing Systems (NeurIPS), 2025
work page 2025
-
[36]
Gated Delta Networks: Improving Mamba2 with Delta Rule
Songlin Yang, Jan Kautz, and Ali Hatamizadeh. Gated Delta Networks: Improving Mamba2 with Delta Rule. In Y . Yue, A. Garg, N. Peng, F. Sha, and R. Yu, editors,International Conference on Learning Representations (ICLR), volume 2025, pages 29687–29707, 2025
work page 2025
-
[37]
Gated linear attention transformers with hardware-efficient training
Songlin Yang, Bailin Wang, Yikang Shen, Rameswar Panda, and Yoon Kim. Gated linear attention transformers with hardware-efficient training. InInternational Conference on Machine Learning (ICML), ICML’24. JMLR, 2024
work page 2024
-
[38]
Parallelizing linear transformers with the delta rule over sequence length
Songlin Yang, Bailin Wang, Yu Zhang, Yikang Shen, and Yoon Kim. Parallelizing linear transformers with the delta rule over sequence length. In A. Globerson, L. Mackey, D. Bel- grave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors,Advances in Neural Information Processing Systems (NeurIPS), volume 37, pages 115491–115522. Curran Associates, Inc., 2024
work page 2024
- [39]
-
[40]
Gonzalez, Clark Barrett, and Ying Sheng
Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E. Gonzalez, Clark Barrett, and Ying Sheng. SGLang: Efficient execution of structured language model programs. InAdvances in Neural Information Processing Systems (NeurIPS), 2024
work page 2024
-
[41]
Understanding transformer from the perspective of associative memory, 2025
Shu Zhong, Mingyu Xu, Tenglong Ao, and Guang Shi. Understanding transformer from the perspective of associative memory, 2025
work page 2025
-
[42]
Anastasios Zouzias and William F. McColl. A parallel scan algorithm in the tensor core unit model. InInternational Conference on Parallel and Distributed Computing (Euro-Par), pages 489–502, 2023. 12 A Code availability The source code for the kernels and the scripts to reproduce the experimental evaluations are available in the following links: • https:/...
work page 2023
-
[43]
CreateD e(X, b)andD o(X, b)as defined above
-
[44]
UpdateX←D e(X, b) +D o(X, b)−D o(X, b)·L·D e(X, b). 3.b←2b. Finally, we return X. In total there are two matrix products of sizen×n for each b= 1,2,4, . . . , n/2 , which gives a total of2 log(n)matrix products. Correctness can be verified by induction. Note that, for b= 1 , after step 2., the 2×2 diagonal blocks of X contain the inverses of the 2×2 diago...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.