pith. sign in

arxiv: 2605.21325 · v1 · pith:XZEYEPZCnew · submitted 2026-05-20 · 💻 cs.LG

Fast and Stable Triangular Inversion for Delta-Rule Linear Transformers

Pith reviewed 2026-05-21 05:34 UTC · model grok-4.3

classification 💻 cs.LG
keywords triangular matrix inversiondelta-rulelinear attentionnumerical stabilitylow-precision computationhardware efficiencylinear transformersperformance optimization
0
0 comments X

The pith

Triangular inversion for Delta-Rule linear transformers reaches up to 4.3 times faster execution on NPUs with full model accuracy preserved.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines direct and iterative algorithms for inverting triangular matrices that arise in Delta-Rule linear attention. These operations sit at the core of efficient long-context models but create both speed bottlenecks and risks of numerical error that can harm final accuracy. The authors concentrate on methods built around repeated matrix multiplications so they can exploit modern hardware accelerators effectively. Their evaluation tracks stability in low-precision arithmetic, overall runtime, and end-to-end model behavior, showing concrete speed gains without accuracy trade-offs.

Core claim

A systematic comparison of matrix-product-rich triangular inversion routines demonstrates that selected direct and iterative variants deliver up to 4.3 times faster execution than current SGLang implementations on NPUs, while remaining numerically stable in low-precision formats and leaving end-to-end accuracy unchanged across practical Delta-Rule deployments.

What carries the argument

Matrix-product-rich triangular inversion algorithms that trade off direct versus iterative steps to maximize hardware utilization while controlling numerical error accumulation.

If this is right

  • Layer-level latency drops substantially for any Delta-Rule linear attention implementation.
  • Full end-to-end model accuracy is retained in low-precision inference settings.
  • Hardware accelerators such as NPUs can be utilized more efficiently for the inversion sub-routine.
  • Designers gain concrete guidance on selecting inversion methods according to precision and hardware constraints.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar inversion patterns may appear in other linear recurrent architectures and could benefit from the same optimizations.
  • Further scaling to larger context lengths or batch sizes would test whether the observed stability margins persist.
  • Porting the fastest variants into additional inference engines could compound the reported layer-level gains.

Load-bearing premise

The numerical stability of the proposed inversion methods holds across the low-precision floating-point formats and model scales used in practical Delta-Rule deployments.

What would settle it

An end-to-end inference run on a production-scale Delta-Rule model that shows measurable accuracy degradation after swapping in the new inversion routines.

Figures

Figures reproduced from arXiv: 2605.21325 by Aleksandros Sobczyk, Anastasios Zouzias, Christos K. Matzoros, Filip Skogh, Gioele Gottardo, Jiawei Zhuang, Mirko De Vita.

Figure 1
Figure 1. Figure 1: Left: Runtime breakdown of single GDN layer. Right: Accuracy drop in end-to-end [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Comparison of the inverse accuracy of various methods, for three different input precision [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
Figure 4
Figure 4. Figure 4: GDN layer run on device using Qwen3.5-0.8B tensors as input (B=8, N=16, S=1024, C = 64 or C = 128) We next investigate numerical accuracy in more detail by analyzing how errors in computing the triangular inverse affect end-to-end model performance. We employ the Newton–Schulz method and evaluate it at different iteration counts. For each checkpoint, we measure the relative error with 8 [PITH_FULL_IMAGE:f… view at source ↗
Figure 5
Figure 5. Figure 5: Kernel evaluation of different inverse algorithms. [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Comparison of the inverse accuracy of various methods, for three different input precision [PITH_FULL_IMAGE:figures/full_fig_p020_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Comparison of the accuracy obtained on MMLU (10-subjects) of different models with [PITH_FULL_IMAGE:figures/full_fig_p020_7.png] view at source ↗
read the original abstract

Linear attention has emerged as a cornerstone for efficient long-context architectures, as evidenced by its integration into state-of-the-art open-source models including Qwen3.5/3.6, Kimi Linear, and RWKV-7. Models that incorporate linear attention layers with the so-called Delta-Rule involve the inversion of triangular matrices as a core sub-routine. This operation often forms a performance bottleneck, and, due to its high-sensitivity to numerical errors, it can significantly deteriorate end-to-end model accuracy if it is not carefully implemented. This work provides a systematic analysis of both direct and iterative triangular inversion algorithms, targeting methods that are rich in matrix products, and, therefore, have the potential to efficiently utilize modern hardware. To that end, our analysis covers a broad spectrum of mathematical and practical aspects, with a heavy focus on numerical stability, computational complexity, and, ultimately, hardware efficiency and practical considerations. We provide a rigorous experimental evaluation to verify these properties in practical scenarios, and in low-precision floating-point representations, highlighting the strengths and limitations of each method. Performance benchmarks on NPUs reveal up to $4.3\times$ speed-up against the state-of-the-art implementations of SGLang for triangular matrix inversion, leading to significant performance improvements on the entire layer level, while maintaining full end-to-end model accuracy.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper analyzes direct and iterative algorithms for triangular matrix inversion, a core subroutine in Delta-Rule linear attention for efficient long-context models (e.g., Qwen, RWKV). It emphasizes numerical stability in low-precision formats (FP16/BF16/FP8), computational complexity, and hardware efficiency, reporting up to 4.3× NPU speedup versus SGLang implementations while preserving full end-to-end model accuracy through rigorous experiments.

Significance. If the stability and performance claims hold under practical conditions, the work offers a practical optimization for a performance bottleneck in linear attention layers, potentially improving inference speed on NPUs for state-of-the-art models without accuracy loss. The focus on matrix-product-rich methods aligns well with modern hardware utilization.

major comments (2)
  1. [Experimental evaluation] Experimental evaluation section: the stability assertions for low-precision triangular inversion lack quantitative forward-error bounds, residual norm comparisons to higher-precision references, or systematic sweeps over matrix condition numbers. The abstract notes maintained accuracy in low precision, but without these metrics the claim does not address potential error growth in ill-conditioned cases typical of trained Delta-Rule layers.
  2. [Performance benchmarks] Performance benchmarks: the 4.3× speedup and layer-level improvements are reported against SGLang, yet the manuscript provides no error-bar details, exact data-exclusion rules, or hardware-specific implementation parameters that would allow independent verification of the NPU results.
minor comments (2)
  1. [Abstract] Abstract: the phrase 'rigorous experimental evaluation' could be clarified by explicitly listing the tested matrix sizes, condition-number ranges, and precision formats in the abstract itself.
  2. [Methods] Notation: ensure consistent use of symbols for triangular matrices and inversion residuals across sections describing direct versus iterative methods.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments. We address each major point below and indicate the revisions we will make to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Experimental evaluation] Experimental evaluation section: the stability assertions for low-precision triangular inversion lack quantitative forward-error bounds, residual norm comparisons to higher-precision references, or systematic sweeps over matrix condition numbers. The abstract notes maintained accuracy in low precision, but without these metrics the claim does not address potential error growth in ill-conditioned cases typical of trained Delta-Rule layers.

    Authors: We appreciate the referee's suggestion for more granular stability metrics. While our primary validation relies on end-to-end model accuracy preservation across full inference runs in low precision, we acknowledge that isolated forward-error analysis would provide additional insight. In the revised manuscript we will add residual norm comparisons against double-precision references and forward-error bounds for representative matrices drawn from trained Delta-Rule layers. We will also include a sweep over matrices with varying condition numbers to quantify error growth behavior. revision: yes

  2. Referee: [Performance benchmarks] Performance benchmarks: the 4.3× speedup and layer-level improvements are reported against SGLang, yet the manuscript provides no error-bar details, exact data-exclusion rules, or hardware-specific implementation parameters that would allow independent verification of the NPU results.

    Authors: We agree that reproducibility details are important. The revised version will report error bars computed over multiple independent runs, specify the exact warm-up and data-exclusion protocol used for timing measurements, and provide additional hardware configuration parameters (clock rates, memory hierarchy details, and kernel launch configurations) to the extent permitted by the NPU vendor. The reported 4.3× figure is the peak observed speedup; average and median speedups across the benchmark suite will also be included. revision: partial

Circularity Check

0 steps flagged

No circularity; algorithmic analysis and hardware benchmarks are self-contained

full rationale

The paper presents a systematic analysis of direct and iterative triangular inversion algorithms for Delta-Rule linear attention, supported by mathematical properties, complexity analysis, numerical stability considerations, and empirical benchmarks on NPUs. The reported speed-ups (up to 4.3×) and maintained end-to-end accuracy are grounded in direct hardware measurements and experimental evaluation in low-precision formats, rather than any self-referential equations, fitted parameters renamed as predictions, or load-bearing self-citations. No derivation step reduces to its own inputs by construction, and the work relies on external benchmarks and implementation details independent of the target claims.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on standard assumptions from numerical linear algebra and the domain premise that triangular inversion is a performance-critical subroutine inside Delta-Rule layers.

axioms (1)
  • domain assumption Triangular matrix inversion forms a core performance and numerical bottleneck inside Delta-Rule linear attention layers.
    Explicitly stated in the abstract as the motivation for the work.

pith-pipeline@v0.9.0 · 5796 in / 1237 out tokens · 70007 ms · 2026-05-21T05:34:18.552858+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

44 extracted references · 44 canonical work pages

  1. [1]

    Bunch and John E

    James R. Bunch and John E. Hopcroft. Triangular factorization and inversion by fast matrix multiplication.Mathematics of Computation, 28(125):231–236, 1974

  2. [2]

    A computational model for tensor core units

    Rezaul Chowdhury, Francesco Silvestri, and Flavio Vella. A computational model for tensor core units. InProceedings of Symposium on Parallelism in Algorithms and Architectures (SPAA), pages 519–521, 2020

  3. [3]

    Algorithm design for tensor units

    Rezaul Chowdhury, Francesco Silvestri, and Flavio Vella. Algorithm design for tensor units. In International Conference on Parallel and Distributed Computing (Euro-Par), pages 353–367. Springer, 2021

  4. [4]

    The role of arithmetic in fast parallel matrix inversion.Algorithmica, 30(4):685–707, 2001

    Bruno Codenotti, Mauro Leoncini, and Franco P Preparata. The role of arithmetic in fast parallel matrix inversion.Algorithmica, 30(4):685–707, 2001

  5. [5]

    Transformers are SSMs: generalized models and efficient algorithms through structured state space duality

    Tri Dao and Albert Gu. Transformers are SSMs: generalized models and efficient algorithms through structured state space duality. InInternational Conference on Machine Learning (ICML), ICML’24. JMLR, 2024

  6. [6]

    Fast linear algebra is stable.Numerische Mathematik, 108(1):59–91, October 2007

    James Demmel, Ioana Dumitriu, and Olga Holtz. Fast linear algebra is stable.Numerische Mathematik, 108(1):59–91, October 2007

  7. [7]

    Du Croz and Nicholas J

    Jeremy J. Du Croz and Nicholas J. Higham. Stability of methods for matrix inversion.IMA Journal of Numerical Analysis, 12(1):1–19, 1992

  8. [8]

    Sameh.Parallelism in Matrix Compu- tations

    Efstratios Gallopoulos, Bernard Philippe, and Ahmed H. Sameh.Parallelism in Matrix Compu- tations. Springer Netherlands, 2016

  9. [9]

    Golub and Charles F

    Gene H. Golub and Charles F. Van Loan.Matrix Computations - 4th Edition. Johns Hopkins University Press, Philadelphia, PA, 2013. 10

  10. [10]

    Measuring Massive Multitask Language Understanding.International Conference on Learning Representations (ICLR), 2021

    Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring Massive Multitask Language Understanding.International Conference on Learning Representations (ICLR), 2021

  11. [11]

    Nicholas J. Higham. Iterative refinement for linear systems and LAPACK.IMA Journal of Numerical Analysis, 17(4):495–509, 1997

  12. [12]

    Higham.Accuracy and Stability of Numerical Algorithms

    Nicholas J. Higham.Accuracy and Stability of Numerical Algorithms. Society for Industrial and Applied Mathematics, January 2002

  13. [13]

    Improving bilinear RNN with closed-loop control

    Jiaxi Hu, Yongqi Pan, Jusen Du, Disen Lan, Xiaqiang Tang, Qingsong Wen, Yuxuan Liang, and Weigao Sun. Improving bilinear RNN with closed-loop control. InAdvances in Neural Information Processing Systems (NeurIPS), 2026

  14. [14]

    Transformer quality in linear time

    Weizhe Hua, Zihang Dai, Hanxiao Liu, and Quoc Le. Transformer quality in linear time. In International Conference on Machine Learning (ICML), pages 9099–9117. PMLR, 2022

  15. [15]

    The elements of DeltaNet’s core inverse matrix are always within [-1, 1]

    Su Jianlin. The elements of DeltaNet’s core inverse matrix are always within [-1, 1]. Available online:https://spaces.ac.cn/archives/11563, Jan 2026

  16. [16]

    Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention

    Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret. Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention. InInternational Conference on Machine Learning (ICML), pages 5156–5165. PMLR, 2020

  17. [17]

    Kimi Team, Yu Zhang, Zongyu Lin, Xingcheng Yao, Jiaxi Hu, Fanqing Meng, Chengyin Liu, Xin Men, Songlin Yang, Zhiyuan Li, Wentao Li, Enzhe Lu, Weizhou Liu, Yanru Chen, Weixin Xu, Longhui Yu, Yejie Wang, Yu Fan, Longguang Zhong, Enming Yuan, Dehao Zhang, Yizhi Zhang, T. Y . Liu, Haiming Wang, Shengjun Fang, Weiran He, Shaowei Liu, Yiwei Li, Jianlin Su, Jiez...

  18. [18]

    Iterative refinement in floating point.Journal of the ACM, 14(2):316–321, 1967

    Cleve B Moler. Iterative refinement in floating point.Journal of the ACM, 14(2):316–321, 1967

  19. [19]

    Efficient parallel solution of linear systems

    Victor Pan and John Reif. Efficient parallel solution of linear systems. InProceedings of the Symposium on Theory of Computing (STOC), pages 143–152, 1985

  20. [20]

    An improved newton iteration for the generalized inverse of a matrix, with applications.SIAM Journal on Scientific and Statistical Computing, 12(5):1109– 1130, 1991

    Victor Pan and Robert Schreiber. An improved newton iteration for the generalized inverse of a matrix, with applications.SIAM Journal on Scientific and Statistical Computing, 12(5):1109– 1130, 1991

  21. [21]

    Wind, Tianyi Wu, Daniel Wuttke, and Christian Zhou-Zheng

    Bo Peng, Ruichong Zhang, Daniel Goldstein, Eric Alcaide, Xingjian Du, Haowen Hou, Jiaju Lin, Jiaxing Liu, Janna Lu, William Merrill, Guangyu Song, Kaifeng Tan, Saiteja Utpala, Nathan Wilce, Johan S. Wind, Tianyi Wu, Daniel Wuttke, and Christian Zhou-Zheng. RWKV-7 ”Goose” with Expressive Dynamic State Evolution. InSecond Conference on Language Modeling, 2025

  22. [22]

    Sameh and Richard P

    Ahmed H. Sameh and Richard P. Brent. Solving triangular systems on a parallel computer. SIAM Journal on Numerical Analysis, 14(6):1101–1113, December 1977

  23. [23]

    Linear Transformers Are Secretly Fast Weight Programmers

    Imanol Schlag, Kazuki Irie, and Jürgen Schmidhuber. Linear Transformers Are Secretly Fast Weight Programmers. In Marina Meila and Tong Zhang, editors,International Conference on Machine Learning (ICML), volume 139 ofProceedings of Machine Learning Research, pages 9355–9366. PMLR, 18–24 Jul 2021

  24. [24]

    Iterative berechung der reziproken matrix.ZAMM - Journal of Applied Mathe- matics and Mechanics / Zeitschrift für Angewandte Mathematik und Mechanik, 13(1):57–59, January 1933

    Günther Schulz. Iterative berechung der reziproken matrix.ZAMM - Journal of Applied Mathe- matics and Mechanics / Zeitschrift für Angewandte Mathematik und Mechanik, 13(1):57–59, January 1933

  25. [25]

    SGLang-Kernel-NPU: SGLang Kernel Library for Ascend NPU

    SGLang Project Contributors. SGLang-Kernel-NPU: SGLang Kernel Library for Ascend NPU. https://github.com/sgl-project/sgl-kernel-npu, 2025. Accessed: 2026-05-06. 11

  26. [26]

    Hermitian diagonalization in linear precision

    Rikhav Shah. Hermitian diagonalization in linear precision. InProceedings of the ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 5599–5615. SIAM, 2025

  27. [27]

    Segmented Operations using Matrix Multiplications.arXiv preprint arXiv:2506.23906, 2025

    Aleksandros Sobczyk, Giuseppe Sorrentino, and Anastasios Zouzias. Segmented Operations using Matrix Multiplications.arXiv preprint arXiv:2506.23906, 2025

  28. [28]

    Torsten Söderström and Gilbert W. Stewart. On the Numerical Properties of an Iterative Method for Computing the Moore–Penrose Generalized Inverse.SIAM Journal on Numerical Analysis, 11(1):61–74, March 1974

  29. [29]

    TileLang-Ascend: Ascend TileLang Adapter

    TileLang Contributors. TileLang-Ascend: Ascend TileLang Adapter. https://github.com/ tile-ai/tilelang-ascend, 2025. Accessed: 2026-05-06

  30. [30]

    Gomez, Łukasz Kaiser, and Illia Polosukhin

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in Neural Information Processing Systems (NeurIPS), 30, 2017

  31. [31]

    Condition numbers of random triangular matrices.SIAM Jounrnal on Matrix Analysis and Applications, 19(2):564–581, 1998

    Divakar Viswanath and Lloyd Nicholas Trefethen. Condition numbers of random triangular matrices.SIAM Jounrnal on Matrix Analysis and Applications, 19(2):564–581, 1998

  32. [32]

    TileLang: Bridge Programmabil- ity and Performance in Modern Neural Kernels

    Lei Wang, Yu Cheng, Yining Shi, Zhiwen Mo, Zhengju Tang, Wenhao Xie, Tong Wu, Lingxiao Ma, Yuqing Xia, Jilong Xue, Fan Yang, and Zhi Yang. TileLang: Bridge Programmabil- ity and Performance in Modern Neural Kernels. InInternational Conference on Learning Representations (ICLR), 2026

  33. [33]

    Wilkinson.Progress report on the automatic computing engine

    James H. Wilkinson.Progress report on the automatic computing engine. Mathematics Division, Department of Scientific & Industrial Research, 1948

  34. [34]

    Wilkinson

    James H. Wilkinson. Error analysis of direct methods of matrix inversion.Journal of the ACM, 8(3):281–330, 1961

  35. [35]

    DeltaFormer: Unlock the state space of Transformer

    Mingyu Xu, Tenglong Ao, Jiaao He, Jianqiao Lu, Guang Shi, and Shu Zhong. DeltaFormer: Unlock the state space of Transformer. InAdvances in Neural Information Processing Systems (NeurIPS), 2025

  36. [36]

    Gated Delta Networks: Improving Mamba2 with Delta Rule

    Songlin Yang, Jan Kautz, and Ali Hatamizadeh. Gated Delta Networks: Improving Mamba2 with Delta Rule. In Y . Yue, A. Garg, N. Peng, F. Sha, and R. Yu, editors,International Conference on Learning Representations (ICLR), volume 2025, pages 29687–29707, 2025

  37. [37]

    Gated linear attention transformers with hardware-efficient training

    Songlin Yang, Bailin Wang, Yikang Shen, Rameswar Panda, and Yoon Kim. Gated linear attention transformers with hardware-efficient training. InInternational Conference on Machine Learning (ICML), ICML’24. JMLR, 2024

  38. [38]

    Parallelizing linear transformers with the delta rule over sequence length

    Songlin Yang, Bailin Wang, Yu Zhang, Yikang Shen, and Yoon Kim. Parallelizing linear transformers with the delta rule over sequence length. In A. Globerson, L. Mackey, D. Bel- grave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors,Advances in Neural Information Processing Systems (NeurIPS), volume 37, pages 115491–115522. Curran Associates, Inc., 2024

  39. [39]

    Al-Fhaid

    Malik Zaka Ullah, Fazlollah Soleymani, and Abdulrahman S. Al-Fhaid. An efficient matrix iter- ation for computing weighted Moore–Penrose inverse.Applied Mathematics and Computation, 226:441–454, January 2014

  40. [40]

    Gonzalez, Clark Barrett, and Ying Sheng

    Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E. Gonzalez, Clark Barrett, and Ying Sheng. SGLang: Efficient execution of structured language model programs. InAdvances in Neural Information Processing Systems (NeurIPS), 2024

  41. [41]

    Understanding transformer from the perspective of associative memory, 2025

    Shu Zhong, Mingyu Xu, Tenglong Ao, and Guang Shi. Understanding transformer from the perspective of associative memory, 2025

  42. [42]

    Anastasios Zouzias and William F. McColl. A parallel scan algorithm in the tensor core unit model. InInternational Conference on Parallel and Distributed Computing (Euro-Par), pages 489–502, 2023. 12 A Code availability The source code for the kernels and the scripts to reproduce the experimental evaluations are available in the following links: • https:/...

  43. [43]

    CreateD e(X, b)andD o(X, b)as defined above

  44. [44]

    I/O level

    UpdateX←D e(X, b) +D o(X, b)−D o(X, b)·L·D e(X, b). 3.b←2b. Finally, we return X. In total there are two matrix products of sizen×n for each b= 1,2,4, . . . , n/2 , which gives a total of2 log(n)matrix products. Correctness can be verified by induction. Note that, for b= 1 , after step 2., the 2×2 diagonal blocks of X contain the inverses of the 2×2 diago...