When Good Enough Is Optimal: Multiplication-Only Matrix Inversion Approximation for Quantized Gated DeltaNet

Denghao Li; Kui Zhang; Liang Zhang; Lingjuan Ge; Luoming Zhang; Matthew Harper Langston; Tian Liu; Weiliang Will Zeng; Yin Huang; Yuwei Ren

arxiv: 2606.06034 · v1 · pith:JIMYX2Z6new · submitted 2026-06-04 · 💻 cs.LG · cs.AI

When Good Enough Is Optimal: Multiplication-Only Matrix Inversion Approximation for Quantized Gated DeltaNet

Luoming Zhang , Yuwei Ren , Kui Zhang , Tian Liu , Lingjuan Ge , Denghao Li , Matthew Harper Langston , Yin Huang

show 2 more authors

Weiliang Will Zeng Liang Zhang

This is my paper

Pith reviewed 2026-06-28 02:32 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords matrix inversionlinear attentionNeumann seriesGated DeltaNetquantized inferencechunk-wise attentionMatMul optimizationlow-bit integer

0 comments

The pith

Truncated Neumann expansion approximates matrix inversion in chunk-wise linear attention using only multiplications.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that inverting the strictly lower-triangular matrices arising in chunk-wise Gated DeltaNet can be done with a truncated Neumann series plus structural masking and parallel residual correction, turning a sequential bottleneck into parallel matrix multiplications. This matters because forward-substitution inversion limits hardware utilization on NPUs during long-context decoding. The authors adapt the truncation order and residual step to chunk size, extend the method to low-bit integers by controlling dynamic-range growth, and validate it on Qwen3.5 models. Experiments show the approximation preserves accuracy in both floating-point and quantized settings while delivering large kernel speedups.

Core claim

A MatMul-based algorithm for strictly lower-triangular matrices in chunk-wise linear attention uses a truncated Neumann expansion with structural masking and parallel residual correction to eliminate sequential dependencies; the approximation order and residual step are tuned to chunk size, and dynamic-range mitigation allows extension to low-bit INT while keeping model accuracy.

What carries the argument

Truncated Neumann expansion with structural masking and parallel residual correction for strictly lower-triangular matrices, adapted per chunk size.

If this is right

Kernel-level speedups reach up to 5x with a 20% reduction in decode-layer overhead on Qwen3.5-family models.
The method works under both floating-point and low-precision inference without accuracy loss.
Adaptation of truncation order to chunk size keeps compute cost low while maintaining fidelity.
The approach removes the sequential dependency of forward substitution, improving NPU utilization.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same structural properties (diagonal concentration in inverses of lower-triangular attention matrices) may appear in other linear or state-space models, suggesting the technique could transfer beyond Gated DeltaNet.
Hardware designers could exploit the multiplication-only nature to simplify NPU matrix units for attention workloads.
Testing the approximation error growth with increasing chunk size would clarify the practical limit on context length.

Load-bearing premise

The inverse matrices exhibit enough diagonal concentration and the Neumann terms grow fast enough that a short truncated expansion plus masking and correction is accurate enough.

What would settle it

Measure the element-wise or operator-norm error of the approximated inverse against exact inversion on matrices extracted from real Qwen3.5 attention layers, or compare end-to-end model accuracy when swapping the approximation in and out.

Figures

Figures reproduced from arXiv: 2606.06034 by Denghao Li, Kui Zhang, Liang Zhang, Lingjuan Ge, Luoming Zhang, Matthew Harper Langston, Tian Liu, Weiliang Will Zeng, Yin Huang, Yuwei Ren.

**Figure 1.** Figure 1: Cycle breakdown across chunk sizes at fixed sequence length (L = 128) on a GatedDeltaNet layer from Qwen3.5-4B. The lighter segment denotes base computation, while the darker segment highlights matrix-inverse overhead. recurrent state, avoiding the quadratic O(T 2 ) cost. GatedDeltaNet (Yang et al., 2025a), adopted by recent largescale models such as QWen3.5 (Qwen Team, 2026) and KiMi (Zhang et al., 2025… view at source ↗

**Figure 2.** Figure 2: Distribution of A n over 100 samples. Values exceeding the FP16 limit (65,504) are highlighted in red; 2 samples exhibit overflow, indicating heavy-tailed growth in higher-order terms. Lemma 3.1 (Diagonal Localization of Neumann Series for strictly lower-triangular matrix). Let A ∈ R k×k be strictly lower triangular, i.e., Aij = 0 for i ≤ j. Then for any n ≥ 0, (A n )ij = 0 ⇒ i − j < n. Based on Lemma 3.1,… view at source ↗

**Figure 3.** Figure 3: Activation distribution under Neumann truncation for order=3 and 4 for a 64×64 matrix Neumann Series. 4. Experiments 4.1. Experiments Setting We select the Qwen3Next (Team, 2025) and Qwen3.5 (Qwen Team, 2026) model families, to complete full accuracy and on-target latency study. Unless otherwise stated, all experiments are conducted with a chunk size k = 64, a Neumann series order N = 3, and S = 8 residual… view at source ↗

**Figure 4.** Figure 4: Plot of single kernel performance across different chunk wise. Here, H=32, Dk=128. Combining the single-kernel results, we observe that as nonmatmul operations become faster, the relative benefit of our matrix inversion method becomes more pronounced. 4.4. Ablation Study Effect of each module [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Low-order truncation is sufficient for a strictly lower-triangular 64 × 64 example. Left: truncation errors ∥Em∥F and ∥Em∥2 versus truncation order m, together with the geometric-series upper bound ∥A∥ m+1 2 /(1 − ∥A∥2). Right: a captured-structure proxy 1 − ∥Em∥F /∥T∥F . Errors drop sharply within the first few orders, indicating that most inverse structure is captured by m ≪ k, making full expansion unne… view at source ↗

**Figure 6.** Figure 6: Accumulated diagonal power ratio across layers. The ratio saturates quickly with increasing n; while ¿98% is typically captured at small n, achieving 0.99 requires substantially larger orders, revealing layer-wise variation and the cost–accuracy trade-off. Real value experiments. Based on Lemma 3.1, the truncation error can be efficiently estimated using the accumulated power ratio along the diagonal dimen… view at source ↗

read the original abstract

Matrix inversion in chunk-wise parallel linear attention is a major bottleneck for long-context modeling, particularly on NPUs, where forward-substitution-based methods exhibit limited parallelism and poor hardware utilization. We propose a fast, Matrix Multiplication (MatMul)-based algorithm tailored for strictly lower-triangular matrices arising in chunk-wise linear attention. Motivated by the rapid growth of Neumann-series terms and the diagonal concentration of the inverse matrix, we employ a truncated Neumann expansion with structural masking and parallel residual correction to eliminate sequential dependencies. We further extend our method to low-bits INT by mitigating the dynamic range expansion arising from repeated matrix power operations, and adapt the approximation order and residual step to the chunk size to minimize computational cost while preserving the model's accuracy. Experiments on Qwen3.5-family models demonstrate up to 5$\times$ kernel-level speedup and a 20% reduction in decode-layer overhead, while preserving accuracy under both floating-point and low-precision inference. Our method offers an efficient and hardware-friendly solution for scalable linear attention.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a MatMul-based Neumann truncation for inverting lower-triangular matrices in chunked Gated DeltaNet with low-bit handling, but the key matrix properties that justify the approximation are not shown.

read the letter

The main takeaway is a concrete, hardware-oriented tweak: replace forward substitution with a truncated Neumann series plus structural masking and parallel residual correction for the strictly lower-triangular matrices that appear in chunk-wise linear attention. They add chunk-size adaptation for the order and residual step, plus a fix for dynamic-range growth under repeated powering so the method stays usable in INT low precision.

What works is the problem framing. Forward substitution really does limit parallelism on NPUs, and moving to dense MatMuls is a direct way to improve utilization. Extending the idea to quantized Gated DeltaNet and reporting up to 5× kernel speedup with 20 % lower decode overhead on Qwen3.5 models, while claiming accuracy holds in both FP and low-bit settings, is the kind of targeted engineering that can matter for deployment.

The soft spot is the missing verification of the two assumptions the method rests on. The abstract says the inverse shows diagonal concentration and that Neumann terms grow rapidly enough for safe truncation, but supplies no measured decay rates for ||A^k|| or off-diagonal mass fractions on the actual matrices produced by Gated DeltaNet chunks. Without those numbers it is hard to know whether the approximation error stays acceptable across chunk sizes or hidden-state distributions, which directly affects whether the reported accuracy preservation follows from the construction. The experimental description also omits baselines and exact error metrics, so the speedup and accuracy claims cannot be weighed yet.

This is for people already working on efficient linear-attention inference, especially on NPUs or quantized models. A reader who needs a drop-in MatMul replacement for matrix inversion in this setting could extract the algorithmic pattern and the chunk-adaptation rule.

It deserves peer review. The core idea is a legitimate incremental tailoring with a clear hardware target; a referee can check the missing diagnostics and the experimental controls.

Referee Report

2 major / 1 minor

Summary. The paper proposes a MatMul-only approximation for inverting strictly lower-triangular matrices arising in chunk-wise Gated DeltaNet linear attention. It uses a truncated Neumann series plus structural masking and parallel residual correction to remove sequential dependencies, extends the approach to low-bit INT by controlling dynamic-range growth from matrix powers, and adapts approximation order and residual step size to chunk size. Experiments on Qwen3.5-family models are reported to yield up to 5× kernel speedup and 20% decode-layer overhead reduction while preserving accuracy in both FP and quantized inference.

Significance. If the approximation accuracy holds under the stated conditions, the method supplies a hardware-friendly, parallelizable alternative to forward substitution for linear attention on NPUs and quantized accelerators. The explicit handling of quantization-induced range expansion and the chunk-size adaptation are practical strengths that could aid scalable long-context deployment.

major comments (2)

[Abstract] Abstract (motivation paragraph): The truncation-plus-masking construction is justified by the claims of 'rapid growth of Neumann-series terms' and 'diagonal concentration of the inverse matrix,' yet no quantitative diagnostics (measured ||A^k|| decay rates, spectral-radius bounds, or off-diagonal mass fractions) are supplied for the actual strictly lower-triangular matrices produced by Gated DeltaNet chunks. These properties are load-bearing for the claim that truncation error remains negligible.
[Experiments] Experiments (abstract claim): The reported 'up to 5× kernel-level speedup' and 'accuracy preservation' lack explicit baselines (exact forward-substitution timings, relative inversion error metrics, or chunk-size exclusion criteria), so the central empirical result cannot be assessed for robustness across the tested Qwen3.5 variants.

minor comments (1)

The abstract refers to 'Qwen3.5-family models' without naming exact sizes or layer counts; adding this detail would improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and will revise the manuscript to incorporate the requested evidence and baselines.

read point-by-point responses

Referee: [Abstract] Abstract (motivation paragraph): The truncation-plus-masking construction is justified by the claims of 'rapid growth of Neumann-series terms' and 'diagonal concentration of the inverse matrix,' yet no quantitative diagnostics (measured ||A^k|| decay rates, spectral-radius bounds, or off-diagonal mass fractions) are supplied for the actual strictly lower-triangular matrices produced by Gated DeltaNet chunks. These properties are load-bearing for the claim that truncation error remains negligible.

Authors: We agree that explicit quantitative diagnostics on the Gated DeltaNet matrices are needed to support the truncation claims. Although the full manuscript motivates the approach via general properties of strictly lower-triangular matrices, we will add a new analysis subsection (or appendix) with measured ||A^k|| decay rates, spectral-radius bounds, and off-diagonal mass fractions computed directly on representative chunks from the Qwen3.5 models. This will provide the requested load-bearing evidence. revision: yes
Referee: [Experiments] Experiments (abstract claim): The reported 'up to 5× kernel-level speedup' and 'accuracy preservation' lack explicit baselines (exact forward-substitution timings, relative inversion error metrics, or chunk-size exclusion criteria), so the central empirical result cannot be assessed for robustness across the tested Qwen3.5 variants.

Authors: We concur that explicit baselines are required for proper assessment. The current experiments compare against a reference implementation, but we will expand the experiments section to report: direct wall-clock timings versus exact forward substitution, relative inversion error (e.g., normalized Frobenius distance to the exact inverse), chunk-size exclusion criteria, and per-variant results across all tested Qwen3.5 models. These additions will enable robustness evaluation. revision: yes

Circularity Check

0 steps flagged

No circularity; algorithmic proposal validated on external models

full rationale

The paper presents a MatMul-based approximation using truncated Neumann expansion, masking, and residual correction for strictly lower-triangular matrices in chunk-wise Gated DeltaNet. The motivation cites rapid Neumann-term growth and diagonal concentration, but these are treated as empirical properties verified by accuracy preservation on Qwen3.5-family models under FP and low-bit inference. No derivation step reduces a claimed result to a fitted parameter, self-citation chain, or input by construction. Experiments supply independent external benchmarks, so the work is self-contained.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard matrix-analysis assumptions about series convergence and matrix structure plus two tunable parameters chosen per chunk size; no new entities are postulated.

free parameters (2)

approximation order
Chosen and adapted to chunk size to balance cost and accuracy.
residual correction step size
Adapted to chunk size.

axioms (2)

domain assumption Neumann series terms grow rapidly for the matrices in question
Stated motivation for truncation in abstract.
domain assumption Inverse matrix is sufficiently diagonally concentrated
Stated motivation for truncation in abstract.

pith-pipeline@v0.9.1-grok · 5740 in / 1354 out tokens · 67197 ms · 2026-06-28T02:32:23.548117+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

54 extracted references · 10 canonical work pages · 3 internal anchors

[1]

Langley , title =

P. Langley , title =. Proceedings of the 17th International Conference on Machine Learning (ICML 2000) , address =. 2000 , pages =

2000
[2]

T. M. Mitchell. The Need for Biases in Learning Generalizations. 1980

1980
[3]

M. J. Kearns , title =
[4]

Machine Learning: An Artificial Intelligence Approach, Vol. I. 1983

1983
[5]

R. O. Duda and P. E. Hart and D. G. Stork. Pattern Classification. 2000

2000
[6]

Suppressed for Anonymity , author=
[7]

Newell and P

A. Newell and P. S. Rosenbloom. Mechanisms of Skill Acquisition and the Law of Practice. Cognitive Skills and Their Acquisition. 1981

1981
[8]

A. L. Samuel. Some Studies in Machine Learning Using the Game of Checkers. IBM Journal of Research and Development. 1959

1959
[9]

Gomez and Lukasz Kaiser and Illia Polosukhin , editor =

Ashish Vaswani and Noam Shazeer and Niki Parmar and Jakob Uszkoreit and Llion Jones and Aidan N. Gomez and Lukasz Kaiser and Illia Polosukhin , editor =. Attention is All you Need , booktitle =. 2017 , url =

2017
[10]

Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention , booktitle =

Angelos Katharopoulos and Apoorv Vyas and Nikolaos Pappas and Fran. Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention , booktitle =. 2020 , url =

2020
[11]

Le , editor =

Weizhe Hua and Zihang Dai and Hanxiao Liu and Quoc V. Le , editor =. Transformer Quality in Linear Time , booktitle =. 2022 , url =

2022
[12]

The Thirteenth International Conference on Learning Representations,

Songlin Yang and Jan Kautz and Ali Hatamizadeh , title =. The Thirteenth International Conference on Learning Representations,. 2025 , url =

2025
[14]

Gated Linear Attention Transformers with Hardware-Efficient Training , booktitle =

Songlin Yang and Bailin Wang and Yikang Shen and Rameswar Panda and Yoon Kim , editor =. Gated Linear Attention Transformers with Hardware-Efficient Training , booktitle =. 2024 , url =

2024
[17]

2013 , publisher=

Matrix Computations , author=. 2013 , publisher=

2013
[18]

FLA: A Triton-Based Library for Hardware-Efficient Implementations of Linear Attention Mechanism , author =
[19]

2026 , howpublished =

Gemma 4: Frontier-Level Open Models , author =. 2026 , howpublished =

2026
[20]

2025 , eprint=

Qwen3 Technical Report , author=. 2025 , eprint=

2025
[21]

5th International Conference on Learning Representations,

Stephen Merity and Caiming Xiong and James Bradbury and Richard Socher , title =. 5th International Conference on Learning Representations,. 2017 , url =

2017
[26]

9th International Conference on Learning Representations,

Dan Hendrycks and Collin Burns and Steven Basart and Andy Zou and Mantas Mazeika and Dawn Song and Jacob Steinhardt , title =. 9th International Conference on Learning Representations,. 2021 , url =

2021
[27]

2024 , howpublished =

RealWorldQA: A Real-World Visual Question Answering Benchmark , author =. 2024 , howpublished =

2024
[28]

2026 , publisher=

FlashQLA: Flash Qwen Linear Attention , author=. 2026 , publisher=

2026
[29]

gdn-tri-inverse: Evaluation of Gated Delta Networks with Triangular Matrix Inversion , howpublished =
[30]

Parallelizing Linear Transformers with the Delta Rule over Sequence Length , booktitle =

Songlin Yang and Bailin Wang and Yu Zhang and Yikang Shen and Yoon Kim , editor =. Parallelizing Linear Transformers with the Delta Rule over Sequence Length , booktitle =. 2024 , url =

2024
[34]

Efficiently Modeling Long Sequences with Structured State Spaces , booktitle =

Albert Gu and Karan Goel and Christopher R. Efficiently Modeling Long Sequences with Structured State Spaces , booktitle =. 2022 , url =

2022
[35]

Liu , title =

Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu , title =. J. Mach. Learn. Res. , volume =. 2020 , url =

2020
[38]

PIQA: Reasoning about physical commonsense in natural language

Bisk, Y., Zellers, R., Bras, R. L., Gao, J., and Choi, Y. PIQA: reasoning about physical commonsense in natural language. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Inte...

work page doi:10.1609/aaai.v34i05.6239 2020
[39]

B ool Q : Exploring the Surprising Difficulty of Natural Yes/No Questions

Clark, C., Lee, K., Chang, M., Kwiatkowski, T., Collins, M., and Toutanova, K. Boolq: Exploring the surprising difficulty of natural yes/no questions. In Burstein, J., Doran, C., and Solorio, T. (eds.), Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT ...

work page doi:10.18653/v1/n19-1300 2019
[40]

Think you have solved question answering? try arc, the AI2 reasoning challenge

Clark, P., Cowhey, I., Etzioni, O., Khot, T., Sabharwal, A., Schoenick, C., and Tafjord, O. Think you have solved question answering? try arc, the AI2 reasoning challenge. CoRR, abs/1803.05457, 2018. URL http://arxiv.org/abs/1803.05457

Pith/arXiv arXiv 2018
[41]

Golub, G. H. and Van Loan, C. F. Matrix Computations. Johns Hopkins University Press, 2013

2013
[42]

Gemma 4: Frontier-level open models

Google DeepMind . Gemma 4: Frontier-level open models. https://deepmind.google/models/gemma/gemma-4/, 2026. Model card and technical documentation

2026
[43]

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

Gu, A. and Dao, T. Mamba: Linear-time sequence modeling with selective state spaces. CoRR, abs/2312.00752, 2023. doi:10.48550/ARXIV.2312.00752. URL https://doi.org/10.48550/arXiv.2312.00752

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2312.00752 2023
[44]

Efficiently modeling long sequences with structured state spaces

Gu, A., Goel, K., and R \' e , C. Efficiently modeling long sequences with structured state spaces. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022 . OpenReview.net, 2022. URL https://openreview.net/forum?id=uYLFoz1vlAC

2022
[45]

Measuring massive multitask language understanding

Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., and Steinhardt, J. Measuring massive multitask language understanding. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021 . OpenReview.net, 2021. URL https://openreview.net/forum?id=d7KBjmI3GmQ

2021
[46]

Hua, W., Dai, Z., Liu, H., and Le, Q. V. Transformer quality in linear time. In Chaudhuri, K., Jegelka, S., Song, L., Szepesv \' a ri, C., Niu, G., and Sabato, S. (eds.), International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA , Proceedings of Machine Learning Research, pp.\ 9099--9117. PMLR , 2022. URL https://p...

2022
[47]

gdn-tri-inverse: Evaluation of gated delta networks with triangular matrix inversion

Huawei CSL . gdn-tri-inverse: Evaluation of gated delta networks with triangular matrix inversion. https://github.com/huawei-csl/gdn-tri-inverse, 2026

2026
[48]

Transformers are rnns: Fast autoregressive transformers with linear attention

Katharopoulos, A., Vyas, A., Pappas, N., and Fleuret, F. Transformers are rnns: Fast autoregressive transformers with linear attention. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event , Proceedings of Machine Learning Research, pp.\ 5156--5165. PMLR , 2020. URL http://proceedings.mlr.press...

2020
[49]

Liu, H., Li, C., Li, Y., and Lee, Y. J. Improved baselines with visual instruction tuning. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024, Seattle, WA, USA, June 16-22, 2024 , pp.\ 26286--26296. IEEE , 2024. doi:10.1109/CVPR52733.2024.02484. URL https://doi.org/10.1109/CVPR52733.2024.02484

work page doi:10.1109/cvpr52733.2024.02484 2024
[50]

Pointer sentinel mixture models

Merity, S., Xiong, C., Bradbury, J., and Socher, R. Pointer sentinel mixture models. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings . OpenReview.net, 2017. URL https://openreview.net/forum?id=Byj72udxe

2017
[51]

Wind, Tianyi Wu, Daniel Wuttke, and Christian Zhou-Zheng

Peng, B., Zhang, R., Goldstein, D., Alcaide, E., Du, X., Hou, H., Lin, J., Liu, J., Lu, J., Merrill, W., Song, G., Tan, K., Utpala, S., Wilce, N., Wind, J. S., Wu, T., Wuttke, D., and Zhou - Zheng, C. RWKV-7 "goose" with expressive dynamic state evolution. CoRR, abs/2503.14456, 2025. doi:10.48550/ARXIV.2503.14456. URL https://doi.org/10.48550/arXiv.2503.14456

work page doi:10.48550/arxiv.2503.14456 2025
[52]

Qwen3.5 : Towards native multimodal agents, February 2026

Qwen Team . Qwen3.5 : Towards native multimodal agents, February 2026. URL https://qwen.ai/blog?id=qwen3.5

2026
[53]

Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., and Liu, P. J. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21: 0 140:1--140:67, 2020. URL https://jmlr.org/papers/v21/20-074.html

2020
[54]

Neural network quantization with AI model efficiency toolkit (AIMET)

Siddegowda, S., Fournarakis, M., Nagel, M., Blankevoort, T., Patel, C., and Khobare, A. Neural network quantization with AI model efficiency toolkit (AIMET) . CoRR, abs/2201.08442, 2022. URL https://arxiv.org/abs/2201.08442

arXiv 2022
[55]

Retentive Network: A Successor to Transformer for Large Language Models

Sun, Y., Dong, L., Huang, S., Ma, S., Xia, Y., Xue, J., Wang, J., and Wei, F. Retentive network: A successor to transformer for large language models. CoRR, abs/2307.08621, 2023. doi:10.48550/ARXIV.2307.08621. URL https://doi.org/10.48550/arXiv.2307.08621

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2307.08621 2023
[56]

Qwen3 technical report, 2025

Team, Q. Qwen3 technical report, 2025. URL https://arxiv.org/abs/2505.09388

Pith/arXiv arXiv 2025
[57]

N., Kaiser, L., and Polosukhin, I

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. Attention is all you need. In Guyon, I., von Luxburg, U., Bengio, S., Wallach, H. M., Fergus, R., Vishwanathan, S. V. N., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processi...

2017
[58]

Realworldqa: A real-world visual question answering benchmark

xAI . Realworldqa: A real-world visual question answering benchmark. https://huggingface.co/datasets/xai-org/RealworldQA, 2024. Released with Grok-1.5 Vision

2024
[59]

and Zhang, Y

Yang, S. and Zhang, Y. Fla: A triton-based library for hardware-efficient implementations of linear attention mechanism, January 2024. URL https://github.com/fla-org/flash-linear-attention

2024
[60]

Gated linear attention transformers with hardware-efficient training

Yang, S., Wang, B., Shen, Y., Panda, R., and Kim, Y. Gated linear attention transformers with hardware-efficient training. In Salakhutdinov, R., Kolter, Z., Heller, K. A., Weller, A., Oliver, N., Scarlett, J., and Berkenkamp, F. (eds.), Forty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024 , Proceedings of ...

2024
[61]

Gated delta networks: Improving mamba2 with delta rule

Yang, S., Kautz, J., and Hatamizadeh, A. Gated delta networks: Improving mamba2 with delta rule. In The Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025 . OpenReview.net, 2025 a . URL https://openreview.net/forum?id=r8H7xhYPwz

2025
[62]

Path attention: Position encoding via accumulating householder transformations

Yang, S., Shen, Y., Wen, K., Tan, S., Mishra, M., Ren, L., Panda, R., and Kim, Y. Path attention: Position encoding via accumulating householder transformations. CoRR, abs/2505.16381, 2025 b . doi:10.48550/ARXIV.2505.16381. URL https://doi.org/10.48550/arXiv.2505.16381

work page doi:10.48550/arxiv.2505.16381 2025
[63]

HellaSwag: Can a machine really finish your sentence? InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4791–4800, 2019

Zellers, R., Holtzman, A., Bisk, Y., Farhadi, A., and Choi, Y. Hellaswag: Can a machine really finish your sentence? In Korhonen, A., Traum, D. R., and M \` a rquez, L. (eds.), Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers , pp.\ 4791--4800. As...

work page doi:10.18653/v1/p19-1472 2019
[64]

Flashqla: Flash qwen linear attention

Zhang, C., Lin, X., Jiang, H., Wang, Z., Li, X., Cao, Y., Zhuang, B., Men, R., Zhang, J., Zheng, B., Lin, J., Liu, D., and Zhou, J. Flashqla: Flash qwen linear attention. https://github.com/QwenLM/FlashQLA, 2026

2026
[65]

Zhang, Y., Lin, Z., Yao, X., Hu, J., Meng, F., Liu, C., Men, X., Yang, S., Li, Z., Li, W., Lu, E., Liu, W., Chen, Y., Xu, W., Yu, L., Wang, Y., Fan, Y., Zhong, L., Yuan, E., Zhang, D., Zhang, Y., Liu, T. Y., Wang, H., Fang, S., He, W., Liu, S., Li, Y., Su, J., Qiu, J., Pang, B., Yan, J., Jiang, Z., Huang, W., Yin, B., You, J., Wei, C., Wang, Z., Hong, C.,...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2510.26692 2025
[66]

Understanding transformer from the perspective of associative memory.arXiv preprint arXiv:2505.19488, 2025

Zhong, S., Xu, M., Ao, T., and Shi, G. Understanding transformer from the perspective of associative memory. CoRR, abs/2505.19488, 2025. doi:10.48550/ARXIV.2505.19488. URL https://doi.org/10.48550/arXiv.2505.19488

work page doi:10.48550/arxiv.2505.19488 2025

[1] [1]

Langley , title =

P. Langley , title =. Proceedings of the 17th International Conference on Machine Learning (ICML 2000) , address =. 2000 , pages =

2000

[2] [2]

T. M. Mitchell. The Need for Biases in Learning Generalizations. 1980

1980

[3] [3]

M. J. Kearns , title =

[4] [4]

Machine Learning: An Artificial Intelligence Approach, Vol. I. 1983

1983

[5] [5]

R. O. Duda and P. E. Hart and D. G. Stork. Pattern Classification. 2000

2000

[6] [6]

Suppressed for Anonymity , author=

[7] [7]

Newell and P

A. Newell and P. S. Rosenbloom. Mechanisms of Skill Acquisition and the Law of Practice. Cognitive Skills and Their Acquisition. 1981

1981

[8] [8]

A. L. Samuel. Some Studies in Machine Learning Using the Game of Checkers. IBM Journal of Research and Development. 1959

1959

[9] [9]

Gomez and Lukasz Kaiser and Illia Polosukhin , editor =

Ashish Vaswani and Noam Shazeer and Niki Parmar and Jakob Uszkoreit and Llion Jones and Aidan N. Gomez and Lukasz Kaiser and Illia Polosukhin , editor =. Attention is All you Need , booktitle =. 2017 , url =

2017

[10] [10]

Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention , booktitle =

Angelos Katharopoulos and Apoorv Vyas and Nikolaos Pappas and Fran. Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention , booktitle =. 2020 , url =

2020

[11] [11]

Le , editor =

Weizhe Hua and Zihang Dai and Hanxiao Liu and Quoc V. Le , editor =. Transformer Quality in Linear Time , booktitle =. 2022 , url =

2022

[12] [12]

The Thirteenth International Conference on Learning Representations,

Songlin Yang and Jan Kautz and Ali Hatamizadeh , title =. The Thirteenth International Conference on Learning Representations,. 2025 , url =

2025

[13] [14]

Gated Linear Attention Transformers with Hardware-Efficient Training , booktitle =

Songlin Yang and Bailin Wang and Yikang Shen and Rameswar Panda and Yoon Kim , editor =. Gated Linear Attention Transformers with Hardware-Efficient Training , booktitle =. 2024 , url =

2024

[14] [17]

2013 , publisher=

Matrix Computations , author=. 2013 , publisher=

2013

[15] [18]

FLA: A Triton-Based Library for Hardware-Efficient Implementations of Linear Attention Mechanism , author =

[16] [19]

2026 , howpublished =

Gemma 4: Frontier-Level Open Models , author =. 2026 , howpublished =

2026

[17] [20]

2025 , eprint=

Qwen3 Technical Report , author=. 2025 , eprint=

2025

[18] [21]

5th International Conference on Learning Representations,

Stephen Merity and Caiming Xiong and James Bradbury and Richard Socher , title =. 5th International Conference on Learning Representations,. 2017 , url =

2017

[19] [26]

9th International Conference on Learning Representations,

Dan Hendrycks and Collin Burns and Steven Basart and Andy Zou and Mantas Mazeika and Dawn Song and Jacob Steinhardt , title =. 9th International Conference on Learning Representations,. 2021 , url =

2021

[20] [27]

2024 , howpublished =

RealWorldQA: A Real-World Visual Question Answering Benchmark , author =. 2024 , howpublished =

2024

[21] [28]

2026 , publisher=

FlashQLA: Flash Qwen Linear Attention , author=. 2026 , publisher=

2026

[22] [29]

gdn-tri-inverse: Evaluation of Gated Delta Networks with Triangular Matrix Inversion , howpublished =

[23] [30]

Parallelizing Linear Transformers with the Delta Rule over Sequence Length , booktitle =

Songlin Yang and Bailin Wang and Yu Zhang and Yikang Shen and Yoon Kim , editor =. Parallelizing Linear Transformers with the Delta Rule over Sequence Length , booktitle =. 2024 , url =

2024

[24] [34]

Efficiently Modeling Long Sequences with Structured State Spaces , booktitle =

Albert Gu and Karan Goel and Christopher R. Efficiently Modeling Long Sequences with Structured State Spaces , booktitle =. 2022 , url =

2022

[25] [35]

Liu , title =

Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu , title =. J. Mach. Learn. Res. , volume =. 2020 , url =

2020

[26] [38]

PIQA: Reasoning about physical commonsense in natural language

Bisk, Y., Zellers, R., Bras, R. L., Gao, J., and Choi, Y. PIQA: reasoning about physical commonsense in natural language. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Inte...

work page doi:10.1609/aaai.v34i05.6239 2020

[27] [39]

B ool Q : Exploring the Surprising Difficulty of Natural Yes/No Questions

Clark, C., Lee, K., Chang, M., Kwiatkowski, T., Collins, M., and Toutanova, K. Boolq: Exploring the surprising difficulty of natural yes/no questions. In Burstein, J., Doran, C., and Solorio, T. (eds.), Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT ...

work page doi:10.18653/v1/n19-1300 2019

[28] [40]

Think you have solved question answering? try arc, the AI2 reasoning challenge

Clark, P., Cowhey, I., Etzioni, O., Khot, T., Sabharwal, A., Schoenick, C., and Tafjord, O. Think you have solved question answering? try arc, the AI2 reasoning challenge. CoRR, abs/1803.05457, 2018. URL http://arxiv.org/abs/1803.05457

Pith/arXiv arXiv 2018

[29] [41]

Golub, G. H. and Van Loan, C. F. Matrix Computations. Johns Hopkins University Press, 2013

2013

[30] [42]

Gemma 4: Frontier-level open models

Google DeepMind . Gemma 4: Frontier-level open models. https://deepmind.google/models/gemma/gemma-4/, 2026. Model card and technical documentation

2026

[31] [43]

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

Gu, A. and Dao, T. Mamba: Linear-time sequence modeling with selective state spaces. CoRR, abs/2312.00752, 2023. doi:10.48550/ARXIV.2312.00752. URL https://doi.org/10.48550/arXiv.2312.00752

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2312.00752 2023

[32] [44]

Efficiently modeling long sequences with structured state spaces

Gu, A., Goel, K., and R \' e , C. Efficiently modeling long sequences with structured state spaces. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022 . OpenReview.net, 2022. URL https://openreview.net/forum?id=uYLFoz1vlAC

2022

[33] [45]

Measuring massive multitask language understanding

Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., and Steinhardt, J. Measuring massive multitask language understanding. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021 . OpenReview.net, 2021. URL https://openreview.net/forum?id=d7KBjmI3GmQ

2021

[34] [46]

Hua, W., Dai, Z., Liu, H., and Le, Q. V. Transformer quality in linear time. In Chaudhuri, K., Jegelka, S., Song, L., Szepesv \' a ri, C., Niu, G., and Sabato, S. (eds.), International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA , Proceedings of Machine Learning Research, pp.\ 9099--9117. PMLR , 2022. URL https://p...

2022

[35] [47]

gdn-tri-inverse: Evaluation of gated delta networks with triangular matrix inversion

Huawei CSL . gdn-tri-inverse: Evaluation of gated delta networks with triangular matrix inversion. https://github.com/huawei-csl/gdn-tri-inverse, 2026

2026

[36] [48]

Transformers are rnns: Fast autoregressive transformers with linear attention

Katharopoulos, A., Vyas, A., Pappas, N., and Fleuret, F. Transformers are rnns: Fast autoregressive transformers with linear attention. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event , Proceedings of Machine Learning Research, pp.\ 5156--5165. PMLR , 2020. URL http://proceedings.mlr.press...

2020

[37] [49]

Liu, H., Li, C., Li, Y., and Lee, Y. J. Improved baselines with visual instruction tuning. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024, Seattle, WA, USA, June 16-22, 2024 , pp.\ 26286--26296. IEEE , 2024. doi:10.1109/CVPR52733.2024.02484. URL https://doi.org/10.1109/CVPR52733.2024.02484

work page doi:10.1109/cvpr52733.2024.02484 2024

[38] [50]

Pointer sentinel mixture models

Merity, S., Xiong, C., Bradbury, J., and Socher, R. Pointer sentinel mixture models. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings . OpenReview.net, 2017. URL https://openreview.net/forum?id=Byj72udxe

2017

[39] [51]

Wind, Tianyi Wu, Daniel Wuttke, and Christian Zhou-Zheng

Peng, B., Zhang, R., Goldstein, D., Alcaide, E., Du, X., Hou, H., Lin, J., Liu, J., Lu, J., Merrill, W., Song, G., Tan, K., Utpala, S., Wilce, N., Wind, J. S., Wu, T., Wuttke, D., and Zhou - Zheng, C. RWKV-7 "goose" with expressive dynamic state evolution. CoRR, abs/2503.14456, 2025. doi:10.48550/ARXIV.2503.14456. URL https://doi.org/10.48550/arXiv.2503.14456

work page doi:10.48550/arxiv.2503.14456 2025

[40] [52]

Qwen3.5 : Towards native multimodal agents, February 2026

Qwen Team . Qwen3.5 : Towards native multimodal agents, February 2026. URL https://qwen.ai/blog?id=qwen3.5

2026

[41] [53]

Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., and Liu, P. J. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21: 0 140:1--140:67, 2020. URL https://jmlr.org/papers/v21/20-074.html

2020

[42] [54]

Neural network quantization with AI model efficiency toolkit (AIMET)

Siddegowda, S., Fournarakis, M., Nagel, M., Blankevoort, T., Patel, C., and Khobare, A. Neural network quantization with AI model efficiency toolkit (AIMET) . CoRR, abs/2201.08442, 2022. URL https://arxiv.org/abs/2201.08442

arXiv 2022

[43] [55]

Retentive Network: A Successor to Transformer for Large Language Models

Sun, Y., Dong, L., Huang, S., Ma, S., Xia, Y., Xue, J., Wang, J., and Wei, F. Retentive network: A successor to transformer for large language models. CoRR, abs/2307.08621, 2023. doi:10.48550/ARXIV.2307.08621. URL https://doi.org/10.48550/arXiv.2307.08621

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2307.08621 2023

[44] [56]

Qwen3 technical report, 2025

Team, Q. Qwen3 technical report, 2025. URL https://arxiv.org/abs/2505.09388

Pith/arXiv arXiv 2025

[45] [57]

N., Kaiser, L., and Polosukhin, I

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. Attention is all you need. In Guyon, I., von Luxburg, U., Bengio, S., Wallach, H. M., Fergus, R., Vishwanathan, S. V. N., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processi...

2017

[46] [58]

Realworldqa: A real-world visual question answering benchmark

xAI . Realworldqa: A real-world visual question answering benchmark. https://huggingface.co/datasets/xai-org/RealworldQA, 2024. Released with Grok-1.5 Vision

2024

[47] [59]

and Zhang, Y

Yang, S. and Zhang, Y. Fla: A triton-based library for hardware-efficient implementations of linear attention mechanism, January 2024. URL https://github.com/fla-org/flash-linear-attention

2024

[48] [60]

Gated linear attention transformers with hardware-efficient training

Yang, S., Wang, B., Shen, Y., Panda, R., and Kim, Y. Gated linear attention transformers with hardware-efficient training. In Salakhutdinov, R., Kolter, Z., Heller, K. A., Weller, A., Oliver, N., Scarlett, J., and Berkenkamp, F. (eds.), Forty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024 , Proceedings of ...

2024

[49] [61]

Gated delta networks: Improving mamba2 with delta rule

Yang, S., Kautz, J., and Hatamizadeh, A. Gated delta networks: Improving mamba2 with delta rule. In The Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025 . OpenReview.net, 2025 a . URL https://openreview.net/forum?id=r8H7xhYPwz

2025

[50] [62]

Path attention: Position encoding via accumulating householder transformations

Yang, S., Shen, Y., Wen, K., Tan, S., Mishra, M., Ren, L., Panda, R., and Kim, Y. Path attention: Position encoding via accumulating householder transformations. CoRR, abs/2505.16381, 2025 b . doi:10.48550/ARXIV.2505.16381. URL https://doi.org/10.48550/arXiv.2505.16381

work page doi:10.48550/arxiv.2505.16381 2025

[51] [63]

HellaSwag: Can a machine really finish your sentence? InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4791–4800, 2019

Zellers, R., Holtzman, A., Bisk, Y., Farhadi, A., and Choi, Y. Hellaswag: Can a machine really finish your sentence? In Korhonen, A., Traum, D. R., and M \` a rquez, L. (eds.), Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers , pp.\ 4791--4800. As...

work page doi:10.18653/v1/p19-1472 2019

[52] [64]

Flashqla: Flash qwen linear attention

Zhang, C., Lin, X., Jiang, H., Wang, Z., Li, X., Cao, Y., Zhuang, B., Men, R., Zhang, J., Zheng, B., Lin, J., Liu, D., and Zhou, J. Flashqla: Flash qwen linear attention. https://github.com/QwenLM/FlashQLA, 2026

2026

[53] [65]

Zhang, Y., Lin, Z., Yao, X., Hu, J., Meng, F., Liu, C., Men, X., Yang, S., Li, Z., Li, W., Lu, E., Liu, W., Chen, Y., Xu, W., Yu, L., Wang, Y., Fan, Y., Zhong, L., Yuan, E., Zhang, D., Zhang, Y., Liu, T. Y., Wang, H., Fang, S., He, W., Liu, S., Li, Y., Su, J., Qiu, J., Pang, B., Yan, J., Jiang, Z., Huang, W., Yin, B., You, J., Wei, C., Wang, Z., Hong, C.,...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2510.26692 2025

[54] [66]

Understanding transformer from the perspective of associative memory.arXiv preprint arXiv:2505.19488, 2025

Zhong, S., Xu, M., Ao, T., and Shi, G. Understanding transformer from the perspective of associative memory. CoRR, abs/2505.19488, 2025. doi:10.48550/ARXIV.2505.19488. URL https://doi.org/10.48550/arXiv.2505.19488

work page doi:10.48550/arxiv.2505.19488 2025