pith. machine review for the scientific record. sign in

arxiv: 2605.08587 · v1 · submitted 2026-05-09 · 💻 cs.LG · cs.AI

Recognition: 2 theorem links

· Lean Theorem

Kaczmarz Linear Attention

Authors on Pith no claims yet

Pith reviewed 2026-05-12 01:11 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords linear attentiondelta ruleKaczmarz methodgated recurrent modelslong context modelingonline regressionstate space models
0
0 comments X

The pith

Kaczmarz-derived normalization of the delta-rule update step size in Gated DeltaNet improves perplexity, long-context stability, and decoding efficiency in linear attention models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Linear recurrent models compress context into a fixed state to avoid quadratic attention costs, but maintaining that state requires careful rules for forgetting and updating. Gated DeltaNet learns a coefficient to balance these, yet the paper shows this coefficient can be derived exactly from the underlying online regression objective by applying the Kaczmarz projection method. The resulting Kaczmarz Linear Attention uses the key-norm normalized step size β_t = η_t / (‖k_t‖₂² + ε) as a direct replacement. This single change yields lower validation perplexity at the 0.4B scale, perfect scores on needle-in-haystack retrieval, and faster decoding without altering the state or algorithm structure.

Core claim

The authors derive a dynamic step size for residual writes in the delta rule from the Kaczmarz projection onto the solution of the online regression problem. This produces the normalized coefficient that replaces the learned one in GDN, preserving all other components of the model. Empirical results confirm gains in language modeling perplexity and associative recall tasks.

What carries the argument

The Kaczmarz-normalized step size β_t = η_t / (‖k_t‖₂² + ε) for residual updates in the gated delta rule state transition.

If this is right

  • Achieves the lowest validation perplexity of 8.09 among linear-time baselines at 0.4B parameters with 1B tokens.
  • Maintains stability and performance up to 65K token contexts.
  • Attains 100% accuracy on single-needle-in-a-haystack retrieval tasks.
  • Improves 8x multi-query associative recall by 7.03 points over GDN.
  • Delivers 2.1x higher decode throughput at 32K context length.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The success of this derivation indicates that other empirically tuned coefficients in linear recurrent models may benefit from similar objective-based analysis.
  • This approach could extend to other projection methods or regression objectives in sequence modeling.
  • Better update rules may allow linear models to close the gap with quadratic attention on even longer contexts.
  • The efficiency gains suggest practical deployment advantages for inference on long sequences.

Load-bearing premise

The specific form of the online-regression objective used in GDN directly produces the Kaczmarz-normalized step size without requiring additional assumptions about the distribution of keys or interactions with learned gates.

What would settle it

If replacing the Kaczmarz coefficient with the original learned coefficient in otherwise identical models eliminates the performance advantage on validation perplexity and retrieval tasks, the benefit of the derivation would be refuted.

Figures

Figures reproduced from arXiv: 2605.08587 by Jiaxuan Zou, Ruifeng Ren, Yong Liu.

Figure 1
Figure 1. Figure 1: Validation perplexity curves during pretraining. KLA converges to a lower final validation [PITH_FULL_IMAGE:figures/full_fig_p006_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Synthetic task training convergence. Table 4 reports length extrapolation. Insets zoom into [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Efficiency comparison with batch size 1. KLA and GDN have nearly identical prefill [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
read the original abstract

Long-context language modeling remains central to modern sequence modeling, but the quadratic cost of Transformer attention makes scaling computationally prohibitive. Linear recurrent models address this bottleneck by compressing the context into a fixed-size state, making the rule that forgets, writes, and edits information a central design problem. To address state maintenance, Gated DeltaNet (GDN) combines gated state decay with delta-rule residual writes, using a learnable coefficient to balance forgetting and update magnitude. However, this coefficient is learned empirically rather than derived from the underlying objective, which can lead to suboptimal update magnitudes. We revisit the online-regression objective underlying GDN and, inspired by the Kaczmarz projection method, derive the key-norm-normalized dynamic step size $\beta_t = \eta_t / (\|k_t\|_2^2 + \epsilon)$ for residual updates. We propose Kaczmarz Linear Attention (KLA), a one-scalar modification of GDN that preserves the state shape, gates, linear recurrence, and chunkwise parallel algorithm. At the 0.4B scale with a 1B-token budget, KLA achieves the lowest validation perplexity among evaluated linear-time baselines, 8.09 versus 8.50 for GDN, and remains stable up to 65K tokens. On controlled tasks, KLA reaches 100% on single-needle-in-a-haystack retrieval, improves 8x multi-query associative recall by 7.03 points over GDN, and delivers 2.1x higher decode throughput at 32K context. These results suggest that the key-norm-normalized Kaczmarz coefficient is a first-order design axis for delta-rule sequence models: it improves accuracy, extrapolation, and decoding efficiency without changing the recurrent state or hardware kernel.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes Kaczmarz Linear Attention (KLA) as a one-scalar modification to Gated DeltaNet (GDN) for linear-time long-context modeling. It revisits the online-regression objective underlying GDN and, inspired by the Kaczmarz projection, derives the key-norm-normalized step size β_t = η_t / (‖k_t‖₂² + ε) for residual updates. The modification is claimed to preserve the recurrent state shape, gates, linear recurrence, and chunkwise parallel algorithm. At the 0.4B scale with a 1B-token budget, KLA reports the lowest validation perplexity among linear baselines (8.09 vs. 8.50 for GDN), stability to 65K tokens, 100% single-needle retrieval, +7.03 points on 8x multi-query associative recall, and 2.1x higher decode throughput at 32K context.

Significance. If the derivation is rigorous and the gains are robust, this would constitute a useful contribution to delta-rule linear recurrent models by identifying key-norm normalization as a first-order design choice that improves accuracy, extrapolation, and efficiency without altering state dimensionality or hardware kernels. The preservation of the existing chunkwise parallel algorithm and the minimal change (one scalar) are clear practical strengths that facilitate adoption. The empirical results at the 0.4B scale provide concrete evidence of benefit on both language modeling and controlled retrieval tasks.

major comments (2)
  1. [§3] §3 (Derivation of β_t): the manuscript states that β_t = η_t / (‖k_t‖₂² + ε) is derived from the GDN online-regression objective via the Kaczmarz projection, yet provides no explicit expansion showing how the projection interacts with the learned forget gates g_t and write gates. Without this step-by-step accounting, it remains unclear whether the normalization emerges exactly after gating or requires additional assumptions on key distributions; this directly affects whether the 0.41 perplexity drop is theoretically motivated or an empirical tuning result.
  2. [§4] §4 (Experimental results): the central performance claims (validation perplexity 8.09 vs. 8.50, stability to 65K tokens, task improvements) are reported as single-point numbers with no error bars, no multiple random seeds, and no statistical significance tests. There is also no ablation on the free parameter ε or sensitivity analysis for η_t, both of which appear in the step-size formula; these omissions are load-bearing for the claim that KLA is reliably superior to GDN.
minor comments (2)
  1. [Abstract] Abstract: the phrase 'one-scalar modification' is used without immediately clarifying that the scalar in question is the step-size coefficient itself.
  2. [§3] Notation: the symbols η_t and ε are introduced in the step-size formula but their initialization or scheduling is not summarized in the main text before the experimental section.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our work. We provide point-by-point responses below and will incorporate the suggested clarifications and additional analyses in the revised manuscript.

read point-by-point responses
  1. Referee: [§3] §3 (Derivation of β_t): the manuscript states that β_t = η_t / (‖k_t‖₂² + ε) is derived from the GDN online-regression objective via the Kaczmarz projection, yet provides no explicit expansion showing how the projection interacts with the learned forget gates g_t and write gates. Without this step-by-step accounting, it remains unclear whether the normalization emerges exactly after gating or requires additional assumptions on key distributions; this directly affects whether the 0.41 perplexity drop is theoretically motivated or an empirical tuning result.

    Authors: We agree that the derivation in §3 would benefit from a more explicit step-by-step expansion. The Kaczmarz projection is applied to the residual update term after the forget gate g_t has been applied to the previous state and before the write gate modulates the update. Starting from the online least-squares objective, the projection onto the key direction yields the normalization by ||k_t||_2^2 directly on the gated residual. We will expand this derivation in the revised §3 to show the interaction with g_t and the write gates without additional distributional assumptions. revision: yes

  2. Referee: [§4] §4 (Experimental results): the central performance claims (validation perplexity 8.09 vs. 8.50, stability to 65K tokens, task improvements) are reported as single-point numbers with no error bars, no multiple random seeds, and no statistical significance tests. There is also no ablation on the free parameter ε or sensitivity analysis for η_t, both of which appear in the step-size formula; these omissions are load-bearing for the claim that KLA is reliably superior to GDN.

    Authors: We acknowledge the limitation of reporting single-run results without error bars or multiple seeds. In the revision, we will rerun the experiments with at least three random seeds and report means with standard deviations for perplexity and task metrics. Additionally, we will include an ablation study on ε (e.g., values from 1e-6 to 1e-2) and sensitivity analysis for η_t to confirm the robustness of the gains. revision: yes

Circularity Check

0 steps flagged

No circularity detected in derivation chain

full rationale

The paper revisits the online-regression objective of GDN and states that it derives the normalized step-size form β_t = η_t / (‖k_t‖₂² + ε) via Kaczmarz projection. No quoted equation or section reduces this form to a tautological re-expression of the input objective, a fitted parameter renamed as prediction, or a self-citation chain. The parameters η_t and ε are presented as part of the resulting update rule rather than hidden inputs that force the output. The central empirical claims (perplexity, retrieval accuracy) rest on experimental comparison rather than on the derivation alone being self-referential. The derivation chain is therefore self-contained against external benchmarks and does not match any enumerated circularity pattern.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The central claim rests on (1) the online-regression objective that justifies delta-rule updates in GDN and (2) the algebraic step that imports the Kaczmarz projection to produce the norm-normalized coefficient. No new entities are postulated.

free parameters (2)
  • η_t
    Scalar multiplier in the derived step size; its schedule or learned status is not specified in the abstract.
  • ε
    Additive constant in the denominator to avoid division by zero; its value is not reported.
axioms (1)
  • domain assumption The delta-rule update in GDN is exactly the online solution to a regression objective.
    Invoked when the authors say they 'revisit the online-regression objective underlying GDN'.

pith-pipeline@v0.9.0 · 5623 in / 1445 out tokens · 50288 ms · 2026-05-12T01:11:40.957978+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • Cost/FunctionalEquation.lean washburn_uniqueness_aczel echoes
    ?
    echoes

    ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

    We revisit the online-regression objective underlying GDN and, inspired by the Kaczmarz projection method, derive the key-norm-normalized dynamic step size β_t = η_t / (‖k_t‖₂² + ε) for residual updates. ... The exact line-search coefficient is therefore τ* = 1/‖k_t‖₂².

  • Foundation/LogicAsFunctionalEquation.lean SatisfiesLawsOfLogic echoes
    ?
    echoes

    ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

    Proposition 1 (Exact proximal form). ... the relaxed update in (3) is the exact minimizer of min_S ½∥S − eS_{t−1}∥²_F + μ_t/2 ∥S^T k_t − v_t∥²₂.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

54 extracted references · 54 canonical work pages · 19 internal anchors

  1. [1]

    GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints

    Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebrón, and Sumit Sanghai. Gqa: Training generalized multi-query transformer models from multi-head checkpoints, 2023. URLhttps://arxiv.org/abs/2305.13245

  2. [2]

    Neural Machine Translation by Jointly Learning to Align and Translate

    Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate, 2016. URLhttps://arxiv.org/abs/1409.0473

  3. [3]

    xlstm: Ex- tended long short-term memory

    Maximilian Beck, Korbinian Pöppel, Markus Spanring, Andreas Auer, Oleksandra Prudnikova, Michael Kopp, Günter Klambauer, Johannes Brandstetter, and Sepp Hochreiter. xlstm: Extended long short-term memory, 2024. URLhttps://arxiv.org/abs/2405.04517

  4. [4]

    Striped attention: Faster ring attention for causal transformers,

    William Brandon, Aniruddha Nrusimha, Kevin Qian, Zachary Ankner, Tian Jin, Zhiye Song, and Jonathan Ragan-Kelley. Striped attention: Faster ring attention for causal transformers,

  5. [5]

    URLhttps://arxiv.org/abs/2311.09431

  6. [6]

    Rethinking Attention with Performers

    Krzysztof Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas Sarlos, Peter Hawkins, Jared Davis, Afroz Mohiuddin, Lukasz Kaiser, David Belanger, Lucy Colwell, and Adrian Weller. Rethinking attention with performers, 2022. URL https: //arxiv.org/abs/2009.14794

  7. [7]

    FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

    Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning, 2023. URLhttps://arxiv.org/abs/2307.08691

  8. [8]

    Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

    Tri Dao and Albert Gu. Transformers are ssms: Generalized models and efficient algorithms through structured state space duality, 2024. URLhttps://arxiv.org/abs/2405.21060

  9. [9]

    FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

    Tri Dao, Daniel Y . Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. Flashattention: Fast and memory-efficient exact attention with io-awareness, 2022. URL https://arxiv.org/ abs/2205.14135

  10. [10]

    Soham De, Samuel L. Smith, Anushan Fernando, Aleksandar Botev, George Cristian-Muraru, Albert Gu, Ruba Haroun, Leonard Berrada, Yutian Chen, Srivatsan Srinivasan, Guillaume Desjardins, Arnaud Doucet, David Budden, Yee Whye Teh, Razvan Pascanu, Nando De Freitas, and Caglar Gulcehre. Griffin: Mixing gated linear recurrences with local attention for efficien...

  11. [11]

    Hungry hungry hippos: Towards language modeling with state space models,

    Daniel Y . Fu, Tri Dao, Khaled K. Saab, Armin W. Thomas, Atri Rudra, and Christopher Ré. Hungry hungry hippos: Towards language modeling with state space models, 2023. URL https://arxiv.org/abs/2212.14052

  12. [12]

    Mamba: Linear-time sequence modeling with selective state spaces

    Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces. InFirst Conference on Language Modeling, 2024. URL https://openreview.net/forum? id=tEYskw1VY2

  13. [13]

    Efficiently Modeling Long Sequences with Structured State Spaces

    Albert Gu, Karan Goel, and Christopher Ré. Efficiently modeling long sequences with structured state spaces, 2022. URLhttps://arxiv.org/abs/2111.00396

  14. [14]

    Log-linear attention

    Han Guo, Songlin Yang, Tarushii Goel, Eric P. Xing, Tri Dao, and Yoon Kim. Log-linear attention, 2026. URLhttps://arxiv.org/abs/2506.04761. 10

  15. [15]

    Training Compute-Optimal Large Language Models

    Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hennigan, Eric Noland, Katie Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karen Simonyan, Erich Elsen, Jack W. Rae, Oriol Vinyals, and Laurent Sifre...

  16. [16]

    Angenäherte auflösung von systemen linearer gleichungen.Bulletin Interna- tional de l’Académie Polonaise des Sciences et des Lettres, pages 355–357, 1937

    Stefan Kaczmarz. Angenäherte auflösung von systemen linearer gleichungen.Bulletin Interna- tional de l’Académie Polonaise des Sciences et des Lettres, pages 355–357, 1937

  17. [17]

    Transformers are RNNs: Fast autoregressive transformers with linear attention

    Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret. Transformers are RNNs: Fast autoregressive transformers with linear attention. In Hal Daumé III and Aarti Singh, editors,Proceedings of the 37th International Conference on Machine Learning, volume 119 ofProceedings of Machine Learning Research, pages 5156–5165. PMLR, 13–18 Jul 202...

  18. [18]

    Gateloop: Fully data-controlled linear recurrence for sequence modeling

    Tobias Katsch. Gateloop: Fully data-controlled linear recurrence for sequence modeling, 2024. URLhttps://arxiv.org/abs/2311.01927

  19. [19]

    Reformer: The Efficient Transformer

    Nikita Kitaev, Łukasz Kaiser, and Anselm Levskaya. Reformer: The efficient transformer, 2020. URLhttps://arxiv.org/abs/2001.04451

  20. [20]

    Jamba: A Hybrid Transformer-Mamba Language Model

    Opher Lieber, Barak Lenz, Hofit Bata, Gal Cohen, Jhonathan Osin, Itay Dalmedigos, Erez Safahi, Shaked Meirom, Yonatan Belinkov, Shai Shalev-Shwartz, Omri Abend, Raz Alon, Tomer Asida, Amir Bergman, Roman Glozman, Michael Gokhman, Avashalom Manevich, Nir Ratner, Noam Rozen, Erez Shwartz, Mor Zusman, and Yoav Shoham. Jamba: A hybrid transformer-mamba langua...

  21. [21]

    Forgetting transformer: Softmax attention with a forget gate

    Zhixuan Lin, Evgenii Nikishin, Xu Owen He, and Aaron Courville. Forgetting transformer: Softmax attention with a forget gate, 2025. URLhttps://arxiv.org/abs/2503.02130

  22. [22]

    Longhorn: State space models are amortized online learners

    Bo Liu, Rui Wang, Lemeng Wu, Yihao Feng, Peter Stone, and Qiang Liu. Longhorn: State space models are amortized online learners, 2024. URLhttps://arxiv.org/abs/2407.14207

  23. [23]

    Blockwise parallel transformer for large context models, 2023

    Hao Liu and Pieter Abbeel. Blockwise parallel transformer for large context models, 2023. URLhttps://arxiv.org/abs/2305.19370

  24. [24]

    RWKV: Reinventing RNNs for the Transformer Era

    Bo Peng, Eric Alcaide, Quentin Anthony, Alon Albalak, Samuel Arcadinho, Stella Biderman, Huanqi Cao, Xin Cheng, Michael Chung, Matteo Grella, Kranthi Kiran GV , Xuzheng He, Haowen Hou, Jiaju Lin, Przemyslaw Kazienko, Jan Kocon, Jiaming Kong, Bartlomiej Koptyra, Hayden Lau, Krishna Sri Ipsit Mantri, Ferdinand Mom, Atsushi Saito, Guangyu Song, Xiangru Tang,...

  25. [25]

    Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation

    Ofir Press, Noah A. Smith, and Mike Lewis. Train short, test long: Attention with linear biases enables input length extrapolation, 2022. URLhttps://arxiv.org/abs/2108.12409

  26. [26]

    Various lengths, constant speed: Efficient language modeling with lightning attention, 2024

    Zhen Qin, Weigao Sun, Dong Li, Xuyang Shen, Weixuan Sun, and Yiran Zhong. Various lengths, constant speed: Efficient language modeling with lightning attention, 2024. URL https://arxiv.org/abs/2405.17381

  27. [27]

    HGRN2: Gated linear RNNs with state expansion

    Zhen Qin, Songlin Yang, Weixuan Sun, Xuyang Shen, Dong Li, Weigao Sun, and Yiran Zhong. HGRN2: Gated linear RNNs with state expansion. InFirst Conference on Language Modeling,

  28. [28]

    URLhttps://openreview.net/forum?id=y6SqbJfCSk

  29. [29]

    Linear transformers are secretly fast weight programmers

    Imanol Schlag, Kazuki Irie, and Jürgen Schmidhuber. Linear transformers are secretly fast weight programmers. In Marina Meila and Tong Zhang, editors,Proceedings of the 38th Inter- national Conference on Machine Learning, volume 139 ofProceedings of Machine Learning Research, pages 9355–9366. PMLR, 18–24 Jul 2021. URL https://proceedings.mlr. press/v139/s...

  30. [30]

    Schmidhuber

    Jürgen Schmidhuber. Learning to control fast-weight memories: An alternative to dynamic recurrent networks.Neural Computation, 4(1):131–139, 1992. doi: 10.1162/neco.1992.4.1.131. 11

  31. [31]

    2022 , archiveprefix =

    Uri Shaham, Elad Segal, Maor Ivgi, Avia Efrat, Ori Yoran, Adi Haviv, Ankit Gupta, Wenhan Xiong, Mor Geva, Jonathan Berant, and Omer Levy. Scrolls: Standardized comparison over long language sequences, 2022. URLhttps://arxiv.org/abs/2201.03533

  32. [32]

    Fast Transformer Decoding: One Write-Head is All You Need

    Noam Shazeer. Fast transformer decoding: One write-head is all you need, 2019. URL https://arxiv.org/abs/1911.02150

  33. [33]

    2025 , archivePrefix=

    Julien Siems, Timur Carstensen, Arber Zela, Frank Hutter, Massimiliano Pontil, and Riccardo Grazzi. Deltaproduct: Improving state-tracking in linear rnns via householder products, 2025. URLhttps://arxiv.org/abs/2502.10297

  34. [34]

    Smith, Andrew Warrington, and Scott Linderman

    Jimmy T.H. Smith, Andrew Warrington, and Scott Linderman. Simplified state space layers for sequence modeling. InThe Eleventh International Conference on Learning Representations,

  35. [35]

    URLhttps://openreview.net/forum?id=Ai8Hw3AXqks

  36. [36]

    SlimPajama: A 627B token cleaned and deduplicated version of RedPajama

    Daria Soboleva, Faisal Al-Khateeb, Robert Myers, Jacob R Steeves, Joel Hestness, and Nolan Dey. SlimPajama: A 627B token cleaned and deduplicated version of RedPajama. https://cerebras.ai/blog/ slimpajama-a-627b-token-cleaned-and-deduplicated-version-of-redpajama ,

  37. [37]

    URLhttps://huggingface.co/datasets/cerebras/SlimPajama-627B

  38. [38]

    Learning to (Learn at Test Time): RNNs with Expressive Hidden States

    Yu Sun, Xinhao Li, Karan Dalal, Jiarui Xu, Arjun Vikram, Genghan Zhang, Yann Dubois, Xinlei Chen, Xiaolong Wang, Sanmi Koyejo, Tatsunori Hashimoto, and Carlos Guestrin. Learning to (learn at test time): Rnns with expressive hidden states, 2025. URL https: //arxiv.org/abs/2407.04620

  39. [39]

    Retentive Network: A Successor to Transformer for Large Language Models

    Yutao Sun, Li Dong, Shaohan Huang, Shuming Ma, Yuqing Xia, Jilong Xue, Jianyong Wang, and Furu Wei. Retentive network: A successor to transformer for large language models, 2023. URLhttps://arxiv.org/abs/2307.08621

  40. [40]

    Kimi Team, Yu Zhang, Zongyu Lin, Xingcheng Yao, Jiaxi Hu, Fanqing Meng, Chengyin Liu, Xin Men, Songlin Yang, Zhiyuan Li, Wentao Li, Enzhe Lu, Weizhou Liu, Yanru Chen, Weixin Xu, Longhui Yu, Yejie Wang, Yu Fan, Longguang Zhong, Enming Yuan, Dehao Zhang, Yizhi Zhang, T. Y . Liu, Haiming Wang, Shengjun Fang, Weiran He, Shaowei Liu, Yiwei Li, Jianlin Su, Jiez...

  41. [41]

    LLaMA: Open and Efficient Foundation Language Models

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timo- thée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. Llama: Open and efficient foundation language models, 2023. URLhttps://arxiv.org/abs/2302.13971

  42. [42]

    Attention is all you need

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In I. Guyon, U. V on Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, ed- itors,Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL ...

  43. [43]

    Waleffe, W

    Roger Waleffe, Wonmin Byeon, Duncan Riach, Brandon Norick, Vijay Korthikanti, Tri Dao, Albert Gu, Ali Hatamizadeh, Sudhakar Singh, Deepak Narayanan, Garvit Kulshreshtha, Vartika Singh, Jared Casper, Jan Kautz, Mohammad Shoeybi, and Bryan Catanzaro. An empirical study of mamba-based language models, 2024. URLhttps://arxiv.org/abs/2406.07887

  44. [44]

    Ke Alexander Wang, Jiaxin Shi, and Emily B. Fox. Test-time regression: a unifying framework for designing sequence models with associative memory, 2025. URL https://arxiv.org/ abs/2501.12352. 12

  45. [45]

    Linformer: Self-Attention with Linear Complexity

    Sinong Wang, Belinda Z. Li, Madian Khabsa, Han Fang, and Hao Ma. Linformer: Self-attention with linear complexity, 2020. URLhttps://arxiv.org/abs/2006.04768

  46. [46]

    Fla: A triton-based library for hardware-efficient implementa- tions of linear attention mechanism, January 2024

    Songlin Yang and Yu Zhang. Fla: A triton-based library for hardware-efficient implementa- tions of linear attention mechanism, January 2024. URL https://github.com/fla-org/ flash-linear-attention

  47. [47]

    Gated linear attention transformers with hardware-efficient training

    Songlin Yang, Bailin Wang, Yikang Shen, Rameswar Panda, and Yoon Kim. Gated linear attention transformers with hardware-efficient training. In Ruslan Salakhutdinov, Zico Kolter, Katherine Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, editors,Proceedings of the 41st International Conference on Machine Learning, volume 235 of...

  48. [48]

    Parallelizing linear transformers with the delta rule over sequence length

    Songlin Yang, Bailin Wang, Yu Zhang, Yikang Shen, and Yoon Kim. Parallelizing linear transformers with the delta rule over sequence length. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URL https://openreview.net/forum? id=y8Rm4VNRPH

  49. [49]

    Gated delta networks: Improving mamba2 with delta rule

    Songlin Yang, Jan Kautz, and Ali Hatamizadeh. Gated delta networks: Improving mamba2 with delta rule. InThe Thirteenth International Conference on Learning Representations, 2025. URLhttps://openreview.net/forum?id=r8H7xhYPwz

  50. [50]

    A., Ainslie, J., Alberti, C., Ontanon, S., Pham, P., Ravula, A., Wang, Q., Yang, L., et al

    Manzil Zaheer, Guru Guruganesh, Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, and Amr Ahmed. Big bird: Transformers for longer sequences, 2021. URLhttps://arxiv.org/abs/2007.14062

  51. [51]

    An attention free transformer,

    Shuangfei Zhai, Walter Talbott, Nitish Srivastava, Chen Huang, Hanlin Goh, Ruixiang Zhang, and Josh Susskind. An attention free transformer, 2021. URL https://arxiv.org/abs/ 2105.14103

  52. [52]

    Test-time training done right.arXiv preprint arXiv:2505.23884, 2025

    Tianyuan Zhang, Sai Bi, Yicong Hong, Kai Zhang, Fujun Luan, Songlin Yang, Kalyan Sunkavalli, William T. Freeman, and Hao Tan. Test-time training done right, 2025. URL https://arxiv.org/abs/2505.23884

  53. [53]

    Gated slot attention for efficient linear-time sequence modeling, 2024

    Yu Zhang, Songlin Yang, Ruijie Zhu, Yue Zhang, Leyang Cui, Yiqiao Wang, Bolun Wang, Freda Shi, Bailin Wang, Wei Bi, Peng Zhou, and Guohong Fu. Gated slot attention for efficient linear-time sequence modeling, 2024. URLhttps://arxiv.org/abs/2409.07146. A Proofs for the Tokenwise Theory A.1 Proof of Theorem 1 LeteS= eSt−1,e=e t, andk=k t. Part (ii): exact l...

  54. [54]

    Learned Scalar

    For anyH∈T At: ⟨∆, H⟩F = tr 1 ∥k∥2 2 keTH T = 1 ∥k∥2 2 eT(H Tk) = 0,(12) sinceH Tk= 0. Therefore∆⊥T At, confirming thatS t is the orthogonal projection ofeSontoA t. Part (iii): normalized SGD.The gradient of Lt at eS is ∇SLt S=eS =k( eSTk−v t)T =−ke T. Therefore: St = eS− 1 ∥k∥2 2 ∇SLt S=eS, which is normalized gradient descent with coefficient1/∥k∥ 2 2. ...