arxiv: 2605.08587 · v1 · submitted 2026-05-09 · 💻 cs.LG · cs.AI

Recognition: 2 theorem links

· Lean Theorem

Kaczmarz Linear Attention

Jiaxuan Zou , Ruifeng Ren , Yong Liu

Authors on Pith no claims yet

Pith reviewed 2026-05-12 01:11 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords linear attentiondelta ruleKaczmarz methodgated recurrent modelslong context modelingonline regressionstate space models

0 comments

The pith

Kaczmarz-derived normalization of the delta-rule update step size in Gated DeltaNet improves perplexity, long-context stability, and decoding efficiency in linear attention models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Linear recurrent models compress context into a fixed state to avoid quadratic attention costs, but maintaining that state requires careful rules for forgetting and updating. Gated DeltaNet learns a coefficient to balance these, yet the paper shows this coefficient can be derived exactly from the underlying online regression objective by applying the Kaczmarz projection method. The resulting Kaczmarz Linear Attention uses the key-norm normalized step size β_t = η_t / (‖k_t‖₂² + ε) as a direct replacement. This single change yields lower validation perplexity at the 0.4B scale, perfect scores on needle-in-haystack retrieval, and faster decoding without altering the state or algorithm structure.

Core claim

The authors derive a dynamic step size for residual writes in the delta rule from the Kaczmarz projection onto the solution of the online regression problem. This produces the normalized coefficient that replaces the learned one in GDN, preserving all other components of the model. Empirical results confirm gains in language modeling perplexity and associative recall tasks.

What carries the argument

The Kaczmarz-normalized step size β_t = η_t / (‖k_t‖₂² + ε) for residual updates in the gated delta rule state transition.

If this is right

Achieves the lowest validation perplexity of 8.09 among linear-time baselines at 0.4B parameters with 1B tokens.
Maintains stability and performance up to 65K token contexts.
Attains 100% accuracy on single-needle-in-a-haystack retrieval tasks.
Improves 8x multi-query associative recall by 7.03 points over GDN.
Delivers 2.1x higher decode throughput at 32K context length.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The success of this derivation indicates that other empirically tuned coefficients in linear recurrent models may benefit from similar objective-based analysis.
This approach could extend to other projection methods or regression objectives in sequence modeling.
Better update rules may allow linear models to close the gap with quadratic attention on even longer contexts.
The efficiency gains suggest practical deployment advantages for inference on long sequences.

Load-bearing premise

The specific form of the online-regression objective used in GDN directly produces the Kaczmarz-normalized step size without requiring additional assumptions about the distribution of keys or interactions with learned gates.

What would settle it

If replacing the Kaczmarz coefficient with the original learned coefficient in otherwise identical models eliminates the performance advantage on validation perplexity and retrieval tasks, the benefit of the derivation would be refuted.

Figures

Figures reproduced from arXiv: 2605.08587 by Jiaxuan Zou, Ruifeng Ren, Yong Liu.

**Figure 2.** Figure 2: Synthetic task training convergence. Table 4 reports length extrapolation. Insets zoom into [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: Efficiency comparison with batch size 1. KLA and GDN have nearly identical prefill [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

read the original abstract

Long-context language modeling remains central to modern sequence modeling, but the quadratic cost of Transformer attention makes scaling computationally prohibitive. Linear recurrent models address this bottleneck by compressing the context into a fixed-size state, making the rule that forgets, writes, and edits information a central design problem. To address state maintenance, Gated DeltaNet (GDN) combines gated state decay with delta-rule residual writes, using a learnable coefficient to balance forgetting and update magnitude. However, this coefficient is learned empirically rather than derived from the underlying objective, which can lead to suboptimal update magnitudes. We revisit the online-regression objective underlying GDN and, inspired by the Kaczmarz projection method, derive the key-norm-normalized dynamic step size $\beta_t = \eta_t / (\|k_t\|_2^2 + \epsilon)$ for residual updates. We propose Kaczmarz Linear Attention (KLA), a one-scalar modification of GDN that preserves the state shape, gates, linear recurrence, and chunkwise parallel algorithm. At the 0.4B scale with a 1B-token budget, KLA achieves the lowest validation perplexity among evaluated linear-time baselines, 8.09 versus 8.50 for GDN, and remains stable up to 65K tokens. On controlled tasks, KLA reaches 100% on single-needle-in-a-haystack retrieval, improves 8x multi-query associative recall by 7.03 points over GDN, and delivers 2.1x higher decode throughput at 32K context. These results suggest that the key-norm-normalized Kaczmarz coefficient is a first-order design axis for delta-rule sequence models: it improves accuracy, extrapolation, and decoding efficiency without changing the recurrent state or hardware kernel.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

KLA is GDN with a key-norm normalized step size that delivers a 0.41 perplexity drop and better long-context stability at 0.4B scale, but the derivation from the online-regression objective looks incomplete once the gates are included.

read the letter

The main thing to know is that this paper replaces the learned coefficient in Gated DeltaNet's delta-rule update with a dynamic step size β_t = η_t / (‖k_t‖₂² + ε) taken from the Kaczmarz projection on the online regression view. At 0.4B parameters trained on 1B tokens, KLA reaches 8.09 validation perplexity versus 8.50 for GDN, stays stable to 65K context, hits 100% on single-needle retrieval, and improves multi-query associative recall by 7 points while running 2.1x faster at decode time. The change is deliberately minimal: same state shape, same gates, same chunkwise parallel kernel. That keeps the practical upside high if the numbers hold. The empirical direction is positive and the throughput gain is a concrete plus for anyone running these models at scale. The soft spot is the justification. The abstract presents the normalized step size as derived directly from the objective plus Kaczmarz, yet the recurrence still contains learned forget and write gates. Without an explicit expansion showing the projection commutes exactly with those multiplications, or at least an ablation on ε and the distribution of keys, the formula reads more like a motivated heuristic than a load-bearing consequence of the loss. The experiments also lack error bars, seed averages, or sensitivity checks, so the 0.41-point gap is difficult to assess for robustness at this scale. This paper is for researchers working on linear recurrent sequence models who want a cheap tweak to the delta-rule update rule. Anyone comparing variants of GDN or similar online-regression attention would get value from the controlled task results and the stability numbers. It deserves a serious referee because the modification is simple to implement, the empirical signal points in a useful direction, and a careful review of the algebra plus basic statistical controls would clarify how much is new theory versus solid engineering.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes Kaczmarz Linear Attention (KLA) as a one-scalar modification to Gated DeltaNet (GDN) for linear-time long-context modeling. It revisits the online-regression objective underlying GDN and, inspired by the Kaczmarz projection, derives the key-norm-normalized step size β_t = η_t / (‖k_t‖₂² + ε) for residual updates. The modification is claimed to preserve the recurrent state shape, gates, linear recurrence, and chunkwise parallel algorithm. At the 0.4B scale with a 1B-token budget, KLA reports the lowest validation perplexity among linear baselines (8.09 vs. 8.50 for GDN), stability to 65K tokens, 100% single-needle retrieval, +7.03 points on 8x multi-query associative recall, and 2.1x higher decode throughput at 32K context.

Significance. If the derivation is rigorous and the gains are robust, this would constitute a useful contribution to delta-rule linear recurrent models by identifying key-norm normalization as a first-order design choice that improves accuracy, extrapolation, and efficiency without altering state dimensionality or hardware kernels. The preservation of the existing chunkwise parallel algorithm and the minimal change (one scalar) are clear practical strengths that facilitate adoption. The empirical results at the 0.4B scale provide concrete evidence of benefit on both language modeling and controlled retrieval tasks.

major comments (2)

[§3] §3 (Derivation of β_t): the manuscript states that β_t = η_t / (‖k_t‖₂² + ε) is derived from the GDN online-regression objective via the Kaczmarz projection, yet provides no explicit expansion showing how the projection interacts with the learned forget gates g_t and write gates. Without this step-by-step accounting, it remains unclear whether the normalization emerges exactly after gating or requires additional assumptions on key distributions; this directly affects whether the 0.41 perplexity drop is theoretically motivated or an empirical tuning result.
[§4] §4 (Experimental results): the central performance claims (validation perplexity 8.09 vs. 8.50, stability to 65K tokens, task improvements) are reported as single-point numbers with no error bars, no multiple random seeds, and no statistical significance tests. There is also no ablation on the free parameter ε or sensitivity analysis for η_t, both of which appear in the step-size formula; these omissions are load-bearing for the claim that KLA is reliably superior to GDN.

minor comments (2)

[Abstract] Abstract: the phrase 'one-scalar modification' is used without immediately clarifying that the scalar in question is the step-size coefficient itself.
[§3] Notation: the symbols η_t and ε are introduced in the step-size formula but their initialization or scheduling is not summarized in the main text before the experimental section.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our work. We provide point-by-point responses below and will incorporate the suggested clarifications and additional analyses in the revised manuscript.

read point-by-point responses

Referee: [§3] §3 (Derivation of β_t): the manuscript states that β_t = η_t / (‖k_t‖₂² + ε) is derived from the GDN online-regression objective via the Kaczmarz projection, yet provides no explicit expansion showing how the projection interacts with the learned forget gates g_t and write gates. Without this step-by-step accounting, it remains unclear whether the normalization emerges exactly after gating or requires additional assumptions on key distributions; this directly affects whether the 0.41 perplexity drop is theoretically motivated or an empirical tuning result.

Authors: We agree that the derivation in §3 would benefit from a more explicit step-by-step expansion. The Kaczmarz projection is applied to the residual update term after the forget gate g_t has been applied to the previous state and before the write gate modulates the update. Starting from the online least-squares objective, the projection onto the key direction yields the normalization by ||k_t||_2^2 directly on the gated residual. We will expand this derivation in the revised §3 to show the interaction with g_t and the write gates without additional distributional assumptions. revision: yes
Referee: [§4] §4 (Experimental results): the central performance claims (validation perplexity 8.09 vs. 8.50, stability to 65K tokens, task improvements) are reported as single-point numbers with no error bars, no multiple random seeds, and no statistical significance tests. There is also no ablation on the free parameter ε or sensitivity analysis for η_t, both of which appear in the step-size formula; these omissions are load-bearing for the claim that KLA is reliably superior to GDN.

Authors: We acknowledge the limitation of reporting single-run results without error bars or multiple seeds. In the revision, we will rerun the experiments with at least three random seeds and report means with standard deviations for perplexity and task metrics. Additionally, we will include an ablation study on ε (e.g., values from 1e-6 to 1e-2) and sensitivity analysis for η_t to confirm the robustness of the gains. revision: yes

Circularity Check

0 steps flagged

No circularity detected in derivation chain

full rationale

The paper revisits the online-regression objective of GDN and states that it derives the normalized step-size form β_t = η_t / (‖k_t‖₂² + ε) via Kaczmarz projection. No quoted equation or section reduces this form to a tautological re-expression of the input objective, a fitted parameter renamed as prediction, or a self-citation chain. The parameters η_t and ε are presented as part of the resulting update rule rather than hidden inputs that force the output. The central empirical claims (perplexity, retrieval accuracy) rest on experimental comparison rather than on the derivation alone being self-referential. The derivation chain is therefore self-contained against external benchmarks and does not match any enumerated circularity pattern.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The central claim rests on (1) the online-regression objective that justifies delta-rule updates in GDN and (2) the algebraic step that imports the Kaczmarz projection to produce the norm-normalized coefficient. No new entities are postulated.

free parameters (2)

η_t
Scalar multiplier in the derived step size; its schedule or learned status is not specified in the abstract.
ε
Additive constant in the denominator to avoid division by zero; its value is not reported.

axioms (1)

domain assumption The delta-rule update in GDN is exactly the online solution to a regression objective.
Invoked when the authors say they 'revisit the online-regression objective underlying GDN'.

pith-pipeline@v0.9.0 · 5623 in / 1445 out tokens · 50288 ms · 2026-05-12T01:11:40.957978+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Cost/FunctionalEquation.lean washburn_uniqueness_aczel echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

We revisit the online-regression objective underlying GDN and, inspired by the Kaczmarz projection method, derive the key-norm-normalized dynamic step size β_t = η_t / (‖k_t‖₂² + ε) for residual updates. ... The exact line-search coefficient is therefore τ* = 1/‖k_t‖₂².
Foundation/LogicAsFunctionalEquation.lean SatisfiesLawsOfLogic echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

Proposition 1 (Exact proximal form). ... the relaxed update in (3) is the exact minimizer of min_S ½∥S − eS_{t−1}∥²_F + μ_t/2 ∥S^T k_t − v_t∥²₂.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

54 extracted references · 54 canonical work pages · 19 internal anchors

[1]

GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints

Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebrón, and Sumit Sanghai. Gqa: Training generalized multi-query transformer models from multi-head checkpoints, 2023. URLhttps://arxiv.org/abs/2305.13245

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Neural Machine Translation by Jointly Learning to Align and Translate

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate, 2016. URLhttps://arxiv.org/abs/1409.0473

work page internal anchor Pith review Pith/arXiv arXiv 2016
[3]

xlstm: Ex- tended long short-term memory

Maximilian Beck, Korbinian Pöppel, Markus Spanring, Andreas Auer, Oleksandra Prudnikova, Michael Kopp, Günter Klambauer, Johannes Brandstetter, and Sepp Hochreiter. xlstm: Extended long short-term memory, 2024. URLhttps://arxiv.org/abs/2405.04517

work page arXiv 2024
[4]

Striped attention: Faster ring attention for causal transformers,

William Brandon, Aniruddha Nrusimha, Kevin Qian, Zachary Ankner, Tian Jin, Zhiye Song, and Jonathan Ragan-Kelley. Striped attention: Faster ring attention for causal transformers,

work page
[5]

URLhttps://arxiv.org/abs/2311.09431

work page arXiv
[6]

Rethinking Attention with Performers

Krzysztof Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas Sarlos, Peter Hawkins, Jared Davis, Afroz Mohiuddin, Lukasz Kaiser, David Belanger, Lucy Colwell, and Adrian Weller. Rethinking attention with performers, 2022. URL https: //arxiv.org/abs/2009.14794

work page internal anchor Pith review arXiv 2022
[7]

FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning, 2023. URLhttps://arxiv.org/abs/2307.08691

work page internal anchor Pith review Pith/arXiv arXiv 2023
[8]

Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

Tri Dao and Albert Gu. Transformers are ssms: Generalized models and efficient algorithms through structured state space duality, 2024. URLhttps://arxiv.org/abs/2405.21060

work page internal anchor Pith review Pith/arXiv arXiv 2024
[9]

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

Tri Dao, Daniel Y . Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. Flashattention: Fast and memory-efficient exact attention with io-awareness, 2022. URL https://arxiv.org/ abs/2205.14135

work page internal anchor Pith review arXiv 2022
[10]

Soham De, Samuel L. Smith, Anushan Fernando, Aleksandar Botev, George Cristian-Muraru, Albert Gu, Ruba Haroun, Leonard Berrada, Yutian Chen, Srivatsan Srinivasan, Guillaume Desjardins, Arnaud Doucet, David Budden, Yee Whye Teh, Razvan Pascanu, Nando De Freitas, and Caglar Gulcehre. Griffin: Mixing gated linear recurrences with local attention for efficien...

work page internal anchor Pith review arXiv 2024
[11]

Hungry hungry hippos: Towards language modeling with state space models,

Daniel Y . Fu, Tri Dao, Khaled K. Saab, Armin W. Thomas, Atri Rudra, and Christopher Ré. Hungry hungry hippos: Towards language modeling with state space models, 2023. URL https://arxiv.org/abs/2212.14052

work page arXiv 2023
[12]

Mamba: Linear-time sequence modeling with selective state spaces

Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces. InFirst Conference on Language Modeling, 2024. URL https://openreview.net/forum? id=tEYskw1VY2

work page 2024
[13]

Efficiently Modeling Long Sequences with Structured State Spaces

Albert Gu, Karan Goel, and Christopher Ré. Efficiently modeling long sequences with structured state spaces, 2022. URLhttps://arxiv.org/abs/2111.00396

work page internal anchor Pith review Pith/arXiv arXiv 2022
[14]

Log-linear attention

Han Guo, Songlin Yang, Tarushii Goel, Eric P. Xing, Tri Dao, and Yoon Kim. Log-linear attention, 2026. URLhttps://arxiv.org/abs/2506.04761. 10

work page arXiv 2026
[15]

Training Compute-Optimal Large Language Models

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hennigan, Eric Noland, Katie Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karen Simonyan, Erich Elsen, Jack W. Rae, Oriol Vinyals, and Laurent Sifre...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[16]

Angenäherte auflösung von systemen linearer gleichungen.Bulletin Interna- tional de l’Académie Polonaise des Sciences et des Lettres, pages 355–357, 1937

Stefan Kaczmarz. Angenäherte auflösung von systemen linearer gleichungen.Bulletin Interna- tional de l’Académie Polonaise des Sciences et des Lettres, pages 355–357, 1937

work page 1937
[17]

Transformers are RNNs: Fast autoregressive transformers with linear attention

Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret. Transformers are RNNs: Fast autoregressive transformers with linear attention. In Hal Daumé III and Aarti Singh, editors,Proceedings of the 37th International Conference on Machine Learning, volume 119 ofProceedings of Machine Learning Research, pages 5156–5165. PMLR, 13–18 Jul 202...

work page 2020
[18]

Gateloop: Fully data-controlled linear recurrence for sequence modeling

Tobias Katsch. Gateloop: Fully data-controlled linear recurrence for sequence modeling, 2024. URLhttps://arxiv.org/abs/2311.01927

work page arXiv 2024
[19]

Reformer: The Efficient Transformer

Nikita Kitaev, Łukasz Kaiser, and Anselm Levskaya. Reformer: The efficient transformer, 2020. URLhttps://arxiv.org/abs/2001.04451

work page internal anchor Pith review arXiv 2020
[20]

Jamba: A Hybrid Transformer-Mamba Language Model

Opher Lieber, Barak Lenz, Hofit Bata, Gal Cohen, Jhonathan Osin, Itay Dalmedigos, Erez Safahi, Shaked Meirom, Yonatan Belinkov, Shai Shalev-Shwartz, Omri Abend, Raz Alon, Tomer Asida, Amir Bergman, Roman Glozman, Michael Gokhman, Avashalom Manevich, Nir Ratner, Noam Rozen, Erez Shwartz, Mor Zusman, and Yoav Shoham. Jamba: A hybrid transformer-mamba langua...

work page internal anchor Pith review arXiv 2024
[21]

Forgetting transformer: Softmax attention with a forget gate

Zhixuan Lin, Evgenii Nikishin, Xu Owen He, and Aaron Courville. Forgetting transformer: Softmax attention with a forget gate, 2025. URLhttps://arxiv.org/abs/2503.02130

work page arXiv 2025
[22]

Longhorn: State space models are amortized online learners

Bo Liu, Rui Wang, Lemeng Wu, Yihao Feng, Peter Stone, and Qiang Liu. Longhorn: State space models are amortized online learners, 2024. URLhttps://arxiv.org/abs/2407.14207

work page arXiv 2024
[23]

Blockwise parallel transformer for large context models, 2023

Hao Liu and Pieter Abbeel. Blockwise parallel transformer for large context models, 2023. URLhttps://arxiv.org/abs/2305.19370

work page arXiv 2023
[24]

RWKV: Reinventing RNNs for the Transformer Era

Bo Peng, Eric Alcaide, Quentin Anthony, Alon Albalak, Samuel Arcadinho, Stella Biderman, Huanqi Cao, Xin Cheng, Michael Chung, Matteo Grella, Kranthi Kiran GV , Xuzheng He, Haowen Hou, Jiaju Lin, Przemyslaw Kazienko, Jan Kocon, Jiaming Kong, Bartlomiej Koptyra, Hayden Lau, Krishna Sri Ipsit Mantri, Ferdinand Mom, Atsushi Saito, Guangyu Song, Xiangru Tang,...

work page internal anchor Pith review arXiv 2023
[25]

Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation

Ofir Press, Noah A. Smith, and Mike Lewis. Train short, test long: Attention with linear biases enables input length extrapolation, 2022. URLhttps://arxiv.org/abs/2108.12409

work page internal anchor Pith review arXiv 2022
[26]

Various lengths, constant speed: Efficient language modeling with lightning attention, 2024

Zhen Qin, Weigao Sun, Dong Li, Xuyang Shen, Weixuan Sun, and Yiran Zhong. Various lengths, constant speed: Efficient language modeling with lightning attention, 2024. URL https://arxiv.org/abs/2405.17381

work page arXiv 2024
[27]

HGRN2: Gated linear RNNs with state expansion

Zhen Qin, Songlin Yang, Weixuan Sun, Xuyang Shen, Dong Li, Weigao Sun, and Yiran Zhong. HGRN2: Gated linear RNNs with state expansion. InFirst Conference on Language Modeling,

work page
[28]

URLhttps://openreview.net/forum?id=y6SqbJfCSk

work page
[29]

Linear transformers are secretly fast weight programmers

Imanol Schlag, Kazuki Irie, and Jürgen Schmidhuber. Linear transformers are secretly fast weight programmers. In Marina Meila and Tong Zhang, editors,Proceedings of the 38th Inter- national Conference on Machine Learning, volume 139 ofProceedings of Machine Learning Research, pages 9355–9366. PMLR, 18–24 Jul 2021. URL https://proceedings.mlr. press/v139/s...

work page 2021
[30]

Schmidhuber

Jürgen Schmidhuber. Learning to control fast-weight memories: An alternative to dynamic recurrent networks.Neural Computation, 4(1):131–139, 1992. doi: 10.1162/neco.1992.4.1.131. 11

work page doi:10.1162/neco.1992.4.1.131 1992
[31]

2022 , archiveprefix =

Uri Shaham, Elad Segal, Maor Ivgi, Avia Efrat, Ori Yoran, Adi Haviv, Ankit Gupta, Wenhan Xiong, Mor Geva, Jonathan Berant, and Omer Levy. Scrolls: Standardized comparison over long language sequences, 2022. URLhttps://arxiv.org/abs/2201.03533

work page arXiv 2022
[32]

Fast Transformer Decoding: One Write-Head is All You Need

Noam Shazeer. Fast transformer decoding: One write-head is all you need, 2019. URL https://arxiv.org/abs/1911.02150

work page internal anchor Pith review Pith/arXiv arXiv 2019
[33]

2025 , archivePrefix=

Julien Siems, Timur Carstensen, Arber Zela, Frank Hutter, Massimiliano Pontil, and Riccardo Grazzi. Deltaproduct: Improving state-tracking in linear rnns via householder products, 2025. URLhttps://arxiv.org/abs/2502.10297

work page arXiv 2025
[34]

Smith, Andrew Warrington, and Scott Linderman

Jimmy T.H. Smith, Andrew Warrington, and Scott Linderman. Simplified state space layers for sequence modeling. InThe Eleventh International Conference on Learning Representations,

work page
[35]

URLhttps://openreview.net/forum?id=Ai8Hw3AXqks

work page
[36]

SlimPajama: A 627B token cleaned and deduplicated version of RedPajama

Daria Soboleva, Faisal Al-Khateeb, Robert Myers, Jacob R Steeves, Joel Hestness, and Nolan Dey. SlimPajama: A 627B token cleaned and deduplicated version of RedPajama. https://cerebras.ai/blog/ slimpajama-a-627b-token-cleaned-and-deduplicated-version-of-redpajama ,

work page
[37]

URLhttps://huggingface.co/datasets/cerebras/SlimPajama-627B

work page
[38]

Learning to (Learn at Test Time): RNNs with Expressive Hidden States

Yu Sun, Xinhao Li, Karan Dalal, Jiarui Xu, Arjun Vikram, Genghan Zhang, Yann Dubois, Xinlei Chen, Xiaolong Wang, Sanmi Koyejo, Tatsunori Hashimoto, and Carlos Guestrin. Learning to (learn at test time): Rnns with expressive hidden states, 2025. URL https: //arxiv.org/abs/2407.04620

work page internal anchor Pith review arXiv 2025
[39]

Retentive Network: A Successor to Transformer for Large Language Models

Yutao Sun, Li Dong, Shaohan Huang, Shuming Ma, Yuqing Xia, Jilong Xue, Jianyong Wang, and Furu Wei. Retentive network: A successor to transformer for large language models, 2023. URLhttps://arxiv.org/abs/2307.08621

work page internal anchor Pith review Pith/arXiv arXiv 2023
[40]

Kimi Team, Yu Zhang, Zongyu Lin, Xingcheng Yao, Jiaxi Hu, Fanqing Meng, Chengyin Liu, Xin Men, Songlin Yang, Zhiyuan Li, Wentao Li, Enzhe Lu, Weizhou Liu, Yanru Chen, Weixin Xu, Longhui Yu, Yejie Wang, Yu Fan, Longguang Zhong, Enming Yuan, Dehao Zhang, Yizhi Zhang, T. Y . Liu, Haiming Wang, Shengjun Fang, Weiran He, Shaowei Liu, Yiwei Li, Jianlin Su, Jiez...

work page internal anchor Pith review arXiv 2025
[41]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timo- thée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. Llama: Open and efficient foundation language models, 2023. URLhttps://arxiv.org/abs/2302.13971

work page internal anchor Pith review Pith/arXiv arXiv 2023
[42]

Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In I. Guyon, U. V on Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, ed- itors,Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL ...

work page 2017
[43]

Waleffe, W

Roger Waleffe, Wonmin Byeon, Duncan Riach, Brandon Norick, Vijay Korthikanti, Tri Dao, Albert Gu, Ali Hatamizadeh, Sudhakar Singh, Deepak Narayanan, Garvit Kulshreshtha, Vartika Singh, Jared Casper, Jan Kautz, Mohammad Shoeybi, and Bryan Catanzaro. An empirical study of mamba-based language models, 2024. URLhttps://arxiv.org/abs/2406.07887

work page arXiv 2024
[44]

Ke Alexander Wang, Jiaxin Shi, and Emily B. Fox. Test-time regression: a unifying framework for designing sequence models with associative memory, 2025. URL https://arxiv.org/ abs/2501.12352. 12

work page arXiv 2025
[45]

Linformer: Self-Attention with Linear Complexity

Sinong Wang, Belinda Z. Li, Madian Khabsa, Han Fang, and Hao Ma. Linformer: Self-attention with linear complexity, 2020. URLhttps://arxiv.org/abs/2006.04768

work page internal anchor Pith review Pith/arXiv arXiv 2020
[46]

Fla: A triton-based library for hardware-efficient implementa- tions of linear attention mechanism, January 2024

Songlin Yang and Yu Zhang. Fla: A triton-based library for hardware-efficient implementa- tions of linear attention mechanism, January 2024. URL https://github.com/fla-org/ flash-linear-attention

work page 2024
[47]

Gated linear attention transformers with hardware-efficient training

Songlin Yang, Bailin Wang, Yikang Shen, Rameswar Panda, and Yoon Kim. Gated linear attention transformers with hardware-efficient training. In Ruslan Salakhutdinov, Zico Kolter, Katherine Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, editors,Proceedings of the 41st International Conference on Machine Learning, volume 235 of...

work page 2024
[48]

Parallelizing linear transformers with the delta rule over sequence length

Songlin Yang, Bailin Wang, Yu Zhang, Yikang Shen, and Yoon Kim. Parallelizing linear transformers with the delta rule over sequence length. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URL https://openreview.net/forum? id=y8Rm4VNRPH

work page 2024
[49]

Gated delta networks: Improving mamba2 with delta rule

Songlin Yang, Jan Kautz, and Ali Hatamizadeh. Gated delta networks: Improving mamba2 with delta rule. InThe Thirteenth International Conference on Learning Representations, 2025. URLhttps://openreview.net/forum?id=r8H7xhYPwz

work page 2025
[50]

A., Ainslie, J., Alberti, C., Ontanon, S., Pham, P., Ravula, A., Wang, Q., Yang, L., et al

Manzil Zaheer, Guru Guruganesh, Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, and Amr Ahmed. Big bird: Transformers for longer sequences, 2021. URLhttps://arxiv.org/abs/2007.14062

work page arXiv 2021
[51]

An attention free transformer,

Shuangfei Zhai, Walter Talbott, Nitish Srivastava, Chen Huang, Hanlin Goh, Ruixiang Zhang, and Josh Susskind. An attention free transformer, 2021. URL https://arxiv.org/abs/ 2105.14103

work page arXiv 2021
[52]

Test-time training done right.arXiv preprint arXiv:2505.23884, 2025

Tianyuan Zhang, Sai Bi, Yicong Hong, Kai Zhang, Fujun Luan, Songlin Yang, Kalyan Sunkavalli, William T. Freeman, and Hao Tan. Test-time training done right, 2025. URL https://arxiv.org/abs/2505.23884

work page arXiv 2025
[53]

Gated slot attention for efficient linear-time sequence modeling, 2024

Yu Zhang, Songlin Yang, Ruijie Zhu, Yue Zhang, Leyang Cui, Yiqiao Wang, Bolun Wang, Freda Shi, Bailin Wang, Wei Bi, Peng Zhou, and Guohong Fu. Gated slot attention for efficient linear-time sequence modeling, 2024. URLhttps://arxiv.org/abs/2409.07146. A Proofs for the Tokenwise Theory A.1 Proof of Theorem 1 LeteS= eSt−1,e=e t, andk=k t. Part (ii): exact l...

work page arXiv 2024
[54]

Learned Scalar

For anyH∈T At: ⟨∆, H⟩F = tr 1 ∥k∥2 2 keTH T = 1 ∥k∥2 2 eT(H Tk) = 0,(12) sinceH Tk= 0. Therefore∆⊥T At, confirming thatS t is the orthogonal projection ofeSontoA t. Part (iii): normalized SGD.The gradient of Lt at eS is ∇SLt S=eS =k( eSTk−v t)T =−ke T. Therefore: St = eS− 1 ∥k∥2 2 ∇SLt S=eS, which is normalized gradient descent with coefficient1/∥k∥ 2 2. ...

work page