Recognition: 2 theorem links
· Lean TheoremKaczmarz Linear Attention
Pith reviewed 2026-05-12 01:11 UTC · model grok-4.3
The pith
Kaczmarz-derived normalization of the delta-rule update step size in Gated DeltaNet improves perplexity, long-context stability, and decoding efficiency in linear attention models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors derive a dynamic step size for residual writes in the delta rule from the Kaczmarz projection onto the solution of the online regression problem. This produces the normalized coefficient that replaces the learned one in GDN, preserving all other components of the model. Empirical results confirm gains in language modeling perplexity and associative recall tasks.
What carries the argument
The Kaczmarz-normalized step size β_t = η_t / (‖k_t‖₂² + ε) for residual updates in the gated delta rule state transition.
If this is right
- Achieves the lowest validation perplexity of 8.09 among linear-time baselines at 0.4B parameters with 1B tokens.
- Maintains stability and performance up to 65K token contexts.
- Attains 100% accuracy on single-needle-in-a-haystack retrieval tasks.
- Improves 8x multi-query associative recall by 7.03 points over GDN.
- Delivers 2.1x higher decode throughput at 32K context length.
Where Pith is reading between the lines
- The success of this derivation indicates that other empirically tuned coefficients in linear recurrent models may benefit from similar objective-based analysis.
- This approach could extend to other projection methods or regression objectives in sequence modeling.
- Better update rules may allow linear models to close the gap with quadratic attention on even longer contexts.
- The efficiency gains suggest practical deployment advantages for inference on long sequences.
Load-bearing premise
The specific form of the online-regression objective used in GDN directly produces the Kaczmarz-normalized step size without requiring additional assumptions about the distribution of keys or interactions with learned gates.
What would settle it
If replacing the Kaczmarz coefficient with the original learned coefficient in otherwise identical models eliminates the performance advantage on validation perplexity and retrieval tasks, the benefit of the derivation would be refuted.
Figures
read the original abstract
Long-context language modeling remains central to modern sequence modeling, but the quadratic cost of Transformer attention makes scaling computationally prohibitive. Linear recurrent models address this bottleneck by compressing the context into a fixed-size state, making the rule that forgets, writes, and edits information a central design problem. To address state maintenance, Gated DeltaNet (GDN) combines gated state decay with delta-rule residual writes, using a learnable coefficient to balance forgetting and update magnitude. However, this coefficient is learned empirically rather than derived from the underlying objective, which can lead to suboptimal update magnitudes. We revisit the online-regression objective underlying GDN and, inspired by the Kaczmarz projection method, derive the key-norm-normalized dynamic step size $\beta_t = \eta_t / (\|k_t\|_2^2 + \epsilon)$ for residual updates. We propose Kaczmarz Linear Attention (KLA), a one-scalar modification of GDN that preserves the state shape, gates, linear recurrence, and chunkwise parallel algorithm. At the 0.4B scale with a 1B-token budget, KLA achieves the lowest validation perplexity among evaluated linear-time baselines, 8.09 versus 8.50 for GDN, and remains stable up to 65K tokens. On controlled tasks, KLA reaches 100% on single-needle-in-a-haystack retrieval, improves 8x multi-query associative recall by 7.03 points over GDN, and delivers 2.1x higher decode throughput at 32K context. These results suggest that the key-norm-normalized Kaczmarz coefficient is a first-order design axis for delta-rule sequence models: it improves accuracy, extrapolation, and decoding efficiency without changing the recurrent state or hardware kernel.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes Kaczmarz Linear Attention (KLA) as a one-scalar modification to Gated DeltaNet (GDN) for linear-time long-context modeling. It revisits the online-regression objective underlying GDN and, inspired by the Kaczmarz projection, derives the key-norm-normalized step size β_t = η_t / (‖k_t‖₂² + ε) for residual updates. The modification is claimed to preserve the recurrent state shape, gates, linear recurrence, and chunkwise parallel algorithm. At the 0.4B scale with a 1B-token budget, KLA reports the lowest validation perplexity among linear baselines (8.09 vs. 8.50 for GDN), stability to 65K tokens, 100% single-needle retrieval, +7.03 points on 8x multi-query associative recall, and 2.1x higher decode throughput at 32K context.
Significance. If the derivation is rigorous and the gains are robust, this would constitute a useful contribution to delta-rule linear recurrent models by identifying key-norm normalization as a first-order design choice that improves accuracy, extrapolation, and efficiency without altering state dimensionality or hardware kernels. The preservation of the existing chunkwise parallel algorithm and the minimal change (one scalar) are clear practical strengths that facilitate adoption. The empirical results at the 0.4B scale provide concrete evidence of benefit on both language modeling and controlled retrieval tasks.
major comments (2)
- [§3] §3 (Derivation of β_t): the manuscript states that β_t = η_t / (‖k_t‖₂² + ε) is derived from the GDN online-regression objective via the Kaczmarz projection, yet provides no explicit expansion showing how the projection interacts with the learned forget gates g_t and write gates. Without this step-by-step accounting, it remains unclear whether the normalization emerges exactly after gating or requires additional assumptions on key distributions; this directly affects whether the 0.41 perplexity drop is theoretically motivated or an empirical tuning result.
- [§4] §4 (Experimental results): the central performance claims (validation perplexity 8.09 vs. 8.50, stability to 65K tokens, task improvements) are reported as single-point numbers with no error bars, no multiple random seeds, and no statistical significance tests. There is also no ablation on the free parameter ε or sensitivity analysis for η_t, both of which appear in the step-size formula; these omissions are load-bearing for the claim that KLA is reliably superior to GDN.
minor comments (2)
- [Abstract] Abstract: the phrase 'one-scalar modification' is used without immediately clarifying that the scalar in question is the step-size coefficient itself.
- [§3] Notation: the symbols η_t and ε are introduced in the step-size formula but their initialization or scheduling is not summarized in the main text before the experimental section.
Simulated Author's Rebuttal
We thank the referee for the constructive comments on our work. We provide point-by-point responses below and will incorporate the suggested clarifications and additional analyses in the revised manuscript.
read point-by-point responses
-
Referee: [§3] §3 (Derivation of β_t): the manuscript states that β_t = η_t / (‖k_t‖₂² + ε) is derived from the GDN online-regression objective via the Kaczmarz projection, yet provides no explicit expansion showing how the projection interacts with the learned forget gates g_t and write gates. Without this step-by-step accounting, it remains unclear whether the normalization emerges exactly after gating or requires additional assumptions on key distributions; this directly affects whether the 0.41 perplexity drop is theoretically motivated or an empirical tuning result.
Authors: We agree that the derivation in §3 would benefit from a more explicit step-by-step expansion. The Kaczmarz projection is applied to the residual update term after the forget gate g_t has been applied to the previous state and before the write gate modulates the update. Starting from the online least-squares objective, the projection onto the key direction yields the normalization by ||k_t||_2^2 directly on the gated residual. We will expand this derivation in the revised §3 to show the interaction with g_t and the write gates without additional distributional assumptions. revision: yes
-
Referee: [§4] §4 (Experimental results): the central performance claims (validation perplexity 8.09 vs. 8.50, stability to 65K tokens, task improvements) are reported as single-point numbers with no error bars, no multiple random seeds, and no statistical significance tests. There is also no ablation on the free parameter ε or sensitivity analysis for η_t, both of which appear in the step-size formula; these omissions are load-bearing for the claim that KLA is reliably superior to GDN.
Authors: We acknowledge the limitation of reporting single-run results without error bars or multiple seeds. In the revision, we will rerun the experiments with at least three random seeds and report means with standard deviations for perplexity and task metrics. Additionally, we will include an ablation study on ε (e.g., values from 1e-6 to 1e-2) and sensitivity analysis for η_t to confirm the robustness of the gains. revision: yes
Circularity Check
No circularity detected in derivation chain
full rationale
The paper revisits the online-regression objective of GDN and states that it derives the normalized step-size form β_t = η_t / (‖k_t‖₂² + ε) via Kaczmarz projection. No quoted equation or section reduces this form to a tautological re-expression of the input objective, a fitted parameter renamed as prediction, or a self-citation chain. The parameters η_t and ε are presented as part of the resulting update rule rather than hidden inputs that force the output. The central empirical claims (perplexity, retrieval accuracy) rest on experimental comparison rather than on the derivation alone being self-referential. The derivation chain is therefore self-contained against external benchmarks and does not match any enumerated circularity pattern.
Axiom & Free-Parameter Ledger
free parameters (2)
- η_t
- ε
axioms (1)
- domain assumption The delta-rule update in GDN is exactly the online solution to a regression objective.
Lean theorems connected to this paper
-
Cost/FunctionalEquation.leanwashburn_uniqueness_aczel echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
We revisit the online-regression objective underlying GDN and, inspired by the Kaczmarz projection method, derive the key-norm-normalized dynamic step size β_t = η_t / (‖k_t‖₂² + ε) for residual updates. ... The exact line-search coefficient is therefore τ* = 1/‖k_t‖₂².
-
Foundation/LogicAsFunctionalEquation.leanSatisfiesLawsOfLogic echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
Proposition 1 (Exact proximal form). ... the relaxed update in (3) is the exact minimizer of min_S ½∥S − eS_{t−1}∥²_F + μ_t/2 ∥S^T k_t − v_t∥²₂.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints
Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebrón, and Sumit Sanghai. Gqa: Training generalized multi-query transformer models from multi-head checkpoints, 2023. URLhttps://arxiv.org/abs/2305.13245
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[2]
Neural Machine Translation by Jointly Learning to Align and Translate
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate, 2016. URLhttps://arxiv.org/abs/1409.0473
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[3]
xlstm: Ex- tended long short-term memory
Maximilian Beck, Korbinian Pöppel, Markus Spanring, Andreas Auer, Oleksandra Prudnikova, Michael Kopp, Günter Klambauer, Johannes Brandstetter, and Sepp Hochreiter. xlstm: Extended long short-term memory, 2024. URLhttps://arxiv.org/abs/2405.04517
-
[4]
Striped attention: Faster ring attention for causal transformers,
William Brandon, Aniruddha Nrusimha, Kevin Qian, Zachary Ankner, Tian Jin, Zhiye Song, and Jonathan Ragan-Kelley. Striped attention: Faster ring attention for causal transformers,
- [5]
-
[6]
Rethinking Attention with Performers
Krzysztof Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas Sarlos, Peter Hawkins, Jared Davis, Afroz Mohiuddin, Lukasz Kaiser, David Belanger, Lucy Colwell, and Adrian Weller. Rethinking attention with performers, 2022. URL https: //arxiv.org/abs/2009.14794
work page internal anchor Pith review arXiv 2022
-
[7]
FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning
Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning, 2023. URLhttps://arxiv.org/abs/2307.08691
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[8]
Tri Dao and Albert Gu. Transformers are ssms: Generalized models and efficient algorithms through structured state space duality, 2024. URLhttps://arxiv.org/abs/2405.21060
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[9]
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
Tri Dao, Daniel Y . Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. Flashattention: Fast and memory-efficient exact attention with io-awareness, 2022. URL https://arxiv.org/ abs/2205.14135
work page internal anchor Pith review arXiv 2022
-
[10]
Soham De, Samuel L. Smith, Anushan Fernando, Aleksandar Botev, George Cristian-Muraru, Albert Gu, Ruba Haroun, Leonard Berrada, Yutian Chen, Srivatsan Srinivasan, Guillaume Desjardins, Arnaud Doucet, David Budden, Yee Whye Teh, Razvan Pascanu, Nando De Freitas, and Caglar Gulcehre. Griffin: Mixing gated linear recurrences with local attention for efficien...
work page internal anchor Pith review arXiv 2024
-
[11]
Hungry hungry hippos: Towards language modeling with state space models,
Daniel Y . Fu, Tri Dao, Khaled K. Saab, Armin W. Thomas, Atri Rudra, and Christopher Ré. Hungry hungry hippos: Towards language modeling with state space models, 2023. URL https://arxiv.org/abs/2212.14052
-
[12]
Mamba: Linear-time sequence modeling with selective state spaces
Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces. InFirst Conference on Language Modeling, 2024. URL https://openreview.net/forum? id=tEYskw1VY2
work page 2024
-
[13]
Efficiently Modeling Long Sequences with Structured State Spaces
Albert Gu, Karan Goel, and Christopher Ré. Efficiently modeling long sequences with structured state spaces, 2022. URLhttps://arxiv.org/abs/2111.00396
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[14]
Han Guo, Songlin Yang, Tarushii Goel, Eric P. Xing, Tri Dao, and Yoon Kim. Log-linear attention, 2026. URLhttps://arxiv.org/abs/2506.04761. 10
-
[15]
Training Compute-Optimal Large Language Models
Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hennigan, Eric Noland, Katie Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karen Simonyan, Erich Elsen, Jack W. Rae, Oriol Vinyals, and Laurent Sifre...
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[16]
Stefan Kaczmarz. Angenäherte auflösung von systemen linearer gleichungen.Bulletin Interna- tional de l’Académie Polonaise des Sciences et des Lettres, pages 355–357, 1937
work page 1937
-
[17]
Transformers are RNNs: Fast autoregressive transformers with linear attention
Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret. Transformers are RNNs: Fast autoregressive transformers with linear attention. In Hal Daumé III and Aarti Singh, editors,Proceedings of the 37th International Conference on Machine Learning, volume 119 ofProceedings of Machine Learning Research, pages 5156–5165. PMLR, 13–18 Jul 202...
work page 2020
-
[18]
Gateloop: Fully data-controlled linear recurrence for sequence modeling
Tobias Katsch. Gateloop: Fully data-controlled linear recurrence for sequence modeling, 2024. URLhttps://arxiv.org/abs/2311.01927
-
[19]
Reformer: The Efficient Transformer
Nikita Kitaev, Łukasz Kaiser, and Anselm Levskaya. Reformer: The efficient transformer, 2020. URLhttps://arxiv.org/abs/2001.04451
work page internal anchor Pith review arXiv 2020
-
[20]
Jamba: A Hybrid Transformer-Mamba Language Model
Opher Lieber, Barak Lenz, Hofit Bata, Gal Cohen, Jhonathan Osin, Itay Dalmedigos, Erez Safahi, Shaked Meirom, Yonatan Belinkov, Shai Shalev-Shwartz, Omri Abend, Raz Alon, Tomer Asida, Amir Bergman, Roman Glozman, Michael Gokhman, Avashalom Manevich, Nir Ratner, Noam Rozen, Erez Shwartz, Mor Zusman, and Yoav Shoham. Jamba: A hybrid transformer-mamba langua...
work page internal anchor Pith review arXiv 2024
-
[21]
Forgetting transformer: Softmax attention with a forget gate
Zhixuan Lin, Evgenii Nikishin, Xu Owen He, and Aaron Courville. Forgetting transformer: Softmax attention with a forget gate, 2025. URLhttps://arxiv.org/abs/2503.02130
-
[22]
Longhorn: State space models are amortized online learners
Bo Liu, Rui Wang, Lemeng Wu, Yihao Feng, Peter Stone, and Qiang Liu. Longhorn: State space models are amortized online learners, 2024. URLhttps://arxiv.org/abs/2407.14207
-
[23]
Blockwise parallel transformer for large context models, 2023
Hao Liu and Pieter Abbeel. Blockwise parallel transformer for large context models, 2023. URLhttps://arxiv.org/abs/2305.19370
-
[24]
RWKV: Reinventing RNNs for the Transformer Era
Bo Peng, Eric Alcaide, Quentin Anthony, Alon Albalak, Samuel Arcadinho, Stella Biderman, Huanqi Cao, Xin Cheng, Michael Chung, Matteo Grella, Kranthi Kiran GV , Xuzheng He, Haowen Hou, Jiaju Lin, Przemyslaw Kazienko, Jan Kocon, Jiaming Kong, Bartlomiej Koptyra, Hayden Lau, Krishna Sri Ipsit Mantri, Ferdinand Mom, Atsushi Saito, Guangyu Song, Xiangru Tang,...
work page internal anchor Pith review arXiv 2023
-
[25]
Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation
Ofir Press, Noah A. Smith, and Mike Lewis. Train short, test long: Attention with linear biases enables input length extrapolation, 2022. URLhttps://arxiv.org/abs/2108.12409
work page internal anchor Pith review arXiv 2022
-
[26]
Various lengths, constant speed: Efficient language modeling with lightning attention, 2024
Zhen Qin, Weigao Sun, Dong Li, Xuyang Shen, Weixuan Sun, and Yiran Zhong. Various lengths, constant speed: Efficient language modeling with lightning attention, 2024. URL https://arxiv.org/abs/2405.17381
-
[27]
HGRN2: Gated linear RNNs with state expansion
Zhen Qin, Songlin Yang, Weixuan Sun, Xuyang Shen, Dong Li, Weigao Sun, and Yiran Zhong. HGRN2: Gated linear RNNs with state expansion. InFirst Conference on Language Modeling,
-
[28]
URLhttps://openreview.net/forum?id=y6SqbJfCSk
-
[29]
Linear transformers are secretly fast weight programmers
Imanol Schlag, Kazuki Irie, and Jürgen Schmidhuber. Linear transformers are secretly fast weight programmers. In Marina Meila and Tong Zhang, editors,Proceedings of the 38th Inter- national Conference on Machine Learning, volume 139 ofProceedings of Machine Learning Research, pages 9355–9366. PMLR, 18–24 Jul 2021. URL https://proceedings.mlr. press/v139/s...
work page 2021
-
[30]
Jürgen Schmidhuber. Learning to control fast-weight memories: An alternative to dynamic recurrent networks.Neural Computation, 4(1):131–139, 1992. doi: 10.1162/neco.1992.4.1.131. 11
-
[31]
Uri Shaham, Elad Segal, Maor Ivgi, Avia Efrat, Ori Yoran, Adi Haviv, Ankit Gupta, Wenhan Xiong, Mor Geva, Jonathan Berant, and Omer Levy. Scrolls: Standardized comparison over long language sequences, 2022. URLhttps://arxiv.org/abs/2201.03533
-
[32]
Fast Transformer Decoding: One Write-Head is All You Need
Noam Shazeer. Fast transformer decoding: One write-head is all you need, 2019. URL https://arxiv.org/abs/1911.02150
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[33]
Julien Siems, Timur Carstensen, Arber Zela, Frank Hutter, Massimiliano Pontil, and Riccardo Grazzi. Deltaproduct: Improving state-tracking in linear rnns via householder products, 2025. URLhttps://arxiv.org/abs/2502.10297
-
[34]
Smith, Andrew Warrington, and Scott Linderman
Jimmy T.H. Smith, Andrew Warrington, and Scott Linderman. Simplified state space layers for sequence modeling. InThe Eleventh International Conference on Learning Representations,
-
[35]
URLhttps://openreview.net/forum?id=Ai8Hw3AXqks
-
[36]
SlimPajama: A 627B token cleaned and deduplicated version of RedPajama
Daria Soboleva, Faisal Al-Khateeb, Robert Myers, Jacob R Steeves, Joel Hestness, and Nolan Dey. SlimPajama: A 627B token cleaned and deduplicated version of RedPajama. https://cerebras.ai/blog/ slimpajama-a-627b-token-cleaned-and-deduplicated-version-of-redpajama ,
-
[37]
URLhttps://huggingface.co/datasets/cerebras/SlimPajama-627B
-
[38]
Learning to (Learn at Test Time): RNNs with Expressive Hidden States
Yu Sun, Xinhao Li, Karan Dalal, Jiarui Xu, Arjun Vikram, Genghan Zhang, Yann Dubois, Xinlei Chen, Xiaolong Wang, Sanmi Koyejo, Tatsunori Hashimoto, and Carlos Guestrin. Learning to (learn at test time): Rnns with expressive hidden states, 2025. URL https: //arxiv.org/abs/2407.04620
work page internal anchor Pith review arXiv 2025
-
[39]
Retentive Network: A Successor to Transformer for Large Language Models
Yutao Sun, Li Dong, Shaohan Huang, Shuming Ma, Yuqing Xia, Jilong Xue, Jianyong Wang, and Furu Wei. Retentive network: A successor to transformer for large language models, 2023. URLhttps://arxiv.org/abs/2307.08621
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[40]
Kimi Team, Yu Zhang, Zongyu Lin, Xingcheng Yao, Jiaxi Hu, Fanqing Meng, Chengyin Liu, Xin Men, Songlin Yang, Zhiyuan Li, Wentao Li, Enzhe Lu, Weizhou Liu, Yanru Chen, Weixin Xu, Longhui Yu, Yejie Wang, Yu Fan, Longguang Zhong, Enming Yuan, Dehao Zhang, Yizhi Zhang, T. Y . Liu, Haiming Wang, Shengjun Fang, Weiran He, Shaowei Liu, Yiwei Li, Jianlin Su, Jiez...
work page internal anchor Pith review arXiv 2025
-
[41]
LLaMA: Open and Efficient Foundation Language Models
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timo- thée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. Llama: Open and efficient foundation language models, 2023. URLhttps://arxiv.org/abs/2302.13971
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[42]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In I. Guyon, U. V on Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, ed- itors,Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL ...
work page 2017
-
[43]
Roger Waleffe, Wonmin Byeon, Duncan Riach, Brandon Norick, Vijay Korthikanti, Tri Dao, Albert Gu, Ali Hatamizadeh, Sudhakar Singh, Deepak Narayanan, Garvit Kulshreshtha, Vartika Singh, Jared Casper, Jan Kautz, Mohammad Shoeybi, and Bryan Catanzaro. An empirical study of mamba-based language models, 2024. URLhttps://arxiv.org/abs/2406.07887
- [44]
-
[45]
Linformer: Self-Attention with Linear Complexity
Sinong Wang, Belinda Z. Li, Madian Khabsa, Han Fang, and Hao Ma. Linformer: Self-attention with linear complexity, 2020. URLhttps://arxiv.org/abs/2006.04768
work page internal anchor Pith review Pith/arXiv arXiv 2020
-
[46]
Songlin Yang and Yu Zhang. Fla: A triton-based library for hardware-efficient implementa- tions of linear attention mechanism, January 2024. URL https://github.com/fla-org/ flash-linear-attention
work page 2024
-
[47]
Gated linear attention transformers with hardware-efficient training
Songlin Yang, Bailin Wang, Yikang Shen, Rameswar Panda, and Yoon Kim. Gated linear attention transformers with hardware-efficient training. In Ruslan Salakhutdinov, Zico Kolter, Katherine Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, editors,Proceedings of the 41st International Conference on Machine Learning, volume 235 of...
work page 2024
-
[48]
Parallelizing linear transformers with the delta rule over sequence length
Songlin Yang, Bailin Wang, Yu Zhang, Yikang Shen, and Yoon Kim. Parallelizing linear transformers with the delta rule over sequence length. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URL https://openreview.net/forum? id=y8Rm4VNRPH
work page 2024
-
[49]
Gated delta networks: Improving mamba2 with delta rule
Songlin Yang, Jan Kautz, and Ali Hatamizadeh. Gated delta networks: Improving mamba2 with delta rule. InThe Thirteenth International Conference on Learning Representations, 2025. URLhttps://openreview.net/forum?id=r8H7xhYPwz
work page 2025
-
[50]
A., Ainslie, J., Alberti, C., Ontanon, S., Pham, P., Ravula, A., Wang, Q., Yang, L., et al
Manzil Zaheer, Guru Guruganesh, Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, and Amr Ahmed. Big bird: Transformers for longer sequences, 2021. URLhttps://arxiv.org/abs/2007.14062
-
[51]
An attention free transformer,
Shuangfei Zhai, Walter Talbott, Nitish Srivastava, Chen Huang, Hanlin Goh, Ruixiang Zhang, and Josh Susskind. An attention free transformer, 2021. URL https://arxiv.org/abs/ 2105.14103
-
[52]
Test-time training done right.arXiv preprint arXiv:2505.23884, 2025
Tianyuan Zhang, Sai Bi, Yicong Hong, Kai Zhang, Fujun Luan, Songlin Yang, Kalyan Sunkavalli, William T. Freeman, and Hao Tan. Test-time training done right, 2025. URL https://arxiv.org/abs/2505.23884
-
[53]
Gated slot attention for efficient linear-time sequence modeling, 2024
Yu Zhang, Songlin Yang, Ruijie Zhu, Yue Zhang, Leyang Cui, Yiqiao Wang, Bolun Wang, Freda Shi, Bailin Wang, Wei Bi, Peng Zhou, and Guohong Fu. Gated slot attention for efficient linear-time sequence modeling, 2024. URLhttps://arxiv.org/abs/2409.07146. A Proofs for the Tokenwise Theory A.1 Proof of Theorem 1 LeteS= eSt−1,e=e t, andk=k t. Part (ii): exact l...
-
[54]
For anyH∈T At: ⟨∆, H⟩F = tr 1 ∥k∥2 2 keTH T = 1 ∥k∥2 2 eT(H Tk) = 0,(12) sinceH Tk= 0. Therefore∆⊥T At, confirming thatS t is the orthogonal projection ofeSontoA t. Part (iii): normalized SGD.The gradient of Lt at eS is ∇SLt S=eS =k( eSTk−v t)T =−ke T. Therefore: St = eS− 1 ∥k∥2 2 ∇SLt S=eS, which is normalized gradient descent with coefficient1/∥k∥ 2 2. ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.