pith. machine review for the scientific record. sign in

arxiv: 2605.13473 · v1 · submitted 2026-05-13 · 💻 cs.LG · cs.CL

Recognition: no theorem link

OSDN: Improving Delta Rule with Provable Online Preconditioning in Linear Attention

Authors on Pith no claims yet

Pith reviewed 2026-05-14 19:04 UTC · model grok-4.3

classification 💻 cs.LG cs.CL
keywords linear attentiondelta ruleonline preconditioningin-context recalldiagonal preconditionerhypergradientchunkwise parallelismstate-space models
0
0 comments X

The pith

OSDN augments the Delta Rule with an online diagonal preconditioner equivalent to per-feature key scaling, delivering super-geometric convergence and 39% lower recall residual at 1.3B parameters.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Online Scaled DeltaNet (OSDN) to fix a limitation in the Delta Rule for linear attention: its single scalar step size ignores how the inner loss curves differently across features. OSDN adds a diagonal preconditioner updated online through hypergradient feedback. This preconditioner turns out to be exactly the same as scaling the write-side key vector once per feature, so the method runs in the same chunkwise parallel pipeline without extra high-dimensional state. The authors exploit the exact quadratic form of the regression loss to prove super-geometric convergence to a right-Newton comparator together with a token-local residual contraction bound. At 1.3 billion parameters the approach reduces the recall residual ratio by 39 percent while matching the baseline on perplexity and LongBench.

Core claim

OSDN augments the scalar gate in the Delta Rule with a diagonal preconditioner updated online via hypergradient feedback. This right-preconditioning is algebraically equivalent to a per-feature scaling of the write-side key, allowing the method to preserve the hardware-friendly chunkwise parallel pipeline. By exploiting the exact-quadratic structure of the inner regression loss, OSDN establishes super-geometric convergence against a right-Newton comparator and proves an algorithm-aligned token-local residual contraction bound. Adaptive Preconditioner Forgetting is introduced to handle non-stationary contexts by dynamically refreshing stale calibration.

What carries the argument

The diagonal preconditioner maintained by online hypergradient updates, algebraically equivalent to per-feature scaling of the write-side key in the Delta Rule.

If this is right

  • At 340M parameters OSDN improves JRT-style in-context recall by 32 percent over the plain Delta Rule.
  • At 1.3B parameters the recall residual ratio drops by 39 percent while perplexity and LongBench scores stay comparable.
  • Adaptive Preconditioner Forgetting refreshes the preconditioner on the fly to handle non-stationary contexts.
  • The theoretical guarantee is super-geometric convergence to the right-Newton comparator with token-local residual contraction.
  • The algebraic equivalence to key scaling lets the algorithm keep its original chunkwise parallel implementation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same online diagonal scaling could be added to other linear-attention or state-space models that rely on quadratic inner losses.
  • Because the change is realized as a simple per-feature multiplication on the key, it can be dropped into existing chunkwise kernels with almost no extra code.
  • If the quadratic-loss assumption continues to hold at still larger scales, the method could become a default upgrade for any Delta-Rule-style memory update.
  • The Adaptive Preconditioner Forgetting schedule itself may be useful in any online setting where curvature estimates quickly become stale.

Load-bearing premise

The inner regression loss must remain exactly quadratic so that the hypergradient update for the diagonal preconditioner stays valid and the convergence bounds apply.

What would settle it

Measurements at 1.3B scale showing recall-residual improvement below 20 percent or convergence that is only linear instead of super-geometric would falsify the central claim.

Figures

Figures reproduced from arXiv: 2605.13473 by Chenyu Zhou, Dongdong Ge, Hongpei Li, Jianghao Lin, Yinyu Ye, Yuerou Liu.

Figure 1
Figure 1. Figure 1: Motivation and Computation Flow of Online Scaled DeltaNet (OSDN). Left: From an online learning perspective, standard DeltaNet applies a uniform scalar learning rate, struggling to adapt to optimization directions with varying curvatures (e.g., frequent vs. rare keys). OSDN resolves this by introducing a diagonal preconditioner that dynamically scales the update directions. Right: The OSDN computation flow… view at source ↗
Figure 2
Figure 2. Figure 2: Direct theorem-facing residual contraction in the DeltaNet rows. (a) Geometric mean of qt = ft(St)/ft(St−1) by relative position bin on JRT-twice prompts; the dashed boundary marks the single-pass / repeated-context transition. (b) Overall qgeo across 7.85×106 token-layer-head measurements. to 0.315; repeated recall is within noise; and the commonsense and LongBench averages remain at parity with DeltaNet.… view at source ↗
Figure 3
Figure 3. Figure 3: Visual summary across matched 340M rows and the 1.3B DeltaNet scale-up. This is the visual companion to [PITH_FULL_IMAGE:figures/full_fig_p036_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Auxiliary PG-19 length-extrapolation diagnostic. Each GPU consumes a 65,536-token packed batch, but training segments are variable-length FineWeb-Edu documents capped at 4K tokens with cu_seqlens-aligned recurrent-state resets at every segment boundary, so the effective recurrent training context is at most 4K tokens. Models are then evaluated on 20K-token PG-19 blocks, well beyond any contiguous segment s… view at source ↗
Figure 5
Figure 5. Figure 5: Token-vs-CE training trajectories at matched 340M scale. Each curve consumes the same 10.74B-token budget at 524,288 tokens per optimizer step (Appendix F); curves are rendered as a centred 257-step rolling mean. Within each panel, the baseline uses a lighter dashed tone, online-scaled (OS) variants a mid-saturation tone, and APF variants the deepest tone. Panel (a) plots the full trajectory on a logarithm… view at source ↗
read the original abstract

Linear attention and state-space models offer constant-memory alternatives to softmax attention, but often struggle with in-context associative recall. The Delta Rule mitigates this by writing each token via one step of online gradient descent. However, its step size relies on a single scalar gate that ignores the feature-wise curvature of the inner objective. We propose Online Scaled DeltaNet (OSDN), which augments the scalar gate with a diagonal preconditioner updated online via hypergradient feedback. Crucially, this right-preconditioning is algebraically equivalent to a per-feature scaling of the write-side key. This equivalence allows OSDN to strictly preserve the hardware-friendly chunkwise parallel pipeline of DeltaNet without incurring high-dimensional state overhead. Theoretically, by exploiting the exact-quadratic structure of the inner regression loss, we establish super-geometric convergence against a right-Newton comparator and prove an algorithm-aligned token-local residual contraction bound. To handle non-stationary contexts, we further introduce Adaptive Preconditioner Forgetting (APF) to dynamically refresh stale calibration. Empirically, OSDN demonstrates strong performance across scales. At the 340M-parameter scale, OSDN improves JRT-style in-context recall by 32% over DeltaNet. Scaling to 1.3B parameters, it achieves a 39% reduction in the recall residual ratio while maintaining parity on general downstream tasks (e.g., perplexity and LongBench) -- demonstrating that our online-preconditioning mechanism effectively transfers and amplifies at the billion-parameter scale.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces Online Scaled DeltaNet (OSDN), augmenting the Delta Rule for linear attention with a diagonal preconditioner updated online via hypergradient feedback. This preconditioner is algebraically equivalent to per-feature scaling of the write-side key, preserving the chunkwise parallel pipeline without high-dimensional state. Theoretical results exploit the exact-quadratic inner regression loss to prove super-geometric convergence against a right-Newton comparator and an algorithm-aligned token-local residual contraction bound. Adaptive Preconditioner Forgetting (APF) handles non-stationary contexts. Empirically, OSDN yields a 32% improvement in JRT-style in-context recall at 340M parameters and a 39% reduction in recall residual ratio at 1.3B parameters while maintaining parity on perplexity and LongBench.

Significance. If the convergence proofs hold rigorously and the hypergradient update preserves O(d) state and chunkwise parallelism, the work would meaningfully advance linear attention and state-space models by addressing associative recall limitations in a hardware-efficient manner. The scaling results to 1.3B parameters indicate practical transferability, offering a potential path to improve in-context learning without softmax attention overhead.

major comments (3)
  1. [Theoretical Analysis] Theoretical Analysis section: The super-geometric convergence and token-local residual contraction bounds exploit the exact-quadratic structure, but the online hypergradient update for the diagonal preconditioner introduces a second-order term when differentiating the Delta-rule update w.r.t. preconditioner parameters. It is not shown that this term remains O(d) state and compatible with the chunkwise parallel scan, even though the forward-pass equivalence to key scaling is established.
  2. [§5 Experiments] §5 Experiments: The 39% reduction in recall residual ratio at 1.3B parameters and 32% improvement at 340M parameters are reported without error bars, details on data exclusion criteria, or statistical significance tests, which undermines assessment of the scaling claims and cross-scale consistency.
  3. [Method] Method section: The Adaptive Preconditioner Forgetting rate is a free parameter whose effect on the provable bounds is not analyzed; the theoretical results assume conditions that APF relaxes, but no adjusted convergence statement is provided.
minor comments (2)
  1. [Abstract] Abstract: The term 'JRT-style in-context recall' is used without definition or citation, reducing accessibility.
  2. [Notation] Notation: Ensure the preconditioner matrix and hypergradient symbols are defined consistently before first use in the main text.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major point below, indicating planned revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Theoretical Analysis] Theoretical Analysis section: The super-geometric convergence and token-local residual contraction bounds exploit the exact-quadratic structure, but the online hypergradient update for the diagonal preconditioner introduces a second-order term when differentiating the Delta-rule update w.r.t. preconditioner parameters. It is not shown that this term remains O(d) state and compatible with the chunkwise parallel scan, even though the forward-pass equivalence to key scaling is established.

    Authors: We appreciate this observation. The forward equivalence to per-feature key scaling ensures that the main Delta-rule update preserves chunkwise parallelism and O(d) state. For the hypergradient update of the diagonal preconditioner, the second-order term arises from differentiating through the update rule. Because the preconditioner is strictly diagonal, this differentiation reduces to element-wise operations that can be maintained with O(d) additional state (the current preconditioner and its gradient estimate). We will include a supplementary derivation in the revised Theoretical Analysis section demonstrating that the hypergradient feedback integrates into the existing parallel scan recurrence without increasing state complexity or breaking chunkwise compatibility. revision: yes

  2. Referee: [§5 Experiments] §5 Experiments: The 39% reduction in recall residual ratio at 1.3B parameters and 32% improvement at 340M parameters are reported without error bars, details on data exclusion criteria, or statistical significance tests, which undermines assessment of the scaling claims and cross-scale consistency.

    Authors: We agree that the empirical results would benefit from additional statistical rigor. In the revised version of §5, we will augment the reported figures with error bars computed over multiple random seeds (at least 3 per scale), specify the data exclusion criteria applied to the JRT-style in-context recall benchmarks, and include statistical significance tests (e.g., paired t-tests) comparing OSDN against the DeltaNet baseline. These additions will better support the scaling claims and cross-scale consistency. revision: yes

  3. Referee: [Method] Method section: The Adaptive Preconditioner Forgetting rate is a free parameter whose effect on the provable bounds is not analyzed; the theoretical results assume conditions that APF relaxes, but no adjusted convergence statement is provided.

    Authors: APF is a practical heuristic designed to mitigate the effects of non-stationarity in real-world contexts, which the core theoretical analysis assumes away. The super-geometric convergence and residual contraction bounds hold under the stationary quadratic-loss setting without forgetting. Introducing APF relaxes these assumptions to improve empirical robustness, but deriving a modified convergence rate would require additional modeling of the forgetting dynamics. We will add a clarifying paragraph in the Method section explaining this distinction and noting that APF preserves the O(d) state and parallelism while serving as an empirical enhancement. A full theoretical treatment of APF is left for future work. revision: partial

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claims rest on the assumption that the inner regression loss is exactly quadratic and that the online preconditioner update preserves parallelism without additional state; no new entities are postulated.

free parameters (1)
  • Adaptive Preconditioner Forgetting rate
    Dynamically refreshed calibration parameter for non-stationary contexts, introduced to handle stale preconditioner values.
axioms (1)
  • domain assumption Exact-quadratic structure of the inner regression loss
    Invoked to derive super-geometric convergence against the right-Newton comparator.

pith-pipeline@v0.9.0 · 5584 in / 1348 out tokens · 48038 ms · 2026-05-14T19:04:37.167670+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

73 extracted references · 73 canonical work pages · 5 internal anchors

  1. [1]

    What learning algorithm is in-context learning? investigations with linear models

    Ekin Akyürek, Dale Schuurmans, Jacob Andreas, Tengyu Ma, and Denny Zhou. What learning algorithm is in-context learning? investigations with linear models. InInternational Conference on Learning Representations, 2023

  2. [2]

    Zoology: Measuring and improving recall in efficient language models

    Simran Arora, Sabri Eyuboglu, Aman Timalsina, Isys Johnson, Michael Poli, James Zou, Atri Rudra, and Christopher Ré. Zoology: Measuring and improving recall in efficient language models, 2024. URLhttps://arxiv.org/abs/2312.04927

  3. [3]

    Simple linear attention language models balance the recall-throughput tradeoff

    Simran Arora, Sabri Eyuboglu, Michael Zhang, Aman Timalsina, Silas Alberti, Dylan Zinsley, James Zou, Atri Rudra, and Christopher Ré. Simple linear attention language models balance the recall-throughput tradeoff, 2024. URLhttps://arxiv.org/abs/2402.18668

  4. [4]

    Just read twice: Closing the recall gap for recurrent language models, 2024

    Simran Arora, Aman Timalsina, Aaryan Singhal, Benjamin Spector, Sabri Eyuboglu, Xinyi Zhao, Ashish Rao, Atri Rudra, and Christopher Ré. Just read twice: Closing the recall gap for recurrent language models, 2024. URLhttps://arxiv.org/abs/2407.05483

  5. [5]

    Hinton, V olodymyr Mnih, Joel Z

    Jimmy Ba, Geoffrey E. Hinton, V olodymyr Mnih, Joel Z. Leibo, and Catalin Ionescu. Using fast weights to attend to the recent past. InAdvances in Neural Information Processing Systems, 2016

  6. [6]

    LongBench: A bilingual, multitask benchmark for long context understanding

    Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, Yuxiao Dong, Jie Tang, and Juanzi Li. LongBench: A bilingual, multitask benchmark for long context understanding. InAnnual Meeting of the Association for Computational Linguistics, 2024

  7. [7]

    Online learning rate adaptation with hypergradient descent

    Atılım Güne¸ s Baydin, Robert Cornish, David Martínez Rubio, Mark Schmidt, and Frank Wood. Online learning rate adaptation with hypergradient descent. InInternational Conference on Learning Representations, 2018

  8. [8]

    xLSTM: extended long short-term memory.Advances in Neural Infor- mation Processing Systems, 37:107547–107603, December 2024

    Maximilian Beck, Korbinian Pöppel, Markus Spanring, Andreas Auer, Oleksandra Prud- nikova, Michael Kopp, Günter Klambauer, Johannes Brandstetter, and Sepp Hochre- iter. xLSTM: extended long short-term memory.Advances in Neural Infor- mation Processing Systems, 37:107547–107603, December 2024. doi: 10.52202/ 079017-3417. URL https://proceedings.neurips.cc/...

  9. [9]

    Titans: Learning to memorize at test time,

    Ali Behrouz, Peilin Zhong, and Vahab Mirrokni. Titans: Learning to memorize at test time,

  10. [10]

    URLhttps://arxiv.org/abs/2501.00663

  11. [11]

    Atlas: Learning to optimally memorize the context at test time, 2025

    Ali Behrouz, Zeman Li, Praneeth Kacham, Majid Daliri, Yuan Deng, Peilin Zhong, Meisam Razaviyayn, and Vahab Mirrokni. Atlas: Learning to optimally memorize the context at test time, 2025. URLhttps://arxiv.org/abs/2505.23735

  12. [12]

    PIQA: Reasoning about physical commonsense in natural language

    Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, and Yejin Choi. PIQA: Reasoning about physical commonsense in natural language. InAAAI Conference on Artificial Intelligence, 2020

  13. [13]

    Rethinking attention with performers

    Krzysztof Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas Sarlos, Peter Hawkins, Jared Davis, Afroz Mohiuddin, Lukasz Kaiser, David Belanger, Lucy Colwell, and Adrian Weller. Rethinking attention with performers. InInternational Conference on Learning Representations, 2021. URL https://openreview.net/forum? id=Ua6zuk0WRH

  14. [14]

    BoolQ: Exploring the surprising difficulty of natural yes/no questions

    Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. BoolQ: Exploring the surprising difficulty of natural yes/no questions. In North American Chapter of the Association for Computational Linguistics, 2019. 10

  15. [15]

    Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

    Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try ARC, the AI2 reasoning challenge, 2018. URLhttps://arxiv.org/abs/1803.05457

  16. [16]

    Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

    Tri Dao and Albert Gu. Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality. InProceedings of the 41st International Conference on Machine Learning, pages 10041–10071. PMLR, July 2024. URL https://proceedings. mlr.press/v235/dao24a.html

  17. [17]

    DROP: A reading comprehension benchmark requiring discrete reasoning over paragraphs

    Dheeru Dua, Yizhong Wang, Pradeep Dasigi, Gabriel Stanovsky, Sameer Singh, and Matt Gardner. DROP: A reading comprehension benchmark requiring discrete reasoning over paragraphs. InNorth American Chapter of the Association for Computational Linguistics, 2019

  18. [18]

    Adaptive subgradient methods for online learning and stochastic optimization.Journal of Machine Learning Research, 12:2121–2159, 2011

    John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization.Journal of Machine Learning Research, 12:2121–2159, 2011

  19. [19]

    Gradient methods with online scaling

    Wenzhi Gao, Ya Chi Chu, Yinyu Ye, and Madeleine Udell. Gradient methods with online scaling. arXiv preprint arXiv:2411.01803, 2024. URLhttps://arxiv.org/abs/2411.01803

  20. [20]

    Mamba: linear-time sequence modeling with selective state spaces

    Albert Gu and Tri Dao. Mamba: linear-time sequence modeling with selective state spaces. August 2024. URLhttps://openreview.net/forum?id=tEYskw1VY2

  21. [21]

    Efficiently modeling long sequences with structured state spaces

    Albert Gu, Karan Goel, and Christopher Re. Efficiently modeling long sequences with structured state spaces. October 2021. URL https://openreview.net/forum?id=uYLFoz1vlAC. shortConferenceName: ICLR

  22. [22]

    Shampoo: Preconditioned stochastic tensor optimization

    Vineet Gupta, Tomer Koren, and Yoram Singer. Shampoo: Preconditioned stochastic tensor optimization. InInternational Conference on Machine Learning, 2018

  23. [23]

    Foundations and Trends in Optimiza- tion, 2016

    Elad Hazan.Introduction to Online Convex Optimization. Foundations and Trends in Optimiza- tion, 2016

  24. [24]

    Logarithmic regret algorithms for online convex optimization.Machine Learning, 69(2–3):169–192, 2007

    Elad Hazan, Amit Agarwal, and Satyen Kale. Logarithmic regret algorithms for online convex optimization.Machine Learning, 69(2–3):169–192, 2007

  25. [25]

    Jelassi, D

    Samy Jelassi, David Brandfonbrener, Sham M. Kakade, and Eran Malach. Repeat after me: Transformers are better than state space models at copying, 2024. URL https://arxiv.org/ abs/2402.01032

  26. [26]

    Weld, and Luke Zettlemoyer

    Mandar Joshi, Eunsol Choi, Daniel S. Weld, and Luke Zettlemoyer. TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension. InAnnual Meeting of the Association for Computational Linguistics, 2017

  27. [27]

    Jungo Kasai, Hao Peng, Yizhe Zhang, Dani Yogatama, Gabriel Ilharco, Nikolaos Pappas, Yi Mao, Weizhu Chen, and Noah A. Smith. Finetuning pretrained transformers into RNNs. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih, editors, Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 10630...

  28. [28]

    Transformers are RNNs: fast autoregressive transformers with linear attention

    Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret. Transformers are RNNs: fast autoregressive transformers with linear attention. InProceedings of the 37th International Conference on Machine Learning, pages 5156–5165. PMLR, November 2020. URL https://proceedings.mlr.press/v119/katharopoulos20a.html. shortConfer- enceName: ICML

  29. [29]

    Kingma and Jimmy Ba

    Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. InInterna- tional Conference on Learning Representations, 2015

  30. [30]

    Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov

    Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, Kristina Toutanova, Llion Jones, Matthew Kelcey, Ming-Wei Chang, Andrew M. Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. Natural questions: A benchmark for question answering research.Transact...

  31. [31]

    arXiv preprint arXiv:2603.15569 , year=

    Aakash Lahoti, Kevin Y . Li, Berlin Chen, Caitlin Wang, Aviv Bick, J. Zico Kolter, Tri Dao, and Albert Gu. Mamba-3: Improved sequence modeling using state space principles, 2026. URL https://arxiv.org/abs/2603.15569

  32. [32]

    Longhorn: State space models are amortized online learners

    Bo Liu, Rui Wang, Lemeng Wu, Yihao Feng, Peter Stone, and Qiang Liu. Longhorn: State space models are amortized online learners, 2024. URLhttps://arxiv.org/abs/2407.14207

  33. [33]

    InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

    Huanru Henry Mao. Fine-tuning pre-trained transformers into decaying fast weights. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang, editors,Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 10236–10242, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. doi: 10.18653/v1...

  34. [34]

    Pointer sentinel mixture models

    Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models. InInternational Conference on Learning Representations, 2017

  35. [35]

    In-context learning and induction heads

    Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, et al. In-context learning and induction heads. InTransformer Circuits Thread, 2022. URL https:// transformer-circuits.pub/2022/in-context-learning-and-induction-heads/ index.html

  36. [36]

    Smith, Albert Gu, Anushan Fernando, Caglar Gulcehre, Razvan Pascanu, and Soham De

    Antonio Orvieto, Samuel L. Smith, Albert Gu, Anushan Fernando, Caglar Gulcehre, Razvan Pascanu, and Soham De. Resurrecting recurrent neural networks for long sequences. In International Conference on Machine Learning, 2023

  37. [37]

    The LAMBADA dataset: Word prediction requiring a broad discourse context

    Denis Paperno, Germán Kruszewski, Angeliki Lazaridou, Quan Ngoc Pham, Raffaella Bernardi, Sandro Pezzelle, Marco Baroni, Gemma Boleda, and Raquel Fernández. The LAMBADA dataset: Word prediction requiring a broad discourse context. InAnnual Meeting of the Association for Computational Linguistics, 2016

  38. [38]

    Eagle and finch: RWKV with matrix-valued states and dynamic recurrence

    Bo Peng, Daniel Goldstein, Quentin Gregory Anthony, Alon Albalak, Eric Alcaide, Stella Biderman, Eugene Cheah, Teddy Ferdinan, Kranthi Kiran Gv, Haowen Hou, Satyapriya Krishna, Ronald McClelland Jr, Niklas Muennighoff, Fares Obeid, Atsushi Saito, Guangyu Song, Haoqin Tu, Ruichong Zhang, Bingchen Zhao, Qihang Zhao, Jian Zhu, and Rui-Jie Zhu. Eagle and finc...

  39. [39]

    RWKV-7 “Goose” with expressive dynamic state evolution, 2025

    Bo Peng, Ruichong Zhang, Daniel Goldstein, Eric Alcaide, Haowen Du, Xingjian Hou, et al. RWKV-7 “Goose” with expressive dynamic state evolution, 2025. URLhttps://arxiv.org/ abs/2503.14456

  40. [40]

    Random feature attention

    Hao Peng, Nikolaos Pappas, Dani Yogatama, Roy Schwartz, Noah Smith, and Lingpeng Kong. Random feature attention. October 2020. URL https://openreview.net/forum?id= QtTKTdVrFBB. shortConferenceName: ICLR

  41. [41]

    Hao Peng, Jungo Kasai, Nikolaos Pappas, Dani Yogatama, Zhaofeng Wu, Lingpeng Kong, Roy Schwartz, and Noah A. Smith. ABC: attention with bounded-memory control. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio, editors,Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7469–7483...

  42. [42]

    Fu, Tri Dao, Stephen Baccus, Yoshua Bengio, Stefano Ermon, and Christopher Ré

    Michael Poli, Stefano Massaroli, Eric Nguyen, Daniel Y . Fu, Tri Dao, Stephen Baccus, Yoshua Bengio, Stefano Ermon, and Christopher Ré. Hyena hierarchy: Towards larger convolutional language models. InInternational Conference on Machine Learning, 2023

  43. [43]

    HGRN2: gated linear RNNs with state expansion

    Zhen Qin, Songlin Yang, Weixuan Sun, Xuyang Shen, Dong Li, Weigao Sun, and Yiran Zhong. HGRN2: gated linear RNNs with state expansion. August 2024. URL https: //openreview.net/forum?id=y6SqbJfCSk

  44. [44]

    W., Potapenko, A., Jayakumar, S

    Jack W. Rae, Anna Potapenko, Siddhant M. Jayakumar, and Timothy P. Lillicrap. Compressive transformers for long-range sequence modelling, 2020. URL https://arxiv.org/abs/ 1911.05507. 12

  45. [45]

    Hopfield networks is all you need

    Hubert Ramsauer, Bernhard Schäfl, Johannes Lehner, Philipp Seidl, Michael Widrich, Thomas Adler, Lukas Gruber, Markus Holzleitner, Milena Pavlovi´c, Geir Kjetil Sandve, Victor Greiff, David Kreil, Michael Kopp, Günter Klambauer, Johannes Brandstetter, and Sepp Hochreiter. Hopfield networks is all you need. InInternational Conference on Learning Representa...

  46. [46]

    WinoGrande: An adversarial winograd schema challenge at scale

    Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. WinoGrande: An adversarial winograd schema challenge at scale. InAAAI Conference on Artificial Intelligence, 2021

  47. [47]

    SocialIQA: Commonsense reasoning about social interactions

    Maarten Sap, Hannah Rashkin, Derek Chen, Ronan Le Bras, and Yejin Choi. SocialIQA: Commonsense reasoning about social interactions. InConference on Empirical Methods in Natural Language Processing, 2019

  48. [48]

    Linear Transformers Are Secretly Fast Weight Programmers

    Imanol Schlag, Kazuki Irie, and Jürgen Schmidhuber. Linear Transformers Are Secretly Fast Weight Programmers. InProceedings of the 38th International Conference on Machine Learning, pages 9355–9366. PMLR, July 2021. URL https://proceedings.mlr.press/ v139/schlag21a.html

  49. [49]

    Learning to control fast-weight memories: An alternative to dynamic recurrent networks.Neural Computation, 4(1):131–139, 1992

    Jürgen Schmidhuber. Learning to control fast-weight memories: An alternative to dynamic recurrent networks.Neural Computation, 4(1):131–139, 1992

  50. [50]

    Deltaproduct: Increasing the expressivity of DeltaNet through products of householders,

    Julien Siems, Timur Carstensen, Arber Zela, Frank Hutter, Massimiliano Pontil, and Riccardo Grazzi. Deltaproduct: Increasing the expressivity of DeltaNet through products of householders,

  51. [51]

    URLhttps://arxiv.org/abs/2502.10297

  52. [52]

    Jimmy T. H. Smith, Andrew Warrington, and Scott Linderman. Simplified state space layers for sequence modeling. September 2022. URL https://openreview.net/forum?id= Ai8Hw3AXqks

  53. [53]

    Smith, Andrew Warrington, and Scott Linderman

    Jimmy T.H. Smith, Andrew Warrington, and Scott Linderman. Simplified state space layers for sequence modeling. InInternational Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=Ai8Hw3AXqks

  54. [54]

    Learning to (Learn at Test Time): RNNs with Expressive Hidden States

    Yu Sun, Xinhao Li, Karan Dalal, Jiarui Xu, Arjun Vikram, Genghan Zhang, Yann Dubois, Xinlei Chen, Xiaolong Wang, Sanmi Koyejo, Tatsunori Hashimoto, and Carlos Guestrin. Learning to (learn at test time): RNNs with expressive hidden states, 2024. URL https: //arxiv.org/abs/2407.04620

  55. [55]

    Retentive Network: A Successor to Transformer for Large Language Models

    Yutao Sun, Li Dong, Shaohan Huang, Shuming Ma, Yuqing Xia, Jilong Xue, Jianyong Wang, and Furu Wei. Retentive network: A successor to transformer for large language mod- els.ArXiv, abs/2307.08621, 2023. URL https://api.semanticscholar.org/CorpusID: 259937453

  56. [56]

    Richard S. Sutton. Adapting bias by gradient descent: An incremental version of delta-bar-delta. Proceedings of the Tenth National Conference on Artificial Intelligence (AAAI), pages 171–176, 1992

  57. [57]

    Kimi Team, Yu Zhang, Zongyu Lin, Xingcheng Yao, Jiaxi Hu, Fanqing Meng, Chengyin Liu, Xin Men, Songlin Yang, Zhiyuan Li, Wentao Li, Enzhe Lu, Weizhou Liu, Yanru Chen, Weixin Xu, Longhui Yu, Yejie Wang, Yu Fan, Longguang Zhong, Enming Yuan, Dehao Zhang, Yizhi Zhang, T. Y . Liu, Haiming Wang, Shengjun Fang, Weiran He, Shaowei Liu, Yiwei Li, Jianlin Su, Jiez...

  58. [58]

    Transformers learn in-context by gradient descent

    Johannes von Oswald, Eyvind Niklasson, Ettore Randazzo, João Sacramento, Alexander Mord- vintsev, Andrey Zhmoginov, and Max Vladymyrov. Transformers learn in-context by gradient descent. InInternational Conference on Machine Learning, 2023. 13

  59. [59]

    2023 , month = sep, journal =

    Johannes von Oswald, Eyvind Niklasson, Maximilian Schlegel, Seijin Kobayashi, Nicolas Zucchet, Nino Scherrer, Nolan Miller, Mark Sandler, Blaise Agüera Y . Arcas, Max Vladymy- rov, Razvan Pascanu, and João Sacramento. Uncovering mesa-optimization algorithms in transformers, 2024. URLhttps://arxiv.org/abs/2309.05858

  60. [60]

    Saurous, Guillaume Lajoie, Charlotte Frenkel, Razvan Pascanu, Blaise Agüera y Arcas, and João Sacramento

    Johannes von Oswald, Nino Scherrer, Seijin Kobayashi, Luca Versari, Songlin Yang, Maximilian Schlegel, Kaitlin Maile, Yanick Schimpf, Oliver Sieberling, Alexander Meulemans, Rif A. Saurous, Guillaume Lajoie, Charlotte Frenkel, Razvan Pascanu, Blaise Agüera y Arcas, and João Sacramento. MesaNet: Sequence modeling by locally optimal test-time training, 2025...

  61. [61]

    Ke Alexander Wang, Jiaxin Shi, and Emily B. Fox. Test-time regression: A unifying framework for designing sequence models with associative memory, 2025. URL https://arxiv.org/ abs/2501.12352

  62. [62]

    RNNs are not transformers (yet): The key bottleneck on in-context retrieval, 2024

    Kaiyue Wen, Xingyu Dang, and Kaifeng Lyu. RNNs are not transformers (yet): The key bottleneck on in-context retrieval, 2024. URLhttps://arxiv.org/abs/2402.18510

  63. [63]

    Gated Delta Networks: Improving Mamba2 with Delta Rule

    Songlin Yang, Jan Kautz, and Ali Hatamizadeh. Gated Delta Networks: Improving Mamba2 with Delta Rule. October 2024. URLhttps://openreview.net/forum?id=r8H7xhYPwz

  64. [64]

    Gated Linear Attention Transformers with Hardware-Efficient Training

    Songlin Yang, Bailin Wang, Yikang Shen, Rameswar Panda, and Yoon Kim. Gated Linear Attention Transformers with Hardware-Efficient Training. InProceedings of the 41st Inter- national Conference on Machine Learning, pages 56501–56523. PMLR, July 2024. URL https://proceedings.mlr.press/v235/yang24ab.html

  65. [65]

    Parallelizing linear transformers with the delta rule over sequence length

    Songlin Yang, Bailin Wang, Yu Zhang, Yikang Shen, and Yoon Kim. Parallelizing linear transformers with the delta rule over sequence length. In A. Globerson, L. Mackey, D. Bel- grave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors,Advances in Neural Information Processing Systems, volume 37, pages 115491–115522. Curran Associates, Inc., 2024. doi: 10...

  66. [66]

    HellaSwag: Can a machine really finish your sentence? InAnnual Meeting of the Association for Computational Linguistics, 2019

    Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. HellaSwag: Can a machine really finish your sentence? InAnnual Meeting of the Association for Computational Linguistics, 2019

  67. [67]

    Gated slot attention for efficient linear-time sequence modeling.Advances in Neural In- formation Processing Systems, 37:116870–116898, December 2024

    Yu Zhang, Songlin Yang, Ruijie Zhu, Yue Zhang, Leyang Cui, Yiqiao Wang, Bolun Wang, Freda Shi, Bailin Wang, Wei Bi, Peng Zhou, and Guohong Fu. Gated slot attention for efficient linear-time sequence modeling.Advances in Neural In- formation Processing Systems, 37:116870–116898, December 2024. doi: 10.52202/ 079017-3710. URL https://proceedings.neurips.cc/...

  68. [68]

    Online convex programming and generalized infinitesimal gradient ascent

    Martin Zinkevich. Online convex programming and generalized infinitesimal gradient ascent. InInternational Conference on Machine Learning, 2003. 14 Appendix Roadmap The appendix is organized to separate mechanism, implementation, theory, evaluation protocol, additional evidence, and background. Appendix A collects derivations and proof details for the Del...

  69. [69]

    Since βt ∈(0,1) , the power-mean inequality givesP i(kt)4 i ≤ P i(kt)2 i 2 = 1

    Under ∥kt∥2 2 = 1, λmax =β 2 t P i(kt)4 i . Since βt ∈(0,1) , the power-mean inequality givesP i(kt)4 i ≤ P i(kt)2 i 2 = 1. The reported runs use the practical online step size η= 0.003 inside D= [0.5,2.0] K, with reproduc- tion details in Appendix F. Corollary A.2(Monotone descent under bounded box).Assume ∥kt∥2 2 = 1 , βt ∈(0,1) , and dt ∈ D= [d min, dm...

  70. [70]

    Conditional algorithmic regret.Let D ⊂R K be a closed convex set containing the algorithmic iterates {dt}

    Under ∥kt∥2 = 1, nt = 1 and ∥st∥2 2 =P i k4 t,i; the Cauchy–Schwarz / power-mean inequality gives the displayed bound. Conditional algorithmic regret.Let D ⊂R K be a closed convex set containing the algorithmic iterates {dt}. We assume the online learner producing {dt} admits a sublinear regret bound against any fixed comparator inD: TX t=1 ht(dt)−h t(d) ...

  71. [71]

    Defining qt :=f t(St)/ft(St−1) on the non-degenerate set{u t ̸= 0}, qt = 1−β t⟨dt, st⟩ 2 ≥0,(20) and Equation (18) with nt = 1 gives qt = 1 + 2ht(dt). (When ut = 0 the ratio is 0/0; we interpret qt = 1sincef t(St) =f t(St−1) = 0, consistent withh t = 0.) Sinceq t ≥0, the AM–GM inequality gives TY t=1 qt ≤ 1 T TX t=1 qt T = 1 + 2 T TX t=1 ht(dt) T . By Equ...

  72. [72]

    Sublinear-regret online learning gives RT =O( √ T) (or O(logT) under additional curva- ture) soR T /T→0

  73. [73]

    single” averages FDA, SWDE, and SQuAD, while “repeated

    In Theorem D.5, the Rayleigh-quotient inequality (Step 2) provides a constant lower bound |ht(D⋆)| ≥1/(2L) , so the per-step ratio ht(Dt)/ht(D⋆) approaches 1 as the meta-learner converges, driving each rt to zero. In Theorem D.7, the analogous role is played by the AM–GM step, which converts cumulative regret on the per-token surrogate into a contraction ...