Gated KalmaNet: A Fading Memory Layer Through Test-Time Ridge Regression

Aditya Chattopadhyay; Elvis Nunez; Liangzu Peng; Luca Zancato; Stefano Soatto; Wei Xia

arxiv: 2511.21016 · v3 · pith:3H7DRKZ6new · submitted 2025-11-26 · 💻 cs.LG · cs.CL

Gated KalmaNet: A Fading Memory Layer Through Test-Time Ridge Regression

Liangzu Peng , Aditya Chattopadhyay , Luca Zancato , Elvis Nunez , Wei Xia , Stefano Soatto This is my paper

Pith reviewed 2026-05-21 17:54 UTC · model grok-4.3

classification 💻 cs.LG cs.CL

keywords state space modelsKalman filterridge regressionlong context modelinglinear attentionsequence modelsgated networksfading memory

0 comments

The pith

Gated KalmaNet uses full Kalman filter covariance for better long-context recall in state-space models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes Gated KalmaNet, a new layer for linear state-space models that is derived from the Kalman filter. It improves upon existing SSM layers by maintaining the full error covariance matrix rather than assuming it is the identity, which allows computing the exact Kalman gain for state updates. This enables the model to better account for how past information should influence current states. Under a steady-state assumption, the approach reduces to an online ridge regression that keeps memory constant and compute linear in sequence length. The authors demonstrate superior performance on long-context retrieval and question answering tasks as well as image classification.

Core claim

Gated KalmaNet maintains the full error covariance and computes the exact Kalman gain. Under a steady-state assumption that enables parallelization, this reduces to an online ridge regression with constant memory and linear compute. The method addresses numerical issues in low precision with adaptive regularization via input-dependent gating and Chebyshev Iteration for stability.

What carries the argument

The steady-state Kalman filter reduced to online ridge regression, which carries the argument by providing covariance-aware updates instead of identity approximations.

If this is right

Outperforms existing SSM layers like Mamba2 and Gated DeltaNet on short-context tasks.
Achieves more than 10% relative improvement on long-context RAG and LongQA up to 128k tokens.
Outperforms Mamba when extended to ImageNet classification.
Provides constant memory and linear compute for sequence modeling.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This approach could be extended to other recurrent models by incorporating similar covariance tracking.
Hardware-aware implementations may allow practical scaling to contexts beyond 128k tokens.
Linking neural layers to classical filtering theory opens paths for analyzing stability and convergence in deep networks.

Load-bearing premise

The steady-state assumption that reduces the Kalman filter recurrence to online ridge regression and enables parallelization.

What would settle it

A direct comparison on a long-sequence task where the error covariance is observed to not reach steady state, leading to degraded performance or inability to parallelize training.

Figures

Figures reproduced from arXiv: 2511.21016 by Aditya Chattopadhyay, Elvis Nunez, Liangzu Peng, Luca Zancato, Stefano Soatto, Wei Xia.

**Figure 1.** Figure 1: CH converges with smaller errors than CG and is more numerically stable. Convergence of different methods in residual norms during the forward pass with batch size 8, sequence length 2048, 8 heads, head dimension 128 (a), and relative gradient differences from the exact solver (torch.linalg.solve) to CG (b, c) or CH (d, e). The backward pass is via implicit differentiation (impl) or torch.autograd (auto); … view at source ↗

**Figure 2.** Figure 2: Our GKA block. Blue refers to established practices in the literature with the solid circles denote ℓ2 normalization. Green components (CH and α-connection) are our proposals. 5.1. GKA on Synthetic Associative Recall Tasks We first assess the capability of our models to recall information on the multi-Query Associative Recall (MQAR) task, a synthetic task introduced by Arora et al. [1]. This task presents… view at source ↗

**Figure 3.** Figure 3: MQAR results (a) Each plot corresponds to a particular sequence length and number of key-value pairs for the model to memorize. Runtime (b) Runtimes are for a single forward + backward pass (8 heads, head dim 128, batch size 4, averaged over 20 runs) [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Long Context Performance up to 128k tokens. GKA achieves strong RAG and LongQA capabilities, outperforming all baselines by 10% in relative improvement. Interestingly, we observe that there is no clear winner Synthetic Recall. All models struggle to perform better than random chance on ICL. 6. Kalman Filter for Optimally Modelling Fading Memory In this section, we show how the Kalman Filter (KF) provides… view at source ↗

**Figure 5.** Figure 5: (a) The theoretical lower and upper bounds for the values of the divisor [PITH_FULL_IMAGE:figures/full_fig_p023_5.png] view at source ↗

**Figure 6.** Figure 6: Adaptive regularization results in smoother and better training curves. (a) Plots the training curve for 2.8B models on 100B tokens from DCLM. (b) Plots the corresponding gradient norm. The model with constant regularization (red curve) results in a higher loss that can be attributed to its non-smooth trajectory over the course of its training run (spiky gradient norms) [PITH_FULL_IMAGE:figures/full_fig_p… view at source ↗

**Figure 7.** Figure 7: GKA without the α- connection severaly underperforms on Synthetic Recall and LongQA. On ICL all SSMs struggle to perform better than random chance (see [PITH_FULL_IMAGE:figures/full_fig_p027_7.png] view at source ↗

**Figure 8.** Figure 8: Convergence for varying regularization strengths (batch size [PITH_FULL_IMAGE:figures/full_fig_p028_8.png] view at source ↗

**Figure 9.** Figure 9: Long context performance GKA for different regularization strengths. The long-context performance of GKA improves initially as we decrease a from 0.1 → 0.05. This is expected since this increases the memorization capacity of the model. However, decreasing further from 0.05 → 0.02 → 0.01 causes performance to decrease. This can be attributed to the increasing condition number of the problem, which reduces t… view at source ↗

**Figure 10.** Figure 10: Our Hybrid GKA + Attention model significantly improves performance across all long-context benchmarks compared to our non-Hybrid model. Adding a few Attention layers to our GKA model improves long-range dependency modeling, improving performance across all sequence lengths on RAG, ICL, Synthetic Recall, and Long-QA [PITH_FULL_IMAGE:figures/full_fig_p030_10.png] view at source ↗

read the original abstract

Linear State-Space Models (SSMs) offer an efficient alternative to softmax Attention with constant memory and linear compute, but their lossy, fading summary of the past hurts recall-oriented tasks. We propose Gated KalmaNet (GKA, pronounced "gee-ka"), a layer that accounts for the full past while retaining SSM-style efficiency. We ground our approach in the Kalman Filter (KF), and show that several existing SSM layers (DeltaNet, Gated DeltaNet, Kimi Delta Attention) are approximations to the KF recurrence under an identity error covariance assumption, which ignores how past keys and values should optimally influence state updates. In contrast, GKA maintains the full error covariance and computes the exact Kalman gain. Under a steady-state assumption that enables parallelization, this reduces to an online ridge regression with constant memory and linear compute. The standard KF equations are numerically unstable in low-precision settings (e.g., bfloat16) and hard to parallelize on GPUs. We address this with (1) adaptive regularization via input-dependent gating to control the ridge regression's condition number, and (2) Chebyshev Iteration, which we show is more stable than conventional iterative solvers in low precision. We further develop hardware-aware chunk-wise kernels for efficient training. Empirically, GKA outperforms existing SSM layers (e.g., Mamba2, Gated DeltaNet) on short-context tasks and achieves more than 10\% relative improvement on long-context RAG and LongQA up to 128k tokens. We further show GKA outperforms Mamba when extended to ImageNet classification. Our code, including Triton kernels for training and inference (vLLM), along with a model zoo of GKA-based Hybrid models at 8B and 32B scale on HuggingFace, is released under Apache 2.0.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GKA keeps full Kalman covariance for SSM updates and reduces it to gated ridge regression under steady state, with practical low-precision fixes and some long-context gains, but the assumption needs checking for non-stationary sequences.

read the letter

The main point is that this paper takes the Kalman filter recurrence seriously for state-space layers instead of defaulting to the identity-covariance shortcuts used in DeltaNet and Gated DeltaNet. It keeps the full error covariance, computes the exact gain, and then invokes a steady-state assumption so the whole thing collapses to online ridge regression with constant memory and linear cost. They add input-dependent gating to keep the condition number under control and swap in Chebyshev iteration for better low-precision behavior, plus chunked Triton kernels for training.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes Gated KalmaNet (GKA), a linear state-space model layer derived from the Kalman filter. It maintains the full error covariance and computes the exact Kalman gain, in contrast to prior SSM layers (DeltaNet, Gated DeltaNet) that assume identity covariance. Under a steady-state assumption the recurrence reduces to online ridge regression, enabling constant memory and linear compute. Numerical stability in low precision is addressed via input-dependent gating for adaptive regularization and Chebyshev iteration; hardware-aware chunked kernels are provided for training. Experiments report gains over Mamba2 and Gated DeltaNet on short-context tasks and >10% relative improvement on long-context RAG and LongQA up to 128k tokens, plus better ImageNet classification when replacing Mamba.

Significance. If the central reduction and stability fixes hold, the work supplies a principled mechanism for incorporating full posterior covariance into SSMs without sacrificing the efficiency that makes them attractive for long contexts. The public release of Triton kernels, vLLM integration, and 8B/32B hybrid model weights strengthens reproducibility. The approach directly targets the fading-memory limitation of existing SSMs on recall-oriented tasks.

major comments (2)

[§3, §4] §3 (Kalman filter derivation) and §4 (steady-state reduction): the claim that GKA computes the exact Kalman gain from the full covariance is load-bearing for the optimality argument, yet the manuscript provides no quantitative bound or empirical measurement of the approximation error introduced by the steady-state covariance assumption on non-stationary sequences. In long-context RAG/LongQA settings the token statistics are typically non-stationary; without such a bound it is unclear whether the fixed-gain ridge-regression form retains the claimed advantage over identity-covariance baselines.
[§5] §5 (experiments): the reported >10% relative gains on LongQA and RAG lack error bars, ablation isolating the steady-state step, and sensitivity analysis to the gating parameters that control the ridge condition number. Without these controls it is difficult to attribute improvements specifically to the full-covariance mechanism rather than to the adaptive regularization or kernel implementation.

minor comments (2)

[§2] Notation for the error covariance matrix and the input-dependent gating function should be introduced with explicit dimensions and initialization details in the main text rather than only in the appendix.
[Figure 2] Figure 2 (condition-number plots) would benefit from a direct comparison against the unregularized Kalman filter in bfloat16 to quantify the stability gain of the proposed Chebyshev solver.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive and detailed review. The comments highlight important aspects of our claims and experimental rigor. We address each major comment below and describe the revisions planned for the next version of the manuscript.

read point-by-point responses

Referee: [§3, §4] §3 (Kalman filter derivation) and §4 (steady-state reduction): the claim that GKA computes the exact Kalman gain from the full covariance is load-bearing for the optimality argument, yet the manuscript provides no quantitative bound or empirical measurement of the approximation error introduced by the steady-state covariance assumption on non-stationary sequences. In long-context RAG/LongQA settings the token statistics are typically non-stationary; without such a bound it is unclear whether the fixed-gain ridge-regression form retains the claimed advantage over identity-covariance baselines.

Authors: We agree that the steady-state assumption is an approximation whose error is not quantified in the current manuscript, and that this weakens the optimality argument for non-stationary data. The reduction to online ridge regression remains a principled incorporation of covariance structure that is absent from identity-covariance baselines, but a bound or measurement would strengthen the presentation. In the revision we will add an empirical analysis on synthetic non-stationary sequences that measures the deviation from the time-varying Kalman filter and compares against identity-covariance variants under controlled degrees of non-stationarity. revision: yes
Referee: [§5] §5 (experiments): the reported >10% relative gains on LongQA and RAG lack error bars, ablation isolating the steady-state step, and sensitivity analysis to the gating parameters that control the ridge condition number. Without these controls it is difficult to attribute improvements specifically to the full-covariance mechanism rather than to the adaptive regularization or kernel implementation.

Authors: We accept that the current experimental section would benefit from additional controls. The revised manuscript will include error bars computed over multiple random seeds for the long-context tasks and a sensitivity study sweeping the gating parameters that affect the ridge condition number. An ablation that fully isolates the steady-state assumption by running the exact time-varying Kalman filter is not feasible at 128k context length; we will instead provide such an ablation on shorter sequences where the full filter remains tractable and discuss the computational barrier for longer contexts. revision: partial

standing simulated objections not resolved

Direct empirical ablation of the steady-state assumption against the exact time-varying Kalman filter on sequences of length 128k, which would require quadratic memory and compute.

Circularity Check

0 steps flagged

No circularity: derivation reduces KF to ridge regression under explicit steady-state assumption

full rationale

The provided abstract and context present a standard mathematical reduction: full-covariance KF yields exact Kalman gain, which under a stated steady-state assumption simplifies to online ridge regression for parallelization and constant memory. This is an approximation justified for efficiency, not a self-definition or fitted input renamed as prediction. Adaptive regularization via gating is introduced to address numerical instability in low precision, not to force the core result. No self-citation load-bearing steps, uniqueness theorems, or ansatzes smuggled via prior work are evident in the given text. The central claim remains independent of its own outputs.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The approach rests on the standard Kalman filter recurrence and the steady-state assumption that converts it to ridge regression; the adaptive gating introduces input-dependent regularization whose exact functional form and fitting procedure are not detailed in the abstract.

free parameters (1)

input-dependent gating parameters
Control the ridge regression condition number; their precise parameterization and training procedure are not specified in the abstract.

axioms (2)

standard math Kalman filter recurrence equations under identity process noise or measurement models
Invoked to derive the exact gain and to show prior SSMs as approximations.
domain assumption Steady-state assumption on the error covariance
Required to obtain the parallelizable online ridge regression form.

pith-pipeline@v0.9.0 · 5887 in / 1404 out tokens · 84230 ms · 2026-05-21T17:54:08.947417+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Under a steady-state assumption that enables parallelization, this reduces to solving an online ridge regression problem with constant memory and linear compute cost.
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

GKA maintains the full error covariance and computes the exact Kalman gain

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Preconditioned DeltaNet: Curvature-aware Sequence Modeling for Linear Recurrences
cs.LG 2026-04 unverdicted novelty 7.0

Preconditioned delta-rule models with a diagonal curvature approximation improve upon standard DeltaNet, GDN, and KDA by better approximating the test-time regression objective.
Priming: Hybrid State Space Models From Pre-trained Transformers
cs.LG 2026-05 unverdicted novelty 6.0

Priming transfers knowledge from pre-trained Transformers to hybrid SSM-attention models, recovering performance with minimal additional tokens and showing Gated KalmaNet outperforming Mamba-2 on long-context reasonin...

Reference graph

Works this paper leans on

79 extracted references · 79 canonical work pages · cited by 2 Pith papers · 15 internal anchors

[1]

Zoology: Measuring and improving recall in efficient language models.arXiv preprint arXiv:2312.04927, 2023

Simran Arora, Sabri Eyuboglu, Aman Timalsina, Isys John- son, Michael Poli, James Zou, Atri Rudra, and Christopher Ré. Zoology: Measuring and improving recall in efficient language models.arXiv preprint arXiv:2312.04927, 2023. 2, 7, 15, 25

work page arXiv 2023
[2]

Hybrid Architectures for Language Models: Systematic Analysis and Design Insights

Sangmin Bae, Bilge Acun, Haroun Habeeb, Seungyeon Kim, Chien-Yu Lin, Liang Luo, Junjie Wang, and Carole-Jean Wu. Hybrid architectures for language models: Systematic anal- ysis and design insights.arXiv preprint arXiv:2510.04800,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

xl- stm: Extended long short-term memory.Advances in Neural Information Processing Systems, 37:107547–107603, 2024

Maximilian Beck, Korbinian Pöppel, Markus Spanring, An- dreas Auer, Oleksandra Prudnikova, Michael Kopp, Günter Klambauer, Johannes Brandstetter, and Sepp Hochreiter. xl- stm: Extended long short-term memory.Advances in Neural Information Processing Systems, 37:107547–107603, 2024. 15

work page 2024
[4]

An attentive survey of attention models.ACM Transactions on Intelligent Systems and Technology, 12(5): 1–32, 2021

Sneha Chaudhari, Varun Mithal, Gungor Polatkan, and Rohan Ramanath. An attentive survey of attention models.ACM Transactions on Intelligent Systems and Technology, 12(5): 1–32, 2021. 2

work page 2021
[5]

MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention

Aili Chen, Aonian Li, Bangwei Gong, Binyang Jiang, Bo Fei, Bo Yang, Boji Shan, Changqing Yu, Chao Wang, Cheng Zhu, et al. Minimax-m1: Scaling test-time compute efficiently with lightning attention.arXiv preprint arXiv:2506.13585,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Scienceagentbench: Toward rigorous assessment of language agents for data-driven scientific discovery

Ziru Chen, Shijie Chen, Yuting Ning, Qianheng Zhang, Boshi Wang, Botao Yu, Yifei Li, Zeyi Liao, Chen Wei, Zitong Lu, et al. Scienceagentbench: Toward rigorous assessment of language agents for data-driven scientific discovery.arXiv preprint arXiv:2410.05080, 2024. 15

work page arXiv 2024
[7]

Curie: Eval- uating llms on multitask scientific long context understanding and reasoning.arXiv preprint arXiv:2503.13517, 2025

Hao Cui, Zahra Shamsi, Gowoon Cheon, Xuejian Ma, Shu- tong Li, Maria Tikhanovskaya, Peter Norgaard, Nayantara Mudur, Martyna Plomecka, Paul Raccuglia, et al. Curie: Eval- uating llms on multitask scientific long context understanding and reasoning.arXiv preprint arXiv:2503.13517, 2025. 15

work page arXiv 2025
[8]

Transformer-xl: At- tentive language models beyond a fixed-length context

Zihang Dai, Zhilin Yang, Yiming Yang, Jaime G Carbonell, Quoc Le, and Ruslan Salakhutdinov. Transformer-xl: At- tentive language models beyond a fixed-length context. In Proceedings of the 57th annual meeting of the association for computational linguistics, pages 2978–2988, 2019. 15

work page 2019
[9]

Transformers are SSMs: Generalized models and efficient algorithms through structured state space duality

Tri Dao and Albert Gu. Transformers are SSMs: Generalized models and efficient algorithms through structured state space duality. InInternational Conference on Machine Learning,

work page
[10]

Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

Tri Dao and Albert Gu. Transformers are ssms: Generalized models and efficient algorithms through structured state space duality.arXiv preprint arXiv:2405.21060, 2024. 15

work page internal anchor Pith review Pith/arXiv arXiv 2024
[11]

Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models

Soham De, Samuel L Smith, Anushan Fernando, Alek- sandar Botev, George Cristian-Muraru, Albert Gu, Ruba Haroun, Leonard Berrada, Yutian Chen, Srivatsan Srinivasan, et al. Griffin: Mixing gated linear recurrences with lo- cal attention for efficient language models.arXiv preprint arXiv:2402.19427, 2024. 15

work page internal anchor Pith review Pith/arXiv arXiv 2024
[12]

Hymba: A hybrid-head architecture for small language models

Xin Dong, Yonggan Fu, Shizhe Diao, Wonmin Byeon, Zi- jia Chen, Ameya Sunil Mahabaleshwarkar, Shih-Yang Liu, Matthijs Van Keirsbilck, Min-Hung Chen, Yoshi Suhara, et al. Hymba: A hybrid-head architecture for small language mod- els.arXiv preprint arXiv:2411.13676, 2024. 15

work page arXiv 2024
[13]

Do large language models favor recent content? a study on recency bias in llm-based reranking.arXiv preprint arXiv:2509.11353, 2025

Hanpei Fang, Sijie Tao, Nuo Chen, Kai-Xin Chang, and Tet- suya Sakai. Do large language models favor recent content? a study on recency bias in llm-based reranking.arXiv preprint arXiv:2509.11353, 2025. 4

work page arXiv 2025
[14]

What is wrong with perplexity for long-context language modeling?arXiv preprint arXiv:2410.23771, 2024

Lizhe Fang, Yifei Wang, Zhaoyang Liu, Chenheng Zhang, Stefanie Jegelka, Jinyang Gao, Bolin Ding, and Yisen Wang. What is wrong with perplexity for long-context language modeling?arXiv preprint arXiv:2410.23771, 2024. 8

work page arXiv 2024
[15]

The language model evaluation harness, 2024

Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Gold- ing, Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle Mc- Donell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. The lan...

work page 2024
[16]

How to train long-context language models (effectively)

Tianyu Gao, Alexander Wettig, Howard Yen, and Danqi Chen. How to train long-context language models (effectively). In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7376–7399, 2025. 8

work page 2025
[17]

Zamba: A compact 7B SSM hybrid model,

Paolo Glorioso, Quentin Anthony, Yury Tokpanov, James Whittington, Jonathan Pilault, Adam Ibrahim, and Beren Mil- lidge. Zamba: A compact 7b ssm hybrid model.arXiv preprint arXiv:2405.16712, 2024. 15

work page arXiv 2024
[18]

Is flash attention stable?arXiv preprint arXiv:2405.02803, 2024

Alicia Golden, Samuel Hsia, Fei Sun, Bilge Acun, Basil Hosmer, Yejin Lee, Zachary DeVito, Jeff Johnson, Gu-Yeon Wei, David Brooks, et al. Is flash attention stable?arXiv preprint arXiv:2405.02803, 2024. 7

work page arXiv 2024
[19]

The Johns Hopkins University Press, 2013

Gene H Golub and Charles F Van Loan.Matrix Computations (4th ed.). The Johns Hopkins University Press, 2013. 3

work page 2013
[20]

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces.arXiv preprint arXiv:2312.00752, 2023. 15

work page internal anchor Pith review Pith/arXiv arXiv 2023
[21]

Efficiently Modeling Long Sequences with Structured State Spaces

Albert Gu, Karan Goel, and Christopher Ré. Efficiently mod- eling long sequences with structured state spaces.CoRR, abs/2111.00396, 2021. 15

work page internal anchor Pith review Pith/arXiv arXiv 2021
[22]

Combining recurrent, convo- lutional, and continuous-time models with linear state space layers.Advances in neural information processing systems, 34:572–585, 2021

Albert Gu, Isys Johnson, Karan Goel, Khaled Saab, Tri Dao, Atri Rudra, and Christopher Ré. Combining recurrent, convo- lutional, and continuous-time models with linear state space layers.Advances in neural information processing systems, 34:572–585, 2021. 15

work page 2021
[23]

Long short-term memory.Neural Computation, 9(8):1735–1780, 1997

Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory.Neural Computation, 9(8):1735–1780, 1997. 2

work page 1997
[24]

An empirical analysis of compute-optimal large language model training.Advances in neural information processing systems, 35:30016–30030, 2022

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. An empirical analysis of compute-optimal large language model training.Advances in neural information processing systems, 35:30016–30030, 2022. 26

work page 2022
[25]

RULER: What's the Real Context Size of Your Long-Context Language Models?

Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, Yang Zhang, and Boris Gins- burg. Ruler: What’s the real context size of your long-context language models?arXiv preprint arXiv:2404.06654, 2024. 2, 8, 25

work page internal anchor Pith review Pith/arXiv arXiv 2024
[26]

Repeat after me: Transformers are bet- ter than state space models at copying.arXiv preprint arXiv:2402.01032, 2024

Samy Jelassi, David Brandfonbrener, Sham M Kakade, and Eran Malach. Repeat after me: Transformers are bet- ter than state space models at copying.arXiv preprint arXiv:2402.01032, 2024. 15

work page arXiv 2024
[27]

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues?arXiv preprint arXiv:2310.06770, 2023. 15

work page internal anchor Pith review Pith/arXiv arXiv 2023
[28]

R. E. Kalman. A new approach to linear filtering and predic- tion problems.Journal of Basic Engineering, 82(1):35–45,

work page
[29]

Needle in a haystack - pressure testing llms

Gregory Kamradt. Needle in a haystack - pressure testing llms. https://github.com/gkamradt/LLMTest_ NeedleInAHaystack, 2023. 25

work page 2023
[30]

Reformer: The Efficient Transformer

Nikita Kitaev, Łukasz Kaiser, and Anselm Levskaya. Reformer: The efficient transformer.arXiv preprint arXiv:2001.04451, 2020. 15

work page internal anchor Pith review Pith/arXiv arXiv 2001
[31]

Gonzalez, Hao Zhang, and Ion Stoica

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InPro- ceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023. 15

work page 2023
[32]

Datacomp-lm: In search of the next generation of training sets for language models.Advances in Neural Information Processing Systems, 37:14200–14282, 2024

Jeffrey Li, Alex Fang, Georgios Smyrnis, Maor Ivgi, Matt Jordan, Samir Yitzhak Gadre, Hritik Bansal, Etash Guha, Sedrick Scott Keh, Kushal Arora, et al. Datacomp-lm: In search of the next generation of training sets for language models.Advances in Neural Information Processing Systems, 37:14200–14282, 2024. 7

work page 2024
[33]

Jamba: A Hybrid Transformer-Mamba Language Model

Opher Lieber, Barak Lenz, Hofit Bata, Gal Cohen, Jhonathan Osin, Itay Dalmedigos, Erez Safahi, Shaked Meirom, Yonatan Belinkov, Shai Shalev-Shwartz, et al. Jamba: A hy- brid transformer-mamba language model.arXiv preprint arXiv:2403.19887, 2024. 15

work page internal anchor Pith review Pith/arXiv arXiv 2024
[34]

Longhorn: State space models are amortized online learners

Bo Liu, Rui Wang, Lemeng Wu, Yihao Feng, Peter Stone, and Qiang Liu. Longhorn: State space models are amortized online learners. InInternational Conference on Learning Representations, 2025. 1, 2

work page 2025
[35]

Error propagation properties of recursive least-squares adaptation algorithms.Automatica, 21(2):157–167, 1985

Stefan Ljung and Lennart Ljung. Error propagation properties of recursive least-squares adaptation algorithms.Automatica, 21(2):157–167, 1985. 15

work page 1985
[36]

RanPAC: Random projections and pre-trained models for continual learning

Mark D McDonnell, Dong Gong, Amin Parvaneh, Ehsan Abbasnejad, and Anton van den Hengel. RanPAC: Random projections and pre-trained models for continual learning. Advances in Neural Information Processing Systems, 2023. 3

work page 2023
[37]

Landmark attention: Random-access infinite context length for transformers

Amirkeivan Mohtashami and Martin Jaggi. Landmark atten- tion: Random-access infinite context length for transformers. arXiv preprint arXiv:2305.16300, 2023. 15

work page arXiv 2023
[38]

Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention

Tsendsuren Munkhdalai, Manaal Faruqui, and Siddharth Gopal. Leave no context behind: Efficient infinite con- text transformers with infini-attention.arXiv preprint arXiv:2404.07143, 101, 2024. 15

work page internal anchor Pith review arXiv 2024
[39]

On estimating regression.Theory of Probability & Its Applications, 9(1):141–142, 1964

Elizbar A Nadaraya. On estimating regression.Theory of Probability & Its Applications, 9(1):141–142, 1964. 2

work page 1964
[40]

Expansion span: Combin- ing fading memory and retrieval in hybrid state space models

Elvis Nunez, Luca Zancato, Benjamin Bowman, Aditya Go- latkar, Wei Xia, and Stefano Soatto. Expansion span: Combin- ing fading memory and retrieval in hybrid state space models. arXiv preprint arXiv:2412.13328, 2024. 8, 15

work page arXiv 2024
[41]

Resurrecting recurrent neural networks for long sequences

Antonio Orvieto, Samuel L Smith, Albert Gu, Anushan Fer- nando, Caglar Gulcehre, Razvan Pascanu, and Soham De. Resurrecting recurrent neural networks for long sequences. InInternational Conference on Machine Learning, pages 26670–26698. PMLR, 2023. 15

work page 2023
[42]

Marconi: Prefix caching for the era of hybrid llms.arXiv preprint arXiv:2411.19379, 2024

Rui Pan, Zhuang Wang, Zhen Jia, Can Karakus, Luca Zan- cato, Tri Dao, Yida Wang, and Ravi Netravali. Marconi: Prefix caching for the era of hybrid llms.arXiv preprint arXiv:2411.19379, 2024. 15

work page arXiv 2024
[43]

Residual polynomials and the Cheby- shev method

Fabian Pedregosa. Residual polynomials and the Cheby- shev method. http://fa.bianp.net/blog/2020/ polyopt/, 2020. 4

work page 2020
[44]

Eagle and finch: RWKV with matrix-valued states and dynamic recurrence

Bo Peng, Daniel Goldstein, Quentin Gregory Anthony, Alon Albalak, Eric Alcaide, Stella Biderman, Eugene Cheah, Teddy Ferdinan, Kranthi Kiran GV , Haowen Hou, Satyapriya Kr- ishna, Ronald McClelland Jr., Niklas Muennighoff, Fares Obeid, Atsushi Saito, Guangyu Song, Haoqin Tu, Ruichong Zhang, Bingchen Zhao, Qihang Zhao, Jian Zhu, and Rui-Jie Zhu. Eagle and ...

work page
[45]

Rwkv-7" goose" with expressive dynamic state evolution

Bo Peng, Ruichong Zhang, Daniel Goldstein, Eric Alcaide, Xingjian Du, Haowen Hou, Jiaju Lin, Jiaxing Liu, Janna Lu, William Merrill, et al. Rwkv-7" goose" with expressive dynamic state evolution. Technical report, arXiv preprint arXiv:2503.14456, 2025. 2

work page arXiv 2025
[46]

Mathematics of continual learning

Liangzu Peng and René Vidal. Mathematics of continual learning. Technical report, arXiv:2504.17963 [cs.LG], 2025. 3

work page arXiv 2025
[47]

TSVD: Bridging theory and practice in continual learning with pre-trained models

Liangzu Peng, Juan Elenter, Joshua Agterberg, Alejandro Ribeiro, and Rene Vidal. TSVD: Bridging theory and practice in continual learning with pre-trained models. InInternational Conference on Learning Representations, 2025. 3

work page 2025
[48]

Kilt: a benchmark for knowledge intensive language tasks

Fabio Petroni, Aleksandra Piktus, Angela Fan, Patrick Lewis, Majid Yazdani, Nicola De Cao, James Thorne, Yacine Jernite, Vladimir Karpukhin, Jean Maillard, et al. Kilt: a benchmark for knowledge intensive language tasks. InProceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Tech...

work page 2021
[49]

Sayed.Fundamentals of Adaptive Filtering

A.H. Sayed.Fundamentals of Adaptive Filtering. Wiley,

work page
[50]

Sayed.Adaptive Filters

A.H. Sayed.Adaptive Filters. Wiley, 2011. 15

work page 2011
[51]

Lin- ear transformers are secretly fast weight programmers

Imanol Schlag, Kazuki Irie, and Jürgen Schmidhuber. Lin- ear transformers are secretly fast weight programmers. In International conference on machine learning, 2021. 2

work page 2021
[52]

Opti- mal filtering algorithms for fast learning in feedforward neural networks.Neural Networks, 5(5):779–787, 1992

Samir Shah, Francesco Palmieri, and Michael Datum. Opti- mal filtering algorithms for fast learning in feedforward neural networks.Neural Networks, 5(5):779–787, 1992. 3

work page 1992
[53]

Deltaproduct: Im- proving state-tracking in linear rnns via householder products

Julien Siems, Timur Carstensen, Arber Zela, Frank Hutter, Massimiliano Pontil, and Riccardo Grazzi. Deltaproduct: Im- proving state-tracking in linear rnns via householder products. Technical report, arXiv:2502.10297v6 [cs.LG], 2025. 2

work page arXiv 2025
[55]

Retentive Network: A Successor to Transformer for Large Language Models

Yutao Sun, Li Dong, Shaohan Huang, Shuming Ma, Yuqing Xia, Jilong Xue, Jianyong Wang, and Furu Wei. Retentive net- work: A successor to transformer for large language models. arXiv preprint arXiv:2307.08621, 2023. 15

work page internal anchor Pith review Pith/arXiv arXiv 2023
[56]

Kimi Linear: An Expressive, Efficient Attention Architecture

Kimi Team, Yu Zhang, Zongyu Lin, Xingcheng Yao, Jiaxi Hu, Fanqing Meng, Chengyin Liu, Xin Men, Songlin Yang, Zhiyuan Li, et al. Kimi linear: An expressive, efficient at- tention architecture.arXiv preprint arXiv:2510.26692, 2025. 11

work page internal anchor Pith review Pith/arXiv arXiv 2025
[57]

Attention is all you need.Advances in Neural Information Processing Systems, 2017

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in Neural Information Processing Systems, 2017. 1, 2, 15

work page 2017
[58]

Attention: Self-expression is all you need, 2022

René Vidal. Attention: Self-expression is all you need, 2022. 2

work page 2022
[59]

Mesanet: Sequence modeling by locally optimal test- time training

Johannes von Oswald, Nino Scherrer, Seijin Kobayashi, Luca Versari, Songlin Yang, Maximilian Schlegel, Kaitlin Maile, Yanick Schimpf, Oliver Sieberling, Alexander Meulemans, et al. Mesanet: Sequence modeling by locally optimal test- time training. Technical report, arXiv:2506.05233 [cs.LG],

work page arXiv
[60]

An Empirical Study of Mamba-based Language Models

Roger Waleffe, Wonmin Byeon, Duncan Riach, Bran- don Norick, Vijay Korthikanti, Tri Dao, Albert Gu, Ali Hatamizadeh, Sudhakar Singh, Deepak Narayanan, et al. An empirical study of mamba-based language models.arXiv preprint arXiv:2406.07887, 2024. 15

work page internal anchor Pith review Pith/arXiv arXiv 2024
[61]

Test- time regression: a unifying framework for designing se- quence models with associative memory

Ke Alexander Wang, Jiaxin Shi, and Emily B Fox. Test- time regression: a unifying framework for designing se- quence models with associative memory. Technical report, arXiv:2501.12352v3 [cs.LG], 2025. 3

work page arXiv 2025
[62]

Smooth regression analysis.Sankhy ¯a: The Indian Journal of Statistics, Series A, pages 359–372,

Geoffrey S Watson. Smooth regression analysis.Sankhy ¯a: The Indian Journal of Statistics, Series A, pages 359–372,

work page
[63]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025. 6, 29

work page internal anchor Pith review Pith/arXiv arXiv 2025
[64]

Gated linear attention transformers with hardware-efficient training

Songlin Yang, Bailin Wang, Yikang Shen, Rameswar Panda, and Yoon Kim. Gated linear attention transformers with hardware-efficient training. InInternational Conference on Machine Learning, 2024. 1, 2, 4, 6, 7, 15

work page 2024
[65]

Parallelizing linear transformers with the delta rule over sequence length

Songlin Yang, Bailin Wang, Yu Zhang, Yikang Shen, and Yoon Kim. Parallelizing linear transformers with the delta rule over sequence length. InAdvances in Neural Information Processing Systems, pages 115491–115522. Curran Asso- ciates, Inc., 2024. 10, 15

work page 2024
[66]

Parallelizing linear transformers with the delta rule over sequence length

Songlin Yang, Bailin Wang, Yu Zhang, Yikang Shen, and Yoon Kim. Parallelizing linear transformers with the delta rule over sequence length. InNeural Information Processing Systems, 2024. 1, 2, 3, 4, 6

work page 2024
[67]

Gated delta networks: Improving mamba2 with delta rule

Songlin Yang, Jan Kautz, and Ali Hatamizadeh. Gated delta networks: Improving mamba2 with delta rule. InInterna- tional Conference on Learning Representations, 2025. 2, 6, 7, 8, 10, 15

work page 2025
[68]

HELMET: How to evaluate long-context language models effectively and thoroughly

Howard Yen, Tianyu Gao, Minmin Hou, Ke Ding, Daniel Fleischer, Peter Izsak, Moshe Wasserblat, and Danqi Chen. HELMET: How to evaluate long-context language models effectively and thoroughly. InInternational Conference on Learning Representations (ICLR), 2025. 2, 8, 25

work page 2025
[69]

Native sparse attention: Hardware- aligned and natively trainable sparse attention

Jingyang Yuan, Huazuo Gao, Damai Dai, Junyu Luo, Liang Zhao, Zhengyan Zhang, Zhenda Xie, Yuxing Wei, Lean Wang, Zhiping Xiao, et al. Native sparse attention: Hardware- aligned and natively trainable sparse attention. InProceedings of the 63rd Annual Meeting of the Association for Compu- tational Linguistics (Volume 1: Long Papers), pages 23078– 23097, 2025. 15

work page 2025
[70]

Stacked residuals of dy- namic layers for time series anomaly detection.arXiv preprint arXiv:2202.12457, 2022

Luca Zancato, Alessandro Achille, Giovanni Paolini, Alessan- dro Chiuso, and Stefano Soatto. Stacked residuals of dy- namic layers for time series anomaly detection.arXiv preprint arXiv:2202.12457, 2022. 15

work page arXiv 2022
[71]

B'mojo: Hybrid state space realizations of foundation models with eidetic and fading memory

Luca Zancato, Arjun Seshadri, Yonatan Dukler, Aditya Go- latkar, Yantao Shen, Benjamin Bowman, Matthew Trager, Alessandro Achille, and Stefano Soatto. B'mojo: Hybrid state space realizations of foundation models with eidetic and fading memory. InAdvances in Neural Information Process- ing Systems, pages 130433–130462. Curran Associates, Inc.,

work page
[72]

compressed

Guanxiong Zeng, Yang Chen, Bo Cui, and Shan Yu. Continual learning of context-dependent processing in neural networks. Nature Machine Intelligence, 1(8):364–372, 2019. 3 A. Related Work Since the introduction of Self-Attention [57], significant research has been conducted to reduce its quadratic cost in processing long input sequences. As models and syste...

work page 2019
[73]

fades away the past

or Linear Time-Invariant dynamical systems [22, 70], to those introducing novel adaptive or gated state updates [9, 41, 64]. Despite their differences, all SSMs follow the same basic working principle inspired by classical state-space models [28]: they process the input sequence by maintaining afixed-sizestate that acts as a compressed (lossy) representat...

work page
[74]

Thus ω1 lies in (ω∗ 1, ω∗

where g(ω) decreases, therefore we have ω1 ≤ω 0. Thus ω1 lies in (ω∗ 1, ω∗

work page
[75]

We could then conclude inductively that ω∗ 1 < ω i ≤ω i−1 for all i= 1,

again. We could then conclude inductively that ω∗ 1 < ω i ≤ω i−1 for all i= 1, . . . , r. From Lemma 3 we know that the update of ωi in (weight schedule) would not create much numerical concern in a forward pass, as we haveω i ∈[1,2]for alli. Furthermore, we can bound the rate at whichω i converges toω ∗ 1: Lemma 4.Defineκ:= L µ . For anyi= 1, . . . , r, ...

work page
[76]

The proof is concluded by unrolling the above recurrence

·(ω i−1 −ω ∗ 1) (i) = ρ2ω∗ 1 4−ρ 2ωi−1 ·(ω i−1 −ω ∗ 1) (ii) ≤ ρ2wi−1ω∗ 1 4 ·(ω i−1 −ω ∗ 1) (iii) ≤ 1− p 1−ρ 2 ·(ω i−1 −ω ∗ 1) (iv) = κ−1 κ+ 1 · √κ−1√κ+ 1 ·(ω i−1 −ω ∗ 1) Here, (i) follows from the fact that ω∗ 1 is a fixed point, (ii) follows from Lemma 3 that ωi ≤ω i−1, (iii) follows from the definition ofω ∗ 1 and the factw i−1 ≤2, and (iv) follows from...

work page
[77]

⟨kl, xt⟩ dL dqt +⟨k l, dL dqt ⟩xt + 2a ||Ht||F ⟨xt, dL dqt ⟩Htkl # (50) Or equivalently, collecting the terms that are linear in the gradient: dL dkl =− X t≥l t−1Y i=l γi !

First-order methods for solving Hξ=q converge at most at a rate Ra := √κ−1√κ+1 , and we see ωi converges at an even faster rate. Numerically, assuming κ= L µ = 1.02 0.02 = 51, we then have: R≈0.7253, R 5 ≈0.2, R 10 ≈0.04, R 20 ≈0.0016, R 30 ≈6×10 −5 Ra ≈0.7543, R 5 a ≈0.244, R 10 a ≈0.0597, R 20 a ≈0.0036, R 30 a ≈0.0002. Thus, withκ= 51, the update ofω i...

work page arXiv 2048
[78]

This choice of 0.25 is motivated from concurrent work [59] which explored a similar ridge regression objective for LLM training

Constant regularizationWe train same model architecture (as above) with λt = 0.25 (a constant). This choice of 0.25 is motivated from concurrent work [59] which explored a similar ridge regression objective for LLM training. As shown in Fig. 6, without strict condition number control, gradient norms spike during training, leading to increased cross entrop...

work page
[79]

We train a model with adaptive weights

Adaptive weighting (gating). We train a model with adaptive weights. Specifically, for allt≥i , we parameterize the weight for thei th sample at time-steptasη t,i =Qt j=i+1 γj, with eachγ j ∈[0,1]learnable

work page
[80]

complexity

No weighting. We train the same model architecture as above, but with no weights. This essentially results in an unweighted ridge regression objective obtained by settingη i = 1in (3). Table 6 shows clear benefits of adapting weighting with improvements across the board on all LM-Harness tasks considered, thereby validating our hypothesis. Table 6.Adaptiv...

work page arXiv 2048

[1] [1]

Zoology: Measuring and improving recall in efficient language models.arXiv preprint arXiv:2312.04927, 2023

Simran Arora, Sabri Eyuboglu, Aman Timalsina, Isys John- son, Michael Poli, James Zou, Atri Rudra, and Christopher Ré. Zoology: Measuring and improving recall in efficient language models.arXiv preprint arXiv:2312.04927, 2023. 2, 7, 15, 25

work page arXiv 2023

[2] [2]

Hybrid Architectures for Language Models: Systematic Analysis and Design Insights

Sangmin Bae, Bilge Acun, Haroun Habeeb, Seungyeon Kim, Chien-Yu Lin, Liang Luo, Junjie Wang, and Carole-Jean Wu. Hybrid architectures for language models: Systematic anal- ysis and design insights.arXiv preprint arXiv:2510.04800,

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

xl- stm: Extended long short-term memory.Advances in Neural Information Processing Systems, 37:107547–107603, 2024

Maximilian Beck, Korbinian Pöppel, Markus Spanring, An- dreas Auer, Oleksandra Prudnikova, Michael Kopp, Günter Klambauer, Johannes Brandstetter, and Sepp Hochreiter. xl- stm: Extended long short-term memory.Advances in Neural Information Processing Systems, 37:107547–107603, 2024. 15

work page 2024

[4] [4]

An attentive survey of attention models.ACM Transactions on Intelligent Systems and Technology, 12(5): 1–32, 2021

Sneha Chaudhari, Varun Mithal, Gungor Polatkan, and Rohan Ramanath. An attentive survey of attention models.ACM Transactions on Intelligent Systems and Technology, 12(5): 1–32, 2021. 2

work page 2021

[5] [5]

MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention

Aili Chen, Aonian Li, Bangwei Gong, Binyang Jiang, Bo Fei, Bo Yang, Boji Shan, Changqing Yu, Chao Wang, Cheng Zhu, et al. Minimax-m1: Scaling test-time compute efficiently with lightning attention.arXiv preprint arXiv:2506.13585,

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

Scienceagentbench: Toward rigorous assessment of language agents for data-driven scientific discovery

Ziru Chen, Shijie Chen, Yuting Ning, Qianheng Zhang, Boshi Wang, Botao Yu, Yifei Li, Zeyi Liao, Chen Wei, Zitong Lu, et al. Scienceagentbench: Toward rigorous assessment of language agents for data-driven scientific discovery.arXiv preprint arXiv:2410.05080, 2024. 15

work page arXiv 2024

[7] [7]

Curie: Eval- uating llms on multitask scientific long context understanding and reasoning.arXiv preprint arXiv:2503.13517, 2025

Hao Cui, Zahra Shamsi, Gowoon Cheon, Xuejian Ma, Shu- tong Li, Maria Tikhanovskaya, Peter Norgaard, Nayantara Mudur, Martyna Plomecka, Paul Raccuglia, et al. Curie: Eval- uating llms on multitask scientific long context understanding and reasoning.arXiv preprint arXiv:2503.13517, 2025. 15

work page arXiv 2025

[8] [8]

Transformer-xl: At- tentive language models beyond a fixed-length context

Zihang Dai, Zhilin Yang, Yiming Yang, Jaime G Carbonell, Quoc Le, and Ruslan Salakhutdinov. Transformer-xl: At- tentive language models beyond a fixed-length context. In Proceedings of the 57th annual meeting of the association for computational linguistics, pages 2978–2988, 2019. 15

work page 2019

[9] [9]

Transformers are SSMs: Generalized models and efficient algorithms through structured state space duality

Tri Dao and Albert Gu. Transformers are SSMs: Generalized models and efficient algorithms through structured state space duality. InInternational Conference on Machine Learning,

work page

[10] [10]

Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

Tri Dao and Albert Gu. Transformers are ssms: Generalized models and efficient algorithms through structured state space duality.arXiv preprint arXiv:2405.21060, 2024. 15

work page internal anchor Pith review Pith/arXiv arXiv 2024

[11] [11]

Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models

Soham De, Samuel L Smith, Anushan Fernando, Alek- sandar Botev, George Cristian-Muraru, Albert Gu, Ruba Haroun, Leonard Berrada, Yutian Chen, Srivatsan Srinivasan, et al. Griffin: Mixing gated linear recurrences with lo- cal attention for efficient language models.arXiv preprint arXiv:2402.19427, 2024. 15

work page internal anchor Pith review Pith/arXiv arXiv 2024

[12] [12]

Hymba: A hybrid-head architecture for small language models

Xin Dong, Yonggan Fu, Shizhe Diao, Wonmin Byeon, Zi- jia Chen, Ameya Sunil Mahabaleshwarkar, Shih-Yang Liu, Matthijs Van Keirsbilck, Min-Hung Chen, Yoshi Suhara, et al. Hymba: A hybrid-head architecture for small language mod- els.arXiv preprint arXiv:2411.13676, 2024. 15

work page arXiv 2024

[13] [13]

Do large language models favor recent content? a study on recency bias in llm-based reranking.arXiv preprint arXiv:2509.11353, 2025

Hanpei Fang, Sijie Tao, Nuo Chen, Kai-Xin Chang, and Tet- suya Sakai. Do large language models favor recent content? a study on recency bias in llm-based reranking.arXiv preprint arXiv:2509.11353, 2025. 4

work page arXiv 2025

[14] [14]

What is wrong with perplexity for long-context language modeling?arXiv preprint arXiv:2410.23771, 2024

Lizhe Fang, Yifei Wang, Zhaoyang Liu, Chenheng Zhang, Stefanie Jegelka, Jinyang Gao, Bolin Ding, and Yisen Wang. What is wrong with perplexity for long-context language modeling?arXiv preprint arXiv:2410.23771, 2024. 8

work page arXiv 2024

[15] [15]

The language model evaluation harness, 2024

Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Gold- ing, Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle Mc- Donell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. The lan...

work page 2024

[16] [16]

How to train long-context language models (effectively)

Tianyu Gao, Alexander Wettig, Howard Yen, and Danqi Chen. How to train long-context language models (effectively). In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7376–7399, 2025. 8

work page 2025

[17] [17]

Zamba: A compact 7B SSM hybrid model,

Paolo Glorioso, Quentin Anthony, Yury Tokpanov, James Whittington, Jonathan Pilault, Adam Ibrahim, and Beren Mil- lidge. Zamba: A compact 7b ssm hybrid model.arXiv preprint arXiv:2405.16712, 2024. 15

work page arXiv 2024

[18] [18]

Is flash attention stable?arXiv preprint arXiv:2405.02803, 2024

Alicia Golden, Samuel Hsia, Fei Sun, Bilge Acun, Basil Hosmer, Yejin Lee, Zachary DeVito, Jeff Johnson, Gu-Yeon Wei, David Brooks, et al. Is flash attention stable?arXiv preprint arXiv:2405.02803, 2024. 7

work page arXiv 2024

[19] [19]

The Johns Hopkins University Press, 2013

Gene H Golub and Charles F Van Loan.Matrix Computations (4th ed.). The Johns Hopkins University Press, 2013. 3

work page 2013

[20] [20]

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces.arXiv preprint arXiv:2312.00752, 2023. 15

work page internal anchor Pith review Pith/arXiv arXiv 2023

[21] [21]

Efficiently Modeling Long Sequences with Structured State Spaces

Albert Gu, Karan Goel, and Christopher Ré. Efficiently mod- eling long sequences with structured state spaces.CoRR, abs/2111.00396, 2021. 15

work page internal anchor Pith review Pith/arXiv arXiv 2021

[22] [22]

Combining recurrent, convo- lutional, and continuous-time models with linear state space layers.Advances in neural information processing systems, 34:572–585, 2021

Albert Gu, Isys Johnson, Karan Goel, Khaled Saab, Tri Dao, Atri Rudra, and Christopher Ré. Combining recurrent, convo- lutional, and continuous-time models with linear state space layers.Advances in neural information processing systems, 34:572–585, 2021. 15

work page 2021

[23] [23]

Long short-term memory.Neural Computation, 9(8):1735–1780, 1997

Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory.Neural Computation, 9(8):1735–1780, 1997. 2

work page 1997

[24] [24]

An empirical analysis of compute-optimal large language model training.Advances in neural information processing systems, 35:30016–30030, 2022

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. An empirical analysis of compute-optimal large language model training.Advances in neural information processing systems, 35:30016–30030, 2022. 26

work page 2022

[25] [25]

RULER: What's the Real Context Size of Your Long-Context Language Models?

Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, Yang Zhang, and Boris Gins- burg. Ruler: What’s the real context size of your long-context language models?arXiv preprint arXiv:2404.06654, 2024. 2, 8, 25

work page internal anchor Pith review Pith/arXiv arXiv 2024

[26] [26]

Repeat after me: Transformers are bet- ter than state space models at copying.arXiv preprint arXiv:2402.01032, 2024

Samy Jelassi, David Brandfonbrener, Sham M Kakade, and Eran Malach. Repeat after me: Transformers are bet- ter than state space models at copying.arXiv preprint arXiv:2402.01032, 2024. 15

work page arXiv 2024

[27] [27]

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues?arXiv preprint arXiv:2310.06770, 2023. 15

work page internal anchor Pith review Pith/arXiv arXiv 2023

[28] [28]

R. E. Kalman. A new approach to linear filtering and predic- tion problems.Journal of Basic Engineering, 82(1):35–45,

work page

[29] [29]

Needle in a haystack - pressure testing llms

Gregory Kamradt. Needle in a haystack - pressure testing llms. https://github.com/gkamradt/LLMTest_ NeedleInAHaystack, 2023. 25

work page 2023

[30] [30]

Reformer: The Efficient Transformer

Nikita Kitaev, Łukasz Kaiser, and Anselm Levskaya. Reformer: The efficient transformer.arXiv preprint arXiv:2001.04451, 2020. 15

work page internal anchor Pith review Pith/arXiv arXiv 2001

[31] [31]

Gonzalez, Hao Zhang, and Ion Stoica

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InPro- ceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023. 15

work page 2023

[32] [32]

Datacomp-lm: In search of the next generation of training sets for language models.Advances in Neural Information Processing Systems, 37:14200–14282, 2024

Jeffrey Li, Alex Fang, Georgios Smyrnis, Maor Ivgi, Matt Jordan, Samir Yitzhak Gadre, Hritik Bansal, Etash Guha, Sedrick Scott Keh, Kushal Arora, et al. Datacomp-lm: In search of the next generation of training sets for language models.Advances in Neural Information Processing Systems, 37:14200–14282, 2024. 7

work page 2024

[33] [33]

Jamba: A Hybrid Transformer-Mamba Language Model

Opher Lieber, Barak Lenz, Hofit Bata, Gal Cohen, Jhonathan Osin, Itay Dalmedigos, Erez Safahi, Shaked Meirom, Yonatan Belinkov, Shai Shalev-Shwartz, et al. Jamba: A hy- brid transformer-mamba language model.arXiv preprint arXiv:2403.19887, 2024. 15

work page internal anchor Pith review Pith/arXiv arXiv 2024

[34] [34]

Longhorn: State space models are amortized online learners

Bo Liu, Rui Wang, Lemeng Wu, Yihao Feng, Peter Stone, and Qiang Liu. Longhorn: State space models are amortized online learners. InInternational Conference on Learning Representations, 2025. 1, 2

work page 2025

[35] [35]

Error propagation properties of recursive least-squares adaptation algorithms.Automatica, 21(2):157–167, 1985

Stefan Ljung and Lennart Ljung. Error propagation properties of recursive least-squares adaptation algorithms.Automatica, 21(2):157–167, 1985. 15

work page 1985

[36] [36]

RanPAC: Random projections and pre-trained models for continual learning

Mark D McDonnell, Dong Gong, Amin Parvaneh, Ehsan Abbasnejad, and Anton van den Hengel. RanPAC: Random projections and pre-trained models for continual learning. Advances in Neural Information Processing Systems, 2023. 3

work page 2023

[37] [37]

Landmark attention: Random-access infinite context length for transformers

Amirkeivan Mohtashami and Martin Jaggi. Landmark atten- tion: Random-access infinite context length for transformers. arXiv preprint arXiv:2305.16300, 2023. 15

work page arXiv 2023

[38] [38]

Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention

Tsendsuren Munkhdalai, Manaal Faruqui, and Siddharth Gopal. Leave no context behind: Efficient infinite con- text transformers with infini-attention.arXiv preprint arXiv:2404.07143, 101, 2024. 15

work page internal anchor Pith review arXiv 2024

[39] [39]

On estimating regression.Theory of Probability & Its Applications, 9(1):141–142, 1964

Elizbar A Nadaraya. On estimating regression.Theory of Probability & Its Applications, 9(1):141–142, 1964. 2

work page 1964

[40] [40]

Expansion span: Combin- ing fading memory and retrieval in hybrid state space models

Elvis Nunez, Luca Zancato, Benjamin Bowman, Aditya Go- latkar, Wei Xia, and Stefano Soatto. Expansion span: Combin- ing fading memory and retrieval in hybrid state space models. arXiv preprint arXiv:2412.13328, 2024. 8, 15

work page arXiv 2024

[41] [41]

Resurrecting recurrent neural networks for long sequences

Antonio Orvieto, Samuel L Smith, Albert Gu, Anushan Fer- nando, Caglar Gulcehre, Razvan Pascanu, and Soham De. Resurrecting recurrent neural networks for long sequences. InInternational Conference on Machine Learning, pages 26670–26698. PMLR, 2023. 15

work page 2023

[42] [42]

Marconi: Prefix caching for the era of hybrid llms.arXiv preprint arXiv:2411.19379, 2024

Rui Pan, Zhuang Wang, Zhen Jia, Can Karakus, Luca Zan- cato, Tri Dao, Yida Wang, and Ravi Netravali. Marconi: Prefix caching for the era of hybrid llms.arXiv preprint arXiv:2411.19379, 2024. 15

work page arXiv 2024

[43] [43]

Residual polynomials and the Cheby- shev method

Fabian Pedregosa. Residual polynomials and the Cheby- shev method. http://fa.bianp.net/blog/2020/ polyopt/, 2020. 4

work page 2020

[44] [44]

Eagle and finch: RWKV with matrix-valued states and dynamic recurrence

Bo Peng, Daniel Goldstein, Quentin Gregory Anthony, Alon Albalak, Eric Alcaide, Stella Biderman, Eugene Cheah, Teddy Ferdinan, Kranthi Kiran GV , Haowen Hou, Satyapriya Kr- ishna, Ronald McClelland Jr., Niklas Muennighoff, Fares Obeid, Atsushi Saito, Guangyu Song, Haoqin Tu, Ruichong Zhang, Bingchen Zhao, Qihang Zhao, Jian Zhu, and Rui-Jie Zhu. Eagle and ...

work page

[45] [45]

Rwkv-7" goose" with expressive dynamic state evolution

Bo Peng, Ruichong Zhang, Daniel Goldstein, Eric Alcaide, Xingjian Du, Haowen Hou, Jiaju Lin, Jiaxing Liu, Janna Lu, William Merrill, et al. Rwkv-7" goose" with expressive dynamic state evolution. Technical report, arXiv preprint arXiv:2503.14456, 2025. 2

work page arXiv 2025

[46] [46]

Mathematics of continual learning

Liangzu Peng and René Vidal. Mathematics of continual learning. Technical report, arXiv:2504.17963 [cs.LG], 2025. 3

work page arXiv 2025

[47] [47]

TSVD: Bridging theory and practice in continual learning with pre-trained models

Liangzu Peng, Juan Elenter, Joshua Agterberg, Alejandro Ribeiro, and Rene Vidal. TSVD: Bridging theory and practice in continual learning with pre-trained models. InInternational Conference on Learning Representations, 2025. 3

work page 2025

[48] [48]

Kilt: a benchmark for knowledge intensive language tasks

Fabio Petroni, Aleksandra Piktus, Angela Fan, Patrick Lewis, Majid Yazdani, Nicola De Cao, James Thorne, Yacine Jernite, Vladimir Karpukhin, Jean Maillard, et al. Kilt: a benchmark for knowledge intensive language tasks. InProceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Tech...

work page 2021

[49] [49]

Sayed.Fundamentals of Adaptive Filtering

A.H. Sayed.Fundamentals of Adaptive Filtering. Wiley,

work page

[50] [50]

Sayed.Adaptive Filters

A.H. Sayed.Adaptive Filters. Wiley, 2011. 15

work page 2011

[51] [51]

Lin- ear transformers are secretly fast weight programmers

Imanol Schlag, Kazuki Irie, and Jürgen Schmidhuber. Lin- ear transformers are secretly fast weight programmers. In International conference on machine learning, 2021. 2

work page 2021

[52] [52]

Opti- mal filtering algorithms for fast learning in feedforward neural networks.Neural Networks, 5(5):779–787, 1992

Samir Shah, Francesco Palmieri, and Michael Datum. Opti- mal filtering algorithms for fast learning in feedforward neural networks.Neural Networks, 5(5):779–787, 1992. 3

work page 1992

[53] [53]

Deltaproduct: Im- proving state-tracking in linear rnns via householder products

Julien Siems, Timur Carstensen, Arber Zela, Frank Hutter, Massimiliano Pontil, and Riccardo Grazzi. Deltaproduct: Im- proving state-tracking in linear rnns via householder products. Technical report, arXiv:2502.10297v6 [cs.LG], 2025. 2

work page arXiv 2025

[54] [55]

Retentive Network: A Successor to Transformer for Large Language Models

Yutao Sun, Li Dong, Shaohan Huang, Shuming Ma, Yuqing Xia, Jilong Xue, Jianyong Wang, and Furu Wei. Retentive net- work: A successor to transformer for large language models. arXiv preprint arXiv:2307.08621, 2023. 15

work page internal anchor Pith review Pith/arXiv arXiv 2023

[55] [56]

Kimi Linear: An Expressive, Efficient Attention Architecture

Kimi Team, Yu Zhang, Zongyu Lin, Xingcheng Yao, Jiaxi Hu, Fanqing Meng, Chengyin Liu, Xin Men, Songlin Yang, Zhiyuan Li, et al. Kimi linear: An expressive, efficient at- tention architecture.arXiv preprint arXiv:2510.26692, 2025. 11

work page internal anchor Pith review Pith/arXiv arXiv 2025

[56] [57]

Attention is all you need.Advances in Neural Information Processing Systems, 2017

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in Neural Information Processing Systems, 2017. 1, 2, 15

work page 2017

[57] [58]

Attention: Self-expression is all you need, 2022

René Vidal. Attention: Self-expression is all you need, 2022. 2

work page 2022

[58] [59]

Mesanet: Sequence modeling by locally optimal test- time training

Johannes von Oswald, Nino Scherrer, Seijin Kobayashi, Luca Versari, Songlin Yang, Maximilian Schlegel, Kaitlin Maile, Yanick Schimpf, Oliver Sieberling, Alexander Meulemans, et al. Mesanet: Sequence modeling by locally optimal test- time training. Technical report, arXiv:2506.05233 [cs.LG],

work page arXiv

[59] [60]

An Empirical Study of Mamba-based Language Models

Roger Waleffe, Wonmin Byeon, Duncan Riach, Bran- don Norick, Vijay Korthikanti, Tri Dao, Albert Gu, Ali Hatamizadeh, Sudhakar Singh, Deepak Narayanan, et al. An empirical study of mamba-based language models.arXiv preprint arXiv:2406.07887, 2024. 15

work page internal anchor Pith review Pith/arXiv arXiv 2024

[60] [61]

Test- time regression: a unifying framework for designing se- quence models with associative memory

Ke Alexander Wang, Jiaxin Shi, and Emily B Fox. Test- time regression: a unifying framework for designing se- quence models with associative memory. Technical report, arXiv:2501.12352v3 [cs.LG], 2025. 3

work page arXiv 2025

[61] [62]

Smooth regression analysis.Sankhy ¯a: The Indian Journal of Statistics, Series A, pages 359–372,

Geoffrey S Watson. Smooth regression analysis.Sankhy ¯a: The Indian Journal of Statistics, Series A, pages 359–372,

work page

[62] [63]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025. 6, 29

work page internal anchor Pith review Pith/arXiv arXiv 2025

[63] [64]

Gated linear attention transformers with hardware-efficient training

Songlin Yang, Bailin Wang, Yikang Shen, Rameswar Panda, and Yoon Kim. Gated linear attention transformers with hardware-efficient training. InInternational Conference on Machine Learning, 2024. 1, 2, 4, 6, 7, 15

work page 2024

[64] [65]

Parallelizing linear transformers with the delta rule over sequence length

Songlin Yang, Bailin Wang, Yu Zhang, Yikang Shen, and Yoon Kim. Parallelizing linear transformers with the delta rule over sequence length. InAdvances in Neural Information Processing Systems, pages 115491–115522. Curran Asso- ciates, Inc., 2024. 10, 15

work page 2024

[65] [66]

Parallelizing linear transformers with the delta rule over sequence length

Songlin Yang, Bailin Wang, Yu Zhang, Yikang Shen, and Yoon Kim. Parallelizing linear transformers with the delta rule over sequence length. InNeural Information Processing Systems, 2024. 1, 2, 3, 4, 6

work page 2024

[66] [67]

Gated delta networks: Improving mamba2 with delta rule

Songlin Yang, Jan Kautz, and Ali Hatamizadeh. Gated delta networks: Improving mamba2 with delta rule. InInterna- tional Conference on Learning Representations, 2025. 2, 6, 7, 8, 10, 15

work page 2025

[67] [68]

HELMET: How to evaluate long-context language models effectively and thoroughly

Howard Yen, Tianyu Gao, Minmin Hou, Ke Ding, Daniel Fleischer, Peter Izsak, Moshe Wasserblat, and Danqi Chen. HELMET: How to evaluate long-context language models effectively and thoroughly. InInternational Conference on Learning Representations (ICLR), 2025. 2, 8, 25

work page 2025

[68] [69]

Native sparse attention: Hardware- aligned and natively trainable sparse attention

Jingyang Yuan, Huazuo Gao, Damai Dai, Junyu Luo, Liang Zhao, Zhengyan Zhang, Zhenda Xie, Yuxing Wei, Lean Wang, Zhiping Xiao, et al. Native sparse attention: Hardware- aligned and natively trainable sparse attention. InProceedings of the 63rd Annual Meeting of the Association for Compu- tational Linguistics (Volume 1: Long Papers), pages 23078– 23097, 2025. 15

work page 2025

[69] [70]

Stacked residuals of dy- namic layers for time series anomaly detection.arXiv preprint arXiv:2202.12457, 2022

Luca Zancato, Alessandro Achille, Giovanni Paolini, Alessan- dro Chiuso, and Stefano Soatto. Stacked residuals of dy- namic layers for time series anomaly detection.arXiv preprint arXiv:2202.12457, 2022. 15

work page arXiv 2022

[70] [71]

B'mojo: Hybrid state space realizations of foundation models with eidetic and fading memory

Luca Zancato, Arjun Seshadri, Yonatan Dukler, Aditya Go- latkar, Yantao Shen, Benjamin Bowman, Matthew Trager, Alessandro Achille, and Stefano Soatto. B'mojo: Hybrid state space realizations of foundation models with eidetic and fading memory. InAdvances in Neural Information Process- ing Systems, pages 130433–130462. Curran Associates, Inc.,

work page

[71] [72]

compressed

Guanxiong Zeng, Yang Chen, Bo Cui, and Shan Yu. Continual learning of context-dependent processing in neural networks. Nature Machine Intelligence, 1(8):364–372, 2019. 3 A. Related Work Since the introduction of Self-Attention [57], significant research has been conducted to reduce its quadratic cost in processing long input sequences. As models and syste...

work page 2019

[72] [73]

fades away the past

or Linear Time-Invariant dynamical systems [22, 70], to those introducing novel adaptive or gated state updates [9, 41, 64]. Despite their differences, all SSMs follow the same basic working principle inspired by classical state-space models [28]: they process the input sequence by maintaining afixed-sizestate that acts as a compressed (lossy) representat...

work page

[73] [74]

Thus ω1 lies in (ω∗ 1, ω∗

where g(ω) decreases, therefore we have ω1 ≤ω 0. Thus ω1 lies in (ω∗ 1, ω∗

work page

[74] [75]

We could then conclude inductively that ω∗ 1 < ω i ≤ω i−1 for all i= 1,

again. We could then conclude inductively that ω∗ 1 < ω i ≤ω i−1 for all i= 1, . . . , r. From Lemma 3 we know that the update of ωi in (weight schedule) would not create much numerical concern in a forward pass, as we haveω i ∈[1,2]for alli. Furthermore, we can bound the rate at whichω i converges toω ∗ 1: Lemma 4.Defineκ:= L µ . For anyi= 1, . . . , r, ...

work page

[75] [76]

The proof is concluded by unrolling the above recurrence

·(ω i−1 −ω ∗ 1) (i) = ρ2ω∗ 1 4−ρ 2ωi−1 ·(ω i−1 −ω ∗ 1) (ii) ≤ ρ2wi−1ω∗ 1 4 ·(ω i−1 −ω ∗ 1) (iii) ≤ 1− p 1−ρ 2 ·(ω i−1 −ω ∗ 1) (iv) = κ−1 κ+ 1 · √κ−1√κ+ 1 ·(ω i−1 −ω ∗ 1) Here, (i) follows from the fact that ω∗ 1 is a fixed point, (ii) follows from Lemma 3 that ωi ≤ω i−1, (iii) follows from the definition ofω ∗ 1 and the factw i−1 ≤2, and (iv) follows from...

work page

[76] [77]

⟨kl, xt⟩ dL dqt +⟨k l, dL dqt ⟩xt + 2a ||Ht||F ⟨xt, dL dqt ⟩Htkl # (50) Or equivalently, collecting the terms that are linear in the gradient: dL dkl =− X t≥l t−1Y i=l γi !

First-order methods for solving Hξ=q converge at most at a rate Ra := √κ−1√κ+1 , and we see ωi converges at an even faster rate. Numerically, assuming κ= L µ = 1.02 0.02 = 51, we then have: R≈0.7253, R 5 ≈0.2, R 10 ≈0.04, R 20 ≈0.0016, R 30 ≈6×10 −5 Ra ≈0.7543, R 5 a ≈0.244, R 10 a ≈0.0597, R 20 a ≈0.0036, R 30 a ≈0.0002. Thus, withκ= 51, the update ofω i...

work page arXiv 2048

[77] [78]

This choice of 0.25 is motivated from concurrent work [59] which explored a similar ridge regression objective for LLM training

Constant regularizationWe train same model architecture (as above) with λt = 0.25 (a constant). This choice of 0.25 is motivated from concurrent work [59] which explored a similar ridge regression objective for LLM training. As shown in Fig. 6, without strict condition number control, gradient norms spike during training, leading to increased cross entrop...

work page

[78] [79]

We train a model with adaptive weights

Adaptive weighting (gating). We train a model with adaptive weights. Specifically, for allt≥i , we parameterize the weight for thei th sample at time-steptasη t,i =Qt j=i+1 γj, with eachγ j ∈[0,1]learnable

work page

[79] [80]

complexity

No weighting. We train the same model architecture as above, but with no weights. This essentially results in an unweighted ridge regression objective obtained by settingη i = 1in (3). Table 6 shows clear benefits of adapting weighting with improvements across the board on all LM-Harness tasks considered, thereby validating our hypothesis. Table 6.Adaptiv...

work page arXiv 2048