Lifelong In-Context Learning with Transformers Requires Parametric Forms of Attention

Luke McDermott; Rahul Parhi; Robert W. Heath Jr.

arxiv: 2606.25342 · v1 · pith:44T3NNESnew · submitted 2026-06-24 · 💻 cs.LG

Lifelong In-Context Learning with Transformers Requires Parametric Forms of Attention

Luke McDermott , Robert W. Heath jr. , Rahul Parhi This is my paper

Pith reviewed 2026-06-25 21:17 UTC · model grok-4.3

classification 💻 cs.LG

keywords lifelong in-context learningparametric attentiontransformerscontinual learningkey-value cacheonline regressionstate-space models

0 comments

The pith

Transformers require parametric attention to perform lifelong in-context learning on a fixed memory budget.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that standard transformers cannot scale in-context learning to lifelong sequences because softmax attention stores an ever-growing key-value cache that exceeds fixed hardware limits. It claims parametric forms of attention solve this by using online parametric regression to learn key-value relationships at test time, replacing the cache with a neural network that keeps memory constant. This generalizes approaches such as linear attention and state-space models as nonparametric alternatives. A sympathetic reader would care because the approach points to a way for AI agents to accumulate and use experience over arbitrary time horizons without periodic full retraining. The work identifies current shortfalls in capacity and update cost for these methods and poses open questions to guide progress.

Core claim

Parametric forms of attention learn the relationship between keys and their associated values at test-time with parametric regression, replacing the ever-growing key-value cache with an online-trainable neural network to maintain a constant memory footprint while extending in-context learning to lifelong settings.

What carries the argument

Parametric attention, which performs parametric regression to associate keys with values during inference instead of storing them explicitly.

If this is right

Transformers can process sequences of arbitrary length while using only constant memory.
In-context learning extends to lifelong continual learning for AI agents.
Methods such as linear attention and state-space models become building blocks for long-horizon agents.
Resolving capacity and update-cost limits in parametric attention will be required to realize the approach.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Hybrid architectures that combine parametric attention with existing transformer layers could enable gradual knowledge accumulation without full model resets.
Success would shift emphasis from ever-larger pretraining runs toward ongoing test-time adaptation.
The open questions on capacity and cost suggest concrete benchmarks that measure retention over thousands of steps with fixed parameters.

Load-bearing premise

An online-trainable neural network can replace the key-value cache without prohibitive update costs or hitting hard capacity limits.

What would settle it

A direct comparison showing that a parametric attention model either drops accuracy on sequences longer than its training horizon or requires more total compute than softmax attention with a memory-bounded cache.

Figures

Figures reproduced from arXiv: 2606.25342 by Luke McDermott, Rahul Parhi, Robert W. Heath Jr..

**Figure 1.** Figure 1: Attention as Test-Time Regression [42]. Across three time steps, we illustrate how attention generates self-supervised training pairs and sequentially fits an estimator mt . The output of attention at any given time is the prediction of the query. 2 Attention as Test-Time Regression As a preliminary, we briefly explain traditional views of attention and formally introduce our perspective on attention as a… view at source ↗

**Figure 2.** Figure 2: Nonparametric attention use keyvalue pairs to form the estimator, leading to unbounded growth. Softmax attention can be viewed as an MLP with KV pairs as weights [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

read the original abstract

Lifelong continual learning remains an obstacle on the path to human-like intelligence. Modern transformers show sparks of intelligence with in-context learning. The quadratic nature of attention, however, prohibits transformers from performing this process on arbitrarily long sequences. In this work, we argue that extending in-context learning to lifelong settings is a practical solution for continual learning in AI agents. In particular, we argue that \emph{parametric forms of attention} are needed to understand a lifetime of context with transformers on a fixed hardware budget. These attention mechanisms learn the relationship between keys and their associated values at test-time with parametric regression. Our generalization of parametric approaches (linear attention, state-space models, fast weight programmers, and test-time training layers) contrasts with nonparametric counterparts like softmax attention. They replace the ever-growing key-value cache with an online-trainable neural network, maintaining a constant memory footprint. We highlight how parametric attention currently fall short of lifelong learning due to limited memory capacity or costly online updates. To address these issues, we pose a set of open questions with novel insights to guide the field toward long-horizon agents.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a position paper that groups existing parametric attention ideas under a lifelong ICL framing and lists open questions, but adds no new results or evidence.

read the letter

The core claim is that transformers need parametric attention (linear attention, SSMs, fast weights, test-time training) instead of softmax to do lifelong in-context learning on fixed hardware, because the KV cache grows without bound. The paper contrasts these methods as online regression against nonparametric storage and notes that current versions still hit capacity or update-cost walls, then ends with open questions.

What is actually here is a synthesis that applies the parametric/nonparametric distinction to the continual-learning setting. The authors correctly flag the memory scaling problem and point to the same families of work that have already been explored for efficiency. They do not introduce new equations, proofs, or experiments.

The main limitation is that the argument stays at the level of premise and restatement. It assumes constant memory is mandatory and that parametric regression will eventually suffice, but offers no quantitative comparison to replay buffers, regularization, or other continual-learning baselines. The open questions are sensible but not derived from new analysis; they largely follow from the acknowledged shortcomings of the cited methods.

A reader already familiar with linear attention and SSM literature will not find new technical content. Someone new to the area might get a compact map of the relevant papers and a reminder of the memory issue. The work is honest about its own gaps and does not overclaim results.

This belongs in a workshop or as a perspective note rather than a standard conference paper. It does not contain enough substance for a serious referee to evaluate as a research contribution.

Referee Report

1 major / 0 minor

Summary. The manuscript is a position paper arguing that extending in-context learning to lifelong settings in transformers requires parametric forms of attention. These replace the growing key-value cache of softmax attention with an online-trainable neural network via parametric regression, thereby maintaining constant memory on fixed hardware while processing arbitrarily long sequences. The paper unifies linear attention, state-space models, fast weight programmers, and test-time training under this parametric umbrella, contrasts them with nonparametric methods, acknowledges current shortcomings in capacity and update cost, and poses open questions to guide future research on long-horizon agents.

Significance. If the argument is accepted, the paper could usefully redirect research toward parametric attention mechanisms that support continual learning without unbounded memory growth. Its main contribution is the synthesis of existing methods under a single framing and the explicit listing of open challenges; the work contains no new derivations, proofs, or experiments.

major comments (1)

[Abstract] Abstract: the necessity claim that parametric attention is required 'to understand a lifetime of context with transformers on a fixed hardware budget' assumes without supporting argument that constant memory is mandatory. The manuscript provides no analysis ruling out alternatives such as cache compression, hierarchical memory, or periodic eviction that could keep quadratic attention feasible within hardware limits.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed review and the opportunity to strengthen the manuscript. As a position paper, our goal is to synthesize existing approaches and highlight open challenges rather than provide exhaustive proofs. We address the concern regarding the necessity claim below.

read point-by-point responses

Referee: [Abstract] Abstract: the necessity claim that parametric attention is required 'to understand a lifetime of context with transformers on a fixed hardware budget' assumes without supporting argument that constant memory is mandatory. The manuscript provides no analysis ruling out alternatives such as cache compression, hierarchical memory, or periodic eviction that could keep quadratic attention feasible within hardware limits.

Authors: We agree that the abstract states the necessity of parametric attention for constant-memory lifelong ICL without explicitly analyzing alternatives. Our core argument is that any approach relying on an ever-growing key-value cache (even with compression or eviction) will eventually exceed fixed hardware limits for truly arbitrary sequence lengths, as compression ratios are bounded and eviction risks losing critical lifelong context. Hierarchical memory introduces additional complexity and latency that may not resolve the fundamental scaling issue. We will revise the abstract and add a short paragraph in the introduction (or a dedicated subsection) acknowledging these alternatives and explaining why we view parametric regression as the more scalable path for constant footprint. This revision will be made without altering the position paper's scope or adding new experiments. revision: yes

Circularity Check

0 steps flagged

Position paper with no derivational claims or reductions

full rationale

The paper is explicitly a position paper that argues for the necessity of parametric attention forms to enable lifelong in-context learning under fixed memory constraints. It presents no equations, formal derivations, fitted parameters, predictions, or empirical results. The abstract and text instead highlight shortcomings of existing parametric methods, contrast them with softmax attention at a conceptual level, and conclude by posing open questions. No load-bearing step reduces to a self-definition, fitted input, or self-citation chain; the argument is self-contained as advocacy for future work rather than a closed derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The argument rests on domain assumptions about memory scaling and the viability of online regression; no free parameters or invented entities are introduced.

axioms (2)

domain assumption The quadratic nature of attention prohibits transformers from performing in-context learning on arbitrarily long sequences.
Invoked in the abstract as the core motivation for switching to parametric forms.
domain assumption An online-trainable neural network can replace the ever-growing key-value cache while maintaining performance.
Central premise of the proposed solution category.

pith-pipeline@v0.9.1-grok · 5724 in / 1166 out tokens · 23857 ms · 2026-06-25T21:17:54.994058+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

53 extracted references · 2 canonical work pages

[1]

Gqa: Training generalized multi-query transformer models from multi-head checkpoints.arXiv preprint arXiv:2305.13245, 2023

Joshua Ainslie, James Lee-Thorp, Michiel De Jong, Yury Zemlyanskiy, Federico Lebrón, and Sumit Sanghai. Gqa: Training generalized multi-query transformer models from multi-head checkpoints.arXiv preprint arXiv:2305.13245, 2023

Pith/arXiv arXiv 2023
[2]

Physics of language models: Part 4.1, architecture design and the magic of canon layers, 2025

Zeyuan Allen-Zhu. Physics of language models: Part 4.1, architecture design and the magic of canon layers, 2025. URLhttps://arxiv.org/abs/2512.17351

arXiv 2025
[3]

Just read twice: closing the recall gap for recurrent language models, 2024

Simran Arora, Aman Timalsina, Aaryan Singhal, Benjamin Spector, Sabri Eyuboglu, Xinyi Zhao, Ashish Rao, Atri Rudra, and Christopher Ré. Just read twice: closing the recall gap for recurrent language models, 2024. URLhttps://arxiv.org/abs/2407.05483

arXiv 2024
[4]

Self-supervised learning from images with a joint- embedding predictive architecture

Mahmoud Assran, Quentin Duval, Ishan Misra, Piotr Bojanowski, Pascal Vincent, Michael Rabbat, Yann LeCun, and Nicolas Ballas. Self-supervised learning from images with a joint- embedding predictive architecture. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15619–15629, 2023

2023
[5]

xlstm: Ex- tended long short-term memory

Maximilian Beck, Korbinian Pöppel, Markus Spanring, Andreas Auer, Oleksandra Prudnikova, Michael K Kopp, Günter Klambauer, Johannes Brandstetter, and Sepp Hochreiter. xlstm: Ex- tended long short-term memory. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

2024
[6]

Titans: Learning to memorize at test time

Ali Behrouz, Peilin Zhong, and Vahab Mirrokni. Titans: Learning to memorize at test time. arXiv preprint arXiv:2501.00663, 2024

Pith/arXiv arXiv 2024
[7]

Atlas: Learning to optimally memorize the context at test time.arXiv preprint arXiv:2505.23735, 2025

Ali Behrouz, Zeman Li, Praneeth Kacham, Majid Daliri, Yuan Deng, Peilin Zhong, Meisam Razaviyayn, and Vahab Mirrokni. Atlas: Learning to optimally memorize the context at test time.arXiv preprint arXiv:2505.23735, 2025

arXiv 2025
[8]

It’s all connected: A journey through test-time memorization, attentional bias, retention, and online optimization

Ali Behrouz, Meisam Razaviyayn, Peilin Zhong, and Vahab Mirrokni. It’s all connected: A journey through test-time memorization, attentional bias, retention, and online optimization. arXiv preprint arXiv:2504.13173, 2025

arXiv 2025
[9]

Nested learning: The illusion of deep learning architectures.arXiv preprint arXiv:2512.24695, 2025

Ali Behrouz, Meisam Razaviyayn, Peilin Zhong, and Vahab Mirrokni. Nested learning: The illusion of deep learning architectures.arXiv preprint arXiv:2512.24695, 2025

arXiv 2025
[10]

Modular duality in deep learning.arXiv preprint arXiv:2410.21265, 2024

Jeremy Bernstein and Laker Newhouse. Modular duality in deep learning.arXiv preprint arXiv:2410.21265, 2024

arXiv 2024
[11]

Lora learns less and forgets less.arXiv preprint arXiv:2405.09673, 2024

Dan Biderman, Jacob Portes, Jose Javier Gonzalez Ortiz, Mansheej Paul, Philip Greengard, Connor Jennings, Daniel King, Sam Havens, Vitaliy Chiley, Jonathan Frankle, et al. Lora learns less and forgets less.arXiv preprint arXiv:2405.09673, 2024

arXiv 2024
[12]

Continual lifelong learning in natural language processing: A survey

Magdalena Biesialska, Katarzyna Biesialska, and Marta R Costa-jussà. Continual lifelong learning in natural language processing: A survey. InProceedings of the 28th International Conference on Computational Linguistics, pages 6523–6541, 2020

2020
[13]

An iterative algorithm for computing the best estimate of an orthogonal matrix.SIAM Journal on Numerical Analysis, 8(2):358–364, 1971

Åke Björck and Clazett Bowie. An iterative algorithm for computing the best estimate of an orthogonal matrix.SIAM Journal on Numerical Analysis, 8(2):358–364, 1971

1971
[14]

Generating long sequences with sparse transformers.arXiv preprint arXiv:1904.10509, 2019

Rewon Child. Generating long sequences with sparse transformers.arXiv preprint arXiv:1904.10509, 2019

Pith/arXiv arXiv 1904
[15]

Transformers are SSMs: Generalized models and efficient algorithms through structured state space duality

Tri Dao and Albert Gu. Transformers are SSMs: Generalized models and efficient algorithms through structured state space duality. InInternational Conference on Machine Learning (ICML), 2024

2024
[16]

Mamba: Linear-time sequence modeling with selective state spaces

Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752, 2024. URL https://arxiv.org/abs/2312.00752

Pith/arXiv arXiv 2024
[17]

Efficiently modeling long sequences with structured state spaces

Albert Gu, Karan Goel, and Christopher Re. Efficiently modeling long sequences with structured state spaces. InInternational Conference on Learning Representations, 2021. 10

2021
[18]

Designing robust transformers using robust kernel density estimation.Advances in Neural Information Processing Systems, 36:53362–53384, 2023

Xing Han, Tongzheng Ren, Tan Nguyen, Khai Nguyen, Joydeep Ghosh, and Nhat Ho. Designing robust transformers using robust kernel density estimation.Advances in Neural Information Processing Systems, 36:53362–53384, 2023

2023
[19]

Psychology press, 2005

Donald Olding Hebb.The organization of behavior: A neuropsychological theory. Psychology press, 2005

2005
[20]

Parallel models of associative memory

Geoffrey E Hinton and James A Anderson. Parallel models of associative memory. 1989

1989
[21]

Long short-term memory.Neural Computation, 9(8): 1735–1780, 1997

Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory.Neural Computation, 9 (8):1735–1780, 1997. doi: 10.1162/neco.1997.9.8.1735

work page doi:10.1162/neco.1997.9.8.1735 1997
[22]

The hardware lottery.Communications of the ACM, 64(12):58–65, 2021

Sara Hooker. The hardware lottery.Communications of the ACM, 64(12):58–65, 2021

2021
[23]

Hwang, S., Folli, V ., Lanza, E., Parisi, G., Ruocco, G., and Zamponi, F

J J Hopfield. Neural networks and physical systems with emergent collective computational abilities.Proceedings of the National Academy of Sciences, 79(8):2554–2558, 1982. doi: 10.1073/pnas.79.8.2554. URL https://www.pnas.org/doi/abs/10.1073/pnas. 79.8.2554

work page doi:10.1073/pnas.79.8.2554 1982
[24]

Kernel memory networks: A unifying framework for memory modeling, 2024

Georgios Iatropoulos, Johanni Brea, and Wulfram Gerstner. Kernel memory networks: A unifying framework for memory modeling, 2024. URL https://arxiv.org/abs/2208. 09416

2024
[25]

Muon: An optimizer for hidden layers in neural networks, 2024

Keller Jordan, Yuchen Jin, Vlado Boza, Jiacheng You, Franz Cesista, Laker Newhouse, and Jeremy Bernstein. Muon: An optimizer for hidden layers in neural networks, 2024. URL https://kellerjordan.github.io/posts/muon/

2024
[26]

Transformers are rnns: Fast autoregressive transformers with linear attention

Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret. Transformers are rnns: Fast autoregressive transformers with linear attention. InInternational conference on machine learning, pages 5156–5165. PMLR, 2020

2020
[27]

Lola: Low-rank linear attention with sparse caching.arXiv preprint arXiv:2505.23666, 2025

Luke McDermott, Robert W Heath Jr, and Rahul Parhi. Lola: Low-rank linear attention with sparse caching.arXiv preprint arXiv:2505.23666, 2025

arXiv 2025
[28]

Embodied lifelong learning for task and motion planning

Jorge Mendez-Mendez, Leslie Pack Kaelbling, and Tomás Lozano-Pérez. Embodied lifelong learning for task and motion planning. InConference on Robot Learning, pages 2134–2150. PMLR, 2023

2023
[29]

Some new estimates for distribution functions.Theory of Probability & Its Applications, 9(3):497–500, 1964

Elizbar A Nadaraya. Some new estimates for distribution functions.Theory of Probability & Its Applications, 9(3):497–500, 1964

1964
[30]

The sparse frontier: Sparse attention trade-offs in transformer llms.arXiv preprint arXiv:2504.17768, 2025

Piotr Nawrot, Robert Li, Renjie Huang, Sebastian Ruder, Kelly Marchisio, and Edoardo M Ponti. The sparse frontier: Sparse attention trade-offs in transformer llms.arXiv preprint arXiv:2504.17768, 2025

Pith/arXiv arXiv 2025
[31]

Neural network capacity using delta rule.Electronics Letters, 25(3): 197–199, 1989

DL Prados and SC Kak. Neural network capacity using delta rule.Electronics Letters, 25(3): 197–199, 1989

1989
[32]

Expe- rience replay for continual learning.Advances in neural information processing systems, 32, 2019

David Rolnick, Arun Ahuja, Jonathan Schwarz, Timothy Lillicrap, and Gregory Wayne. Expe- rience replay for continual learning.Advances in neural information processing systems, 32, 2019

2019
[33]

Linear transformers are secretly fast weight programmers

Imanol Schlag, Kazuki Irie, and Jürgen Schmidhuber. Linear transformers are secretly fast weight programmers. InInternational conference on machine learning, pages 9355–9366. PMLR, 2021

2021
[34]

Learning to control fast-weight memories: An alternative to dynamic recurrent networks.Neural Computation, 4(1):131–139, 1992

Jürgen Schmidhuber. Learning to control fast-weight memories: An alternative to dynamic recurrent networks.Neural Computation, 4(1):131–139, 1992

1992
[35]

Deltaproduct: Increasing the expressivity of deltanet through products of householders

Julien Siems, Timur Carstensen, Arber Zela, Frank Hutter, Massimiliano Pontil, and Riccardo Grazzi. Deltaproduct: Increasing the expressivity of deltanet through products of householders. InICLR 2025 Workshop on Foundation Models in the Wild, 2025

2025
[36]

Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024. 11

2024
[37]

Learning to (learn at test time): Rnns with expressive hidden states.arXiv preprint arXiv:2407.04620, 2024

Yu Sun, Xinhao Li, Karan Dalal, Jiarui Xu, Arjun Vikram, Genghan Zhang, Yann Dubois, Xinlei Chen, Xiaolong Wang, Sanmi Koyejo, et al. Learning to (learn at test time): Rnns with expressive hidden states.arXiv preprint arXiv:2407.04620, 2024

Pith/arXiv arXiv 2024
[38]

Retentive network: A successor to transformer for large language models, 2023

Yutao Sun, Li Dong, Shaohan Huang, Shuming Ma, Yuqing Xia, Jilong Xue, Jianyong Wang, and Furu Wei. Retentive network: A successor to transformer for large language models, 2023. URLhttps://arxiv.org/abs/2307.08621

Pith/arXiv arXiv 2023
[39]

Sutton, Michael Bowling, and Patrick M

Richard S. Sutton, Michael Bowling, and Patrick M. Pilarski. The alberta plan for ai research,
[40]

URLhttps://arxiv.org/abs/2208.11173

arXiv
[41]

Attention is all you need.Advances in neural information processing systems, 30, 2017

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017

2017
[42]

Mesanet: Sequence modeling by locally optimal test-time training.arXiv preprint arXiv:2506.05233, 2025

Johannes von Oswald, Nino Scherrer, Seijin Kobayashi, Luca Versari, Songlin Yang, Max- imilian Schlegel, Kaitlin Maile, Yanick Schimpf, Oliver Sieberling, Alexander Meulemans, et al. Mesanet: Sequence modeling by locally optimal test-time training.arXiv preprint arXiv:2506.05233, 2025

Pith/arXiv arXiv 2025
[43]

Test-time regression: a unifying framework for designing sequence models with associative memory.arXiv preprint arXiv:2501.12352, 2025

Ke Alexander Wang, Jiaxin Shi, and Emily B Fox. Test-time regression: a unifying framework for designing sequence models with associative memory.arXiv preprint arXiv:2501.12352, 2025

arXiv 2025
[44]

Smooth regression analysis.Sankhy ¯a: The Indian Journal of Statistics, Series A, pages 359–372, 1964

Geoffrey S Watson. Smooth regression analysis.Sankhy ¯a: The Indian Journal of Statistics, Series A, pages 359–372, 1964

1964
[45]

Adaptive filters.Aspects of network and system theory, 1971

Bernard Widrow. Adaptive filters.Aspects of network and system theory, 1971

1971
[46]

Adaptive switching circuits

Bernard Widrow and Marcian E Hoff. Adaptive switching circuits. InNeurocomputing: foundations of research, pages 123–134. 1988

1988
[47]

Gated delta networks: Improving mamba2 with delta rule.arXiv preprint arXiv:2412.06464, 2024

Songlin Yang, Jan Kautz, and Ali Hatamizadeh. Gated delta networks: Improving mamba2 with delta rule.arXiv preprint arXiv:2412.06464, 2024

Pith/arXiv arXiv 2024
[48]

Gated linear attention transformers with hardware-efficient training

Songlin Yang, Bailin Wang, Yikang Shen, Rameswar Panda, and Yoon Kim. Gated linear attention transformers with hardware-efficient training. InInternational Conference on Machine Learning, pages 56501–56523. PMLR, 2024

2024
[49]

Parallelizing linear transformers with the delta rule over sequence length

Songlin Yang, Bailin Wang, Yu Zhang, Yikang Shen, and Yoon Kim. Parallelizing linear transformers with the delta rule over sequence length. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

2024
[50]

Cambridge University Press, 2023

Aston Zhang, Zachary C Lipton, Mu Li, and Alexander J Smola.Dive into deep learning. Cambridge University Press, 2023

2023
[51]

Test-time training done right.arXiv preprint arXiv:2505.23884, 2025

Tianyuan Zhang, Sai Bi, Yicong Hong, Kai Zhang, Fujun Luan, Songlin Yang, Kalyan Sunkavalli, William T Freeman, and Hao Tan. Test-time training done right.arXiv preprint arXiv:2505.23884, 2025

Pith/arXiv arXiv 2025
[52]

Deep research: A survey of autonomous research agents.arXiv preprint arXiv:2508.12752, 2025

Wenlin Zhang, Xiaopeng Li, Yingyi Zhang, Pengyue Jia, Yichao Wang, Huifeng Guo, Yong Liu, and Xiangyu Zhao. Deep research: A survey of autonomous research agents.arXiv preprint arXiv:2508.12752, 2025

arXiv 2025
[53]

Understanding transformer from the perspective of associative memory, 2025

Shu Zhong, Mingyu Xu, Tenglong Ao, and Guang Shi. Understanding transformer from the perspective of associative memory, 2025. URL https://arxiv.org/abs/2505. 19488. 12

2025

[1] [1]

Gqa: Training generalized multi-query transformer models from multi-head checkpoints.arXiv preprint arXiv:2305.13245, 2023

Joshua Ainslie, James Lee-Thorp, Michiel De Jong, Yury Zemlyanskiy, Federico Lebrón, and Sumit Sanghai. Gqa: Training generalized multi-query transformer models from multi-head checkpoints.arXiv preprint arXiv:2305.13245, 2023

Pith/arXiv arXiv 2023

[2] [2]

Physics of language models: Part 4.1, architecture design and the magic of canon layers, 2025

Zeyuan Allen-Zhu. Physics of language models: Part 4.1, architecture design and the magic of canon layers, 2025. URLhttps://arxiv.org/abs/2512.17351

arXiv 2025

[3] [3]

Just read twice: closing the recall gap for recurrent language models, 2024

Simran Arora, Aman Timalsina, Aaryan Singhal, Benjamin Spector, Sabri Eyuboglu, Xinyi Zhao, Ashish Rao, Atri Rudra, and Christopher Ré. Just read twice: closing the recall gap for recurrent language models, 2024. URLhttps://arxiv.org/abs/2407.05483

arXiv 2024

[4] [4]

Self-supervised learning from images with a joint- embedding predictive architecture

Mahmoud Assran, Quentin Duval, Ishan Misra, Piotr Bojanowski, Pascal Vincent, Michael Rabbat, Yann LeCun, and Nicolas Ballas. Self-supervised learning from images with a joint- embedding predictive architecture. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15619–15629, 2023

2023

[5] [5]

xlstm: Ex- tended long short-term memory

Maximilian Beck, Korbinian Pöppel, Markus Spanring, Andreas Auer, Oleksandra Prudnikova, Michael K Kopp, Günter Klambauer, Johannes Brandstetter, and Sepp Hochreiter. xlstm: Ex- tended long short-term memory. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

2024

[6] [6]

Titans: Learning to memorize at test time

Ali Behrouz, Peilin Zhong, and Vahab Mirrokni. Titans: Learning to memorize at test time. arXiv preprint arXiv:2501.00663, 2024

Pith/arXiv arXiv 2024

[7] [7]

Atlas: Learning to optimally memorize the context at test time.arXiv preprint arXiv:2505.23735, 2025

Ali Behrouz, Zeman Li, Praneeth Kacham, Majid Daliri, Yuan Deng, Peilin Zhong, Meisam Razaviyayn, and Vahab Mirrokni. Atlas: Learning to optimally memorize the context at test time.arXiv preprint arXiv:2505.23735, 2025

arXiv 2025

[8] [8]

It’s all connected: A journey through test-time memorization, attentional bias, retention, and online optimization

Ali Behrouz, Meisam Razaviyayn, Peilin Zhong, and Vahab Mirrokni. It’s all connected: A journey through test-time memorization, attentional bias, retention, and online optimization. arXiv preprint arXiv:2504.13173, 2025

arXiv 2025

[9] [9]

Nested learning: The illusion of deep learning architectures.arXiv preprint arXiv:2512.24695, 2025

Ali Behrouz, Meisam Razaviyayn, Peilin Zhong, and Vahab Mirrokni. Nested learning: The illusion of deep learning architectures.arXiv preprint arXiv:2512.24695, 2025

arXiv 2025

[10] [10]

Modular duality in deep learning.arXiv preprint arXiv:2410.21265, 2024

Jeremy Bernstein and Laker Newhouse. Modular duality in deep learning.arXiv preprint arXiv:2410.21265, 2024

arXiv 2024

[11] [11]

Lora learns less and forgets less.arXiv preprint arXiv:2405.09673, 2024

Dan Biderman, Jacob Portes, Jose Javier Gonzalez Ortiz, Mansheej Paul, Philip Greengard, Connor Jennings, Daniel King, Sam Havens, Vitaliy Chiley, Jonathan Frankle, et al. Lora learns less and forgets less.arXiv preprint arXiv:2405.09673, 2024

arXiv 2024

[12] [12]

Continual lifelong learning in natural language processing: A survey

Magdalena Biesialska, Katarzyna Biesialska, and Marta R Costa-jussà. Continual lifelong learning in natural language processing: A survey. InProceedings of the 28th International Conference on Computational Linguistics, pages 6523–6541, 2020

2020

[13] [13]

An iterative algorithm for computing the best estimate of an orthogonal matrix.SIAM Journal on Numerical Analysis, 8(2):358–364, 1971

Åke Björck and Clazett Bowie. An iterative algorithm for computing the best estimate of an orthogonal matrix.SIAM Journal on Numerical Analysis, 8(2):358–364, 1971

1971

[14] [14]

Generating long sequences with sparse transformers.arXiv preprint arXiv:1904.10509, 2019

Rewon Child. Generating long sequences with sparse transformers.arXiv preprint arXiv:1904.10509, 2019

Pith/arXiv arXiv 1904

[15] [15]

Transformers are SSMs: Generalized models and efficient algorithms through structured state space duality

Tri Dao and Albert Gu. Transformers are SSMs: Generalized models and efficient algorithms through structured state space duality. InInternational Conference on Machine Learning (ICML), 2024

2024

[16] [16]

Mamba: Linear-time sequence modeling with selective state spaces

Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752, 2024. URL https://arxiv.org/abs/2312.00752

Pith/arXiv arXiv 2024

[17] [17]

Efficiently modeling long sequences with structured state spaces

Albert Gu, Karan Goel, and Christopher Re. Efficiently modeling long sequences with structured state spaces. InInternational Conference on Learning Representations, 2021. 10

2021

[18] [18]

Designing robust transformers using robust kernel density estimation.Advances in Neural Information Processing Systems, 36:53362–53384, 2023

Xing Han, Tongzheng Ren, Tan Nguyen, Khai Nguyen, Joydeep Ghosh, and Nhat Ho. Designing robust transformers using robust kernel density estimation.Advances in Neural Information Processing Systems, 36:53362–53384, 2023

2023

[19] [19]

Psychology press, 2005

Donald Olding Hebb.The organization of behavior: A neuropsychological theory. Psychology press, 2005

2005

[20] [20]

Parallel models of associative memory

Geoffrey E Hinton and James A Anderson. Parallel models of associative memory. 1989

1989

[21] [21]

Long short-term memory.Neural Computation, 9(8): 1735–1780, 1997

Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory.Neural Computation, 9 (8):1735–1780, 1997. doi: 10.1162/neco.1997.9.8.1735

work page doi:10.1162/neco.1997.9.8.1735 1997

[22] [22]

The hardware lottery.Communications of the ACM, 64(12):58–65, 2021

Sara Hooker. The hardware lottery.Communications of the ACM, 64(12):58–65, 2021

2021

[23] [23]

Hwang, S., Folli, V ., Lanza, E., Parisi, G., Ruocco, G., and Zamponi, F

J J Hopfield. Neural networks and physical systems with emergent collective computational abilities.Proceedings of the National Academy of Sciences, 79(8):2554–2558, 1982. doi: 10.1073/pnas.79.8.2554. URL https://www.pnas.org/doi/abs/10.1073/pnas. 79.8.2554

work page doi:10.1073/pnas.79.8.2554 1982

[24] [24]

Kernel memory networks: A unifying framework for memory modeling, 2024

Georgios Iatropoulos, Johanni Brea, and Wulfram Gerstner. Kernel memory networks: A unifying framework for memory modeling, 2024. URL https://arxiv.org/abs/2208. 09416

2024

[25] [25]

Muon: An optimizer for hidden layers in neural networks, 2024

Keller Jordan, Yuchen Jin, Vlado Boza, Jiacheng You, Franz Cesista, Laker Newhouse, and Jeremy Bernstein. Muon: An optimizer for hidden layers in neural networks, 2024. URL https://kellerjordan.github.io/posts/muon/

2024

[26] [26]

Transformers are rnns: Fast autoregressive transformers with linear attention

Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret. Transformers are rnns: Fast autoregressive transformers with linear attention. InInternational conference on machine learning, pages 5156–5165. PMLR, 2020

2020

[27] [27]

Lola: Low-rank linear attention with sparse caching.arXiv preprint arXiv:2505.23666, 2025

Luke McDermott, Robert W Heath Jr, and Rahul Parhi. Lola: Low-rank linear attention with sparse caching.arXiv preprint arXiv:2505.23666, 2025

arXiv 2025

[28] [28]

Embodied lifelong learning for task and motion planning

Jorge Mendez-Mendez, Leslie Pack Kaelbling, and Tomás Lozano-Pérez. Embodied lifelong learning for task and motion planning. InConference on Robot Learning, pages 2134–2150. PMLR, 2023

2023

[29] [29]

Some new estimates for distribution functions.Theory of Probability & Its Applications, 9(3):497–500, 1964

Elizbar A Nadaraya. Some new estimates for distribution functions.Theory of Probability & Its Applications, 9(3):497–500, 1964

1964

[30] [30]

The sparse frontier: Sparse attention trade-offs in transformer llms.arXiv preprint arXiv:2504.17768, 2025

Piotr Nawrot, Robert Li, Renjie Huang, Sebastian Ruder, Kelly Marchisio, and Edoardo M Ponti. The sparse frontier: Sparse attention trade-offs in transformer llms.arXiv preprint arXiv:2504.17768, 2025

Pith/arXiv arXiv 2025

[31] [31]

Neural network capacity using delta rule.Electronics Letters, 25(3): 197–199, 1989

DL Prados and SC Kak. Neural network capacity using delta rule.Electronics Letters, 25(3): 197–199, 1989

1989

[32] [32]

Expe- rience replay for continual learning.Advances in neural information processing systems, 32, 2019

David Rolnick, Arun Ahuja, Jonathan Schwarz, Timothy Lillicrap, and Gregory Wayne. Expe- rience replay for continual learning.Advances in neural information processing systems, 32, 2019

2019

[33] [33]

Linear transformers are secretly fast weight programmers

Imanol Schlag, Kazuki Irie, and Jürgen Schmidhuber. Linear transformers are secretly fast weight programmers. InInternational conference on machine learning, pages 9355–9366. PMLR, 2021

2021

[34] [34]

Learning to control fast-weight memories: An alternative to dynamic recurrent networks.Neural Computation, 4(1):131–139, 1992

Jürgen Schmidhuber. Learning to control fast-weight memories: An alternative to dynamic recurrent networks.Neural Computation, 4(1):131–139, 1992

1992

[35] [35]

Deltaproduct: Increasing the expressivity of deltanet through products of householders

Julien Siems, Timur Carstensen, Arber Zela, Frank Hutter, Massimiliano Pontil, and Riccardo Grazzi. Deltaproduct: Increasing the expressivity of deltanet through products of householders. InICLR 2025 Workshop on Foundation Models in the Wild, 2025

2025

[36] [36]

Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024. 11

2024

[37] [37]

Learning to (learn at test time): Rnns with expressive hidden states.arXiv preprint arXiv:2407.04620, 2024

Yu Sun, Xinhao Li, Karan Dalal, Jiarui Xu, Arjun Vikram, Genghan Zhang, Yann Dubois, Xinlei Chen, Xiaolong Wang, Sanmi Koyejo, et al. Learning to (learn at test time): Rnns with expressive hidden states.arXiv preprint arXiv:2407.04620, 2024

Pith/arXiv arXiv 2024

[38] [38]

Retentive network: A successor to transformer for large language models, 2023

Yutao Sun, Li Dong, Shaohan Huang, Shuming Ma, Yuqing Xia, Jilong Xue, Jianyong Wang, and Furu Wei. Retentive network: A successor to transformer for large language models, 2023. URLhttps://arxiv.org/abs/2307.08621

Pith/arXiv arXiv 2023

[39] [39]

Sutton, Michael Bowling, and Patrick M

Richard S. Sutton, Michael Bowling, and Patrick M. Pilarski. The alberta plan for ai research,

[40] [40]

URLhttps://arxiv.org/abs/2208.11173

arXiv

[41] [41]

Attention is all you need.Advances in neural information processing systems, 30, 2017

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017

2017

[42] [42]

Mesanet: Sequence modeling by locally optimal test-time training.arXiv preprint arXiv:2506.05233, 2025

Johannes von Oswald, Nino Scherrer, Seijin Kobayashi, Luca Versari, Songlin Yang, Max- imilian Schlegel, Kaitlin Maile, Yanick Schimpf, Oliver Sieberling, Alexander Meulemans, et al. Mesanet: Sequence modeling by locally optimal test-time training.arXiv preprint arXiv:2506.05233, 2025

Pith/arXiv arXiv 2025

[43] [43]

Test-time regression: a unifying framework for designing sequence models with associative memory.arXiv preprint arXiv:2501.12352, 2025

Ke Alexander Wang, Jiaxin Shi, and Emily B Fox. Test-time regression: a unifying framework for designing sequence models with associative memory.arXiv preprint arXiv:2501.12352, 2025

arXiv 2025

[44] [44]

Smooth regression analysis.Sankhy ¯a: The Indian Journal of Statistics, Series A, pages 359–372, 1964

Geoffrey S Watson. Smooth regression analysis.Sankhy ¯a: The Indian Journal of Statistics, Series A, pages 359–372, 1964

1964

[45] [45]

Adaptive filters.Aspects of network and system theory, 1971

Bernard Widrow. Adaptive filters.Aspects of network and system theory, 1971

1971

[46] [46]

Adaptive switching circuits

Bernard Widrow and Marcian E Hoff. Adaptive switching circuits. InNeurocomputing: foundations of research, pages 123–134. 1988

1988

[47] [47]

Gated delta networks: Improving mamba2 with delta rule.arXiv preprint arXiv:2412.06464, 2024

Songlin Yang, Jan Kautz, and Ali Hatamizadeh. Gated delta networks: Improving mamba2 with delta rule.arXiv preprint arXiv:2412.06464, 2024

Pith/arXiv arXiv 2024

[48] [48]

Gated linear attention transformers with hardware-efficient training

Songlin Yang, Bailin Wang, Yikang Shen, Rameswar Panda, and Yoon Kim. Gated linear attention transformers with hardware-efficient training. InInternational Conference on Machine Learning, pages 56501–56523. PMLR, 2024

2024

[49] [49]

Parallelizing linear transformers with the delta rule over sequence length

Songlin Yang, Bailin Wang, Yu Zhang, Yikang Shen, and Yoon Kim. Parallelizing linear transformers with the delta rule over sequence length. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

2024

[50] [50]

Cambridge University Press, 2023

Aston Zhang, Zachary C Lipton, Mu Li, and Alexander J Smola.Dive into deep learning. Cambridge University Press, 2023

2023

[51] [51]

Test-time training done right.arXiv preprint arXiv:2505.23884, 2025

Tianyuan Zhang, Sai Bi, Yicong Hong, Kai Zhang, Fujun Luan, Songlin Yang, Kalyan Sunkavalli, William T Freeman, and Hao Tan. Test-time training done right.arXiv preprint arXiv:2505.23884, 2025

Pith/arXiv arXiv 2025

[52] [52]

Deep research: A survey of autonomous research agents.arXiv preprint arXiv:2508.12752, 2025

Wenlin Zhang, Xiaopeng Li, Yingyi Zhang, Pengyue Jia, Yichao Wang, Huifeng Guo, Yong Liu, and Xiangyu Zhao. Deep research: A survey of autonomous research agents.arXiv preprint arXiv:2508.12752, 2025

arXiv 2025

[53] [53]

Understanding transformer from the perspective of associative memory, 2025

Shu Zhong, Mingyu Xu, Tenglong Ao, and Guang Shi. Understanding transformer from the perspective of associative memory, 2025. URL https://arxiv.org/abs/2505. 19488. 12

2025