pith. sign in

arxiv: 2606.25342 · v1 · pith:44T3NNESnew · submitted 2026-06-24 · 💻 cs.LG

Lifelong In-Context Learning with Transformers Requires Parametric Forms of Attention

Pith reviewed 2026-06-25 21:17 UTC · model grok-4.3

classification 💻 cs.LG
keywords lifelong in-context learningparametric attentiontransformerscontinual learningkey-value cacheonline regressionstate-space models
0
0 comments X

The pith

Transformers require parametric attention to perform lifelong in-context learning on a fixed memory budget.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that standard transformers cannot scale in-context learning to lifelong sequences because softmax attention stores an ever-growing key-value cache that exceeds fixed hardware limits. It claims parametric forms of attention solve this by using online parametric regression to learn key-value relationships at test time, replacing the cache with a neural network that keeps memory constant. This generalizes approaches such as linear attention and state-space models as nonparametric alternatives. A sympathetic reader would care because the approach points to a way for AI agents to accumulate and use experience over arbitrary time horizons without periodic full retraining. The work identifies current shortfalls in capacity and update cost for these methods and poses open questions to guide progress.

Core claim

Parametric forms of attention learn the relationship between keys and their associated values at test-time with parametric regression, replacing the ever-growing key-value cache with an online-trainable neural network to maintain a constant memory footprint while extending in-context learning to lifelong settings.

What carries the argument

Parametric attention, which performs parametric regression to associate keys with values during inference instead of storing them explicitly.

If this is right

  • Transformers can process sequences of arbitrary length while using only constant memory.
  • In-context learning extends to lifelong continual learning for AI agents.
  • Methods such as linear attention and state-space models become building blocks for long-horizon agents.
  • Resolving capacity and update-cost limits in parametric attention will be required to realize the approach.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Hybrid architectures that combine parametric attention with existing transformer layers could enable gradual knowledge accumulation without full model resets.
  • Success would shift emphasis from ever-larger pretraining runs toward ongoing test-time adaptation.
  • The open questions on capacity and cost suggest concrete benchmarks that measure retention over thousands of steps with fixed parameters.

Load-bearing premise

An online-trainable neural network can replace the key-value cache without prohibitive update costs or hitting hard capacity limits.

What would settle it

A direct comparison showing that a parametric attention model either drops accuracy on sequences longer than its training horizon or requires more total compute than softmax attention with a memory-bounded cache.

Figures

Figures reproduced from arXiv: 2606.25342 by Luke McDermott, Rahul Parhi, Robert W. Heath Jr..

Figure 1
Figure 1. Figure 1: Attention as Test-Time Regression [42]. Across three time steps, we illustrate how attention generates self-supervised training pairs and sequentially fits an estimator mt . The output of attention at any given time is the prediction of the query. 2 Attention as Test-Time Regression As a preliminary, we briefly explain traditional views of attention and formally introduce our perspec￾tive on attention as a… view at source ↗
Figure 2
Figure 2. Figure 2: Nonparametric attention use key￾value pairs to form the estimator, leading to unbounded growth. Softmax attention can be viewed as an MLP with KV pairs as weights [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
read the original abstract

Lifelong continual learning remains an obstacle on the path to human-like intelligence. Modern transformers show sparks of intelligence with in-context learning. The quadratic nature of attention, however, prohibits transformers from performing this process on arbitrarily long sequences. In this work, we argue that extending in-context learning to lifelong settings is a practical solution for continual learning in AI agents. In particular, we argue that \emph{parametric forms of attention} are needed to understand a lifetime of context with transformers on a fixed hardware budget. These attention mechanisms learn the relationship between keys and their associated values at test-time with parametric regression. Our generalization of parametric approaches (linear attention, state-space models, fast weight programmers, and test-time training layers) contrasts with nonparametric counterparts like softmax attention. They replace the ever-growing key-value cache with an online-trainable neural network, maintaining a constant memory footprint. We highlight how parametric attention currently fall short of lifelong learning due to limited memory capacity or costly online updates. To address these issues, we pose a set of open questions with novel insights to guide the field toward long-horizon agents.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The manuscript is a position paper arguing that extending in-context learning to lifelong settings in transformers requires parametric forms of attention. These replace the growing key-value cache of softmax attention with an online-trainable neural network via parametric regression, thereby maintaining constant memory on fixed hardware while processing arbitrarily long sequences. The paper unifies linear attention, state-space models, fast weight programmers, and test-time training under this parametric umbrella, contrasts them with nonparametric methods, acknowledges current shortcomings in capacity and update cost, and poses open questions to guide future research on long-horizon agents.

Significance. If the argument is accepted, the paper could usefully redirect research toward parametric attention mechanisms that support continual learning without unbounded memory growth. Its main contribution is the synthesis of existing methods under a single framing and the explicit listing of open challenges; the work contains no new derivations, proofs, or experiments.

major comments (1)
  1. [Abstract] Abstract: the necessity claim that parametric attention is required 'to understand a lifetime of context with transformers on a fixed hardware budget' assumes without supporting argument that constant memory is mandatory. The manuscript provides no analysis ruling out alternatives such as cache compression, hierarchical memory, or periodic eviction that could keep quadratic attention feasible within hardware limits.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed review and the opportunity to strengthen the manuscript. As a position paper, our goal is to synthesize existing approaches and highlight open challenges rather than provide exhaustive proofs. We address the concern regarding the necessity claim below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the necessity claim that parametric attention is required 'to understand a lifetime of context with transformers on a fixed hardware budget' assumes without supporting argument that constant memory is mandatory. The manuscript provides no analysis ruling out alternatives such as cache compression, hierarchical memory, or periodic eviction that could keep quadratic attention feasible within hardware limits.

    Authors: We agree that the abstract states the necessity of parametric attention for constant-memory lifelong ICL without explicitly analyzing alternatives. Our core argument is that any approach relying on an ever-growing key-value cache (even with compression or eviction) will eventually exceed fixed hardware limits for truly arbitrary sequence lengths, as compression ratios are bounded and eviction risks losing critical lifelong context. Hierarchical memory introduces additional complexity and latency that may not resolve the fundamental scaling issue. We will revise the abstract and add a short paragraph in the introduction (or a dedicated subsection) acknowledging these alternatives and explaining why we view parametric regression as the more scalable path for constant footprint. This revision will be made without altering the position paper's scope or adding new experiments. revision: yes

Circularity Check

0 steps flagged

Position paper with no derivational claims or reductions

full rationale

The paper is explicitly a position paper that argues for the necessity of parametric attention forms to enable lifelong in-context learning under fixed memory constraints. It presents no equations, formal derivations, fitted parameters, predictions, or empirical results. The abstract and text instead highlight shortcomings of existing parametric methods, contrast them with softmax attention at a conceptual level, and conclude by posing open questions. No load-bearing step reduces to a self-definition, fitted input, or self-citation chain; the argument is self-contained as advocacy for future work rather than a closed derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The argument rests on domain assumptions about memory scaling and the viability of online regression; no free parameters or invented entities are introduced.

axioms (2)
  • domain assumption The quadratic nature of attention prohibits transformers from performing in-context learning on arbitrarily long sequences.
    Invoked in the abstract as the core motivation for switching to parametric forms.
  • domain assumption An online-trainable neural network can replace the ever-growing key-value cache while maintaining performance.
    Central premise of the proposed solution category.

pith-pipeline@v0.9.1-grok · 5724 in / 1166 out tokens · 23857 ms · 2026-06-25T21:17:54.994058+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

53 extracted references · 2 canonical work pages

  1. [1]

    Gqa: Training generalized multi-query transformer models from multi-head checkpoints.arXiv preprint arXiv:2305.13245, 2023

    Joshua Ainslie, James Lee-Thorp, Michiel De Jong, Yury Zemlyanskiy, Federico Lebrón, and Sumit Sanghai. Gqa: Training generalized multi-query transformer models from multi-head checkpoints.arXiv preprint arXiv:2305.13245, 2023

  2. [2]

    Physics of language models: Part 4.1, architecture design and the magic of canon layers, 2025

    Zeyuan Allen-Zhu. Physics of language models: Part 4.1, architecture design and the magic of canon layers, 2025. URLhttps://arxiv.org/abs/2512.17351

  3. [3]

    Just read twice: closing the recall gap for recurrent language models, 2024

    Simran Arora, Aman Timalsina, Aaryan Singhal, Benjamin Spector, Sabri Eyuboglu, Xinyi Zhao, Ashish Rao, Atri Rudra, and Christopher Ré. Just read twice: closing the recall gap for recurrent language models, 2024. URLhttps://arxiv.org/abs/2407.05483

  4. [4]

    Self-supervised learning from images with a joint- embedding predictive architecture

    Mahmoud Assran, Quentin Duval, Ishan Misra, Piotr Bojanowski, Pascal Vincent, Michael Rabbat, Yann LeCun, and Nicolas Ballas. Self-supervised learning from images with a joint- embedding predictive architecture. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15619–15629, 2023

  5. [5]

    xlstm: Ex- tended long short-term memory

    Maximilian Beck, Korbinian Pöppel, Markus Spanring, Andreas Auer, Oleksandra Prudnikova, Michael K Kopp, Günter Klambauer, Johannes Brandstetter, and Sepp Hochreiter. xlstm: Ex- tended long short-term memory. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

  6. [6]

    Titans: Learning to memorize at test time

    Ali Behrouz, Peilin Zhong, and Vahab Mirrokni. Titans: Learning to memorize at test time. arXiv preprint arXiv:2501.00663, 2024

  7. [7]

    Atlas: Learning to optimally memorize the context at test time.arXiv preprint arXiv:2505.23735, 2025

    Ali Behrouz, Zeman Li, Praneeth Kacham, Majid Daliri, Yuan Deng, Peilin Zhong, Meisam Razaviyayn, and Vahab Mirrokni. Atlas: Learning to optimally memorize the context at test time.arXiv preprint arXiv:2505.23735, 2025

  8. [8]

    It’s all connected: A journey through test-time memorization, attentional bias, retention, and online optimization

    Ali Behrouz, Meisam Razaviyayn, Peilin Zhong, and Vahab Mirrokni. It’s all connected: A journey through test-time memorization, attentional bias, retention, and online optimization. arXiv preprint arXiv:2504.13173, 2025

  9. [9]

    Nested learning: The illusion of deep learning architectures.arXiv preprint arXiv:2512.24695, 2025

    Ali Behrouz, Meisam Razaviyayn, Peilin Zhong, and Vahab Mirrokni. Nested learning: The illusion of deep learning architectures.arXiv preprint arXiv:2512.24695, 2025

  10. [10]

    Modular duality in deep learning.arXiv preprint arXiv:2410.21265, 2024

    Jeremy Bernstein and Laker Newhouse. Modular duality in deep learning.arXiv preprint arXiv:2410.21265, 2024

  11. [11]

    Lora learns less and forgets less.arXiv preprint arXiv:2405.09673, 2024

    Dan Biderman, Jacob Portes, Jose Javier Gonzalez Ortiz, Mansheej Paul, Philip Greengard, Connor Jennings, Daniel King, Sam Havens, Vitaliy Chiley, Jonathan Frankle, et al. Lora learns less and forgets less.arXiv preprint arXiv:2405.09673, 2024

  12. [12]

    Continual lifelong learning in natural language processing: A survey

    Magdalena Biesialska, Katarzyna Biesialska, and Marta R Costa-jussà. Continual lifelong learning in natural language processing: A survey. InProceedings of the 28th International Conference on Computational Linguistics, pages 6523–6541, 2020

  13. [13]

    An iterative algorithm for computing the best estimate of an orthogonal matrix.SIAM Journal on Numerical Analysis, 8(2):358–364, 1971

    Åke Björck and Clazett Bowie. An iterative algorithm for computing the best estimate of an orthogonal matrix.SIAM Journal on Numerical Analysis, 8(2):358–364, 1971

  14. [14]

    Generating long sequences with sparse transformers.arXiv preprint arXiv:1904.10509, 2019

    Rewon Child. Generating long sequences with sparse transformers.arXiv preprint arXiv:1904.10509, 2019

  15. [15]

    Transformers are SSMs: Generalized models and efficient algorithms through structured state space duality

    Tri Dao and Albert Gu. Transformers are SSMs: Generalized models and efficient algorithms through structured state space duality. InInternational Conference on Machine Learning (ICML), 2024

  16. [16]

    Mamba: Linear-time sequence modeling with selective state spaces

    Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752, 2024. URL https://arxiv.org/abs/2312.00752

  17. [17]

    Efficiently modeling long sequences with structured state spaces

    Albert Gu, Karan Goel, and Christopher Re. Efficiently modeling long sequences with structured state spaces. InInternational Conference on Learning Representations, 2021. 10

  18. [18]

    Designing robust transformers using robust kernel density estimation.Advances in Neural Information Processing Systems, 36:53362–53384, 2023

    Xing Han, Tongzheng Ren, Tan Nguyen, Khai Nguyen, Joydeep Ghosh, and Nhat Ho. Designing robust transformers using robust kernel density estimation.Advances in Neural Information Processing Systems, 36:53362–53384, 2023

  19. [19]

    Psychology press, 2005

    Donald Olding Hebb.The organization of behavior: A neuropsychological theory. Psychology press, 2005

  20. [20]

    Parallel models of associative memory

    Geoffrey E Hinton and James A Anderson. Parallel models of associative memory. 1989

  21. [21]

    Long short-term memory.Neural Computation, 9(8): 1735–1780, 1997

    Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory.Neural Computation, 9 (8):1735–1780, 1997. doi: 10.1162/neco.1997.9.8.1735

  22. [22]

    The hardware lottery.Communications of the ACM, 64(12):58–65, 2021

    Sara Hooker. The hardware lottery.Communications of the ACM, 64(12):58–65, 2021

  23. [23]

    Hwang, S., Folli, V ., Lanza, E., Parisi, G., Ruocco, G., and Zamponi, F

    J J Hopfield. Neural networks and physical systems with emergent collective computational abilities.Proceedings of the National Academy of Sciences, 79(8):2554–2558, 1982. doi: 10.1073/pnas.79.8.2554. URL https://www.pnas.org/doi/abs/10.1073/pnas. 79.8.2554

  24. [24]

    Kernel memory networks: A unifying framework for memory modeling, 2024

    Georgios Iatropoulos, Johanni Brea, and Wulfram Gerstner. Kernel memory networks: A unifying framework for memory modeling, 2024. URL https://arxiv.org/abs/2208. 09416

  25. [25]

    Muon: An optimizer for hidden layers in neural networks, 2024

    Keller Jordan, Yuchen Jin, Vlado Boza, Jiacheng You, Franz Cesista, Laker Newhouse, and Jeremy Bernstein. Muon: An optimizer for hidden layers in neural networks, 2024. URL https://kellerjordan.github.io/posts/muon/

  26. [26]

    Transformers are rnns: Fast autoregressive transformers with linear attention

    Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret. Transformers are rnns: Fast autoregressive transformers with linear attention. InInternational conference on machine learning, pages 5156–5165. PMLR, 2020

  27. [27]

    Lola: Low-rank linear attention with sparse caching.arXiv preprint arXiv:2505.23666, 2025

    Luke McDermott, Robert W Heath Jr, and Rahul Parhi. Lola: Low-rank linear attention with sparse caching.arXiv preprint arXiv:2505.23666, 2025

  28. [28]

    Embodied lifelong learning for task and motion planning

    Jorge Mendez-Mendez, Leslie Pack Kaelbling, and Tomás Lozano-Pérez. Embodied lifelong learning for task and motion planning. InConference on Robot Learning, pages 2134–2150. PMLR, 2023

  29. [29]

    Some new estimates for distribution functions.Theory of Probability & Its Applications, 9(3):497–500, 1964

    Elizbar A Nadaraya. Some new estimates for distribution functions.Theory of Probability & Its Applications, 9(3):497–500, 1964

  30. [30]

    The sparse frontier: Sparse attention trade-offs in transformer llms.arXiv preprint arXiv:2504.17768, 2025

    Piotr Nawrot, Robert Li, Renjie Huang, Sebastian Ruder, Kelly Marchisio, and Edoardo M Ponti. The sparse frontier: Sparse attention trade-offs in transformer llms.arXiv preprint arXiv:2504.17768, 2025

  31. [31]

    Neural network capacity using delta rule.Electronics Letters, 25(3): 197–199, 1989

    DL Prados and SC Kak. Neural network capacity using delta rule.Electronics Letters, 25(3): 197–199, 1989

  32. [32]

    Expe- rience replay for continual learning.Advances in neural information processing systems, 32, 2019

    David Rolnick, Arun Ahuja, Jonathan Schwarz, Timothy Lillicrap, and Gregory Wayne. Expe- rience replay for continual learning.Advances in neural information processing systems, 32, 2019

  33. [33]

    Linear transformers are secretly fast weight programmers

    Imanol Schlag, Kazuki Irie, and Jürgen Schmidhuber. Linear transformers are secretly fast weight programmers. InInternational conference on machine learning, pages 9355–9366. PMLR, 2021

  34. [34]

    Learning to control fast-weight memories: An alternative to dynamic recurrent networks.Neural Computation, 4(1):131–139, 1992

    Jürgen Schmidhuber. Learning to control fast-weight memories: An alternative to dynamic recurrent networks.Neural Computation, 4(1):131–139, 1992

  35. [35]

    Deltaproduct: Increasing the expressivity of deltanet through products of householders

    Julien Siems, Timur Carstensen, Arber Zela, Frank Hutter, Massimiliano Pontil, and Riccardo Grazzi. Deltaproduct: Increasing the expressivity of deltanet through products of householders. InICLR 2025 Workshop on Foundation Models in the Wild, 2025

  36. [36]

    Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

    Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024. 11

  37. [37]

    Learning to (learn at test time): Rnns with expressive hidden states.arXiv preprint arXiv:2407.04620, 2024

    Yu Sun, Xinhao Li, Karan Dalal, Jiarui Xu, Arjun Vikram, Genghan Zhang, Yann Dubois, Xinlei Chen, Xiaolong Wang, Sanmi Koyejo, et al. Learning to (learn at test time): Rnns with expressive hidden states.arXiv preprint arXiv:2407.04620, 2024

  38. [38]

    Retentive network: A successor to transformer for large language models, 2023

    Yutao Sun, Li Dong, Shaohan Huang, Shuming Ma, Yuqing Xia, Jilong Xue, Jianyong Wang, and Furu Wei. Retentive network: A successor to transformer for large language models, 2023. URLhttps://arxiv.org/abs/2307.08621

  39. [39]

    Sutton, Michael Bowling, and Patrick M

    Richard S. Sutton, Michael Bowling, and Patrick M. Pilarski. The alberta plan for ai research,

  40. [40]

    URLhttps://arxiv.org/abs/2208.11173

  41. [41]

    Attention is all you need.Advances in neural information processing systems, 30, 2017

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017

  42. [42]

    Mesanet: Sequence modeling by locally optimal test-time training.arXiv preprint arXiv:2506.05233, 2025

    Johannes von Oswald, Nino Scherrer, Seijin Kobayashi, Luca Versari, Songlin Yang, Max- imilian Schlegel, Kaitlin Maile, Yanick Schimpf, Oliver Sieberling, Alexander Meulemans, et al. Mesanet: Sequence modeling by locally optimal test-time training.arXiv preprint arXiv:2506.05233, 2025

  43. [43]

    Test-time regression: a unifying framework for designing sequence models with associative memory.arXiv preprint arXiv:2501.12352, 2025

    Ke Alexander Wang, Jiaxin Shi, and Emily B Fox. Test-time regression: a unifying framework for designing sequence models with associative memory.arXiv preprint arXiv:2501.12352, 2025

  44. [44]

    Smooth regression analysis.Sankhy ¯a: The Indian Journal of Statistics, Series A, pages 359–372, 1964

    Geoffrey S Watson. Smooth regression analysis.Sankhy ¯a: The Indian Journal of Statistics, Series A, pages 359–372, 1964

  45. [45]

    Adaptive filters.Aspects of network and system theory, 1971

    Bernard Widrow. Adaptive filters.Aspects of network and system theory, 1971

  46. [46]

    Adaptive switching circuits

    Bernard Widrow and Marcian E Hoff. Adaptive switching circuits. InNeurocomputing: foundations of research, pages 123–134. 1988

  47. [47]

    Gated delta networks: Improving mamba2 with delta rule.arXiv preprint arXiv:2412.06464, 2024

    Songlin Yang, Jan Kautz, and Ali Hatamizadeh. Gated delta networks: Improving mamba2 with delta rule.arXiv preprint arXiv:2412.06464, 2024

  48. [48]

    Gated linear attention transformers with hardware-efficient training

    Songlin Yang, Bailin Wang, Yikang Shen, Rameswar Panda, and Yoon Kim. Gated linear attention transformers with hardware-efficient training. InInternational Conference on Machine Learning, pages 56501–56523. PMLR, 2024

  49. [49]

    Parallelizing linear transformers with the delta rule over sequence length

    Songlin Yang, Bailin Wang, Yu Zhang, Yikang Shen, and Yoon Kim. Parallelizing linear transformers with the delta rule over sequence length. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

  50. [50]

    Cambridge University Press, 2023

    Aston Zhang, Zachary C Lipton, Mu Li, and Alexander J Smola.Dive into deep learning. Cambridge University Press, 2023

  51. [51]

    Test-time training done right.arXiv preprint arXiv:2505.23884, 2025

    Tianyuan Zhang, Sai Bi, Yicong Hong, Kai Zhang, Fujun Luan, Songlin Yang, Kalyan Sunkavalli, William T Freeman, and Hao Tan. Test-time training done right.arXiv preprint arXiv:2505.23884, 2025

  52. [52]

    Deep research: A survey of autonomous research agents.arXiv preprint arXiv:2508.12752, 2025

    Wenlin Zhang, Xiaopeng Li, Yingyi Zhang, Pengyue Jia, Yichao Wang, Huifeng Guo, Yong Liu, and Xiangyu Zhao. Deep research: A survey of autonomous research agents.arXiv preprint arXiv:2508.12752, 2025

  53. [53]

    Understanding transformer from the perspective of associative memory, 2025

    Shu Zhong, Mingyu Xu, Tenglong Ao, and Guang Shi. Understanding transformer from the perspective of associative memory, 2025. URL https://arxiv.org/abs/2505. 19488. 12