pith. machine review for the scientific record. sign in

arxiv: 2604.06169 · v1 · submitted 2026-04-07 · 💻 cs.LG · cs.AI· cs.CL· stat.ML

Recognition: 2 theorem links

· Lean Theorem

In-Place Test-Time Training

Authors on Pith no claims yet

Pith reviewed 2026-05-10 19:00 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CLstat.ML
keywords test time traininglarge language modelsMLPfast weightsnext token predictioncontinual adaptationinference timecontext length
0
0 comments X

The pith

In-Place Test-Time Training endows large language models with the ability to adapt weights at inference time by updating the final projection matrices of their MLP blocks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models are currently limited by a fixed set of weights after training, which prevents them from responding to new data streams during use. In-Place TTT overcomes this by selecting the final projection matrix in every MLP block as the fast weights that get updated at test time. The method introduces a next-token prediction objective that matches the core task of language modeling, along with chunk-wise updates that work with parallel processing of long contexts. This results in better performance for a 4 billion parameter model on inputs as long as 128 thousand tokens, and stronger results than other test-time training techniques when the model is trained from the start. A reader would care if they want models that keep learning after deployment without full retraining.

Core claim

In-Place TTT treats the final projection matrix of the ubiquitous MLP blocks as its adaptable fast weights, enabling a drop-in enhancement for LLMs without costly retraining from scratch. It replaces TTT's generic reconstruction objective with a tailored objective aligned with next-token prediction. Combined with an efficient chunk-wise update mechanism, this produces a scalable algorithm. Experiments show superior performance on long-context tasks and outperformance of competitive approaches when pretrained from scratch.

What carries the argument

The final projection matrix of MLP blocks as fast weights, updated with a next-token-prediction objective through chunk-wise mechanisms.

Load-bearing premise

That adapting only the final projection matrices inside the MLP blocks using the new next-token objective produces stable updates that improve performance without degrading the model or needing other changes.

What would settle it

A direct comparison where a model with In-Place TTT fails to improve or worsens on long-context benchmarks relative to its non-adapting counterpart would falsify the central effectiveness claim.

read the original abstract

The static ``train then deploy" paradigm fundamentally limits Large Language Models (LLMs) from dynamically adapting their weights in response to continuous streams of new information inherent in real-world tasks. Test-Time Training (TTT) offers a compelling alternative by updating a subset of model parameters (fast weights) at inference time, yet its potential in the current LLM ecosystem is hindered by critical barriers including architectural incompatibility, computational inefficiency and misaligned fast weight objectives for language modeling. In this work, we introduce In-Place Test-Time Training (In-Place TTT), a framework that seamlessly endows LLMs with Test-Time Training ability. In-Place TTT treats the final projection matrix of the ubiquitous MLP blocks as its adaptable fast weights, enabling a ``drop-in" enhancement for LLMs without costly retraining from scratch. Furthermore, we replace TTT's generic reconstruction objective with a tailored, theoretically-grounded objective explicitly aligned with the Next-Token-Prediction task governing autoregressive language modeling. This principled objective, combined with an efficient chunk-wise update mechanism, results in a highly scalable algorithm compatible with context parallelism. Extensive experiments validate our framework's effectiveness: as an in-place enhancement, it enables a 4B-parameter model to achieve superior performance on tasks with contexts up to 128k, and when pretrained from scratch, it consistently outperforms competitive TTT-related approaches. Ablation study results further provide deeper insights on our design choices. Collectively, our results establish In-Place TTT as a promising step towards a paradigm of continual learning in LLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces In-Place Test-Time Training (In-Place TTT) as a drop-in framework for LLMs that adapts only the final projection matrix within each MLP block as fast weights during inference. It replaces generic TTT reconstruction objectives with a new next-token-prediction-aligned objective and uses chunk-wise updates for scalability with context parallelism. Experiments claim that this enables a 4B model to outperform baselines on tasks with up to 128k contexts as an in-place enhancement, and that pretraining from scratch with In-Place TTT consistently beats competitive TTT methods, supported by ablations on design choices.

Significance. If the empirical results and stability claims hold under the restricted adaptation, this could meaningfully advance practical test-time adaptation for existing LLMs by avoiding architectural changes or full retraining. The emphasis on a theoretically aligned objective and compatibility with long contexts addresses real barriers in the TTT literature for language modeling. The drop-in property and reported outperformance on 128k contexts would be notable strengths if the limited fast-weight capacity proves sufficient without side effects.

major comments (2)
  1. [§3] §3 (Method) and Eq. for the new objective: the claim that the objective is 'theoretically-grounded' and independent of experimental outcomes is not demonstrated in the provided description; the derivation must be shown explicitly to confirm it does not reduce to a fitted quantity or introduce circularity with the reported gains.
  2. [Experiments] Experiments section (4B model results on 128k contexts): the central claim that restricting updates to only the final MLP projection matrix produces stable, effective adaptation without degrading the rest of the model or requiring changes rests on unverified assumptions about capacity; additional controls or analysis are needed to show why this restriction suffices rather than leaking or underfitting on long contexts.
minor comments (2)
  1. [Abstract] Abstract: notation for 'fast weights' and 'chunk size' should be defined on first use for clarity.
  2. [§3] The description of 'context parallelism' compatibility would benefit from a brief diagram or pseudocode in the methods to illustrate the chunk-wise mechanism.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and will revise the manuscript to incorporate the requested clarifications and analyses.

read point-by-point responses
  1. Referee: [§3] §3 (Method) and Eq. for the new objective: the claim that the objective is 'theoretically-grounded' and independent of experimental outcomes is not demonstrated in the provided description; the derivation must be shown explicitly to confirm it does not reduce to a fitted quantity or introduce circularity with the reported gains.

    Authors: We appreciate this observation. The objective is obtained by replacing the generic reconstruction loss of prior TTT methods with the standard autoregressive cross-entropy loss applied to the next token, where the loss is evaluated after the in-place update of the fast weights. This construction follows directly from the next-token-prediction objective that defines language-model training and does not depend on any post-hoc fitting to the reported results. To make the grounding fully explicit and to rule out any appearance of circularity, we will insert the complete derivation (including the precise loss expression and the justification for its independence from experimental outcomes) into the revised Section 3. revision: yes

  2. Referee: [Experiments] Experiments section (4B model results on 128k contexts): the central claim that restricting updates to only the final MLP projection matrix produces stable, effective adaptation without degrading the rest of the model or requiring changes rests on unverified assumptions about capacity; additional controls or analysis are needed to show why this restriction suffices rather than leaking or underfitting on long contexts.

    Authors: We agree that stronger evidence for the sufficiency of the restricted adaptation is warranted. The final projection matrix is chosen because it is the linear transformation that produces the MLP block output after the non-linearity, thereby providing a compact yet expressive site for fast-weight updates while preserving the rest of the model unchanged. The 4B-model experiments already demonstrate stable gains up to 128k contexts without degradation on shorter contexts or unrelated tasks, which is consistent with adequate capacity. Nevertheless, we will add in the revised experiments section (i) an ablation comparing adaptation of the final projection versus other matrices inside the MLP block and (ii) a capacity analysis that tracks the effective rank and gradient norms of the updated weights across long contexts, thereby directly addressing concerns about leakage or underfitting. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in the derivation

full rationale

The abstract presents the In-Place TTT framework as a practical design choice: using the final projection matrix of MLP blocks as fast weights for drop-in compatibility, and replacing the generic reconstruction objective with a next-token-prediction aligned objective described as theoretically-grounded. No equations are shown in the provided text, and no self-citations are invoked to justify the core choices. The experimental results on 4B model and pretraining comparisons are presented as validation, not as the basis for the design. Therefore, there is no reduction of predictions to inputs by construction, and the derivation chain appears self-contained against external benchmarks like standard TTT methods.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The framework relies on the assumption that MLP final projections can serve as effective fast weights and that the new objective aligns with autoregressive modeling without introducing instability. No explicit free parameters are named in the abstract, but chunk size and update learning rate are implied implementation choices.

free parameters (2)
  • update learning rate
    Likely tuned for the test-time adaptation step, though not quantified in the abstract.
  • chunk size
    Determines the granularity of the efficient update mechanism for long contexts.
axioms (2)
  • domain assumption The final projection matrix in MLP blocks can be updated independently without affecting model stability or requiring changes to other components.
    Invoked to justify the drop-in nature of the method.
  • domain assumption A next-token-prediction-aligned objective is superior to generic reconstruction for test-time adaptation in autoregressive LLMs.
    Central to replacing the standard TTT objective.

pith-pipeline@v0.9.0 · 5594 in / 1418 out tokens · 36745 ms · 2026-05-10T19:00:51.067928+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Query-Conditioned Test-Time Self-Training for Large Language Models

    cs.CL 2026-05 unverdicted novelty 7.0

    QueST lets LLMs create query-conditioned problem-solution pairs at inference time and use them for parameter-efficient self-training, outperforming prior test-time baselines on math and science benchmarks.

  2. Query-Conditioned Test-Time Self-Training for Large Language Models

    cs.CL 2026-05 conditional novelty 7.0

    QueST adapts LLMs at test time by generating query-specific problem-solution pairs for self-supervised fine-tuning, improving reasoning performance without external data.

Reference graph

Works this paper leans on

67 extracted references · 46 canonical work pages · cited by 1 Pith paper · 27 internal anchors

  1. [1]

    Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone

    Marah Abdin, Jyoti Aneja, Hany Awadalla, Ahmed Awadallah, Ammar Ahmad Awan, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Jianmin Bao, Harkirat Behl, Alon Benhaim, Misha Bilenko, Johan Bjorck, Sébastien Bubeck, Martin Cai, et al. Phi-3 technical report: A highly capable language model locally on your phone, 2024. URL https://arxiv.org/abs/2404.14219

  2. [2]

    Using Fast Weights to Attend to the Recent Past

    Jimmy Lei Ba, Geoffrey E. Hinton, Volodymyr Mnih, Joel Z. Leibo, and Catalin Ionescu. Using fast weights to attend to the recent past. In Advances in Neural Information Processing Systems, 2016. URL https: //arxiv.org/abs/1610.06258

  3. [3]

    Titans: Learning to Memorize at Test Time

    Ali Behrouz, Peilin Zhong, and Vahab Mirrokni. Titans: Learning to memorize at test time.arXiv preprint arXiv:2501.00663, 2024

  4. [5]

    arXiv preprint arXiv:2504.13173 , year =

    Ali Behrouz, Meisam Razaviyayn, Peilin Zhong, and Vahab Mirrokni. It’s all connected: A journey through test-time memorization, attentional bias, retention, and online optimization.arXiv preprint arXiv:2504.13173, 2025

  5. [6]

    Longformer: The Long-Document Transformer

    Iz Beltagy, Matthew E. Peters, and Arman Cohan. Longformer: The long-document transformer.arXiv preprint arXiv:2004.05150, 2020. URLhttps://arxiv.org/abs/2004.05150

  6. [7]

    Piqa: Reasoning about physical commonsense in natural language

    Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, and Yejin Choi. Piqa: Reasoning about physical commonsense in natural language, 2019. URLhttps://arxiv.org/abs/1911.11641

  7. [8]

    Language models are few-shot learners

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advancesin neural information processing systems, 33:1877–1901, 2020

  8. [9]

    MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering

    Jun Shern Chan, Neil Chowdhury, Oliver Jaffe, James Aung, Dane Sherburn, Evan Mays, Giulio Starace, Kevin Liu, Leon Maksin, Tejal Patwardhan, et al. Mle-bench: Evaluating machine learning agents on machine learning engineering. arXiv preprint arXiv:2410.07095, 2024

  9. [10]

    Generating Long Sequences with Sparse Transformers

    Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509, 2019

  10. [11]

    PaLM: Scaling Language Modeling with Pathways

    Aakanksha Chowdhery and et al. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022. URLhttps://arxiv.org/abs/2204.02311

  11. [12]

    Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

    Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge.arXiv:1803.05457v1, 2018

  12. [13]

    Opencompass: A universal evaluation platform for foundation models

    OpenCompass Contributors. Opencompass: A universal evaluation platform for foundation models. https: //github.com/open-compass/opencompass, 2023

  13. [14]

    Le, and Ruslan Salakhutdinov

    Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc V. Le, and Ruslan Salakhutdinov. Transformer-XL: Attentive language models beyond a fixed-length context. InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2978–2988. Association for Computational Linguistics, 2019. URLhttps://aclanthology.org/P19-1285/

  14. [15]

    One-minute video generation with test-time training

    Karan Dalal, Daniel Koceja, Jiarui Xu, Yue Zhao, Shihao Han, Ka Chun Cheung, Jan Kautz, Yejin Choi, Yu Sun, and Xiaolong Wang. One-minute video generation with test-time training. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 17702–17711, 2025

  15. [16]

    Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

    Tri Dao and Albert Gu. Transformers are ssms: Generalized models and efficient algorithms through structured state space duality, 2024. URLhttps://arxiv.org/abs/2405.21060

  16. [17]

    Mamba: Linear-Time Sequence Modeling with Selective State Spaces

    Tri Dao, Albert Gu, et al. Hungry Hungry Hippos: Towards language modeling with state space models.arXiv preprint arXiv:2312.00752, 2023

  17. [18]

    Test-time training for speech, 2023

    Sri Harsha Dumpala, Chandramouli Sastry, and Sageev Oore. Test-time training for speech, 2023. URL https://arxiv.org/abs/2309.10930. 12

  18. [19]

    A mathematical framework for transformer circuits

    Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Brown, and et al. A mathematical framework for transformer circuits. Transformer Circuits Thread, 2021. URLhttps://transformer-circuits.pub/2021/framework/index. html

  19. [20]

    The Pile: An 800GB Dataset of Diverse Text for Language Modeling

    Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, and Connor Leahy. The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020. URLhttps://arxiv.org/abs/2101.00027

  20. [21]

    Ethan He, Abhinav Khattar, Ryan Prenger, Vijay Korthikanti, Zijie Yan, Tong Liu, Shiqing Fan, Ashwath Aithal, Mohammad Shoeybi, and Bryan Catanzaro

    Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. The languag...

  21. [22]

    Transformer Feed-Forward Layers Are Key-Value Memories

    Mor Geva, Roei Schuster, Jonathan Berant, and Omer Levy. Transformer feed-forward layers are key-value memories. arXiv preprint arXiv:2012.14913, 2020

  22. [23]

    ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools

    Team GLM. Chatglm: A family of large language models from glm-130b to glm-4 all tools, 2024. URL https://arxiv.org/abs/2406.12793

  23. [24]

    The Llama 3 Herd of Models

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

  24. [25]

    Realm: Retrieval-augmented language model pre-training

    Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Ming-Wei Chang. Realm: Retrieval-augmented language model pre-training. InICML, 2020

  25. [26]

    Aligning ai with shared human values.Proceedings of the International Conference on Learning Representations (ICLR), 2021

    Dan Hendrycks, Collin Burns, Steven Basart, Andrew Critch, Jerry Li, Dawn Song, and Jacob Steinhardt. Aligning ai with shared human values.Proceedings of the International Conference on Learning Representations (ICLR), 2021

  26. [27]

    Measuring massive multitask language understanding

    Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. InInternational Conference on Learning Representations,

  27. [28]

    URLhttps://arxiv.org/abs/2009.03300

  28. [29]

    RULER: What's the Real Context Size of Your Long-Context Language Models?

    Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, Yang Zhang, and Boris Ginsburg. RULER: What’s the real context size of your long-context language models?arXiv preprint arXiv:2404.06654, 2024. URLhttps://arxiv.org/abs/2404.06654

  29. [30]

    Test-time learning for large language models.arXiv preprint arXiv:2505.20633, 2025

    Jinwu Hu, Zhitian Zhang, Guohao Chen, Xutao Wen, Chao Shuai, Wei Luo, Bin Xiao, Yuanqing Li, and Mingkui Tan. Test-time learning for large language models. arXiv preprint arXiv:2505.20633, 2025. URL https://arxiv.org/abs/2505.20633. Accepted at ICML 2025

  30. [31]

    Gershman

    Kazuki Irie and Samuel J. Gershman. Fast weight programming and linear transformers: from machine learning to neurobiology, 2025. URLhttps://arxiv.org/abs/2508.08435

  31. [32]

    Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. Mistral 7b, 2023. URLhttps://arxiv...

  32. [33]

    Lattice: Learn- ing to efficiently compress the memory.arXiv preprint arXiv:2504.05646, 2025

    Mahdi Karami and Vahab Mirrokni. Lattice: Learning to efficiently compress the memory.arXiv preprint arXiv:2504.05646, 2025

  33. [34]

    Transformers are

    Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret. Transformers are rnns: Fast autoregressive transformers with linear attention. InProceedings of the 37th InternationalConference on Machine Learning, Proceedings of Machine Learning Research. PMLR, 2020. URLhttps://arxiv.org/abs/2006.16236

  34. [35]

    Generalization through memorization: Nearest neighbor language models

    Urvashi Khandelwal, Omer Levy, Dan Jurafsky, Luke Zettlemoyer, and Mike Lewis. Generalization through memorization: Nearest neighbor language models. InICLR, 2020

  35. [36]

    Retrieval-augmented generation for knowledge-intensive nlp tasks

    Patrick Lewis, Ethan Perez, Aleksandra Piktus, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks. In NeurIPS, 2020. 13

  36. [37]

    Tnt: Improving chunkwise training for test-time memorization.arXiv preprint arXiv:2511.07343, 2025

    Zeman Li, Ali Behrouz, Yuan Deng, Peilin Zhong, Praneeth Kacham, Mahdi Karami, Meisam Razaviyayn, and Vahab Mirrokni. Tnt: Improving chunkwise training for test-time memorization.arXiv preprint arXiv:2511.07343, 2025

  37. [38]

    DeepSeek-V3 Technical Report

    Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024

  38. [39]

    In-context Learning and Induction Heads

    Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Scott Johnston, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark, Jared Kaplan, Sam McCandlish, a...

  39. [40]

    GPT-4 Technical Report

    OpenAI. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2024

  40. [41]

    Openwebmath: An open dataset of high-quality mathematical web text, 2023

    Keiran Paster, Marco Dos Santos, Zhangir Azerbayev, and Jimmy Ba. Openwebmath: An open dataset of high-quality mathematical web text, 2023

  41. [42]

    Llama 3 gradient: A series of long context models, 2024

    Leonid Pekelis, Michael Feil, Forrest Moret, Mark Huang, and Tiffany Peng. Llama 3 gradient: A series of long context models, 2024. URL https://gradient.ai/blog/ scaling-rotational-embeddings-for-long-context-language-models

  42. [43]

    YaRN: Efficient Context Window Extension of Large Language Models

    Bowen Peng, Jeffrey Quesnelle, Honglu Fan, and Enrico Shippole. Yarn: Efficient context window extension of large language models.arXiv preprint arXiv:2309.00071, 2023

  43. [44]

    Linear transformers are secretly fast weight programmers

    Imanol Schlag, Kazuki Irie, and Jürgen Schmidhuber. Linear transformers are secretly fast weight programmers. In International Conference on Machine Learning, pages 9355–9366. PMLR, 2021

  44. [45]

    Welcome to the era of experience.Google AI, 1, 2025

    David Silver and Richard S Sutton. Welcome to the era of experience.Google AI, 1, 2025

  45. [46]

    arXiv:2504.01848 , year =

    Giulio Starace, Oliver Jaffe, Dane Sherburn, James Aung, Jun Shern Chan, Leon Maksin, Rachel Dias, Evan Mays, Benjamin Kinsella, Wyatt Thompson, et al. Paperbench: Evaluating ai’s ability to replicate ai research. arXiv preprint arXiv:2504.01848, 2025

  46. [47]

    Roformer: Enhanced transformer with rotary position embedding, 2023

    Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding, 2023

  47. [48]

    Efros, and Moritz Hardt

    Yu Sun, Xiaolong Wang, Zhuang Liu, John Miller, Alexei A. Efros, and Moritz Hardt. Test-time training with self-supervision for generalization under distribution shifts. InProceedings of the 37th International Conference on Machine Learning, volume 119 ofProceedings of Machine Learning Research, pages 9229–9248. PMLR, 2020. URLhttps://proceedings.mlr.pres...

  48. [49]

    Learning to (Learn at Test Time): RNNs with Expressive Hidden States

    Yu Sun, Xinhao Li, Karan Dalal, Jiarui Xu, Arjun Vikram, Genghan Zhang, Yann Dubois, Xinlei Chen, Xiaolong Wang, Sanmi Koyejo, Tatsunori Hashimoto, and Carlos Guestrin. Learning to (learn at test time): Rnns with expressive hidden states.arXiv preprint arXiv:2407.04620, 2024. URLhttps://arxiv.org/abs/2407.04620

  49. [50]

    Retentive Network: A Successor to Transformer for Large Language Models

    Yutao Sun, Li Dong, Shaohan Huang, Shuming Ma, Yuqing Xia, Jilong Xue, Jianyong Wang, and Furu Wei. Retentive network: A successor to transformer for large language models.arXiv preprint arXiv:2307.08621, 2023

  50. [51]

    Long data collections database, 2024

    TogetherAI. Long data collections database, 2024

  51. [52]

    LLaMA: Open and Efficient Foundation Language Models

    Hugo Touvron, Théo Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, and et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023

  52. [53]

    Gomez, Lukasz Kaiser, and Illia Polosukhin

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. InAdvances in Neural Information Processing Systems (NeurIPS), 2017

  53. [54]

    Tent: Fully test-time adaptation by entropy minimization

    Dequan Wang, Evan Shelhamer, Shaoteng Liu, Bruno Olshausen, and Trevor Darrell. Tent: Fully test-time adaptation by entropy minimization. InICLR, 2021

  54. [55]

    arXiv preprint arXiv:2501.12352 , year=

    Ke Alexander Wang, Jiaxin Shi, and Emily B Fox. Test-time regression: a unifying framework for designing sequence models with associative memory.arXiv preprint arXiv:2501.12352, 2025. 14

  55. [56]

    Memoryllm: Towards self-updatable large language models, 2024

    Yu Wang, Yifan Gao, Xiusi Chen, Haoming Jiang, Shiyang Li, Jingfeng Yang, Qingyu Yin, Zheng Li, Xian Li, Bing Yin, Jingbo Shang, and Julian McAuley. Memoryllm: Towards self-updatable large language models, 2024. URLhttps://arxiv.org/abs/2402.04624

  56. [57]

    Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models, 2023. URL https: //arxiv.org/abs/2201.11903

  57. [58]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

  58. [59]

    Gated Linear Attention Transformers with Hardware-Efficient Training

    Songlin Yang, Bailin Wang, Yikang Shen, Rameswar Panda, and Yoon Kim. Gated linear attention transformers with hardware-efficient training. arXiv preprint arXiv:2312.06635, 2023. URL https://arxiv.org/abs/2312. 06635

  59. [60]

    Gated Delta Networks: Improving Mamba2 with Delta Rule

    Songlin Yang, Jan Kautz, and Ali Hatamizadeh. Gated delta networks: Improving mamba2 with delta rule.arXiv preprint arXiv:2412.06464, 2024

  60. [61]

    Gated linear attention transformers with hardware-efficient training

    Songlin Yang, Bailin Wang, Yikang Shen, Rameswar Panda, and Yoon Kim. Gated linear attention transformers with hardware-efficient training. InInternational Conference on Machine Learning, pages 56501–56523. PMLR, 2024

  61. [62]

    Parallelizing linear transformers with the delta rule over sequence length

    Songlin Yang, Bailin Wang, Yu Zhang, Yikang Shen, and Yoon Kim. Parallelizing linear transformers with the delta rule over sequence length.arXiv preprint arXiv:2406.06484, 2024. URLhttps://arxiv.org/abs/2406.06484

  62. [63]

    Parallelizing linear transformers with the delta rule over sequence length

    Songlin Yang, Bailin Wang, Yu Zhang, Yikang Shen, and Yoon Kim. Parallelizing linear transformers with the delta rule over sequence length. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

  63. [64]

    Sequential-Parallel Duality in Prefix Scannable Models

    Morris Yau, Sharut Gupta, Valerie Engelmayer, Kazuki Irie, Stefanie Jegelka, and Jacob Andreas. Sequential- parallel duality in prefix scannable models, 2025. URLhttps://arxiv.org/abs/2506.10918

  64. [65]

    arXiv preprint arXiv:2507.02259 , year=

    Hongli Yu, Tinghong Chen, Jiangtao Feng, Jiangjie Chen, Weinan Dai, Qiying Yu, Ya-Qin Zhang, Wei-Ying Ma, Jingjing Liu, Mingxuan Wang, and Hao Zhou. Memagent: Reshaping long-context llm with multi-conv rl-based memory agent, 2025. URLhttps://arxiv.org/abs/2507.02259

  65. [66]

    Jingyang Yuan, Huazuo Gao, Damai Dai, Junyu Luo, Liang Zhao, Zhengyan Zhang, Zhenda Xie, Y. X. Wei, Lean Wang, Zhiping Xiao, Yuqing Wang, Chong Ruan, Ming Zhang, Wenfeng Liang, and Wangding Zeng. Native sparse attention: Hardware-aligned and natively trainable sparse attention, 2025. URLhttps://arxiv.org/abs/ 2502.11089

  66. [67]

    HellaSwag: Can a Machine Really Finish Your Sentence?

    Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4791–4800, 2019. URLhttps://arxiv.org/abs/1905.07830

  67. [68]

    Test-time training done right.arXiv preprint arXiv:2505.23884, 2025

    Tianyuan Zhang, Sai Bi, Yicong Hong, Kai Zhang, Fujun Luan, Songlin Yang, Kalyan Sunkavalli, William T. Freeman, and Hao Tan. Test-time training done right.arXiv preprint arXiv:2505.23884, 2025. URL https: //arxiv.org/abs/2505.23884. 15 Appendix A Proof of theorem 1 For completeness, we first restate the theorem with the precise bounds derived from the as...