arxiv: 2604.06169 · v1 · submitted 2026-04-07 · 💻 cs.LG · cs.AI· cs.CL· stat.ML

Recognition: 2 theorem links

· Lean Theorem

In-Place Test-Time Training

Guhao Feng , Shengjie Luo , Kai Hua , Ge Zhang , Di He , Wenhao Huang , Tianle Cai

Authors on Pith no claims yet

Pith reviewed 2026-05-10 19:00 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CLstat.ML

keywords test time traininglarge language modelsMLPfast weightsnext token predictioncontinual adaptationinference timecontext length

0 comments

The pith

In-Place Test-Time Training endows large language models with the ability to adapt weights at inference time by updating the final projection matrices of their MLP blocks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models are currently limited by a fixed set of weights after training, which prevents them from responding to new data streams during use. In-Place TTT overcomes this by selecting the final projection matrix in every MLP block as the fast weights that get updated at test time. The method introduces a next-token prediction objective that matches the core task of language modeling, along with chunk-wise updates that work with parallel processing of long contexts. This results in better performance for a 4 billion parameter model on inputs as long as 128 thousand tokens, and stronger results than other test-time training techniques when the model is trained from the start. A reader would care if they want models that keep learning after deployment without full retraining.

Core claim

In-Place TTT treats the final projection matrix of the ubiquitous MLP blocks as its adaptable fast weights, enabling a drop-in enhancement for LLMs without costly retraining from scratch. It replaces TTT's generic reconstruction objective with a tailored objective aligned with next-token prediction. Combined with an efficient chunk-wise update mechanism, this produces a scalable algorithm. Experiments show superior performance on long-context tasks and outperformance of competitive approaches when pretrained from scratch.

What carries the argument

The final projection matrix of MLP blocks as fast weights, updated with a next-token-prediction objective through chunk-wise mechanisms.

Load-bearing premise

That adapting only the final projection matrices inside the MLP blocks using the new next-token objective produces stable updates that improve performance without degrading the model or needing other changes.

What would settle it

A direct comparison where a model with In-Place TTT fails to improve or worsens on long-context benchmarks relative to its non-adapting counterpart would falsify the central effectiveness claim.

read the original abstract

The static ``train then deploy" paradigm fundamentally limits Large Language Models (LLMs) from dynamically adapting their weights in response to continuous streams of new information inherent in real-world tasks. Test-Time Training (TTT) offers a compelling alternative by updating a subset of model parameters (fast weights) at inference time, yet its potential in the current LLM ecosystem is hindered by critical barriers including architectural incompatibility, computational inefficiency and misaligned fast weight objectives for language modeling. In this work, we introduce In-Place Test-Time Training (In-Place TTT), a framework that seamlessly endows LLMs with Test-Time Training ability. In-Place TTT treats the final projection matrix of the ubiquitous MLP blocks as its adaptable fast weights, enabling a ``drop-in" enhancement for LLMs without costly retraining from scratch. Furthermore, we replace TTT's generic reconstruction objective with a tailored, theoretically-grounded objective explicitly aligned with the Next-Token-Prediction task governing autoregressive language modeling. This principled objective, combined with an efficient chunk-wise update mechanism, results in a highly scalable algorithm compatible with context parallelism. Extensive experiments validate our framework's effectiveness: as an in-place enhancement, it enables a 4B-parameter model to achieve superior performance on tasks with contexts up to 128k, and when pretrained from scratch, it consistently outperforms competitive TTT-related approaches. Ablation study results further provide deeper insights on our design choices. Collectively, our results establish In-Place TTT as a promising step towards a paradigm of continual learning in LLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

In-Place TTT makes test-time adaptation a drop-in by updating only the final MLP projection per block with a next-token objective, delivering reported gains on 128k contexts for a 4B model while raising capacity questions for the restricted updates.

read the letter

The main thing to know is that this paper turns test-time training into something you can bolt onto existing LLMs. They designate the final projection matrix inside each MLP block as the fast weights and update them at inference with a loss built directly around next-token prediction rather than a generic reconstruction target. Chunk-wise processing keeps it compatible with context parallelism, so the whole thing scales without major rewrites to the model or training pipeline.

Referee Report

2 major / 2 minor

Summary. The paper introduces In-Place Test-Time Training (In-Place TTT) as a drop-in framework for LLMs that adapts only the final projection matrix within each MLP block as fast weights during inference. It replaces generic TTT reconstruction objectives with a new next-token-prediction-aligned objective and uses chunk-wise updates for scalability with context parallelism. Experiments claim that this enables a 4B model to outperform baselines on tasks with up to 128k contexts as an in-place enhancement, and that pretraining from scratch with In-Place TTT consistently beats competitive TTT methods, supported by ablations on design choices.

Significance. If the empirical results and stability claims hold under the restricted adaptation, this could meaningfully advance practical test-time adaptation for existing LLMs by avoiding architectural changes or full retraining. The emphasis on a theoretically aligned objective and compatibility with long contexts addresses real barriers in the TTT literature for language modeling. The drop-in property and reported outperformance on 128k contexts would be notable strengths if the limited fast-weight capacity proves sufficient without side effects.

major comments (2)

[§3] §3 (Method) and Eq. for the new objective: the claim that the objective is 'theoretically-grounded' and independent of experimental outcomes is not demonstrated in the provided description; the derivation must be shown explicitly to confirm it does not reduce to a fitted quantity or introduce circularity with the reported gains.
[Experiments] Experiments section (4B model results on 128k contexts): the central claim that restricting updates to only the final MLP projection matrix produces stable, effective adaptation without degrading the rest of the model or requiring changes rests on unverified assumptions about capacity; additional controls or analysis are needed to show why this restriction suffices rather than leaking or underfitting on long contexts.

minor comments (2)

[Abstract] Abstract: notation for 'fast weights' and 'chunk size' should be defined on first use for clarity.
[§3] The description of 'context parallelism' compatibility would benefit from a brief diagram or pseudocode in the methods to illustrate the chunk-wise mechanism.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and will revise the manuscript to incorporate the requested clarifications and analyses.

read point-by-point responses

Referee: [§3] §3 (Method) and Eq. for the new objective: the claim that the objective is 'theoretically-grounded' and independent of experimental outcomes is not demonstrated in the provided description; the derivation must be shown explicitly to confirm it does not reduce to a fitted quantity or introduce circularity with the reported gains.

Authors: We appreciate this observation. The objective is obtained by replacing the generic reconstruction loss of prior TTT methods with the standard autoregressive cross-entropy loss applied to the next token, where the loss is evaluated after the in-place update of the fast weights. This construction follows directly from the next-token-prediction objective that defines language-model training and does not depend on any post-hoc fitting to the reported results. To make the grounding fully explicit and to rule out any appearance of circularity, we will insert the complete derivation (including the precise loss expression and the justification for its independence from experimental outcomes) into the revised Section 3. revision: yes
Referee: [Experiments] Experiments section (4B model results on 128k contexts): the central claim that restricting updates to only the final MLP projection matrix produces stable, effective adaptation without degrading the rest of the model or requiring changes rests on unverified assumptions about capacity; additional controls or analysis are needed to show why this restriction suffices rather than leaking or underfitting on long contexts.

Authors: We agree that stronger evidence for the sufficiency of the restricted adaptation is warranted. The final projection matrix is chosen because it is the linear transformation that produces the MLP block output after the non-linearity, thereby providing a compact yet expressive site for fast-weight updates while preserving the rest of the model unchanged. The 4B-model experiments already demonstrate stable gains up to 128k contexts without degradation on shorter contexts or unrelated tasks, which is consistent with adequate capacity. Nevertheless, we will add in the revised experiments section (i) an ablation comparing adaptation of the final projection versus other matrices inside the MLP block and (ii) a capacity analysis that tracks the effective rank and gradient norms of the updated weights across long contexts, thereby directly addressing concerns about leakage or underfitting. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in the derivation

full rationale

The abstract presents the In-Place TTT framework as a practical design choice: using the final projection matrix of MLP blocks as fast weights for drop-in compatibility, and replacing the generic reconstruction objective with a next-token-prediction aligned objective described as theoretically-grounded. No equations are shown in the provided text, and no self-citations are invoked to justify the core choices. The experimental results on 4B model and pretraining comparisons are presented as validation, not as the basis for the design. Therefore, there is no reduction of predictions to inputs by construction, and the derivation chain appears self-contained against external benchmarks like standard TTT methods.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The framework relies on the assumption that MLP final projections can serve as effective fast weights and that the new objective aligns with autoregressive modeling without introducing instability. No explicit free parameters are named in the abstract, but chunk size and update learning rate are implied implementation choices.

free parameters (2)

update learning rate
Likely tuned for the test-time adaptation step, though not quantified in the abstract.
chunk size
Determines the granularity of the efficient update mechanism for long contexts.

axioms (2)

domain assumption The final projection matrix in MLP blocks can be updated independently without affecting model stability or requiring changes to other components.
Invoked to justify the drop-in nature of the method.
domain assumption A next-token-prediction-aligned objective is superior to generic reconstruction for test-time adaptation in autoregressive LLMs.
Central to replacing the standard TTT objective.

pith-pipeline@v0.9.0 · 5594 in / 1418 out tokens · 36745 ms · 2026-05-10T19:00:51.067928+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

treats the final projection matrix of the ubiquitous MLP blocks as its adaptable fast weights... replace TTT's generic reconstruction objective with a tailored... Next-Token-Prediction task
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

chunk-wise update rule... context parallelism... 8-tick period absent

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Query-Conditioned Test-Time Self-Training for Large Language Models
cs.CL 2026-05 unverdicted novelty 7.0

QueST lets LLMs create query-conditioned problem-solution pairs at inference time and use them for parameter-efficient self-training, outperforming prior test-time baselines on math and science benchmarks.
Query-Conditioned Test-Time Self-Training for Large Language Models
cs.CL 2026-05 conditional novelty 7.0

QueST adapts LLMs at test time by generating query-specific problem-solution pairs for self-supervised fine-tuning, improving reasoning performance without external data.

Reference graph

Works this paper leans on

67 extracted references · 46 canonical work pages · cited by 1 Pith paper · 28 internal anchors

[1]

Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone

Marah Abdin, Jyoti Aneja, Hany Awadalla, Ahmed Awadallah, Ammar Ahmad Awan, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Jianmin Bao, Harkirat Behl, Alon Benhaim, Misha Bilenko, Johan Bjorck, Sébastien Bubeck, Martin Cai, et al. Phi-3 technical report: A highly capable language model locally on your phone, 2024. URL https://arxiv.org/abs/2404.14219

work page internal anchor Pith review arXiv 2024
[2]

Using Fast Weights to Attend to the Recent Past

Jimmy Lei Ba, Geoffrey E. Hinton, Volodymyr Mnih, Joel Z. Leibo, and Catalin Ionescu. Using fast weights to attend to the recent past. In Advances in Neural Information Processing Systems, 2016. URL https: //arxiv.org/abs/1610.06258

work page Pith review arXiv 2016
[3]

Titans: Learning to Memorize at Test Time

Ali Behrouz, Peilin Zhong, and Vahab Mirrokni. Titans: Learning to memorize at test time.arXiv preprint arXiv:2501.00663, 2024

work page internal anchor Pith review arXiv 2024
[5]

It’s all connected: A journey through test-time memorization, attentional bias, retention, and online optimization

Ali Behrouz, Meisam Razaviyayn, Peilin Zhong, and Vahab Mirrokni. It’s all connected: A journey through test-time memorization, attentional bias, retention, and online optimization.arXiv preprint arXiv:2504.13173, 2025

work page arXiv 2025
[6]

Longformer: The Long-Document Transformer

Iz Beltagy, Matthew E. Peters, and Arman Cohan. Longformer: The long-document transformer.arXiv preprint arXiv:2004.05150, 2020. URLhttps://arxiv.org/abs/2004.05150

work page internal anchor Pith review Pith/arXiv arXiv 2004
[7]

Piqa: Reasoning about physical commonsense in natural language

Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, and Yejin Choi. Piqa: Reasoning about physical commonsense in natural language, 2019. URLhttps://arxiv.org/abs/1911.11641

work page arXiv 2019
[8]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advancesin neural information processing systems, 33:1877–1901, 2020

1901
[9]

MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering

Jun Shern Chan, Neil Chowdhury, Oliver Jaffe, James Aung, Dane Sherburn, Evan Mays, Giulio Starace, Kevin Liu, Leon Maksin, Tejal Patwardhan, et al. Mle-bench: Evaluating machine learning agents on machine learning engineering. arXiv preprint arXiv:2410.07095, 2024

work page Pith review arXiv 2024
[10]

Generating Long Sequences with Sparse Transformers

Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509, 2019

work page internal anchor Pith review arXiv 1904
[11]

PaLM: Scaling Language Modeling with Pathways

Aakanksha Chowdhery and et al. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022. URLhttps://arxiv.org/abs/2204.02311

work page internal anchor Pith review arXiv 2022
[12]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge.arXiv:1803.05457v1, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[13]

Opencompass: A universal evaluation platform for foundation models

OpenCompass Contributors. Opencompass: A universal evaluation platform for foundation models. https: //github.com/open-compass/opencompass, 2023

2023
[14]

Le, and Ruslan Salakhutdinov

Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc V. Le, and Ruslan Salakhutdinov. Transformer-XL: Attentive language models beyond a fixed-length context. InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2978–2988. Association for Computational Linguistics, 2019. URLhttps://aclanthology.org/P19-1285/

2019
[15]

One-minute video generation with test-time training

Karan Dalal, Daniel Koceja, Jiarui Xu, Yue Zhao, Shihao Han, Ka Chun Cheung, Jan Kautz, Yejin Choi, Yu Sun, and Xiaolong Wang. One-minute video generation with test-time training. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 17702–17711, 2025

2025
[16]

Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

Tri Dao and Albert Gu. Transformers are ssms: Generalized models and efficient algorithms through structured state space duality, 2024. URLhttps://arxiv.org/abs/2405.21060

work page internal anchor Pith review arXiv 2024
[17]

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

Tri Dao, Albert Gu, et al. Hungry Hungry Hippos: Towards language modeling with state space models.arXiv preprint arXiv:2312.00752, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[18]

Test-time training for speech, 2023

Sri Harsha Dumpala, Chandramouli Sastry, and Sageev Oore. Test-time training for speech, 2023. URL https://arxiv.org/abs/2309.10930. 12

work page arXiv 2023
[19]

A mathematical framework for transformer circuits

Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Brown, and et al. A mathematical framework for transformer circuits. Transformer Circuits Thread, 2021. URLhttps://transformer-circuits.pub/2021/framework/index. html

2021
[20]

The Pile: An 800GB Dataset of Diverse Text for Language Modeling

Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, and Connor Leahy. The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020. URLhttps://arxiv.org/abs/2101.00027

work page internal anchor Pith review arXiv 2020
[21]

The language model evaluation harness, 07 2024.https://zenodo.org/records/12608602

Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. The languag...

work page arXiv 2024
[22]

Transformer Feed-Forward Layers Are Key-Value Memories

Mor Geva, Roei Schuster, Jonathan Berant, and Omer Levy. Transformer feed-forward layers are key-value memories. arXiv preprint arXiv:2012.14913, 2020

work page internal anchor Pith review arXiv 2012
[23]

ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools

Team GLM. Chatglm: A family of large language models from glm-130b to glm-4 all tools, 2024. URL https://arxiv.org/abs/2406.12793

work page internal anchor Pith review arXiv 2024
[24]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[25]

Realm: Retrieval-augmented language model pre-training

Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Ming-Wei Chang. Realm: Retrieval-augmented language model pre-training. InICML, 2020

2020
[26]

Aligning ai with shared human values.Proceedings of the International Conference on Learning Representations (ICLR), 2021

Dan Hendrycks, Collin Burns, Steven Basart, Andrew Critch, Jerry Li, Dawn Song, and Jacob Steinhardt. Aligning ai with shared human values.Proceedings of the International Conference on Learning Representations (ICLR), 2021

2021
[27]

Measuring massive multitask language understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. InInternational Conference on Learning Representations,
[28]

URLhttps://arxiv.org/abs/2009.03300

work page internal anchor Pith review Pith/arXiv arXiv 2009
[29]

RULER: What's the Real Context Size of Your Long-Context Language Models?

Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, Yang Zhang, and Boris Ginsburg. RULER: What’s the real context size of your long-context language models?arXiv preprint arXiv:2404.06654, 2024. URLhttps://arxiv.org/abs/2404.06654

work page internal anchor Pith review arXiv 2024
[30]

Test-time learning for large language models.arXiv preprint arXiv:2505.20633, 2025

Jinwu Hu, Zhitian Zhang, Guohao Chen, Xutao Wen, Chao Shuai, Wei Luo, Bin Xiao, Yuanqing Li, and Mingkui Tan. Test-time learning for large language models. arXiv preprint arXiv:2505.20633, 2025. URL https://arxiv.org/abs/2505.20633. Accepted at ICML 2025

work page arXiv 2025
[31]

Gershman

Kazuki Irie and Samuel J. Gershman. Fast weight programming and linear transformers: from machine learning to neurobiology, 2025. URLhttps://arxiv.org/abs/2508.08435

work page arXiv 2025
[32]

Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. Mistral 7b, 2023. URLhttps://arxiv...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[33]

Karami, M., Pascanu, R., and Mirrokni, V

Mahdi Karami and Vahab Mirrokni. Lattice: Learning to efficiently compress the memory.arXiv preprint arXiv:2504.05646, 2025

work page arXiv 2025
[34]

Transformers are

Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret. Transformers are rnns: Fast autoregressive transformers with linear attention. InProceedings of the 37th InternationalConference on Machine Learning, Proceedings of Machine Learning Research. PMLR, 2020. URLhttps://arxiv.org/abs/2006.16236

work page arXiv 2020
[35]

Generalization through memorization: Nearest neighbor language models

Urvashi Khandelwal, Omer Levy, Dan Jurafsky, Luke Zettlemoyer, and Mike Lewis. Generalization through memorization: Nearest neighbor language models. InICLR, 2020

2020
[36]

Retrieval-augmented generation for knowledge-intensive nlp tasks

Patrick Lewis, Ethan Perez, Aleksandra Piktus, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks. In NeurIPS, 2020. 13

2020
[37]

Tnt: Improving chunkwise training for test-time memorization.arXiv preprint arXiv:2511.07343, 2025

Zeman Li, Ali Behrouz, Yuan Deng, Peilin Zhong, Praneeth Kacham, Mahdi Karami, Meisam Razaviyayn, and Vahab Mirrokni. Tnt: Improving chunkwise training for test-time memorization.arXiv preprint arXiv:2511.07343, 2025

work page arXiv 2025
[38]

DeepSeek-V3 Technical Report

Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[39]

In-context Learning and Induction Heads

Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Scott Johnston, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark, Jared Kaplan, Sam McCandlish, a...

work page internal anchor Pith review arXiv 2022
[40]

GPT-4 Technical Report

OpenAI. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[41]

Openwebmath: An open dataset of high-quality mathematical web text, 2023

Keiran Paster, Marco Dos Santos, Zhangir Azerbayev, and Jimmy Ba. Openwebmath: An open dataset of high-quality mathematical web text, 2023

2023
[42]

Llama 3 gradient: A series of long context models, 2024

Leonid Pekelis, Michael Feil, Forrest Moret, Mark Huang, and Tiffany Peng. Llama 3 gradient: A series of long context models, 2024. URL https://gradient.ai/blog/ scaling-rotational-embeddings-for-long-context-language-models

2024
[43]

YaRN: Efficient Context Window Extension of Large Language Models

Bowen Peng, Jeffrey Quesnelle, Honglu Fan, and Enrico Shippole. Yarn: Efficient context window extension of large language models.arXiv preprint arXiv:2309.00071, 2023

work page internal anchor Pith review arXiv 2023
[44]

Linear transformers are secretly fast weight programmers

Imanol Schlag, Kazuki Irie, and Jürgen Schmidhuber. Linear transformers are secretly fast weight programmers. In International Conference on Machine Learning, pages 9355–9366. PMLR, 2021

2021
[45]

Welcome to the era of experience.Google AI, 1, 2025

David Silver and Richard S Sutton. Welcome to the era of experience.Google AI, 1, 2025

2025
[46]

arXiv:2504.01848 , year =

Giulio Starace, Oliver Jaffe, Dane Sherburn, James Aung, Jun Shern Chan, Leon Maksin, Rachel Dias, Evan Mays, Benjamin Kinsella, Wyatt Thompson, et al. Paperbench: Evaluating ai’s ability to replicate ai research. arXiv preprint arXiv:2504.01848, 2025

work page arXiv 2025
[47]

Roformer: Enhanced transformer with rotary position embedding, 2023

Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding, 2023

2023
[48]

Efros, and Moritz Hardt

Yu Sun, Xiaolong Wang, Zhuang Liu, John Miller, Alexei A. Efros, and Moritz Hardt. Test-time training with self-supervision for generalization under distribution shifts. InProceedings of the 37th International Conference on Machine Learning, volume 119 ofProceedings of Machine Learning Research, pages 9229–9248. PMLR, 2020. URLhttps://proceedings.mlr.pres...

2020
[49]

Learning to (Learn at Test Time): RNNs with Expressive Hidden States

Yu Sun, Xinhao Li, Karan Dalal, Jiarui Xu, Arjun Vikram, Genghan Zhang, Yann Dubois, Xinlei Chen, Xiaolong Wang, Sanmi Koyejo, Tatsunori Hashimoto, and Carlos Guestrin. Learning to (learn at test time): Rnns with expressive hidden states.arXiv preprint arXiv:2407.04620, 2024. URLhttps://arxiv.org/abs/2407.04620

work page internal anchor Pith review arXiv 2024
[50]

Retentive Network: A Successor to Transformer for Large Language Models

Yutao Sun, Li Dong, Shaohan Huang, Shuming Ma, Yuqing Xia, Jilong Xue, Jianyong Wang, and Furu Wei. Retentive network: A successor to transformer for large language models.arXiv preprint arXiv:2307.08621, 2023

work page internal anchor Pith review arXiv 2023
[51]

Long data collections database, 2024

TogetherAI. Long data collections database, 2024

2024
[52]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Théo Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, and et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[53]

Gomez, Lukasz Kaiser, and Illia Polosukhin

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. InAdvances in Neural Information Processing Systems (NeurIPS), 2017

2017
[54]

Tent: Fully test-time adaptation by entropy minimization

Dequan Wang, Evan Shelhamer, Shaoteng Liu, Bruno Olshausen, and Trevor Darrell. Tent: Fully test-time adaptation by entropy minimization. InICLR, 2021

2021
[55]

A., Shi, J., and Fox, E

Ke Alexander Wang, Jiaxin Shi, and Emily B Fox. Test-time regression: a unifying framework for designing sequence models with associative memory.arXiv preprint arXiv:2501.12352, 2025. 14

work page arXiv 2025
[56]

Memoryllm: Towards self-updatable large language models, 2024

Yu Wang, Yifan Gao, Xiusi Chen, Haoming Jiang, Shiyang Li, Jingfeng Yang, Qingyu Yin, Zheng Li, Xian Li, Bing Yin, Jingbo Shang, and Julian McAuley. Memoryllm: Towards self-updatable large language models, 2024. URLhttps://arxiv.org/abs/2402.04624

work page arXiv 2024
[57]

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models, 2023. URL https: //arxiv.org/abs/2201.11903

work page internal anchor Pith review Pith/arXiv arXiv 2023
[58]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[59]

Gated Linear Attention Transformers with Hardware-Efficient Training

Songlin Yang, Bailin Wang, Yikang Shen, Rameswar Panda, and Yoon Kim. Gated linear attention transformers with hardware-efficient training. arXiv preprint arXiv:2312.06635, 2023. URL https://arxiv.org/abs/2312. 06635

work page internal anchor Pith review arXiv 2023
[60]

Gated Delta Networks: Improving Mamba2 with Delta Rule

Songlin Yang, Jan Kautz, and Ali Hatamizadeh. Gated delta networks: Improving mamba2 with delta rule.arXiv preprint arXiv:2412.06464, 2024

work page internal anchor Pith review arXiv 2024
[61]

Gated linear attention transformers with hardware-efficient training

Songlin Yang, Bailin Wang, Yikang Shen, Rameswar Panda, and Yoon Kim. Gated linear attention transformers with hardware-efficient training. InInternational Conference on Machine Learning, pages 56501–56523. PMLR, 2024

2024
[62]

Parallelizing Linear Transformers with the Delta Rule over Sequence Length.arXiv:2406.06484, 2024

Songlin Yang, Bailin Wang, Yu Zhang, Yikang Shen, and Yoon Kim. Parallelizing linear transformers with the delta rule over sequence length.arXiv preprint arXiv:2406.06484, 2024. URLhttps://arxiv.org/abs/2406.06484

work page arXiv 2024
[63]

Parallelizing linear transformers with the delta rule over sequence length

Songlin Yang, Bailin Wang, Yu Zhang, Yikang Shen, and Yoon Kim. Parallelizing linear transformers with the delta rule over sequence length. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

2024
[64]

Sequential-Parallel Duality in Prefix Scannable Models

Morris Yau, Sharut Gupta, Valerie Engelmayer, Kazuki Irie, Stefanie Jegelka, and Jacob Andreas. Sequential- parallel duality in prefix scannable models, 2025. URLhttps://arxiv.org/abs/2506.10918

work page arXiv 2025
[65]

MemAgent: Reshaping Long-Context LLM with Multi-Conv RL-based Memory Agent

Hongli Yu, Tinghong Chen, Jiangtao Feng, Jiangjie Chen, Weinan Dai, Qiying Yu, Ya-Qin Zhang, Wei-Ying Ma, Jingjing Liu, Mingxuan Wang, and Hao Zhou. Memagent: Reshaping long-context llm with multi-conv rl-based memory agent, 2025. URLhttps://arxiv.org/abs/2507.02259

work page internal anchor Pith review arXiv 2025
[66]

Jingyang Yuan, Huazuo Gao, Damai Dai, Junyu Luo, Liang Zhao, Zhengyan Zhang, Zhenda Xie, Y. X. Wei, Lean Wang, Zhiping Xiao, Yuqing Wang, Chong Ruan, Ming Zhang, Wenfeng Liang, and Wangding Zeng. Native sparse attention: Hardware-aligned and natively trainable sparse attention, 2025. URLhttps://arxiv.org/abs/ 2502.11089

work page arXiv 2025
[67]

HellaSwag: Can a Machine Really Finish Your Sentence?

Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4791–4800, 2019. URLhttps://arxiv.org/abs/1905.07830

work page internal anchor Pith review arXiv 2019
[68]

T., and Tan, H

Tianyuan Zhang, Sai Bi, Yicong Hong, Kai Zhang, Fujun Luan, Songlin Yang, Kalyan Sunkavalli, William T. Freeman, and Hao Tan. Test-time training done right.arXiv preprint arXiv:2505.23884, 2025. URL https: //arxiv.org/abs/2505.23884. 15 Appendix A Proof of theorem 1 For completeness, we first restate the theorem with the precise bounds derived from the as...

work page arXiv 2025