Recognition: 2 theorem links
· Lean TheoremIn-Place Test-Time Training
Pith reviewed 2026-05-10 19:00 UTC · model grok-4.3
The pith
In-Place Test-Time Training endows large language models with the ability to adapt weights at inference time by updating the final projection matrices of their MLP blocks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
In-Place TTT treats the final projection matrix of the ubiquitous MLP blocks as its adaptable fast weights, enabling a drop-in enhancement for LLMs without costly retraining from scratch. It replaces TTT's generic reconstruction objective with a tailored objective aligned with next-token prediction. Combined with an efficient chunk-wise update mechanism, this produces a scalable algorithm. Experiments show superior performance on long-context tasks and outperformance of competitive approaches when pretrained from scratch.
What carries the argument
The final projection matrix of MLP blocks as fast weights, updated with a next-token-prediction objective through chunk-wise mechanisms.
Load-bearing premise
That adapting only the final projection matrices inside the MLP blocks using the new next-token objective produces stable updates that improve performance without degrading the model or needing other changes.
What would settle it
A direct comparison where a model with In-Place TTT fails to improve or worsens on long-context benchmarks relative to its non-adapting counterpart would falsify the central effectiveness claim.
read the original abstract
The static ``train then deploy" paradigm fundamentally limits Large Language Models (LLMs) from dynamically adapting their weights in response to continuous streams of new information inherent in real-world tasks. Test-Time Training (TTT) offers a compelling alternative by updating a subset of model parameters (fast weights) at inference time, yet its potential in the current LLM ecosystem is hindered by critical barriers including architectural incompatibility, computational inefficiency and misaligned fast weight objectives for language modeling. In this work, we introduce In-Place Test-Time Training (In-Place TTT), a framework that seamlessly endows LLMs with Test-Time Training ability. In-Place TTT treats the final projection matrix of the ubiquitous MLP blocks as its adaptable fast weights, enabling a ``drop-in" enhancement for LLMs without costly retraining from scratch. Furthermore, we replace TTT's generic reconstruction objective with a tailored, theoretically-grounded objective explicitly aligned with the Next-Token-Prediction task governing autoregressive language modeling. This principled objective, combined with an efficient chunk-wise update mechanism, results in a highly scalable algorithm compatible with context parallelism. Extensive experiments validate our framework's effectiveness: as an in-place enhancement, it enables a 4B-parameter model to achieve superior performance on tasks with contexts up to 128k, and when pretrained from scratch, it consistently outperforms competitive TTT-related approaches. Ablation study results further provide deeper insights on our design choices. Collectively, our results establish In-Place TTT as a promising step towards a paradigm of continual learning in LLMs.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces In-Place Test-Time Training (In-Place TTT) as a drop-in framework for LLMs that adapts only the final projection matrix within each MLP block as fast weights during inference. It replaces generic TTT reconstruction objectives with a new next-token-prediction-aligned objective and uses chunk-wise updates for scalability with context parallelism. Experiments claim that this enables a 4B model to outperform baselines on tasks with up to 128k contexts as an in-place enhancement, and that pretraining from scratch with In-Place TTT consistently beats competitive TTT methods, supported by ablations on design choices.
Significance. If the empirical results and stability claims hold under the restricted adaptation, this could meaningfully advance practical test-time adaptation for existing LLMs by avoiding architectural changes or full retraining. The emphasis on a theoretically aligned objective and compatibility with long contexts addresses real barriers in the TTT literature for language modeling. The drop-in property and reported outperformance on 128k contexts would be notable strengths if the limited fast-weight capacity proves sufficient without side effects.
major comments (2)
- [§3] §3 (Method) and Eq. for the new objective: the claim that the objective is 'theoretically-grounded' and independent of experimental outcomes is not demonstrated in the provided description; the derivation must be shown explicitly to confirm it does not reduce to a fitted quantity or introduce circularity with the reported gains.
- [Experiments] Experiments section (4B model results on 128k contexts): the central claim that restricting updates to only the final MLP projection matrix produces stable, effective adaptation without degrading the rest of the model or requiring changes rests on unverified assumptions about capacity; additional controls or analysis are needed to show why this restriction suffices rather than leaking or underfitting on long contexts.
minor comments (2)
- [Abstract] Abstract: notation for 'fast weights' and 'chunk size' should be defined on first use for clarity.
- [§3] The description of 'context parallelism' compatibility would benefit from a brief diagram or pseudocode in the methods to illustrate the chunk-wise mechanism.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment below and will revise the manuscript to incorporate the requested clarifications and analyses.
read point-by-point responses
-
Referee: [§3] §3 (Method) and Eq. for the new objective: the claim that the objective is 'theoretically-grounded' and independent of experimental outcomes is not demonstrated in the provided description; the derivation must be shown explicitly to confirm it does not reduce to a fitted quantity or introduce circularity with the reported gains.
Authors: We appreciate this observation. The objective is obtained by replacing the generic reconstruction loss of prior TTT methods with the standard autoregressive cross-entropy loss applied to the next token, where the loss is evaluated after the in-place update of the fast weights. This construction follows directly from the next-token-prediction objective that defines language-model training and does not depend on any post-hoc fitting to the reported results. To make the grounding fully explicit and to rule out any appearance of circularity, we will insert the complete derivation (including the precise loss expression and the justification for its independence from experimental outcomes) into the revised Section 3. revision: yes
-
Referee: [Experiments] Experiments section (4B model results on 128k contexts): the central claim that restricting updates to only the final MLP projection matrix produces stable, effective adaptation without degrading the rest of the model or requiring changes rests on unverified assumptions about capacity; additional controls or analysis are needed to show why this restriction suffices rather than leaking or underfitting on long contexts.
Authors: We agree that stronger evidence for the sufficiency of the restricted adaptation is warranted. The final projection matrix is chosen because it is the linear transformation that produces the MLP block output after the non-linearity, thereby providing a compact yet expressive site for fast-weight updates while preserving the rest of the model unchanged. The 4B-model experiments already demonstrate stable gains up to 128k contexts without degradation on shorter contexts or unrelated tasks, which is consistent with adequate capacity. Nevertheless, we will add in the revised experiments section (i) an ablation comparing adaptation of the final projection versus other matrices inside the MLP block and (ii) a capacity analysis that tracks the effective rank and gradient norms of the updated weights across long contexts, thereby directly addressing concerns about leakage or underfitting. revision: yes
Circularity Check
No significant circularity detected in the derivation
full rationale
The abstract presents the In-Place TTT framework as a practical design choice: using the final projection matrix of MLP blocks as fast weights for drop-in compatibility, and replacing the generic reconstruction objective with a next-token-prediction aligned objective described as theoretically-grounded. No equations are shown in the provided text, and no self-citations are invoked to justify the core choices. The experimental results on 4B model and pretraining comparisons are presented as validation, not as the basis for the design. Therefore, there is no reduction of predictions to inputs by construction, and the derivation chain appears self-contained against external benchmarks like standard TTT methods.
Axiom & Free-Parameter Ledger
free parameters (2)
- update learning rate
- chunk size
axioms (2)
- domain assumption The final projection matrix in MLP blocks can be updated independently without affecting model stability or requiring changes to other components.
- domain assumption A next-token-prediction-aligned objective is superior to generic reconstruction for test-time adaptation in autoregressive LLMs.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
treats the final projection matrix of the ubiquitous MLP blocks as its adaptable fast weights... replace TTT's generic reconstruction objective with a tailored... Next-Token-Prediction task
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
chunk-wise update rule... context parallelism... 8-tick period absent
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 2 Pith papers
-
Query-Conditioned Test-Time Self-Training for Large Language Models
QueST lets LLMs create query-conditioned problem-solution pairs at inference time and use them for parameter-efficient self-training, outperforming prior test-time baselines on math and science benchmarks.
-
Query-Conditioned Test-Time Self-Training for Large Language Models
QueST adapts LLMs at test time by generating query-specific problem-solution pairs for self-supervised fine-tuning, improving reasoning performance without external data.
Reference graph
Works this paper leans on
-
[1]
Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone
Marah Abdin, Jyoti Aneja, Hany Awadalla, Ahmed Awadallah, Ammar Ahmad Awan, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Jianmin Bao, Harkirat Behl, Alon Benhaim, Misha Bilenko, Johan Bjorck, Sébastien Bubeck, Martin Cai, et al. Phi-3 technical report: A highly capable language model locally on your phone, 2024. URL https://arxiv.org/abs/2404.14219
work page internal anchor Pith review arXiv 2024
-
[2]
Using Fast Weights to Attend to the Recent Past
Jimmy Lei Ba, Geoffrey E. Hinton, Volodymyr Mnih, Joel Z. Leibo, and Catalin Ionescu. Using fast weights to attend to the recent past. In Advances in Neural Information Processing Systems, 2016. URL https: //arxiv.org/abs/1610.06258
work page Pith review arXiv 2016
-
[3]
Titans: Learning to Memorize at Test Time
Ali Behrouz, Peilin Zhong, and Vahab Mirrokni. Titans: Learning to memorize at test time.arXiv preprint arXiv:2501.00663, 2024
work page internal anchor Pith review arXiv 2024
-
[5]
Ali Behrouz, Meisam Razaviyayn, Peilin Zhong, and Vahab Mirrokni. It’s all connected: A journey through test-time memorization, attentional bias, retention, and online optimization.arXiv preprint arXiv:2504.13173, 2025
-
[6]
Longformer: The Long-Document Transformer
Iz Beltagy, Matthew E. Peters, and Arman Cohan. Longformer: The long-document transformer.arXiv preprint arXiv:2004.05150, 2020. URLhttps://arxiv.org/abs/2004.05150
work page internal anchor Pith review Pith/arXiv arXiv 2004
-
[7]
Piqa: Reasoning about physical commonsense in natural language
Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, and Yejin Choi. Piqa: Reasoning about physical commonsense in natural language, 2019. URLhttps://arxiv.org/abs/1911.11641
-
[8]
Language models are few-shot learners
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advancesin neural information processing systems, 33:1877–1901, 2020
1901
-
[9]
MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering
Jun Shern Chan, Neil Chowdhury, Oliver Jaffe, James Aung, Dane Sherburn, Evan Mays, Giulio Starace, Kevin Liu, Leon Maksin, Tejal Patwardhan, et al. Mle-bench: Evaluating machine learning agents on machine learning engineering. arXiv preprint arXiv:2410.07095, 2024
work page Pith review arXiv 2024
-
[10]
Generating Long Sequences with Sparse Transformers
Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509, 2019
work page internal anchor Pith review arXiv 1904
-
[11]
PaLM: Scaling Language Modeling with Pathways
Aakanksha Chowdhery and et al. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022. URLhttps://arxiv.org/abs/2204.02311
work page internal anchor Pith review arXiv 2022
-
[12]
Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge
Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge.arXiv:1803.05457v1, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[13]
Opencompass: A universal evaluation platform for foundation models
OpenCompass Contributors. Opencompass: A universal evaluation platform for foundation models. https: //github.com/open-compass/opencompass, 2023
2023
-
[14]
Le, and Ruslan Salakhutdinov
Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc V. Le, and Ruslan Salakhutdinov. Transformer-XL: Attentive language models beyond a fixed-length context. InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2978–2988. Association for Computational Linguistics, 2019. URLhttps://aclanthology.org/P19-1285/
2019
-
[15]
One-minute video generation with test-time training
Karan Dalal, Daniel Koceja, Jiarui Xu, Yue Zhao, Shihao Han, Ka Chun Cheung, Jan Kautz, Yejin Choi, Yu Sun, and Xiaolong Wang. One-minute video generation with test-time training. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 17702–17711, 2025
2025
-
[16]
Tri Dao and Albert Gu. Transformers are ssms: Generalized models and efficient algorithms through structured state space duality, 2024. URLhttps://arxiv.org/abs/2405.21060
work page internal anchor Pith review arXiv 2024
-
[17]
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
Tri Dao, Albert Gu, et al. Hungry Hungry Hippos: Towards language modeling with state space models.arXiv preprint arXiv:2312.00752, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[18]
Test-time training for speech, 2023
Sri Harsha Dumpala, Chandramouli Sastry, and Sageev Oore. Test-time training for speech, 2023. URL https://arxiv.org/abs/2309.10930. 12
-
[19]
A mathematical framework for transformer circuits
Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Brown, and et al. A mathematical framework for transformer circuits. Transformer Circuits Thread, 2021. URLhttps://transformer-circuits.pub/2021/framework/index. html
2021
-
[20]
The Pile: An 800GB Dataset of Diverse Text for Language Modeling
Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, and Connor Leahy. The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020. URLhttps://arxiv.org/abs/2101.00027
work page internal anchor Pith review arXiv 2020
-
[21]
The language model evaluation harness, 07 2024.https://zenodo.org/records/12608602
Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. The languag...
-
[22]
Transformer Feed-Forward Layers Are Key-Value Memories
Mor Geva, Roei Schuster, Jonathan Berant, and Omer Levy. Transformer feed-forward layers are key-value memories. arXiv preprint arXiv:2012.14913, 2020
work page internal anchor Pith review arXiv 2012
-
[23]
ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools
Team GLM. Chatglm: A family of large language models from glm-130b to glm-4 all tools, 2024. URL https://arxiv.org/abs/2406.12793
work page internal anchor Pith review arXiv 2024
-
[24]
Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[25]
Realm: Retrieval-augmented language model pre-training
Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Ming-Wei Chang. Realm: Retrieval-augmented language model pre-training. InICML, 2020
2020
-
[26]
Aligning ai with shared human values.Proceedings of the International Conference on Learning Representations (ICLR), 2021
Dan Hendrycks, Collin Burns, Steven Basart, Andrew Critch, Jerry Li, Dawn Song, and Jacob Steinhardt. Aligning ai with shared human values.Proceedings of the International Conference on Learning Representations (ICLR), 2021
2021
-
[27]
Measuring massive multitask language understanding
Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. InInternational Conference on Learning Representations,
-
[28]
URLhttps://arxiv.org/abs/2009.03300
work page internal anchor Pith review Pith/arXiv arXiv 2009
-
[29]
RULER: What's the Real Context Size of Your Long-Context Language Models?
Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, Yang Zhang, and Boris Ginsburg. RULER: What’s the real context size of your long-context language models?arXiv preprint arXiv:2404.06654, 2024. URLhttps://arxiv.org/abs/2404.06654
work page internal anchor Pith review arXiv 2024
-
[30]
Test-time learning for large language models.arXiv preprint arXiv:2505.20633, 2025
Jinwu Hu, Zhitian Zhang, Guohao Chen, Xutao Wen, Chao Shuai, Wei Luo, Bin Xiao, Yuanqing Li, and Mingkui Tan. Test-time learning for large language models. arXiv preprint arXiv:2505.20633, 2025. URL https://arxiv.org/abs/2505.20633. Accepted at ICML 2025
- [31]
-
[32]
Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. Mistral 7b, 2023. URLhttps://arxiv...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[33]
Karami, M., Pascanu, R., and Mirrokni, V
Mahdi Karami and Vahab Mirrokni. Lattice: Learning to efficiently compress the memory.arXiv preprint arXiv:2504.05646, 2025
-
[34]
Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret. Transformers are rnns: Fast autoregressive transformers with linear attention. InProceedings of the 37th InternationalConference on Machine Learning, Proceedings of Machine Learning Research. PMLR, 2020. URLhttps://arxiv.org/abs/2006.16236
-
[35]
Generalization through memorization: Nearest neighbor language models
Urvashi Khandelwal, Omer Levy, Dan Jurafsky, Luke Zettlemoyer, and Mike Lewis. Generalization through memorization: Nearest neighbor language models. InICLR, 2020
2020
-
[36]
Retrieval-augmented generation for knowledge-intensive nlp tasks
Patrick Lewis, Ethan Perez, Aleksandra Piktus, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks. In NeurIPS, 2020. 13
2020
-
[37]
Tnt: Improving chunkwise training for test-time memorization.arXiv preprint arXiv:2511.07343, 2025
Zeman Li, Ali Behrouz, Yuan Deng, Peilin Zhong, Praneeth Kacham, Mahdi Karami, Meisam Razaviyayn, and Vahab Mirrokni. Tnt: Improving chunkwise training for test-time memorization.arXiv preprint arXiv:2511.07343, 2025
-
[38]
Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[39]
In-context Learning and Induction Heads
Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Scott Johnston, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark, Jared Kaplan, Sam McCandlish, a...
work page internal anchor Pith review arXiv 2022
-
[40]
OpenAI. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[41]
Openwebmath: An open dataset of high-quality mathematical web text, 2023
Keiran Paster, Marco Dos Santos, Zhangir Azerbayev, and Jimmy Ba. Openwebmath: An open dataset of high-quality mathematical web text, 2023
2023
-
[42]
Llama 3 gradient: A series of long context models, 2024
Leonid Pekelis, Michael Feil, Forrest Moret, Mark Huang, and Tiffany Peng. Llama 3 gradient: A series of long context models, 2024. URL https://gradient.ai/blog/ scaling-rotational-embeddings-for-long-context-language-models
2024
-
[43]
YaRN: Efficient Context Window Extension of Large Language Models
Bowen Peng, Jeffrey Quesnelle, Honglu Fan, and Enrico Shippole. Yarn: Efficient context window extension of large language models.arXiv preprint arXiv:2309.00071, 2023
work page internal anchor Pith review arXiv 2023
-
[44]
Linear transformers are secretly fast weight programmers
Imanol Schlag, Kazuki Irie, and Jürgen Schmidhuber. Linear transformers are secretly fast weight programmers. In International Conference on Machine Learning, pages 9355–9366. PMLR, 2021
2021
-
[45]
Welcome to the era of experience.Google AI, 1, 2025
David Silver and Richard S Sutton. Welcome to the era of experience.Google AI, 1, 2025
2025
-
[46]
Giulio Starace, Oliver Jaffe, Dane Sherburn, James Aung, Jun Shern Chan, Leon Maksin, Rachel Dias, Evan Mays, Benjamin Kinsella, Wyatt Thompson, et al. Paperbench: Evaluating ai’s ability to replicate ai research. arXiv preprint arXiv:2504.01848, 2025
-
[47]
Roformer: Enhanced transformer with rotary position embedding, 2023
Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding, 2023
2023
-
[48]
Efros, and Moritz Hardt
Yu Sun, Xiaolong Wang, Zhuang Liu, John Miller, Alexei A. Efros, and Moritz Hardt. Test-time training with self-supervision for generalization under distribution shifts. InProceedings of the 37th International Conference on Machine Learning, volume 119 ofProceedings of Machine Learning Research, pages 9229–9248. PMLR, 2020. URLhttps://proceedings.mlr.pres...
2020
-
[49]
Learning to (Learn at Test Time): RNNs with Expressive Hidden States
Yu Sun, Xinhao Li, Karan Dalal, Jiarui Xu, Arjun Vikram, Genghan Zhang, Yann Dubois, Xinlei Chen, Xiaolong Wang, Sanmi Koyejo, Tatsunori Hashimoto, and Carlos Guestrin. Learning to (learn at test time): Rnns with expressive hidden states.arXiv preprint arXiv:2407.04620, 2024. URLhttps://arxiv.org/abs/2407.04620
work page internal anchor Pith review arXiv 2024
-
[50]
Retentive Network: A Successor to Transformer for Large Language Models
Yutao Sun, Li Dong, Shaohan Huang, Shuming Ma, Yuqing Xia, Jilong Xue, Jianyong Wang, and Furu Wei. Retentive network: A successor to transformer for large language models.arXiv preprint arXiv:2307.08621, 2023
work page internal anchor Pith review arXiv 2023
-
[51]
Long data collections database, 2024
TogetherAI. Long data collections database, 2024
2024
-
[52]
LLaMA: Open and Efficient Foundation Language Models
Hugo Touvron, Théo Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, and et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[53]
Gomez, Lukasz Kaiser, and Illia Polosukhin
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. InAdvances in Neural Information Processing Systems (NeurIPS), 2017
2017
-
[54]
Tent: Fully test-time adaptation by entropy minimization
Dequan Wang, Evan Shelhamer, Shaoteng Liu, Bruno Olshausen, and Trevor Darrell. Tent: Fully test-time adaptation by entropy minimization. InICLR, 2021
2021
-
[55]
Ke Alexander Wang, Jiaxin Shi, and Emily B Fox. Test-time regression: a unifying framework for designing sequence models with associative memory.arXiv preprint arXiv:2501.12352, 2025. 14
-
[56]
Memoryllm: Towards self-updatable large language models, 2024
Yu Wang, Yifan Gao, Xiusi Chen, Haoming Jiang, Shiyang Li, Jingfeng Yang, Qingyu Yin, Zheng Li, Xian Li, Bing Yin, Jingbo Shang, and Julian McAuley. Memoryllm: Towards self-updatable large language models, 2024. URLhttps://arxiv.org/abs/2402.04624
-
[57]
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models, 2023. URL https: //arxiv.org/abs/2201.11903
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[58]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[59]
Gated Linear Attention Transformers with Hardware-Efficient Training
Songlin Yang, Bailin Wang, Yikang Shen, Rameswar Panda, and Yoon Kim. Gated linear attention transformers with hardware-efficient training. arXiv preprint arXiv:2312.06635, 2023. URL https://arxiv.org/abs/2312. 06635
work page internal anchor Pith review arXiv 2023
-
[60]
Gated Delta Networks: Improving Mamba2 with Delta Rule
Songlin Yang, Jan Kautz, and Ali Hatamizadeh. Gated delta networks: Improving mamba2 with delta rule.arXiv preprint arXiv:2412.06464, 2024
work page internal anchor Pith review arXiv 2024
-
[61]
Gated linear attention transformers with hardware-efficient training
Songlin Yang, Bailin Wang, Yikang Shen, Rameswar Panda, and Yoon Kim. Gated linear attention transformers with hardware-efficient training. InInternational Conference on Machine Learning, pages 56501–56523. PMLR, 2024
2024
-
[62]
Parallelizing Linear Transformers with the Delta Rule over Sequence Length.arXiv:2406.06484, 2024
Songlin Yang, Bailin Wang, Yu Zhang, Yikang Shen, and Yoon Kim. Parallelizing linear transformers with the delta rule over sequence length.arXiv preprint arXiv:2406.06484, 2024. URLhttps://arxiv.org/abs/2406.06484
-
[63]
Parallelizing linear transformers with the delta rule over sequence length
Songlin Yang, Bailin Wang, Yu Zhang, Yikang Shen, and Yoon Kim. Parallelizing linear transformers with the delta rule over sequence length. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024
2024
-
[64]
Sequential-Parallel Duality in Prefix Scannable Models
Morris Yau, Sharut Gupta, Valerie Engelmayer, Kazuki Irie, Stefanie Jegelka, and Jacob Andreas. Sequential- parallel duality in prefix scannable models, 2025. URLhttps://arxiv.org/abs/2506.10918
-
[65]
MemAgent: Reshaping Long-Context LLM with Multi-Conv RL-based Memory Agent
Hongli Yu, Tinghong Chen, Jiangtao Feng, Jiangjie Chen, Weinan Dai, Qiying Yu, Ya-Qin Zhang, Wei-Ying Ma, Jingjing Liu, Mingxuan Wang, and Hao Zhou. Memagent: Reshaping long-context llm with multi-conv rl-based memory agent, 2025. URLhttps://arxiv.org/abs/2507.02259
work page internal anchor Pith review arXiv 2025
-
[66]
Jingyang Yuan, Huazuo Gao, Damai Dai, Junyu Luo, Liang Zhao, Zhengyan Zhang, Zhenda Xie, Y. X. Wei, Lean Wang, Zhiping Xiao, Yuqing Wang, Chong Ruan, Ming Zhang, Wenfeng Liang, and Wangding Zeng. Native sparse attention: Hardware-aligned and natively trainable sparse attention, 2025. URLhttps://arxiv.org/abs/ 2502.11089
-
[67]
HellaSwag: Can a Machine Really Finish Your Sentence?
Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4791–4800, 2019. URLhttps://arxiv.org/abs/1905.07830
work page internal anchor Pith review arXiv 2019
-
[68]
Tianyuan Zhang, Sai Bi, Yicong Hong, Kai Zhang, Fujun Luan, Songlin Yang, Kalyan Sunkavalli, William T. Freeman, and Hao Tan. Test-time training done right.arXiv preprint arXiv:2505.23884, 2025. URL https: //arxiv.org/abs/2505.23884. 15 Appendix A Proof of theorem 1 For completeness, we first restate the theorem with the precise bounds derived from the as...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.