Beyond Perplexity: A Behavioral Evaluation Framework for Deployment-Memory Claims in LLM Test-Time Training

Guangyi Chen; Kun Zhang; Lingjing Kong; Shaoan Xie; Xiangchen Song; Xinshuai Dong; Zhenhao Chen

arxiv: 2607.00368 · v1 · pith:YNXGBHPAnew · submitted 2026-07-01 · 💻 cs.CL

Beyond Perplexity: A Behavioral Evaluation Framework for Deployment-Memory Claims in LLM Test-Time Training

Xiangchen Song , Zhenhao Chen , Lingjing Kong , Shaoan Xie , Xinshuai Dong , Guangyi Chen , Kun Zhang This is my paper

Pith reviewed 2026-07-02 13:42 UTC · model grok-4.3

classification 💻 cs.CL

keywords test-time trainingLLM memory evaluationbehavioral metricsdeployment claimsLoRA updatesnonce factsperplexity proxiesrecall gaps

0 comments

The pith

Proxy metrics like loss reduction fail to produce actual recall in LLM test-time training for deployment memory.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a behavioral evaluation framework to align claims about deployed assistant memory and personalization in LLM test-time training with supporting evidence. Current evaluations rely on local proxy metrics such as perplexity or future-token loss, which match claims about stream or domain adaptation but provide weaker support for behavioral outcomes like later recall after the original context is removed. The framework consists of a claim-calibrated evidence ladder distinguishing adaptation types and an evaluation protocol that uses matched explicit-memory baselines plus mutually exclusive failure categories. Validation in a sparse nonce-fact setting shows one-step LoRA updates reduce support and answer loss across Qwen3 scales yet leave generated free-form recall at zero. This exposes a gap between proxy gains and the behavioral evidence required for deployment-memory claims.

Core claim

In a sparse nonce-fact setting, one-step LoRA updates lower support and answer loss across three Qwen3 model scales while generated free-form recall stays at zero, exposing a measurable gap between proxy improvement and deployment behavior. The framework calibrates TTT memory claims to the evidence that supports them through an evidence ladder separating stream/domain adaptation, bridge internalization, and deployment-time behavioral learning, plus a protocol with matched baselines and failure categories.

What carries the argument

The claim-calibrated evidence ladder that separates stream/domain adaptation, bridge internalization, and deployment-time behavioral learning, together with the evaluation protocol using matched explicit-memory baselines and mutually exclusive failure categories.

If this is right

TTT papers claiming deployment memory must report behavioral metrics such as recall and paraphrase robustness rather than proxies alone.
Existing TTT methods that only reduce loss may not support claims of post-deployment learning or personalization.
The framework supplies a concrete standard for authors and evaluators to match TTT memory claims to the evidence presented.
One-step updates in current setups achieve surface adaptation but not the internalization needed for later behavioral use.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The framework could be applied to test whether other adaptation techniques beyond one-step LoRA achieve behavioral recall.
It points to a need for TTT methods explicitly designed to bridge context into durable, retrievable internal representations.
Similar gaps may appear in continual learning settings where models process new facts without later free-form access.

Load-bearing premise

The chosen nonce-fact diagnostic and mutually exclusive failure categories fully represent the requirements for real deployment-memory claims without omitting important real-world usage patterns.

What would settle it

An experiment in the same sparse nonce-fact setting where one-step LoRA updates produce measurable increases in generated free-form recall after context removal, while all other conditions remain fixed.

Figures

Figures reproduced from arXiv: 2607.00368 by Guangyi Chen, Kun Zhang, Lingjing Kong, Shaoan Xie, Xiangchen Song, Xinshuai Dong, Zhenhao Chen.

**Figure 2.** Figure 2: Three claim levels used to calibrate TTT claims. The level records the strongest evidence [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

read the original abstract

Large language model test-time training (TTT) is often evaluated through local proxy metrics: models are updated on recent tokens, retrieved context, target-domain data, or verifiable task attempts, and then judged by perplexity, future-token loss, long-context performance, or reward. These metrics are well matched to claims about stream adaptation, domain adaptation, context compression, and reward-backed test-time improvement. They are weaker evidence, however, for a capability that TTT results are increasingly used to motivate: deployed assistant memory, personalization, or sparse post-deployment learning, which instead requires behavioral evidence such as later recall, paraphrase robustness, retention, locality, conflict handling, and use in downstream actions after the original support context is removed. We introduce a behavioral evaluation framework that calibrates TTT memory claims to the evidence that supports them. It has two components: a claim-calibrated evidence ladder that separates stream/domain adaptation, bridge internalization, and deployment-time behavioral learning; and an evaluation protocol with matched explicit-memory baselines and mutually exclusive failure categories. We validate the framework by auditing recent TTT and memory-adjacent work and by instantiating it as a controlled diagnostic in which, in a sparse nonce-fact setting, one-step LoRA updates lower support and answer loss across three Qwen3 model scales while generated free-form recall stays at zero, exposing a measurable gap between proxy improvement and deployment behavior. The framework gives authors and evaluators a concrete standard for aligning TTT memory claims with the evidence actually reported.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a concrete framework to tie TTT memory claims to behavioral tests rather than proxies, but its nonce-fact diagnostic leaves out conflicts and integration.

read the letter

The paper's main contribution is a framework that requires behavioral evidence for claims about deployed memory in test-time training, instead of relying on perplexity or loss alone. They show in a nonce-fact setup that one-step LoRA can improve support and answer loss across model scales but leaves free-form recall at zero.

What is new is the evidence ladder that distinguishes stream adaptation from bridge internalization and behavioral learning, together with the protocol that includes matched explicit-memory baselines and mutually exclusive failure categories. They use it to audit recent papers and run their own diagnostic.

It does a good job of explaining why current proxy metrics are insufficient for personalization claims and of offering a structured way to align claims with evidence.

The limitation is that their diagnostic uses only sparse, non-conflicting facts. Deployment memory often requires handling conflicts or integrating new info with existing knowledge, so the framework might not fully cover those cases yet. Without more details on how the failure categories are defined and the stats, it's hard to judge how robust the protocol is.

This is for people in the TTT and LLM adaptation area who care about evaluation standards. A reader working on memory-related methods would find the ladder useful to apply or critique.

I think it should go to peer review. The issue with proxy metrics is worth addressing, and the proposal is specific enough for referees to work with.

Referee Report

1 major / 2 minor

Summary. The paper introduces a behavioral evaluation framework for deployment-memory claims in LLM test-time training (TTT). It consists of a claim-calibrated evidence ladder separating stream/domain adaptation, bridge internalization, and deployment-time behavioral learning, plus an evaluation protocol using matched explicit-memory baselines and mutually exclusive failure categories. The framework is validated by auditing recent TTT work and via a controlled diagnostic: in a sparse nonce-fact setting, one-step LoRA updates lower support and answer loss across three Qwen3 scales, yet generated free-form recall remains zero, exposing a gap between proxy metrics and deployment behavior.

Significance. If the framework holds, it supplies a concrete standard for aligning TTT memory claims with supporting evidence, addressing the mismatch between common proxy metrics (perplexity, loss) and behavioral requirements such as recall, retention, and conflict handling. The empirical demonstration of the proxy-deployment gap in the nonce-fact diagnostic provides a falsifiable prediction that strengthens the contribution and could improve evaluation rigor in the field.

major comments (1)

[§4] §4 (diagnostic instantiation): The central empirical claim that the nonce-fact setting plus mutually exclusive failure categories adequately calibrates the evidence ladder for deployment-memory claims rests on the assumption that isolated, non-conflicting facts suffice; however, the introduction lists conflict handling and incremental personalization as required behavioral evidence, and these are not probed, risking an incomplete separation of claim types.

minor comments (2)

[§3] The abstract and §3 mention 'matched explicit-memory baselines' but do not specify how baselines are constructed or matched in the diagnostic; adding a table or subsection detailing the exact baseline implementations would improve reproducibility.
[Figure 1] Figure 1 (evidence ladder diagram) uses terms like 'bridge internalization' that are defined in text but could benefit from an explicit legend or example row to clarify distinctions for readers.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive review. We respond to the single major comment below.

read point-by-point responses

Referee: [§4] §4 (diagnostic instantiation): The central empirical claim that the nonce-fact setting plus mutually exclusive failure categories adequately calibrates the evidence ladder for deployment-memory claims rests on the assumption that isolated, non-conflicting facts suffice; however, the introduction lists conflict handling and incremental personalization as required behavioral evidence, and these are not probed, risking an incomplete separation of claim types.

Authors: The nonce-fact diagnostic was selected as a minimal non-conflicting case precisely to isolate whether loss reductions produce any behavioral recall once support is removed. This tests the lowest bar for the deployment-time behavioral learning rung of the ladder. The introduction does identify conflict handling and incremental personalization as necessary behavioral evidence for full deployment-memory claims. Because the results already show zero free-form recall despite loss improvement, stronger claims requiring conflict resolution are already unsupported by the evidence. We will add a paragraph in §4 clarifying that the diagnostic functions as a lower-bound test and that conflict scenarios constitute a natural next instantiation of the protocol. This is a clarification rather than a change to the empirical results or framework structure. revision: partial

Circularity Check

0 steps flagged

No circularity; new conceptual framework and diagnostic introduced without reduction to inputs

full rationale

The paper defines a claim-calibrated evidence ladder and evaluation protocol as new constructs, then validates them via auditing of prior TTT work and a controlled nonce-fact diagnostic experiment (one-step LoRA on Qwen3 models). No load-bearing step reduces a claimed result to a fitted parameter, self-citation chain, or definitional equivalence; the central separation of stream adaptation from deployment-time behavioral learning is instantiated empirically rather than derived from prior author equations or ansatzes. This is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The paper rests on the domain assumption that behavioral evidence is necessary for deployment-memory claims and introduces two new conceptual entities (the evidence ladder and the evaluation protocol) without independent empirical grounding beyond the single diagnostic.

axioms (1)

domain assumption Proxy metrics such as perplexity are weaker evidence than behavioral tests for deployment-memory claims
Stated directly in the abstract as the motivation for the framework.

invented entities (2)

claim-calibrated evidence ladder no independent evidence
purpose: Separates stream/domain adaptation, bridge internalization, and deployment-time behavioral learning
New conceptual structure introduced by the paper
evaluation protocol with matched explicit-memory baselines and mutually exclusive failure categories no independent evidence
purpose: Provides concrete testing method for the ladder
New protocol introduced by the paper

pith-pipeline@v0.9.1-grok · 5821 in / 1319 out tokens · 30981 ms · 2026-07-02T13:42:16.766496+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

41 extracted references · 7 canonical work pages · 3 internal anchors

[1]

Self-improving llm agents at test-time

Emre Can Acikgoz, Cheng Qian, Heng Ji, Dilek Hakkani-Tür, and Gokhan Tur. Self-improving llm agents at test-time. 2025

2025
[2]

Memorybench: A benchmark for memory and continual learning in LLM systems, 2026

Qingyao Ai, Yichen Tang, Changyue Wang, Jianming Long, Weihang Su, and Yiqun LIU. Memorybench: A benchmark for memory and continual learning in LLM systems, 2026

2026
[3]

Test-time training undermines existing safety guardrails

Simone Antonelli, Mohammad Sadegh Akhondzadeh, and Aleksandar Bojchevski. Test-time training undermines existing safety guardrails. InICLR 2026 Workshop on Trustworthy AI, 2026

2026
[4]

Let’s (not) just put things in context: Test-time training for long-context LLMs

Rachit Bansal, Aston Zhang, Rishabh Tiwari, Lovish Madaan, Sai Surya Duvvuri, Fnu Devvrit, David Brandfonbrener, David Alvarez-Melis, Prajjwal Bhargava, Mihir Kale, and Samy Jelassi. Let’s (not) just put things in context: Test-time training for long-context LLMs. InThe Fourteenth International Conference on Learning Representations, 2026

2026
[5]

Atlas: Learning to optimally memorize the context at test time.arXiv preprint arXiv:2505.23735, 2025

Ali Behrouz, Zeman Li, Praneeth Kacham, Majid Daliri, Yuan Deng, Peilin Zhong, Meisam Razaviyayn, and Vahab Mirrokni. ATLAS: Learning to optimally memorize the context at test time.arXiv preprint arXiv:2505.23735, 2025

work page arXiv 2025
[6]

Titans: Learning to Memorize at Test Time

Ali Behrouz, Peilin Zhong, and Vahab Mirrokni. Titans: Learning to memorize at test time. arXiv preprint arXiv:2501.00663, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[7]

PERK: Long-context reasoning as parameter-efficient test-time learning

Zeming Chen, Angelika Romanou, Gail Weiss, and Antoine Bosselut. PERK: Long-context reasoning as parameter-efficient test-time learning. InThe Fourteenth International Conference on Learning Representations, 2026

2026
[8]

Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory

Prateek Chhikara, Dev Khant, Saket Aryan, Taranjeet Singh, and Deshraj Yadav. Mem0: Building production-ready ai agents with scalable long-term memory.arXiv preprint arXiv:2504.19413, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[9]

In-place test-time training

Guhao Feng, Shengjie Luo, Kai Hua, Ge Zhang, Wenhao Huang, Di He, and Tianle Cai. In-place test-time training. InThe Fourteenth International Conference on Learning Representations, 2026

2026
[10]

Model editing at scale leads to gradual and catastrophic forgetting

Akshat Gupta, Anurag Rao, and Gopala Anumanchipalli. Model editing at scale leads to gradual and catastrophic forgetting. InFindings of the Association for Computational Linguistics: ACL 2024, pages 15202–15232, 2024

2024
[11]

Test-time training on nearest neighbors for large language models

Moritz Hardt and Yu Sun. Test-time training on nearest neighbors for large language models. InThe Twelfth International Conference on Learning Representations, 2024

2024
[12]

Detecting edit failures in large language models: An improved specificity benchmark

Jason Hoelscher-Obermaier, Julia Persson, Esben Kran, Ioannis Konstas, and Fazl Barez. Detecting edit failures in large language models: An improved specificity benchmark. In 10 Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors,Findings of the Association for Computational Linguistics: ACL 2023, pages 11548–11559, Toronto, Canada, July 2023. As...

2023
[13]

Test-time learning for large language models

Jinwu Hu, Zitian Zhang, Guohao Chen, Xutao Wen, Chao Shuai, Wei Luo, Bin Xiao, Yuan- qing Li, and Mingkui Tan. Test-time learning for large language models. InForty-second International Conference on Machine Learning, 2025

2025
[14]

Evaluating memory in LLM agents via in- cremental multi-turn interactions

Yuanzhe Hu, Yu Wang, and Julian McAuley. Evaluating memory in LLM agents via in- cremental multi-turn interactions. InThe Fourteenth International Conference on Learning Representations, 2026

2026
[15]

Dynamic evaluation of neural sequence models

Ben Krause, Emmanuel Kahembwe, Iain Murray, and Steve Renals. Dynamic evaluation of neural sequence models. InInternational Conference on Machine Learning, pages 2766–2775. PMLR, 2018

2018
[16]

Should we really edit language models? on the evaluation of edited language models.Advances in Neural Information Processing Systems, 37:30850–30885, 2024

Qi Li, Xiang Liu, Zhenheng Tang, Peijie Dong, Zeyu Li, Xinglin Pan, and Xiaowen Chu. Should we really edit language models? on the evaluation of edited language models.Advances in Neural Information Processing Systems, 37:30850–30885, 2024

2024
[17]

Locas: Your models are principled initializers of locally-supported parametric memories.arXiv preprint arXiv:2602.05085, 2026

Sidi Lu, Zhenwen Liang, Dongyang Ma, Yan Wang, Haitao Mi, and Dong Yu. Locas: Your models are principled initializers of locally-supported parametric memories.arXiv preprint arXiv:2602.05085, 2026

work page arXiv 2026
[18]

Evaluating very long-term conversational memory of llm agents

Adyasha Maharana, Dong-Ho Lee, Sergey Tulyakov, Mohit Bansal, Francesco Barbieri, and Yuwei Fang. Evaluating very long-term conversational memory of llm agents. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 13851–13870, 2024

2024
[19]

Locating and editing factual associations in gpt

Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. Locating and editing factual associations in gpt. InProceedings of the 36th International Conference on Neural Information Processing Systems, NIPS ’22, Red Hook, NY , USA, 2022. Curran Associates Inc

2022
[20]

Mass- editing memory in a transformer

Kevin Meng, Arnab Sen Sharma, Alex J Andonian, Yonatan Belinkov, and David Bau. Mass- editing memory in a transformer. InThe Eleventh International Conference on Learning Representations, 2023

2023
[21]

Memgpt: towards llms as operating systems

Charles Packer, Vivian Fang, Shishir_G Patil, Kevin Lin, Sarah Wooders, and Joseph_E Gonza- lez. Memgpt: towards llms as operating systems. 2023

2023
[22]

Amal Rannen-Triki, Jorg Bornschein, Razvan Pascanu, Marcus Hutter, Andras György, Alexan- dre Galashov, Yee Whye Teh, and Michalis K. Titsias. Revisiting dynamic evaluation: Online adaptation for large language models.arXiv preprint arXiv:2403.01518, 2024

work page arXiv 2024
[23]

Long-form evaluation of model editing

Domenic Rosati, Robie Gonzales, Jinkun Chen, Xuemin Yu, Yahya Kayani, Frank Rudzicz, and Hassan Sajjad. Long-form evaluation of model editing. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 3749–3780, 2024

2024
[24]

Mem2actbench: A benchmark for evaluating long-term memory utilization in task-oriented autonomous agents, 2026

Yiting Shen, Kun Li, Wei Zhou, and Songlin Hu. Mem2actbench: A benchmark for evaluating long-term memory utilization in task-oriented autonomous agents, 2026

2026
[25]

Learning to (learn at test time): RNNs with expressive hidden states

Yu Sun, Xinhao Li, Karan Dalal, Jiarui Xu, Arjun Vikram, Genghan Zhang, Yann Dubois, Xinlei Chen, Xiaolong Wang, Sanmi Koyejo, Tatsunori Hashimoto, and Carlos Guestrin. Learning to (learn at test time): RNNs with expressive hidden states. InForty-second International Conference on Machine Learning, 2025

2025
[26]

Test-time training with self-supervision for generalization under distribution shifts

Yu Sun, Xiaolong Wang, Zhuang Liu, John Miller, Alexei Efros, and Moritz Hardt. Test-time training with self-supervision for generalization under distribution shifts. InInternational conference on machine learning, pages 9229–9248. PMLR, 2020. 11

2020
[27]

Dynamic cheatsheet: Test-time learning with adaptive memory

Mirac Suzgun, Mert Yuksekgonul, Federico Bianchi, Dan Jurafsky, and James Zou. Dynamic cheatsheet: Test-time learning with adaptive memory. In Vera Demberg, Kentaro Inui, and Lluís Marquez, editors,Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7080–7106, Rabat, Mo...

2026
[28]

End-to-end test-time training for long context, 2025

Arnuv Tandon, Karan Dalal, Xinhao Li, Daniel Koceja, Marcel Rød, Sam Buchanan, Xiaolong Wang, Jure Leskovec, Sanmi Koyejo, Tatsunori Hashimoto, Carlos Guestrin, Jed McCaleb, Yejin Choi, and Yu Sun. End-to-end test-time training for long context, 2025

2025
[29]

Augmenting language models with long-term memory

Weizhi Wang, Li Dong, Hao Cheng, Xiaodong Liu, Xifeng Yan, Jianfeng Gao, and Furu Wei. Augmenting language models with long-term memory. InThirty-seventh Conference on Neural Information Processing Systems, 2023

2023
[30]

MEMORYLLM: Towards self-updatable large language models

Yu Wang, Yifan Gao, Xiusi Chen, Haoming Jiang, Shiyang Li, Jingfeng Yang, Qingyu Yin, Zheng Li, Xian Li, Bing Yin, Jingbo Shang, and Julian McAuley. MEMORYLLM: Towards self-updatable large language models. InForty-first International Conference on Machine Learning, 2024

2024
[31]

Long- memeval: Benchmarking chat assistants on long-term interactive memory

Di Wu, Hongwei Wang, Wenhao Yu, Yuwei Zhang, Kai-Wei Chang, and Dong Yu. Long- memeval: Benchmarking chat assistants on long-term interactive memory. InThe Thirteenth International Conference on Learning Representations, 2025

2025
[32]

You only need 4 extra tokens: Synergistic test-time adaptation for LLMs, 2026

Yijie Xu, Huizai Yao, Zhiyu Guo, Weiyu Guo, Pengteng Li, Aiwei Liu, Xuming Hu, and Hui Xiong. You only need 4 extra tokens: Synergistic test-time adaptation for LLMs, 2026

2026
[33]

The mirage of model editing: Revisiting evaluation in the wild

Wanli Yang, Fei Sun, Jiajun Tan, Xinyu Ma, Qi Cao, Dawei Yin, Huawei Shen, and Xueqi Cheng. The mirage of model editing: Revisiting evaluation in the wild. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15336–15354, Vienna, Austria, July 2025. Association for Computational Linguistics

2025
[34]

Learning to discover at test time, 2026

Mert Yuksekgonul, Daniel Koceja, Xinhao Li, Federico Bianchi, Jed McCaleb, Xiaolong Wang, Jan Kautz, Yejin Choi, James Zou, Carlos Guestrin, and Yu Sun. Learning to discover at test time, 2026

2026
[35]

Freeman, and Hao Tan

Tianyuan Zhang, Sai Bi, Yicong Hong, Kai Zhang, Fujun Luan, Songlin Yang, Kalyan Sunkavalli, William T. Freeman, and Hao Tan. Test-time training done right. InThe Fourteenth International Conference on Learning Representations, 2026

2026
[36]

Memorycd: Benchmarking long-context user memory of llm agents for lifelong cross-domain personalization.arXiv preprint arXiv:2603.25973, 2026

Weizhi Zhang, Xiaokai Wei, Wei-Chieh Huang, Zheng Hui, Chen Wang, Michelle Gong, and Philip S Yu. Memorycd: Benchmarking long-context user memory of llm agents for lifelong cross-domain personalization.arXiv preprint arXiv:2603.25973, 2026

work page arXiv 2026
[37]

Absorber LLM: Harnessing Causal Synchronization for Test-Time Training

Zhixin Zhang, Shabo Zhang, Chengcan Wu, Zeming Wei, and Meng Sun. Absorber llm: Harnessing causal synchronization for test-time training.arXiv preprint arXiv:2604.20915, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[38]

Memorybank: Enhancing large language models with long-term memory

Wanjun Zhong, Lianghong Guo, Qiqi Gao, He Ye, and Yanlin Wang. Memorybank: Enhancing large language models with long-term memory. InProceedings of the AAAI conference on artificial intelligence, volume 38, pages 19724–19731, 2024

2024
[39]

MQuAKE: Assessing knowledge editing in language models via multi-hop questions

Zexuan Zhong, Zhengxuan Wu, Christopher D Manning, Christopher Potts, and Danqi Chen. MQuAKE: Assessing knowledge editing in language models via multi-hop questions. InThe 2023 Conference on Empirical Methods in Natural Language Processing, 2023

2023
[40]

TTRL: Test-time reinforcement learning

Yuxin Zuo, Kaiyan Zhang, Li Sheng, Shang Qu, Ganqu Cui, Xuekai Zhu, Haozhan Li, Yuchen Zhang, Xinwei Long, Ermo Hua, Biqing Qi, Youbang Sun, Zhiyuan Ma, Lifan Yuan, Ning Ding, and Bowen Zhou. TTRL: Test-time reinforcement learning. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

2025
[41]

Dynamic evaluation

Adam Zweiger, Jyothish Pari, Han Guo, Yoon Kim, and Pulkit Agrawal. Self-adapting language models. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. 12 A Evidence migration patterns This appendix makes the evidence-migration concern concrete without treating the cited papers as overclaiming. The migration risk is interpr...

2025

[1] [1]

Self-improving llm agents at test-time

Emre Can Acikgoz, Cheng Qian, Heng Ji, Dilek Hakkani-Tür, and Gokhan Tur. Self-improving llm agents at test-time. 2025

2025

[2] [2]

Memorybench: A benchmark for memory and continual learning in LLM systems, 2026

Qingyao Ai, Yichen Tang, Changyue Wang, Jianming Long, Weihang Su, and Yiqun LIU. Memorybench: A benchmark for memory and continual learning in LLM systems, 2026

2026

[3] [3]

Test-time training undermines existing safety guardrails

Simone Antonelli, Mohammad Sadegh Akhondzadeh, and Aleksandar Bojchevski. Test-time training undermines existing safety guardrails. InICLR 2026 Workshop on Trustworthy AI, 2026

2026

[4] [4]

Let’s (not) just put things in context: Test-time training for long-context LLMs

Rachit Bansal, Aston Zhang, Rishabh Tiwari, Lovish Madaan, Sai Surya Duvvuri, Fnu Devvrit, David Brandfonbrener, David Alvarez-Melis, Prajjwal Bhargava, Mihir Kale, and Samy Jelassi. Let’s (not) just put things in context: Test-time training for long-context LLMs. InThe Fourteenth International Conference on Learning Representations, 2026

2026

[5] [5]

Atlas: Learning to optimally memorize the context at test time.arXiv preprint arXiv:2505.23735, 2025

Ali Behrouz, Zeman Li, Praneeth Kacham, Majid Daliri, Yuan Deng, Peilin Zhong, Meisam Razaviyayn, and Vahab Mirrokni. ATLAS: Learning to optimally memorize the context at test time.arXiv preprint arXiv:2505.23735, 2025

work page arXiv 2025

[6] [6]

Titans: Learning to Memorize at Test Time

Ali Behrouz, Peilin Zhong, and Vahab Mirrokni. Titans: Learning to memorize at test time. arXiv preprint arXiv:2501.00663, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[7] [7]

PERK: Long-context reasoning as parameter-efficient test-time learning

Zeming Chen, Angelika Romanou, Gail Weiss, and Antoine Bosselut. PERK: Long-context reasoning as parameter-efficient test-time learning. InThe Fourteenth International Conference on Learning Representations, 2026

2026

[8] [8]

Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory

Prateek Chhikara, Dev Khant, Saket Aryan, Taranjeet Singh, and Deshraj Yadav. Mem0: Building production-ready ai agents with scalable long-term memory.arXiv preprint arXiv:2504.19413, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[9] [9]

In-place test-time training

Guhao Feng, Shengjie Luo, Kai Hua, Ge Zhang, Wenhao Huang, Di He, and Tianle Cai. In-place test-time training. InThe Fourteenth International Conference on Learning Representations, 2026

2026

[10] [10]

Model editing at scale leads to gradual and catastrophic forgetting

Akshat Gupta, Anurag Rao, and Gopala Anumanchipalli. Model editing at scale leads to gradual and catastrophic forgetting. InFindings of the Association for Computational Linguistics: ACL 2024, pages 15202–15232, 2024

2024

[11] [11]

Test-time training on nearest neighbors for large language models

Moritz Hardt and Yu Sun. Test-time training on nearest neighbors for large language models. InThe Twelfth International Conference on Learning Representations, 2024

2024

[12] [12]

Detecting edit failures in large language models: An improved specificity benchmark

Jason Hoelscher-Obermaier, Julia Persson, Esben Kran, Ioannis Konstas, and Fazl Barez. Detecting edit failures in large language models: An improved specificity benchmark. In 10 Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors,Findings of the Association for Computational Linguistics: ACL 2023, pages 11548–11559, Toronto, Canada, July 2023. As...

2023

[13] [13]

Test-time learning for large language models

Jinwu Hu, Zitian Zhang, Guohao Chen, Xutao Wen, Chao Shuai, Wei Luo, Bin Xiao, Yuan- qing Li, and Mingkui Tan. Test-time learning for large language models. InForty-second International Conference on Machine Learning, 2025

2025

[14] [14]

Evaluating memory in LLM agents via in- cremental multi-turn interactions

Yuanzhe Hu, Yu Wang, and Julian McAuley. Evaluating memory in LLM agents via in- cremental multi-turn interactions. InThe Fourteenth International Conference on Learning Representations, 2026

2026

[15] [15]

Dynamic evaluation of neural sequence models

Ben Krause, Emmanuel Kahembwe, Iain Murray, and Steve Renals. Dynamic evaluation of neural sequence models. InInternational Conference on Machine Learning, pages 2766–2775. PMLR, 2018

2018

[16] [16]

Should we really edit language models? on the evaluation of edited language models.Advances in Neural Information Processing Systems, 37:30850–30885, 2024

Qi Li, Xiang Liu, Zhenheng Tang, Peijie Dong, Zeyu Li, Xinglin Pan, and Xiaowen Chu. Should we really edit language models? on the evaluation of edited language models.Advances in Neural Information Processing Systems, 37:30850–30885, 2024

2024

[17] [17]

Locas: Your models are principled initializers of locally-supported parametric memories.arXiv preprint arXiv:2602.05085, 2026

Sidi Lu, Zhenwen Liang, Dongyang Ma, Yan Wang, Haitao Mi, and Dong Yu. Locas: Your models are principled initializers of locally-supported parametric memories.arXiv preprint arXiv:2602.05085, 2026

work page arXiv 2026

[18] [18]

Evaluating very long-term conversational memory of llm agents

Adyasha Maharana, Dong-Ho Lee, Sergey Tulyakov, Mohit Bansal, Francesco Barbieri, and Yuwei Fang. Evaluating very long-term conversational memory of llm agents. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 13851–13870, 2024

2024

[19] [19]

Locating and editing factual associations in gpt

Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. Locating and editing factual associations in gpt. InProceedings of the 36th International Conference on Neural Information Processing Systems, NIPS ’22, Red Hook, NY , USA, 2022. Curran Associates Inc

2022

[20] [20]

Mass- editing memory in a transformer

Kevin Meng, Arnab Sen Sharma, Alex J Andonian, Yonatan Belinkov, and David Bau. Mass- editing memory in a transformer. InThe Eleventh International Conference on Learning Representations, 2023

2023

[21] [21]

Memgpt: towards llms as operating systems

Charles Packer, Vivian Fang, Shishir_G Patil, Kevin Lin, Sarah Wooders, and Joseph_E Gonza- lez. Memgpt: towards llms as operating systems. 2023

2023

[22] [22]

Amal Rannen-Triki, Jorg Bornschein, Razvan Pascanu, Marcus Hutter, Andras György, Alexan- dre Galashov, Yee Whye Teh, and Michalis K. Titsias. Revisiting dynamic evaluation: Online adaptation for large language models.arXiv preprint arXiv:2403.01518, 2024

work page arXiv 2024

[23] [23]

Long-form evaluation of model editing

Domenic Rosati, Robie Gonzales, Jinkun Chen, Xuemin Yu, Yahya Kayani, Frank Rudzicz, and Hassan Sajjad. Long-form evaluation of model editing. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 3749–3780, 2024

2024

[24] [24]

Mem2actbench: A benchmark for evaluating long-term memory utilization in task-oriented autonomous agents, 2026

Yiting Shen, Kun Li, Wei Zhou, and Songlin Hu. Mem2actbench: A benchmark for evaluating long-term memory utilization in task-oriented autonomous agents, 2026

2026

[25] [25]

Learning to (learn at test time): RNNs with expressive hidden states

Yu Sun, Xinhao Li, Karan Dalal, Jiarui Xu, Arjun Vikram, Genghan Zhang, Yann Dubois, Xinlei Chen, Xiaolong Wang, Sanmi Koyejo, Tatsunori Hashimoto, and Carlos Guestrin. Learning to (learn at test time): RNNs with expressive hidden states. InForty-second International Conference on Machine Learning, 2025

2025

[26] [26]

Test-time training with self-supervision for generalization under distribution shifts

Yu Sun, Xiaolong Wang, Zhuang Liu, John Miller, Alexei Efros, and Moritz Hardt. Test-time training with self-supervision for generalization under distribution shifts. InInternational conference on machine learning, pages 9229–9248. PMLR, 2020. 11

2020

[27] [27]

Dynamic cheatsheet: Test-time learning with adaptive memory

Mirac Suzgun, Mert Yuksekgonul, Federico Bianchi, Dan Jurafsky, and James Zou. Dynamic cheatsheet: Test-time learning with adaptive memory. In Vera Demberg, Kentaro Inui, and Lluís Marquez, editors,Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7080–7106, Rabat, Mo...

2026

[28] [28]

End-to-end test-time training for long context, 2025

Arnuv Tandon, Karan Dalal, Xinhao Li, Daniel Koceja, Marcel Rød, Sam Buchanan, Xiaolong Wang, Jure Leskovec, Sanmi Koyejo, Tatsunori Hashimoto, Carlos Guestrin, Jed McCaleb, Yejin Choi, and Yu Sun. End-to-end test-time training for long context, 2025

2025

[29] [29]

Augmenting language models with long-term memory

Weizhi Wang, Li Dong, Hao Cheng, Xiaodong Liu, Xifeng Yan, Jianfeng Gao, and Furu Wei. Augmenting language models with long-term memory. InThirty-seventh Conference on Neural Information Processing Systems, 2023

2023

[30] [30]

MEMORYLLM: Towards self-updatable large language models

Yu Wang, Yifan Gao, Xiusi Chen, Haoming Jiang, Shiyang Li, Jingfeng Yang, Qingyu Yin, Zheng Li, Xian Li, Bing Yin, Jingbo Shang, and Julian McAuley. MEMORYLLM: Towards self-updatable large language models. InForty-first International Conference on Machine Learning, 2024

2024

[31] [31]

Long- memeval: Benchmarking chat assistants on long-term interactive memory

Di Wu, Hongwei Wang, Wenhao Yu, Yuwei Zhang, Kai-Wei Chang, and Dong Yu. Long- memeval: Benchmarking chat assistants on long-term interactive memory. InThe Thirteenth International Conference on Learning Representations, 2025

2025

[32] [32]

You only need 4 extra tokens: Synergistic test-time adaptation for LLMs, 2026

Yijie Xu, Huizai Yao, Zhiyu Guo, Weiyu Guo, Pengteng Li, Aiwei Liu, Xuming Hu, and Hui Xiong. You only need 4 extra tokens: Synergistic test-time adaptation for LLMs, 2026

2026

[33] [33]

The mirage of model editing: Revisiting evaluation in the wild

Wanli Yang, Fei Sun, Jiajun Tan, Xinyu Ma, Qi Cao, Dawei Yin, Huawei Shen, and Xueqi Cheng. The mirage of model editing: Revisiting evaluation in the wild. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15336–15354, Vienna, Austria, July 2025. Association for Computational Linguistics

2025

[34] [34]

Learning to discover at test time, 2026

Mert Yuksekgonul, Daniel Koceja, Xinhao Li, Federico Bianchi, Jed McCaleb, Xiaolong Wang, Jan Kautz, Yejin Choi, James Zou, Carlos Guestrin, and Yu Sun. Learning to discover at test time, 2026

2026

[35] [35]

Freeman, and Hao Tan

Tianyuan Zhang, Sai Bi, Yicong Hong, Kai Zhang, Fujun Luan, Songlin Yang, Kalyan Sunkavalli, William T. Freeman, and Hao Tan. Test-time training done right. InThe Fourteenth International Conference on Learning Representations, 2026

2026

[36] [36]

Memorycd: Benchmarking long-context user memory of llm agents for lifelong cross-domain personalization.arXiv preprint arXiv:2603.25973, 2026

Weizhi Zhang, Xiaokai Wei, Wei-Chieh Huang, Zheng Hui, Chen Wang, Michelle Gong, and Philip S Yu. Memorycd: Benchmarking long-context user memory of llm agents for lifelong cross-domain personalization.arXiv preprint arXiv:2603.25973, 2026

work page arXiv 2026

[37] [37]

Absorber LLM: Harnessing Causal Synchronization for Test-Time Training

Zhixin Zhang, Shabo Zhang, Chengcan Wu, Zeming Wei, and Meng Sun. Absorber llm: Harnessing causal synchronization for test-time training.arXiv preprint arXiv:2604.20915, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[38] [38]

Memorybank: Enhancing large language models with long-term memory

Wanjun Zhong, Lianghong Guo, Qiqi Gao, He Ye, and Yanlin Wang. Memorybank: Enhancing large language models with long-term memory. InProceedings of the AAAI conference on artificial intelligence, volume 38, pages 19724–19731, 2024

2024

[39] [39]

MQuAKE: Assessing knowledge editing in language models via multi-hop questions

Zexuan Zhong, Zhengxuan Wu, Christopher D Manning, Christopher Potts, and Danqi Chen. MQuAKE: Assessing knowledge editing in language models via multi-hop questions. InThe 2023 Conference on Empirical Methods in Natural Language Processing, 2023

2023

[40] [40]

TTRL: Test-time reinforcement learning

Yuxin Zuo, Kaiyan Zhang, Li Sheng, Shang Qu, Ganqu Cui, Xuekai Zhu, Haozhan Li, Yuchen Zhang, Xinwei Long, Ermo Hua, Biqing Qi, Youbang Sun, Zhiyuan Ma, Lifan Yuan, Ning Ding, and Bowen Zhou. TTRL: Test-time reinforcement learning. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

2025

[41] [41]

Dynamic evaluation

Adam Zweiger, Jyothish Pari, Han Guo, Yoon Kim, and Pulkit Agrawal. Self-adapting language models. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. 12 A Evidence migration patterns This appendix makes the evidence-migration concern concrete without treating the cited papers as overclaiming. The migration risk is interpr...

2025