ElasticMem: Latent Memory as a Learnable Resource for LLM Agents

Chongrui Ye; Ge Liu; Haozhen Zhang; Jiaxuan You; Jingjun Xu; Tao Feng; Tianyang Luo; Xueqiang Xu

arxiv: 2605.30690 · v1 · pith:4XJ5KD2Dnew · submitted 2026-05-29 · 💻 cs.CL

ElasticMem: Latent Memory as a Learnable Resource for LLM Agents

Tao Feng , Chongrui Ye , Tianyang Luo , Jingjun Xu , Xueqiang Xu , Haozhen Zhang , Ge Liu , Jiaxuan You This is my paper

Pith reviewed 2026-06-28 23:04 UTC · model grok-4.3

classification 💻 cs.CL

keywords latent memoryLLM agentsadaptive retrievalmemory augmentationpolicy optimizationembodied agentsquestion answeringtoken efficiency

0 comments

The pith

Treating latent memory as a learnable elastic resource lets LLM agents adaptively retrieve and budget memory items to improve task performance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces ElasticMem to fix the mismatch between fixed memory allocation and query-dependent needs in LLM agents. It constructs a latent memory bank and trains a policy that retrieves items from the agent's hidden state while assigning each a variable budget. The selected states are added as soft tokens and the whole retrieval-plus-allocation process is optimized end-to-end with group-relative policy optimization on downstream rewards. Experiments on MemorySuite show large gains in QA accuracy and embodied success rates together with lower token usage. Readers would care because current fixed-memory methods either inflate context costs or fail to reuse experience effectively over long interactions.

Core claim

ElasticMem builds an offline latent memory bank with retrieval keys and content caches, retrieves memories adaptively from the reasoner's hidden state, assigns each retrieved memory a variable latent budget through a learned policy, and injects selected latent states as soft memory tokens for generation. The full memory-use process is optimized with downstream task rewards through group-relative policy optimization. Across Qwen2.5-3B-Instruct and Qwen2.5-7B-Instruct backbones, this produces weighted average QA accuracy gains of 26.2% and 24.6% and ALFWorld success rate gains of 66.3% and 27.2% over the strongest baselines while recording the lowest ALFWorld token cost.

What carries the argument

The learned policy for adaptive retrieval from hidden states combined with variable latent budget allocation, optimized via group-relative policy optimization on task rewards.

If this is right

Adaptive retrieval from hidden states prioritizes useful evidence and transferable plans beyond rigid cosine similarity.
Variable latent budget allocation reduces token overhead while raising success rates on memory-intensive QA and control tasks.
The same elastic mechanism produces gains on both 3B and 7B backbones, indicating the policy is not tied to a single model scale.
Ablations confirm that both the adaptive retrieval component and the elastic budget component contribute to the observed performance lift.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the policy truly generalizes, agents could maintain coherent state over interaction lengths that exceed current context windows without proportional cost increases.
The same reward-driven elasticity could be applied to other internal resources such as KV-cache eviction or attention span.
Success on ALFWorld suggests the approach may transfer to other embodied or multi-step planning domains that require reuse of past plans.

Load-bearing premise

Optimizing memory retrieval and budget allocation with downstream task rewards produces a policy that generalizes across tasks rather than fitting only the training distribution.

What would settle it

Training the policy on MemorySuite tasks and then testing it on a new memory-intensive task outside that suite, such as a different long-horizon embodied environment, and observing no accuracy or success improvement would falsify the claim of a generalizable elastic memory policy.

Figures

Figures reproduced from arXiv: 2605.30690 by Chongrui Ye, Ge Liu, Haozhen Zhang, Jiaxuan You, Jingjun Xu, Tao Feng, Tianyang Luo, Xueqiang Xu.

**Figure 1.** Figure 1: Overview of ElasticMem. ElasticMem learns to use long-term memory as an elastic latent resource. (1) Latent memory bank construction. Memory chunks from dialogues, passages, and skill cards are encoded once by a frozen offline LLM encoder. Each chunk is stored as a retrieval key and a latent content cache, forming a read-only memory bank B that is not updated during training. (2) Query-conditioned elastic … view at source ↗

**Figure 2.** Figure 2: Ablation studies on Qwen2.5-7B-Instruct. (a) Effect of budget policy: ElasticMem outperforms Random Budget, Uniform Budget, and MLP Budget Policy, showing that the Transformer budget policy better learns how to allocate latent capacity across retrieved memories. (b) Effect of retrieval design: ElasticMem outperforms Semantic Retrieval, Frozen-State Retrieval, and QueryState Retrieval, demonstrating the be… view at source ↗

read the original abstract

Long-term memory is essential for LLM agents to reason coherently across extended interactions, personalize responses, and reuse past experience. However, existing memory-augmented methods typically treat memory as a fixed resource: text-space approaches concatenate retrieved memories into the context window, causing substantial token overhead and sensitivity to noisy evidence, while latent-space approaches reduce textual cost but still rely on rigid retrieval or fixed-capacity memory interfaces. This creates a mismatch between query-dependent memory utility and fixed memory allocation. We propose ElasticMem, a memory-augmented LLM framework that learns to use memory as an elastic latent resource. ElasticMem builds an offline latent memory bank with retrieval keys and content caches, retrieves memories adaptively from the reasoner's hidden state, assigns each retrieved memory a variable latent budget through a learned policy, and injects selected latent states as soft memory tokens for generation. The full memory-use process is optimized with downstream task rewards through group-relative policy optimization. We evaluate ElasticMem on MemorySuite, covering memory-intensive QA and embodied agent control. Across Qwen2.5-3B-Instruct and Qwen2.5-7B-Instruct backbones, ElasticMem improves weighted average QA accuracy by 26.2% and 24.6%, and improves ALFWorld success rate by 66.3% and 27.2%, respectively, over the strongest baselines, while achieving the lowest ALFWorld token cost. Ablations and qualitative analyses further show that adaptive retrieval and elastic budget allocation help ElasticMem prioritize useful evidence and transferable plans beyond rigid cosine similarity. Our code for ElasticMem will be released at https://github.com/ulab-uiuc/ElasticMem.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ElasticMem learns a policy for adaptive hidden-state retrieval and variable latent memory budgets via GRPO, with reported gains on QA and ALFWorld but thin experimental detail and no transfer tests.

read the letter

The main takeaway is that this paper treats memory allocation itself as a learnable component rather than a fixed hyperparameter. It retrieves from a latent bank using the reasoner's hidden state, assigns each item a variable budget through a policy, and optimizes the retrieval-plus-budget decisions directly on downstream rewards with group-relative policy optimization.

What is new is the joint handling of adaptive retrieval and elastic budget in one trainable loop, plus the use of soft memory tokens injected into generation. The abstract reports clear lifts: 26% and 24% weighted QA accuracy on the 3B and 7B Qwen backbones, 66% and 27% ALFWorld success rate, and the lowest token cost among compared methods. Ablations are said to show that the adaptive pieces beat rigid cosine retrieval.

The soft spots are the usual ones for an abstract-only view. Baseline implementations are not described, so it is hard to judge how much of the gap comes from the new machinery versus better tuning. No variance numbers or statistical tests appear. More importantly, training and evaluation stay inside MemorySuite and ALFWorld; without cross-task or out-of-distribution runs it is difficult to tell whether the policy discovers transferable utility signals or simply fits task-specific retrieval patterns. That matches the stress-test worry about overfitting.

The work is aimed at people already building memory-augmented agents who want lower token overhead and query-dependent allocation. The idea is concrete enough and the claimed improvements large enough that it should go to referees so the community can examine the code, the exact baselines, and whether the policy actually generalizes.

Referee Report

3 major / 2 minor

Summary. The paper proposes ElasticMem, a framework for LLM agents that builds an offline latent memory bank, performs adaptive retrieval from the reasoner's hidden states, assigns variable latent budgets to retrieved memories via a learned policy, and injects them as soft memory tokens. The entire memory-use process is optimized end-to-end with group-relative policy optimization (GRPO) on downstream task rewards. On MemorySuite QA and ALFWorld, it reports large gains (26.2%/24.6% weighted QA accuracy and 66.3%/27.2% success rate for 3B/7B Qwen2.5 backbones) over strongest baselines while using the lowest token cost; ablations suggest benefits from adaptive retrieval and elastic allocation.

Significance. If the results and generalizability claims hold, ElasticMem would advance memory-augmented agents by demonstrating that memory allocation and retrieval can be learned as an elastic, reward-optimized resource rather than fixed or rigid. The planned code release supports reproducibility. The work addresses a clear mismatch between query-dependent memory utility and fixed allocation in prior text- and latent-space methods.

major comments (3)

[Experiments] Experiments section: the reported gains (e.g., 26.2% QA accuracy, 66.3% ALFWorld success) are presented without details on baseline implementations, number of random seeds, variance, statistical tests, or train/test splits, so the data support for the central performance claims cannot be assessed.
[Evaluation] Evaluation / §4: no cross-task transfer, held-out task, or out-of-distribution experiments are described to test whether the GRPO-optimized memory budget allocation policy learns transferable utility signals rather than task-specific retrieval patterns on MemorySuite/ALFWorld; this directly bears on the claim that memory becomes a generalizable 'learnable resource'.
[Method] Method section, policy description: the free parameters of the memory budget allocation policy are optimized on downstream rewards, but no analysis shows that the resulting policy reduces to quantities independent of the specific training environments, leaving open the possibility that gains arise from environment-specific fitting.

minor comments (2)

[Method] The term 'soft memory tokens' is used without an explicit definition or equation showing how they are constructed from the retrieved latent states.
[Figures] Figure captions and axis labels in the qualitative analysis figures could more clearly indicate which ablations correspond to which curves.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback, which helps clarify the strength of our empirical claims and the scope of our generalizability arguments. We address each major comment below and indicate planned revisions where appropriate.

read point-by-point responses

Referee: [Experiments] Experiments section: the reported gains (e.g., 26.2% QA accuracy, 66.3% ALFWorld success) are presented without details on baseline implementations, number of random seeds, variance, statistical tests, or train/test splits, so the data support for the central performance claims cannot be assessed.

Authors: We agree that these details are essential for assessing the results. In the revised manuscript we will expand §4 and the appendix to specify: (i) exact baseline implementations and hyper-parameters, (ii) the number of random seeds (three seeds were used throughout), (iii) per-seed means and standard deviations, (iv) statistical significance tests (paired t-tests against the strongest baseline), and (v) the precise train/test splits used for MemorySuite and ALFWorld. These additions will directly support the reported gains. revision: yes
Referee: [Evaluation] Evaluation / §4: no cross-task transfer, held-out task, or out-of-distribution experiments are described to test whether the GRPO-optimized memory budget allocation policy learns transferable utility signals rather than task-specific retrieval patterns on MemorySuite/ALFWorld; this directly bears on the claim that memory becomes a generalizable 'learnable resource'.

Authors: We note that the evaluation already spans two qualitatively different domains—memory-intensive QA (MemorySuite) and long-horizon embodied control (ALFWorld)—and that the same ElasticMem framework yields substantial gains on both. This provides initial evidence that the learned policy is not narrowly tuned to a single task family. Nevertheless, we acknowledge the absence of explicit held-out or OOD splits within each benchmark. We will add a limitations paragraph discussing this point and the computational cost of additional transfer experiments, while retaining the cross-domain results as supporting evidence for the generalizability claim. revision: partial
Referee: [Method] Method section, policy description: the free parameters of the memory budget allocation policy are optimized on downstream rewards, but no analysis shows that the resulting policy reduces to quantities independent of the specific training environments, leaving open the possibility that gains arise from environment-specific fitting.

Authors: The policy is trained end-to-end via GRPO on task rewards, and our ablations (Table 3) show that disabling elastic allocation hurts performance on both QA and ALFWorld. While we do not present an explicit decomposition proving environment-independent quantities, the fact that a single learned mechanism improves two dissimilar tasks argues against pure environment-specific fitting. In revision we will add a qualitative analysis of the learned budget distributions across the two domains to further address this concern. revision: partial

Circularity Check

0 steps flagged

No significant circularity; derivation relies on external task rewards and GRPO optimization

full rationale

The paper's central mechanism optimizes a memory-use policy via group-relative policy optimization using downstream task rewards on MemorySuite QA and ALFWorld. Reported gains (e.g., 26.2% QA accuracy) are empirical outcomes of this training, not quantities that reduce by the paper's own equations or self-citations to fitted parameters or inputs by construction. No self-definitional steps, fitted-input predictions, load-bearing self-citations, or ansatz smuggling appear in the provided text. The derivation chain is self-contained against external benchmarks and does not exhibit any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 1 invented entities

Based on abstract description, the central additions rest on a learned policy and standard RL assumptions; full paper would allow more precise enumeration.

free parameters (1)

parameters of the memory budget allocation policy
The policy that assigns variable latent budgets is learned from downstream task rewards.

axioms (2)

domain assumption Hidden states from the LLM reasoner provide effective keys for retrieving relevant latent memories
Retrieval mechanism is described as operating from the reasoner's hidden state.
domain assumption Group-relative policy optimization can effectively train the full memory retrieval and injection process
The abstract states the memory-use process is optimized with this method using task rewards.

invented entities (1)

soft memory tokens no independent evidence
purpose: Inject selected latent states into the generation process as variable-budget memory
Introduced as the injection mechanism for the elastic memory.

pith-pipeline@v0.9.1-grok · 5854 in / 1444 out tokens · 39915 ms · 2026-06-28T23:04:42.999850+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Rosetta Memory: Adaptive Memory for Cross-LLM Agents
cs.LG 2026-06 unverdicted novelty 7.0

Rosetta Memory trains two profile-conditioned operators with a minimum-gain sampling curriculum and performance-gap reward to enable memory transfer between LLMs, showing gains on multi-hop QA benchmarks and robustnes...

Reference graph

Works this paper leans on

83 extracted references · 33 canonical work pages · cited by 1 Pith paper · 18 internal anchors

[1]

Self-rag: Learn- ing to retrieve, generate, and critique through self-reflection

Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and Hannaneh Hajishirzi. Self-rag: Learn- ing to retrieve, generate, and critique through self-reflection. InThe Twelfth International Conference on Learning Representations, 2023

2023
[2]

Multi-agent evolve: Llm self-improve through co-evolution.arXiv preprint arXiv:2510.23595, 2025

Yixing Chen, Yiding Wang, Siqi Zhu, Haofei Yu, Tao Feng, Muhan Zhang, Mostofa Patwary, and Jiaxuan You. Multi-agent evolve: Llm self-improve through co-evolution.arXiv preprint arXiv:2510.23595, 2025

work page arXiv 2025
[3]

Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models

Zixiang Chen, Yihe Deng, Huizhuo Yuan, Kaixuan Ji, and Quanquan Gu. Self-play fine-tuning converts weak language models to strong language models.arXiv preprint arXiv:2401.01335, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[4]

Adapting language models to compress contexts

Alexis Chevalier, Alexander Wettig, Anirudh Ajith, and Danqi Chen. Adapting language models to compress contexts. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 3829–3846, 2023

2023
[5]

Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory

Prateek Chhikara, Dev Khant, Saket Aryan, Taranjeet Singh, and Deshraj Yadav. Mem0: Building production-ready ai agents with scalable long-term memory.arXiv preprint arXiv:2504.19413, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models

Ganqu Cui, Yuchen Zhang, Jiacheng Chen, Lifan Yuan, Zhi Wang, Yuxin Zuo, Haozhan Li, Yuchen Fan, Huayu Chen, Weize Chen, et al. The entropy mechanism of reinforcement learning for reasoning language models.arXiv preprint arXiv:2505.22617, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[7]

Seper: Measure retrieval utility through the lens of semantic perplexity reduction.arXiv preprint arXiv:2503.01478, 2025

Lu Dai, Yijie Xu, Jinhui Ye, Hao Liu, and Hui Xiong. Seper: Measure retrieval utility through the lens of semantic perplexity reduction.arXiv preprint arXiv:2503.01478, 2025

work page arXiv 2025
[8]

Memory-t1: Reinforcement learning for temporal reasoning in multi-session agents.arXiv preprint arXiv:2512.20092, 2025

Yiming Du, Baojun Wang, Yifan Xiang, Zhaowei Wang, Wenyu Huang, Boyang Xue, Bin Liang, Xingshan Zeng, Fei Mi, Haoli Bai, et al. Memory-t1: Reinforcement learning for temporal reasoning in multi-session agents.arXiv preprint arXiv:2512.20092, 2025

work page arXiv 2025
[9]

LightMem: Lightweight and Efficient Memory-Augmented Generation

Jizhan Fang, Xinle Deng, Haoming Xu, Ziyan Jiang, Yuqi Tang, Ziwen Xu, Shumin Deng, Yunzhi Yao, Mengru Wang, Shuofei Qiao, et al. Lightmem: Lightweight and efficient memory- augmented generation.arXiv preprint arXiv:2510.18866, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[10]

Memp: Exploring Agent Procedural Memory

Runnan Fang, Yuan Liang, Xiaobin Wang, Jialong Wu, Shuofei Qiao, Pengjun Xie, Fei Huang, Huajun Chen, and Ningyu Zhang. Memp: Exploring agent procedural memory.arXiv preprint arXiv:2508.06433, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[11]

Similarity is not all you need: Endowing retrieval augmented generation with multi layered thoughts.arXiv preprint arXiv:2405.19893, 2024

Chunjing Gan, Dan Yang, Binbin Hu, Hanxiao Zhang, Siyuan Li, Ziqi Liu, Yue Shen, Lin Ju, Zhiqiang Zhang, Jinjie Gu, et al. Similarity is not all you need: Endowing retrieval augmented generation with multi layered thoughts.arXiv preprint arXiv:2405.19893, 2024

work page arXiv 2024
[12]

Smartrag: Jointly learn rag-related tasks from the environment feedback.arXiv preprint arXiv:2410.18141, 2024

Jingsheng Gao, Linxu Li, Weiyuan Li, Yuzhuo Fu, and Bin Dai. Smartrag: Jointly learn rag-related tasks from the environment feedback.arXiv preprint arXiv:2410.18141, 2024

work page arXiv 2024
[13]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[14]

Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022

2022
[15]

Hiagent: Hierarchical working memory management for solving long-horizon agent tasks with large lan- guage model

Mengkang Hu, Tianxing Chen, Qiguang Chen, Yao Mu, Wenqi Shao, and Ping Luo. Hiagent: Hierarchical working memory management for solving long-horizon agent tasks with large lan- guage model. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 32779–32798, 2025

2025
[16]

Discoveryworld: A virtual environment for developing and evaluating automated scientific discovery agents.Advances in Neural Information Processing Systems, 37:10088–10116, 2024

Peter Jansen, Marc-Alexandre Côté, Tushar Khot, Erin Bransom, Bhavana Dalvi Mishra, Bodhisattwa Prasad Majumder, Oyvind Tafjord, and Peter Clark. Discoveryworld: A virtual environment for developing and evaluating automated scientific discovery agents.Advances in Neural Information Processing Systems, 37:10088–10116, 2024. 10

2024
[17]

Personamem-v2: Towards person- alized intelligence via learning implicit user personas and agentic memory.arXiv preprint arXiv:2512.06688, 2025

Bowen Jiang, Yuan Yuan, Maohao Shen, Zhuoqun Hao, Zhangchen Xu, Zichen Chen, Ziyi Liu, Anvesh Rao Vijjini, Jiashu He, Hanchao Yu, et al. Personamem-v2: Towards person- alized intelligence via learning implicit user personas and agentic memory.arXiv preprint arXiv:2512.06688, 2025

work page arXiv 2025
[18]

Adaptation of agentic ai.arXiv preprint arXiv:2512.16301, 2025

Pengcheng Jiang, Jiacheng Lin, Zhiyi Shi, Zifeng Wang, Luxi He, Yichen Wu, Ming Zhong, Peiyang Song, Qizheng Zhang, Heng Wang, et al. Adaptation of agentic ai.arXiv preprint arXiv:2512.16301, 2025

work page arXiv 2025
[19]

Memory os of ai agent

Jiazheng Kang, Mingming Ji, Zhe Zhao, and Ting Bai. Memory os of ai agent. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 25972– 25981, 2025

2025
[20]

Dense passage retrieval for open-domain question answering

Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. Dense passage retrieval for open-domain question answering. InProceedings of the 2020 conference on empirical methods in natural language processing (EMNLP), pages 6769–6781, 2020

2020
[21]

Training Language Models to Self-Correct via Reinforcement Learning

Aviral Kumar, Vincent Zhuang, Rishabh Agarwal, Yi Su, John D Co-Reyes, Avi Singh, Kate Baumli, Shariq Iqbal, Colton Bishop, Rebecca Roelofs, et al. Training language models to self-correct via reinforcement learning.arXiv preprint arXiv:2409.12917, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[22]

LangMem.https://langchain-ai.github.io/langmem/, 2024

LangChain. LangMem.https://langchain-ai.github.io/langmem/, 2024

2024
[23]

Let’s verify step by step

Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. InThe Twelfth International Conference on Learning Representations, 2023

2023
[24]

DeepSeek-V3 Technical Report

Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[25]

Lost in the middle: How language models use long contexts.Transactions of the association for computational linguistics, 12:157–173, 2024

Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts.Transactions of the association for computational linguistics, 12:157–173, 2024

2024
[26]

Understanding R1-Zero-Like Training: A Critical Perspective

Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Understanding r1-zero-like training: A critical perspective.arXiv preprint arXiv:2503.20783, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[27]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[28]

Evaluating very long-term conversational memory of llm agents

Adyasha Maharana, Dong-Ho Lee, Sergey Tulyakov, Mohit Bansal, Francesco Barbieri, and Yuwei Fang. Evaluating very long-term conversational memory of llm agents. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 13851–13870, 2024

2024
[29]

Skill-Pro: Learning Reusable Skills from Experience via Non-Parametric PPO for LLM Agents

Qirui Mi, Zhijian Ma, Mengyue Yang, Haoxuan Li, Yisen Wang, Haifeng Zhang, and Jun Wang. Procmem: Learning reusable procedural memory from experience via non-parametric ppo for llm agents.arXiv preprint arXiv:2602.01869, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[30]

Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

2022
[31]

On memory construction and retrieval for personalized conversational agents.arXiv preprint arXiv:2502.05589, 2025

Zhuoshi Pan, Qianhui Wu, Huiqiang Jiang, Xufang Luo, Hao Cheng, Dongsheng Li, Yuqing Yang, Chin-Yew Lin, H Vicky Zhao, Lili Qiu, et al. On memory construction and retrieval for personalized conversational agents.arXiv preprint arXiv:2502.05589, 2025

work page arXiv 2025
[32]

Generative agents: Interactive simulacra of human behavior

Joon Sung Park, Joseph O’Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S Bernstein. Generative agents: Interactive simulacra of human behavior. InProceed- ings of the 36th annual acm symposium on user interface software and technology, pages 1–22, 2023. 11

2023
[33]

Direct preference optimization: Your language model is secretly a reward model

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in neural information processing systems, 36:53728–53741, 2023

2023
[34]

Sentence-bert: Sentence embeddings using siamese bert- networks

Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using siamese bert- networks. InProceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP- IJCNLP), pages 3982–3992, 2019

2019
[35]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[36]

Large language models can be easily distracted by irrelevant context

Freda Shi, Xinyun Chen, Kanishka Misra, Nathan Scales, David Dohan, Ed H Chi, Nathanael Schärli, and Denny Zhou. Large language models can be easily distracted by irrelevant context. InInternational Conference on Machine Learning, pages 31210–31227. PMLR, 2023

2023
[37]

Reflexion: Language agents with verbal reinforcement learning.Advances in Neural Information Processing Systems, 36:8634–8652, 2023

Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning.Advances in Neural Information Processing Systems, 36:8634–8652, 2023

2023
[38]

ALFWorld: Aligning Text and Embodied Environments for Interactive Learning

Mohit Shridhar, Xingdi Yuan, Marc-Alexandre Côté, Yonatan Bisk, Adam Trischler, and Matthew Hausknecht. Alfworld: Aligning text and embodied environments for interactive learning.arXiv preprint arXiv:2010.03768, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010
[39]

Preference-aware memory update for long-term llm agents.arXiv preprint arXiv:2510.09720, 2025

Haoran Sun, Zekun Zhang, and Shaoning Zeng. Preference-aware memory update for long-term llm agents.arXiv preprint arXiv:2510.09720, 2025

work page arXiv 2025
[40]

Beyond a million tokens: Benchmarking and enhancing long-term memory in llms.arXiv preprint arXiv:2510.27246, 2025

Mohammad Tavakoli, Alireza Salemi, Carrie Ye, Mohamed Abdalla, Hamed Zamani, and J Ross Mitchell. Beyond a million tokens: Benchmarking and enhancing long-term memory in llms.arXiv preprint arXiv:2510.27246, 2025

work page arXiv 2025
[41]

Voyager: An Open-Ended Embodied Agent with Large Language Models

Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. V oyager: An open-ended embodied agent with large language models. arXiv preprint arXiv:2305.16291, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[42]

M+: Extending memoryllm with scalable long-term memory.arXiv preprint arXiv:2502.00592, 2025

Yu Wang, Dmitry Krotov, Yuanzhe Hu, Yifan Gao, Wangchunshu Zhou, Julian McAuley, Dan Gutfreund, Rogerio Feris, and Zexue He. M+: Extending memoryllm with scalable long-term memory.arXiv preprint arXiv:2502.00592, 2025

work page arXiv 2025
[43]

Browsecomp: A simple yet challenging benchmark for browsing agents, 2025

Jason Wei, Zhiqing Sun, Spencer Papay, Scott McKinney, Jeffrey Han, Isa Fulford, Hyung Won Chung, Alex Tachard Passos, William Fedus, and Amelia Glaese. Browsecomp: A simple yet challenging benchmark for browsing agents, 2025

2025
[44]

LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory

Di Wu, Hongwei Wang, Wenhao Yu, Yuwei Zhang, Kai-Wei Chang, and Dong Yu. Long- memeval: Benchmarking chat assistants on long-term interactive memory.arXiv preprint arXiv:2410.10813, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[45]

A-MEM: Agentic Memory for LLM Agents

Wujiang Xu, Zujie Liang, Kai Mei, Hang Gao, Juntao Tan, and Yongfeng Zhang. A-mem: Agentic memory for llm agents.arXiv preprint arXiv:2502.12110, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[46]

Training a utility-based retriever through shared context attribution for retrieval- augmented language models

Yilong Xu, Jinhua Gao, Xiaoming Yu, Yuanhai Xue, Baolong Bi, Huawei Shen, and Xueqi Cheng. Training a utility-based retriever through shared context attribution for retrieval- augmented language models. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 629–648, 2025

2025
[47]

Corrective retrieval augmented generation

Shi-Qi Yan, Jia-Chen Gu, Yun Zhu, and Zhen-Hua Ling. Corrective retrieval augmented generation. 2024

2024
[48]

Memory-R1: Enhancing Large Language Model Agents to Manage and Utilize Memories via Reinforcement Learning

Sikuan Yan, Xiufeng Yang, Zuchao Huang, Ercong Nie, Zifeng Ding, Zonggen Li, Xiaowen Ma, Jinhe Bi, Kristian Kersting, Jeff Z Pan, et al. Memory-r1: Enhancing large language model agents to manage and utilize memories via reinforcement learning.arXiv preprint arXiv:2508.19828, 2025. 12

work page internal anchor Pith review Pith/arXiv arXiv 2025
[49]

Swe-bench multimodal: Do ai systems generalize to visual software domains?arXiv preprint arXiv:2410.03859, 2024

John Yang, Carlos E Jimenez, Alex L Zhang, Kilian Lieret, Joyce Yang, Xindi Wu, Ori Press, Niklas Muennighoff, Gabriel Synnaeve, Karthik R Narasimhan, et al. Swe-bench multimodal: Do ai systems generalize to visual software domains?arXiv preprint arXiv:2410.03859, 2024

work page arXiv 2024
[50]

Memgen: Weaving generative latent memory for self-evolving agents.arXiv preprint arXiv:2509.24704, 2025

Guibin Zhang, Muxin Fu, and Shuicheng Yan. Memgen: Weaving generative latent memory for self-evolving agents.arXiv preprint arXiv:2509.24704, 2025

work page arXiv 2025
[51]

Agent-pro: Learning to evolve via policy-level reflection and optimization

Wenqi Zhang, Ke Tang, Hai Wu, Mengna Wang, Yongliang Shen, Guiyang Hou, Zeqi Tan, Peng Li, Yueting Zhuang, and Weiming Lu. Agent-pro: Learning to evolve via policy-level reflection and optimization. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5348–5375, 2024

2024
[52]

Evoking user memory: Personalizing llm via recollection-familiarity adaptive retrieval.arXiv preprint arXiv:2603.09250, 2026

Yingyi Zhang, Junyi Li, Wenlin Zhang, Penyue Jia, Xianneng Li, Yichao Wang, Derong Xu, Yi Wen, Huifeng Guo, Yong Liu, et al. Evoking user memory: Personalizing llm via recollection-familiarity adaptive retrieval.arXiv preprint arXiv:2603.09250, 2026

work page arXiv 2026
[53]

A survey on the memory mechanism of large language model-based agents

Zeyu Zhang, Quanyu Dai, Xiaohe Bo, Chen Ma, Rui Li, Xu Chen, Jieming Zhu, Zhenhua Dong, and Ji-Rong Wen. A survey on the memory mechanism of large language model-based agents. ACM Transactions on Information Systems, 43(6):1–47, 2025

2025
[54]

Group Sequence Policy Optimization

Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, et al. Group sequence policy optimization.arXiv preprint arXiv:2507.18071, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[55]

Synapse: Trajectory-as-exemplar prompting with memory for computer control.arXiv preprint arXiv:2306.07863, 2023

Longtao Zheng, Rundong Wang, Xinrun Wang, and Bo An. Synapse: Trajectory-as-exemplar prompting with memory for computer control.arXiv preprint arXiv:2306.07863, 2023

work page arXiv 2023
[56]

(a)”, “(b)

Wanjun Zhong, Lianghong Guo, Qiqi Gao, He Ye, and Yanlin Wang. Memorybank: Enhancing large language models with long-term memory. InProceedings of the AAAI conference on artificial intelligence, volume 38, pages 19724–19731, 2024. 13 Contents of Appendix A Limitations, Future Work, and Broader Impact 14 B Implementation Details 15 C Training Procedure ofE...

2024
[57]

Store Personal Preferences: Keep track of likes, dislikes, and specific preferences
[58]

Maintain Important Personal Details: Remember significant personal information like names, relation- ships, and important dates
[59]

Track Plans and Intentions: Note upcoming events, trips, goals, and any plans the user has shared
[60]

facts" : [

Monitor Health and Wellness Preferences: Keep a record of dietary restrictions, fitness routines, and other wellness-related information. [. . . other categories omitted for brevity . . . ] Here are some few shot examples: Input: Hi, my name is John. I am a software engineer. Output: {"facts" : ["Name is John", "Is a Software engineer"]} Return the facts ...
[61]

Start from the previous meta-summary (if exists)
[62]

Add/update information based on the new dialogue
[63]

Keep it concise (1-2 sentences max)
[64]

theme":

Maintain context coherence Previous Meta-summary:{last_meta} New Dialogue: {new_dialogue} Updated Meta-summary: Table 13:MemoryOS: Multi-Summary Prompt (subtopic extraction). Please analyze the following dialogue and generate extremely concise subtopic summaries, if applicable, with a maximum of two themes. Each summary should be very brief – just a few w...
[65]

Order from most to least important

KEYWORDS: The most important keywords (nouns, verbs, key concepts). Order from most to least important. At least three keywords
[66]

CONTEXT: One sentence summarizing the main topic, key points, and purpose
[67]

At least three tags

TAGS: Broad categories/themes for classification (domain, format, type). At least three tags. Respond using EXACTLY this format (one section per header): KEYWORDS: keyword1, keyword2, keyword3, ... CONTEXT: A single sentence summarizing the content. TAGS: tag1, tag2, tag3, ... Content for analysis: {content} 23 Table 15:A-MEM: Memory Evolution Decision Pr...
[68]

For each message, decide whether it contains any factual information

You MUST process every user message in order, one by one. For each message, decide whether it contains any factual information. - If yes→extract it and rephrase into a standalone sentence. - If no, such as pure greeting, filler, or irrelevant remark,→skip it. - Do NOT skip just because the information looks minor or unimportant
[69]

user: Bought apples yesterday

Perform light contextual completion so that each fact is a clear standalone statement. Examples: “user: Bought apples yesterday”→“User bought apples yesterday.”
[70]

data": [ {

Output format: { "data": [ {"source_id": "<source_id>", "fact": "<complete fact with ALL specific details>"} ] } Table 17:LightMem: Memory Consolidation Prompt. You are a memory management assistant. Your task is to decide whether the target memory should be updated, deleted, or ignored based on the candidate source memories. Decision rules:
[71]

Update: If the target and candidate memories describe essentially the same fact but are not fully consistent, update by integrating additional information
[72]

Delete: If the target and candidate memories contain a direct conflict, delete the target memory
[73]

action":

Ignore: If unrelated, no action is needed. The output must be a JSON object: { "action": "update" | "delete" | "ignore", "new_memory": {. . . } // only required when action = "update" } 24 Table 18:MeMP: Workflow Generation Prompt. You are provided with a query and a trajectory taken to solve the query. The trajectory consists of multiple steps of thought...
[74]

The general task category ( pick_and_place, heat_then_place, clean_then_place, cool_then_place,examine_in_light,pick_two)
[75]

The concrete step-by-step strategy that worked
[76]

soapbar is usually on countertop, bathtubbasin, or shelf

Common locations where target objects are found (e.g. “soapbar is usually on countertop, bathtubbasin, or shelf”) Be specific. Use actual object/location types (countertop,sinkbasin,microwave). Output format:SKILL: [your skill text] Table 22:ElasticMem: ALFWorld Skill Extraction Prompt (Failed Trajectory). system You are an expert at analyzing household r...
[77]

The general task category
[78]

What specific mistake was made
[79]

(a)”, “(b)

What the agent should have done differently Output format:SKILL: [your lesson text] Table 23:PersonaMem: Shared MC Answering Prompt (all baselines). QUESTION: {question} RETRIEVED MEMORY (relevant chunks from prior conversation): {retrieved_text} Answer with exactly one of the four options below, formatted as a single token like “(a)”, “(b)”, “(c)”, or “(...

2026
[80]

go to stoveburner 3→located kettle 2

Showing first 80 references.

[1] [1]

Self-rag: Learn- ing to retrieve, generate, and critique through self-reflection

Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and Hannaneh Hajishirzi. Self-rag: Learn- ing to retrieve, generate, and critique through self-reflection. InThe Twelfth International Conference on Learning Representations, 2023

2023

[2] [2]

Multi-agent evolve: Llm self-improve through co-evolution.arXiv preprint arXiv:2510.23595, 2025

Yixing Chen, Yiding Wang, Siqi Zhu, Haofei Yu, Tao Feng, Muhan Zhang, Mostofa Patwary, and Jiaxuan You. Multi-agent evolve: Llm self-improve through co-evolution.arXiv preprint arXiv:2510.23595, 2025

work page arXiv 2025

[3] [3]

Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models

Zixiang Chen, Yihe Deng, Huizhuo Yuan, Kaixuan Ji, and Quanquan Gu. Self-play fine-tuning converts weak language models to strong language models.arXiv preprint arXiv:2401.01335, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[4] [4]

Adapting language models to compress contexts

Alexis Chevalier, Alexander Wettig, Anirudh Ajith, and Danqi Chen. Adapting language models to compress contexts. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 3829–3846, 2023

2023

[5] [5]

Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory

Prateek Chhikara, Dev Khant, Saket Aryan, Taranjeet Singh, and Deshraj Yadav. Mem0: Building production-ready ai agents with scalable long-term memory.arXiv preprint arXiv:2504.19413, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[6] [6]

The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models

Ganqu Cui, Yuchen Zhang, Jiacheng Chen, Lifan Yuan, Zhi Wang, Yuxin Zuo, Haozhan Li, Yuchen Fan, Huayu Chen, Weize Chen, et al. The entropy mechanism of reinforcement learning for reasoning language models.arXiv preprint arXiv:2505.22617, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[7] [7]

Seper: Measure retrieval utility through the lens of semantic perplexity reduction.arXiv preprint arXiv:2503.01478, 2025

Lu Dai, Yijie Xu, Jinhui Ye, Hao Liu, and Hui Xiong. Seper: Measure retrieval utility through the lens of semantic perplexity reduction.arXiv preprint arXiv:2503.01478, 2025

work page arXiv 2025

[8] [8]

Memory-t1: Reinforcement learning for temporal reasoning in multi-session agents.arXiv preprint arXiv:2512.20092, 2025

Yiming Du, Baojun Wang, Yifan Xiang, Zhaowei Wang, Wenyu Huang, Boyang Xue, Bin Liang, Xingshan Zeng, Fei Mi, Haoli Bai, et al. Memory-t1: Reinforcement learning for temporal reasoning in multi-session agents.arXiv preprint arXiv:2512.20092, 2025

work page arXiv 2025

[9] [9]

LightMem: Lightweight and Efficient Memory-Augmented Generation

Jizhan Fang, Xinle Deng, Haoming Xu, Ziyan Jiang, Yuqi Tang, Ziwen Xu, Shumin Deng, Yunzhi Yao, Mengru Wang, Shuofei Qiao, et al. Lightmem: Lightweight and efficient memory- augmented generation.arXiv preprint arXiv:2510.18866, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[10] [10]

Memp: Exploring Agent Procedural Memory

Runnan Fang, Yuan Liang, Xiaobin Wang, Jialong Wu, Shuofei Qiao, Pengjun Xie, Fei Huang, Huajun Chen, and Ningyu Zhang. Memp: Exploring agent procedural memory.arXiv preprint arXiv:2508.06433, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[11] [11]

Similarity is not all you need: Endowing retrieval augmented generation with multi layered thoughts.arXiv preprint arXiv:2405.19893, 2024

Chunjing Gan, Dan Yang, Binbin Hu, Hanxiao Zhang, Siyuan Li, Ziqi Liu, Yue Shen, Lin Ju, Zhiqiang Zhang, Jinjie Gu, et al. Similarity is not all you need: Endowing retrieval augmented generation with multi layered thoughts.arXiv preprint arXiv:2405.19893, 2024

work page arXiv 2024

[12] [12]

Smartrag: Jointly learn rag-related tasks from the environment feedback.arXiv preprint arXiv:2410.18141, 2024

Jingsheng Gao, Linxu Li, Weiyuan Li, Yuzhuo Fu, and Bin Dai. Smartrag: Jointly learn rag-related tasks from the environment feedback.arXiv preprint arXiv:2410.18141, 2024

work page arXiv 2024

[13] [13]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[14] [14]

Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022

2022

[15] [15]

Hiagent: Hierarchical working memory management for solving long-horizon agent tasks with large lan- guage model

Mengkang Hu, Tianxing Chen, Qiguang Chen, Yao Mu, Wenqi Shao, and Ping Luo. Hiagent: Hierarchical working memory management for solving long-horizon agent tasks with large lan- guage model. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 32779–32798, 2025

2025

[16] [16]

Discoveryworld: A virtual environment for developing and evaluating automated scientific discovery agents.Advances in Neural Information Processing Systems, 37:10088–10116, 2024

Peter Jansen, Marc-Alexandre Côté, Tushar Khot, Erin Bransom, Bhavana Dalvi Mishra, Bodhisattwa Prasad Majumder, Oyvind Tafjord, and Peter Clark. Discoveryworld: A virtual environment for developing and evaluating automated scientific discovery agents.Advances in Neural Information Processing Systems, 37:10088–10116, 2024. 10

2024

[17] [17]

Personamem-v2: Towards person- alized intelligence via learning implicit user personas and agentic memory.arXiv preprint arXiv:2512.06688, 2025

Bowen Jiang, Yuan Yuan, Maohao Shen, Zhuoqun Hao, Zhangchen Xu, Zichen Chen, Ziyi Liu, Anvesh Rao Vijjini, Jiashu He, Hanchao Yu, et al. Personamem-v2: Towards person- alized intelligence via learning implicit user personas and agentic memory.arXiv preprint arXiv:2512.06688, 2025

work page arXiv 2025

[18] [18]

Adaptation of agentic ai.arXiv preprint arXiv:2512.16301, 2025

Pengcheng Jiang, Jiacheng Lin, Zhiyi Shi, Zifeng Wang, Luxi He, Yichen Wu, Ming Zhong, Peiyang Song, Qizheng Zhang, Heng Wang, et al. Adaptation of agentic ai.arXiv preprint arXiv:2512.16301, 2025

work page arXiv 2025

[19] [19]

Memory os of ai agent

Jiazheng Kang, Mingming Ji, Zhe Zhao, and Ting Bai. Memory os of ai agent. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 25972– 25981, 2025

2025

[20] [20]

Dense passage retrieval for open-domain question answering

Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. Dense passage retrieval for open-domain question answering. InProceedings of the 2020 conference on empirical methods in natural language processing (EMNLP), pages 6769–6781, 2020

2020

[21] [21]

Training Language Models to Self-Correct via Reinforcement Learning

Aviral Kumar, Vincent Zhuang, Rishabh Agarwal, Yi Su, John D Co-Reyes, Avi Singh, Kate Baumli, Shariq Iqbal, Colton Bishop, Rebecca Roelofs, et al. Training language models to self-correct via reinforcement learning.arXiv preprint arXiv:2409.12917, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[22] [22]

LangMem.https://langchain-ai.github.io/langmem/, 2024

LangChain. LangMem.https://langchain-ai.github.io/langmem/, 2024

2024

[23] [23]

Let’s verify step by step

Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. InThe Twelfth International Conference on Learning Representations, 2023

2023

[24] [24]

DeepSeek-V3 Technical Report

Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[25] [25]

Lost in the middle: How language models use long contexts.Transactions of the association for computational linguistics, 12:157–173, 2024

Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts.Transactions of the association for computational linguistics, 12:157–173, 2024

2024

[26] [26]

Understanding R1-Zero-Like Training: A Critical Perspective

Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Understanding r1-zero-like training: A critical perspective.arXiv preprint arXiv:2503.20783, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[27] [27]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[28] [28]

Evaluating very long-term conversational memory of llm agents

Adyasha Maharana, Dong-Ho Lee, Sergey Tulyakov, Mohit Bansal, Francesco Barbieri, and Yuwei Fang. Evaluating very long-term conversational memory of llm agents. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 13851–13870, 2024

2024

[29] [29]

Skill-Pro: Learning Reusable Skills from Experience via Non-Parametric PPO for LLM Agents

Qirui Mi, Zhijian Ma, Mengyue Yang, Haoxuan Li, Yisen Wang, Haifeng Zhang, and Jun Wang. Procmem: Learning reusable procedural memory from experience via non-parametric ppo for llm agents.arXiv preprint arXiv:2602.01869, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[30] [30]

Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

2022

[31] [31]

On memory construction and retrieval for personalized conversational agents.arXiv preprint arXiv:2502.05589, 2025

Zhuoshi Pan, Qianhui Wu, Huiqiang Jiang, Xufang Luo, Hao Cheng, Dongsheng Li, Yuqing Yang, Chin-Yew Lin, H Vicky Zhao, Lili Qiu, et al. On memory construction and retrieval for personalized conversational agents.arXiv preprint arXiv:2502.05589, 2025

work page arXiv 2025

[32] [32]

Generative agents: Interactive simulacra of human behavior

Joon Sung Park, Joseph O’Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S Bernstein. Generative agents: Interactive simulacra of human behavior. InProceed- ings of the 36th annual acm symposium on user interface software and technology, pages 1–22, 2023. 11

2023

[33] [33]

Direct preference optimization: Your language model is secretly a reward model

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in neural information processing systems, 36:53728–53741, 2023

2023

[34] [34]

Sentence-bert: Sentence embeddings using siamese bert- networks

Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using siamese bert- networks. InProceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP- IJCNLP), pages 3982–3992, 2019

2019

[35] [35]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[36] [36]

Large language models can be easily distracted by irrelevant context

Freda Shi, Xinyun Chen, Kanishka Misra, Nathan Scales, David Dohan, Ed H Chi, Nathanael Schärli, and Denny Zhou. Large language models can be easily distracted by irrelevant context. InInternational Conference on Machine Learning, pages 31210–31227. PMLR, 2023

2023

[37] [37]

Reflexion: Language agents with verbal reinforcement learning.Advances in Neural Information Processing Systems, 36:8634–8652, 2023

Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning.Advances in Neural Information Processing Systems, 36:8634–8652, 2023

2023

[38] [38]

ALFWorld: Aligning Text and Embodied Environments for Interactive Learning

Mohit Shridhar, Xingdi Yuan, Marc-Alexandre Côté, Yonatan Bisk, Adam Trischler, and Matthew Hausknecht. Alfworld: Aligning text and embodied environments for interactive learning.arXiv preprint arXiv:2010.03768, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010

[39] [39]

Preference-aware memory update for long-term llm agents.arXiv preprint arXiv:2510.09720, 2025

Haoran Sun, Zekun Zhang, and Shaoning Zeng. Preference-aware memory update for long-term llm agents.arXiv preprint arXiv:2510.09720, 2025

work page arXiv 2025

[40] [40]

Beyond a million tokens: Benchmarking and enhancing long-term memory in llms.arXiv preprint arXiv:2510.27246, 2025

Mohammad Tavakoli, Alireza Salemi, Carrie Ye, Mohamed Abdalla, Hamed Zamani, and J Ross Mitchell. Beyond a million tokens: Benchmarking and enhancing long-term memory in llms.arXiv preprint arXiv:2510.27246, 2025

work page arXiv 2025

[41] [41]

Voyager: An Open-Ended Embodied Agent with Large Language Models

Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. V oyager: An open-ended embodied agent with large language models. arXiv preprint arXiv:2305.16291, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[42] [42]

M+: Extending memoryllm with scalable long-term memory.arXiv preprint arXiv:2502.00592, 2025

Yu Wang, Dmitry Krotov, Yuanzhe Hu, Yifan Gao, Wangchunshu Zhou, Julian McAuley, Dan Gutfreund, Rogerio Feris, and Zexue He. M+: Extending memoryllm with scalable long-term memory.arXiv preprint arXiv:2502.00592, 2025

work page arXiv 2025

[43] [43]

Browsecomp: A simple yet challenging benchmark for browsing agents, 2025

Jason Wei, Zhiqing Sun, Spencer Papay, Scott McKinney, Jeffrey Han, Isa Fulford, Hyung Won Chung, Alex Tachard Passos, William Fedus, and Amelia Glaese. Browsecomp: A simple yet challenging benchmark for browsing agents, 2025

2025

[44] [44]

LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory

Di Wu, Hongwei Wang, Wenhao Yu, Yuwei Zhang, Kai-Wei Chang, and Dong Yu. Long- memeval: Benchmarking chat assistants on long-term interactive memory.arXiv preprint arXiv:2410.10813, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[45] [45]

A-MEM: Agentic Memory for LLM Agents

Wujiang Xu, Zujie Liang, Kai Mei, Hang Gao, Juntao Tan, and Yongfeng Zhang. A-mem: Agentic memory for llm agents.arXiv preprint arXiv:2502.12110, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[46] [46]

Training a utility-based retriever through shared context attribution for retrieval- augmented language models

Yilong Xu, Jinhua Gao, Xiaoming Yu, Yuanhai Xue, Baolong Bi, Huawei Shen, and Xueqi Cheng. Training a utility-based retriever through shared context attribution for retrieval- augmented language models. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 629–648, 2025

2025

[47] [47]

Corrective retrieval augmented generation

Shi-Qi Yan, Jia-Chen Gu, Yun Zhu, and Zhen-Hua Ling. Corrective retrieval augmented generation. 2024

2024

[48] [48]

Memory-R1: Enhancing Large Language Model Agents to Manage and Utilize Memories via Reinforcement Learning

Sikuan Yan, Xiufeng Yang, Zuchao Huang, Ercong Nie, Zifeng Ding, Zonggen Li, Xiaowen Ma, Jinhe Bi, Kristian Kersting, Jeff Z Pan, et al. Memory-r1: Enhancing large language model agents to manage and utilize memories via reinforcement learning.arXiv preprint arXiv:2508.19828, 2025. 12

work page internal anchor Pith review Pith/arXiv arXiv 2025

[49] [49]

Swe-bench multimodal: Do ai systems generalize to visual software domains?arXiv preprint arXiv:2410.03859, 2024

John Yang, Carlos E Jimenez, Alex L Zhang, Kilian Lieret, Joyce Yang, Xindi Wu, Ori Press, Niklas Muennighoff, Gabriel Synnaeve, Karthik R Narasimhan, et al. Swe-bench multimodal: Do ai systems generalize to visual software domains?arXiv preprint arXiv:2410.03859, 2024

work page arXiv 2024

[50] [50]

Memgen: Weaving generative latent memory for self-evolving agents.arXiv preprint arXiv:2509.24704, 2025

Guibin Zhang, Muxin Fu, and Shuicheng Yan. Memgen: Weaving generative latent memory for self-evolving agents.arXiv preprint arXiv:2509.24704, 2025

work page arXiv 2025

[51] [51]

Agent-pro: Learning to evolve via policy-level reflection and optimization

Wenqi Zhang, Ke Tang, Hai Wu, Mengna Wang, Yongliang Shen, Guiyang Hou, Zeqi Tan, Peng Li, Yueting Zhuang, and Weiming Lu. Agent-pro: Learning to evolve via policy-level reflection and optimization. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5348–5375, 2024

2024

[52] [52]

Evoking user memory: Personalizing llm via recollection-familiarity adaptive retrieval.arXiv preprint arXiv:2603.09250, 2026

Yingyi Zhang, Junyi Li, Wenlin Zhang, Penyue Jia, Xianneng Li, Yichao Wang, Derong Xu, Yi Wen, Huifeng Guo, Yong Liu, et al. Evoking user memory: Personalizing llm via recollection-familiarity adaptive retrieval.arXiv preprint arXiv:2603.09250, 2026

work page arXiv 2026

[53] [53]

A survey on the memory mechanism of large language model-based agents

Zeyu Zhang, Quanyu Dai, Xiaohe Bo, Chen Ma, Rui Li, Xu Chen, Jieming Zhu, Zhenhua Dong, and Ji-Rong Wen. A survey on the memory mechanism of large language model-based agents. ACM Transactions on Information Systems, 43(6):1–47, 2025

2025

[54] [54]

Group Sequence Policy Optimization

Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, et al. Group sequence policy optimization.arXiv preprint arXiv:2507.18071, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[55] [55]

Synapse: Trajectory-as-exemplar prompting with memory for computer control.arXiv preprint arXiv:2306.07863, 2023

Longtao Zheng, Rundong Wang, Xinrun Wang, and Bo An. Synapse: Trajectory-as-exemplar prompting with memory for computer control.arXiv preprint arXiv:2306.07863, 2023

work page arXiv 2023

[56] [56]

(a)”, “(b)

Wanjun Zhong, Lianghong Guo, Qiqi Gao, He Ye, and Yanlin Wang. Memorybank: Enhancing large language models with long-term memory. InProceedings of the AAAI conference on artificial intelligence, volume 38, pages 19724–19731, 2024. 13 Contents of Appendix A Limitations, Future Work, and Broader Impact 14 B Implementation Details 15 C Training Procedure ofE...

2024

[57] [57]

Store Personal Preferences: Keep track of likes, dislikes, and specific preferences

[58] [58]

Maintain Important Personal Details: Remember significant personal information like names, relation- ships, and important dates

[59] [59]

Track Plans and Intentions: Note upcoming events, trips, goals, and any plans the user has shared

[60] [60]

facts" : [

Monitor Health and Wellness Preferences: Keep a record of dietary restrictions, fitness routines, and other wellness-related information. [. . . other categories omitted for brevity . . . ] Here are some few shot examples: Input: Hi, my name is John. I am a software engineer. Output: {"facts" : ["Name is John", "Is a Software engineer"]} Return the facts ...

[61] [61]

Start from the previous meta-summary (if exists)

[62] [62]

Add/update information based on the new dialogue

[63] [63]

Keep it concise (1-2 sentences max)

[64] [64]

theme":

Maintain context coherence Previous Meta-summary:{last_meta} New Dialogue: {new_dialogue} Updated Meta-summary: Table 13:MemoryOS: Multi-Summary Prompt (subtopic extraction). Please analyze the following dialogue and generate extremely concise subtopic summaries, if applicable, with a maximum of two themes. Each summary should be very brief – just a few w...

[65] [65]

Order from most to least important

KEYWORDS: The most important keywords (nouns, verbs, key concepts). Order from most to least important. At least three keywords

[66] [66]

CONTEXT: One sentence summarizing the main topic, key points, and purpose

[67] [67]

At least three tags

TAGS: Broad categories/themes for classification (domain, format, type). At least three tags. Respond using EXACTLY this format (one section per header): KEYWORDS: keyword1, keyword2, keyword3, ... CONTEXT: A single sentence summarizing the content. TAGS: tag1, tag2, tag3, ... Content for analysis: {content} 23 Table 15:A-MEM: Memory Evolution Decision Pr...

[68] [68]

For each message, decide whether it contains any factual information

You MUST process every user message in order, one by one. For each message, decide whether it contains any factual information. - If yes→extract it and rephrase into a standalone sentence. - If no, such as pure greeting, filler, or irrelevant remark,→skip it. - Do NOT skip just because the information looks minor or unimportant

[69] [69]

user: Bought apples yesterday

Perform light contextual completion so that each fact is a clear standalone statement. Examples: “user: Bought apples yesterday”→“User bought apples yesterday.”

[70] [70]

data": [ {

Output format: { "data": [ {"source_id": "<source_id>", "fact": "<complete fact with ALL specific details>"} ] } Table 17:LightMem: Memory Consolidation Prompt. You are a memory management assistant. Your task is to decide whether the target memory should be updated, deleted, or ignored based on the candidate source memories. Decision rules:

[71] [71]

Update: If the target and candidate memories describe essentially the same fact but are not fully consistent, update by integrating additional information

[72] [72]

Delete: If the target and candidate memories contain a direct conflict, delete the target memory

[73] [73]

action":

Ignore: If unrelated, no action is needed. The output must be a JSON object: { "action": "update" | "delete" | "ignore", "new_memory": {. . . } // only required when action = "update" } 24 Table 18:MeMP: Workflow Generation Prompt. You are provided with a query and a trajectory taken to solve the query. The trajectory consists of multiple steps of thought...

[74] [74]

The general task category ( pick_and_place, heat_then_place, clean_then_place, cool_then_place,examine_in_light,pick_two)

[75] [75]

The concrete step-by-step strategy that worked

[76] [76]

soapbar is usually on countertop, bathtubbasin, or shelf

Common locations where target objects are found (e.g. “soapbar is usually on countertop, bathtubbasin, or shelf”) Be specific. Use actual object/location types (countertop,sinkbasin,microwave). Output format:SKILL: [your skill text] Table 22:ElasticMem: ALFWorld Skill Extraction Prompt (Failed Trajectory). system You are an expert at analyzing household r...

[77] [77]

The general task category

[78] [78]

What specific mistake was made

[79] [79]

(a)”, “(b)

What the agent should have done differently Output format:SKILL: [your lesson text] Table 23:PersonaMem: Shared MC Answering Prompt (all baselines). QUESTION: {question} RETRIEVED MEMORY (relevant chunks from prior conversation): {retrieved_text} Answer with exactly one of the four options below, formatted as a single token like “(a)”, “(b)”, “(c)”, or “(...

2026

[80] [80]

go to stoveburner 3→located kettle 2