pith. sign in

arxiv: 2605.30690 · v1 · pith:4XJ5KD2Dnew · submitted 2026-05-29 · 💻 cs.CL

ElasticMem: Latent Memory as a Learnable Resource for LLM Agents

Pith reviewed 2026-06-28 23:04 UTC · model grok-4.3

classification 💻 cs.CL
keywords latent memoryLLM agentsadaptive retrievalmemory augmentationpolicy optimizationembodied agentsquestion answeringtoken efficiency
0
0 comments X

The pith

Treating latent memory as a learnable elastic resource lets LLM agents adaptively retrieve and budget memory items to improve task performance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces ElasticMem to fix the mismatch between fixed memory allocation and query-dependent needs in LLM agents. It constructs a latent memory bank and trains a policy that retrieves items from the agent's hidden state while assigning each a variable budget. The selected states are added as soft tokens and the whole retrieval-plus-allocation process is optimized end-to-end with group-relative policy optimization on downstream rewards. Experiments on MemorySuite show large gains in QA accuracy and embodied success rates together with lower token usage. Readers would care because current fixed-memory methods either inflate context costs or fail to reuse experience effectively over long interactions.

Core claim

ElasticMem builds an offline latent memory bank with retrieval keys and content caches, retrieves memories adaptively from the reasoner's hidden state, assigns each retrieved memory a variable latent budget through a learned policy, and injects selected latent states as soft memory tokens for generation. The full memory-use process is optimized with downstream task rewards through group-relative policy optimization. Across Qwen2.5-3B-Instruct and Qwen2.5-7B-Instruct backbones, this produces weighted average QA accuracy gains of 26.2% and 24.6% and ALFWorld success rate gains of 66.3% and 27.2% over the strongest baselines while recording the lowest ALFWorld token cost.

What carries the argument

The learned policy for adaptive retrieval from hidden states combined with variable latent budget allocation, optimized via group-relative policy optimization on task rewards.

If this is right

  • Adaptive retrieval from hidden states prioritizes useful evidence and transferable plans beyond rigid cosine similarity.
  • Variable latent budget allocation reduces token overhead while raising success rates on memory-intensive QA and control tasks.
  • The same elastic mechanism produces gains on both 3B and 7B backbones, indicating the policy is not tied to a single model scale.
  • Ablations confirm that both the adaptive retrieval component and the elastic budget component contribute to the observed performance lift.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the policy truly generalizes, agents could maintain coherent state over interaction lengths that exceed current context windows without proportional cost increases.
  • The same reward-driven elasticity could be applied to other internal resources such as KV-cache eviction or attention span.
  • Success on ALFWorld suggests the approach may transfer to other embodied or multi-step planning domains that require reuse of past plans.

Load-bearing premise

Optimizing memory retrieval and budget allocation with downstream task rewards produces a policy that generalizes across tasks rather than fitting only the training distribution.

What would settle it

Training the policy on MemorySuite tasks and then testing it on a new memory-intensive task outside that suite, such as a different long-horizon embodied environment, and observing no accuracy or success improvement would falsify the claim of a generalizable elastic memory policy.

Figures

Figures reproduced from arXiv: 2605.30690 by Chongrui Ye, Ge Liu, Haozhen Zhang, Jiaxuan You, Jingjun Xu, Tao Feng, Tianyang Luo, Xueqiang Xu.

Figure 1
Figure 1. Figure 1: Overview of ElasticMem. ElasticMem learns to use long-term memory as an elastic latent resource. (1) Latent memory bank construction. Memory chunks from dialogues, passages, and skill cards are encoded once by a frozen offline LLM encoder. Each chunk is stored as a retrieval key and a latent content cache, forming a read-only memory bank B that is not updated during training. (2) Query-conditioned elastic … view at source ↗
Figure 2
Figure 2. Figure 2: Ablation studies on Qwen2.5-7B-Instruct. (a) Effect of budget policy: ElasticMem outperforms Random Budget, Uniform Budget, and MLP Budget Policy, showing that the Transformer budget policy better learns how to allocate latent capacity across retrieved memories. (b) Effect of retrieval design: ElasticMem outperforms Semantic Retrieval, Frozen-State Retrieval, and Query￾State Retrieval, demonstrating the be… view at source ↗
read the original abstract

Long-term memory is essential for LLM agents to reason coherently across extended interactions, personalize responses, and reuse past experience. However, existing memory-augmented methods typically treat memory as a fixed resource: text-space approaches concatenate retrieved memories into the context window, causing substantial token overhead and sensitivity to noisy evidence, while latent-space approaches reduce textual cost but still rely on rigid retrieval or fixed-capacity memory interfaces. This creates a mismatch between query-dependent memory utility and fixed memory allocation. We propose ElasticMem, a memory-augmented LLM framework that learns to use memory as an elastic latent resource. ElasticMem builds an offline latent memory bank with retrieval keys and content caches, retrieves memories adaptively from the reasoner's hidden state, assigns each retrieved memory a variable latent budget through a learned policy, and injects selected latent states as soft memory tokens for generation. The full memory-use process is optimized with downstream task rewards through group-relative policy optimization. We evaluate ElasticMem on MemorySuite, covering memory-intensive QA and embodied agent control. Across Qwen2.5-3B-Instruct and Qwen2.5-7B-Instruct backbones, ElasticMem improves weighted average QA accuracy by 26.2% and 24.6%, and improves ALFWorld success rate by 66.3% and 27.2%, respectively, over the strongest baselines, while achieving the lowest ALFWorld token cost. Ablations and qualitative analyses further show that adaptive retrieval and elastic budget allocation help ElasticMem prioritize useful evidence and transferable plans beyond rigid cosine similarity. Our code for ElasticMem will be released at https://github.com/ulab-uiuc/ElasticMem.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes ElasticMem, a framework for LLM agents that builds an offline latent memory bank, performs adaptive retrieval from the reasoner's hidden states, assigns variable latent budgets to retrieved memories via a learned policy, and injects them as soft memory tokens. The entire memory-use process is optimized end-to-end with group-relative policy optimization (GRPO) on downstream task rewards. On MemorySuite QA and ALFWorld, it reports large gains (26.2%/24.6% weighted QA accuracy and 66.3%/27.2% success rate for 3B/7B Qwen2.5 backbones) over strongest baselines while using the lowest token cost; ablations suggest benefits from adaptive retrieval and elastic allocation.

Significance. If the results and generalizability claims hold, ElasticMem would advance memory-augmented agents by demonstrating that memory allocation and retrieval can be learned as an elastic, reward-optimized resource rather than fixed or rigid. The planned code release supports reproducibility. The work addresses a clear mismatch between query-dependent memory utility and fixed allocation in prior text- and latent-space methods.

major comments (3)
  1. [Experiments] Experiments section: the reported gains (e.g., 26.2% QA accuracy, 66.3% ALFWorld success) are presented without details on baseline implementations, number of random seeds, variance, statistical tests, or train/test splits, so the data support for the central performance claims cannot be assessed.
  2. [Evaluation] Evaluation / §4: no cross-task transfer, held-out task, or out-of-distribution experiments are described to test whether the GRPO-optimized memory budget allocation policy learns transferable utility signals rather than task-specific retrieval patterns on MemorySuite/ALFWorld; this directly bears on the claim that memory becomes a generalizable 'learnable resource'.
  3. [Method] Method section, policy description: the free parameters of the memory budget allocation policy are optimized on downstream rewards, but no analysis shows that the resulting policy reduces to quantities independent of the specific training environments, leaving open the possibility that gains arise from environment-specific fitting.
minor comments (2)
  1. [Method] The term 'soft memory tokens' is used without an explicit definition or equation showing how they are constructed from the retrieved latent states.
  2. [Figures] Figure captions and axis labels in the qualitative analysis figures could more clearly indicate which ablations correspond to which curves.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback, which helps clarify the strength of our empirical claims and the scope of our generalizability arguments. We address each major comment below and indicate planned revisions where appropriate.

read point-by-point responses
  1. Referee: [Experiments] Experiments section: the reported gains (e.g., 26.2% QA accuracy, 66.3% ALFWorld success) are presented without details on baseline implementations, number of random seeds, variance, statistical tests, or train/test splits, so the data support for the central performance claims cannot be assessed.

    Authors: We agree that these details are essential for assessing the results. In the revised manuscript we will expand §4 and the appendix to specify: (i) exact baseline implementations and hyper-parameters, (ii) the number of random seeds (three seeds were used throughout), (iii) per-seed means and standard deviations, (iv) statistical significance tests (paired t-tests against the strongest baseline), and (v) the precise train/test splits used for MemorySuite and ALFWorld. These additions will directly support the reported gains. revision: yes

  2. Referee: [Evaluation] Evaluation / §4: no cross-task transfer, held-out task, or out-of-distribution experiments are described to test whether the GRPO-optimized memory budget allocation policy learns transferable utility signals rather than task-specific retrieval patterns on MemorySuite/ALFWorld; this directly bears on the claim that memory becomes a generalizable 'learnable resource'.

    Authors: We note that the evaluation already spans two qualitatively different domains—memory-intensive QA (MemorySuite) and long-horizon embodied control (ALFWorld)—and that the same ElasticMem framework yields substantial gains on both. This provides initial evidence that the learned policy is not narrowly tuned to a single task family. Nevertheless, we acknowledge the absence of explicit held-out or OOD splits within each benchmark. We will add a limitations paragraph discussing this point and the computational cost of additional transfer experiments, while retaining the cross-domain results as supporting evidence for the generalizability claim. revision: partial

  3. Referee: [Method] Method section, policy description: the free parameters of the memory budget allocation policy are optimized on downstream rewards, but no analysis shows that the resulting policy reduces to quantities independent of the specific training environments, leaving open the possibility that gains arise from environment-specific fitting.

    Authors: The policy is trained end-to-end via GRPO on task rewards, and our ablations (Table 3) show that disabling elastic allocation hurts performance on both QA and ALFWorld. While we do not present an explicit decomposition proving environment-independent quantities, the fact that a single learned mechanism improves two dissimilar tasks argues against pure environment-specific fitting. In revision we will add a qualitative analysis of the learned budget distributions across the two domains to further address this concern. revision: partial

Circularity Check

0 steps flagged

No significant circularity; derivation relies on external task rewards and GRPO optimization

full rationale

The paper's central mechanism optimizes a memory-use policy via group-relative policy optimization using downstream task rewards on MemorySuite QA and ALFWorld. Reported gains (e.g., 26.2% QA accuracy) are empirical outcomes of this training, not quantities that reduce by the paper's own equations or self-citations to fitted parameters or inputs by construction. No self-definitional steps, fitted-input predictions, load-bearing self-citations, or ansatz smuggling appear in the provided text. The derivation chain is self-contained against external benchmarks and does not exhibit any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 1 invented entities

Based on abstract description, the central additions rest on a learned policy and standard RL assumptions; full paper would allow more precise enumeration.

free parameters (1)
  • parameters of the memory budget allocation policy
    The policy that assigns variable latent budgets is learned from downstream task rewards.
axioms (2)
  • domain assumption Hidden states from the LLM reasoner provide effective keys for retrieving relevant latent memories
    Retrieval mechanism is described as operating from the reasoner's hidden state.
  • domain assumption Group-relative policy optimization can effectively train the full memory retrieval and injection process
    The abstract states the memory-use process is optimized with this method using task rewards.
invented entities (1)
  • soft memory tokens no independent evidence
    purpose: Inject selected latent states into the generation process as variable-budget memory
    Introduced as the injection mechanism for the elastic memory.

pith-pipeline@v0.9.1-grok · 5854 in / 1444 out tokens · 39915 ms · 2026-06-28T23:04:42.999850+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Rosetta Memory: Adaptive Memory for Cross-LLM Agents

    cs.LG 2026-06 unverdicted novelty 7.0

    Rosetta Memory trains two profile-conditioned operators with a minimum-gain sampling curriculum and performance-gap reward to enable memory transfer between LLMs, showing gains on multi-hop QA benchmarks and robustnes...

Reference graph

Works this paper leans on

83 extracted references · 33 canonical work pages · cited by 1 Pith paper · 18 internal anchors

  1. [1]

    Self-rag: Learn- ing to retrieve, generate, and critique through self-reflection

    Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and Hannaneh Hajishirzi. Self-rag: Learn- ing to retrieve, generate, and critique through self-reflection. InThe Twelfth International Conference on Learning Representations, 2023

  2. [2]

    Multi-agent evolve: Llm self-improve through co-evolution.arXiv preprint arXiv:2510.23595, 2025

    Yixing Chen, Yiding Wang, Siqi Zhu, Haofei Yu, Tao Feng, Muhan Zhang, Mostofa Patwary, and Jiaxuan You. Multi-agent evolve: Llm self-improve through co-evolution.arXiv preprint arXiv:2510.23595, 2025

  3. [3]

    Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models

    Zixiang Chen, Yihe Deng, Huizhuo Yuan, Kaixuan Ji, and Quanquan Gu. Self-play fine-tuning converts weak language models to strong language models.arXiv preprint arXiv:2401.01335, 2024

  4. [4]

    Adapting language models to compress contexts

    Alexis Chevalier, Alexander Wettig, Anirudh Ajith, and Danqi Chen. Adapting language models to compress contexts. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 3829–3846, 2023

  5. [5]

    Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory

    Prateek Chhikara, Dev Khant, Saket Aryan, Taranjeet Singh, and Deshraj Yadav. Mem0: Building production-ready ai agents with scalable long-term memory.arXiv preprint arXiv:2504.19413, 2025

  6. [6]

    The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models

    Ganqu Cui, Yuchen Zhang, Jiacheng Chen, Lifan Yuan, Zhi Wang, Yuxin Zuo, Haozhan Li, Yuchen Fan, Huayu Chen, Weize Chen, et al. The entropy mechanism of reinforcement learning for reasoning language models.arXiv preprint arXiv:2505.22617, 2025

  7. [7]

    Seper: Measure retrieval utility through the lens of semantic perplexity reduction.arXiv preprint arXiv:2503.01478, 2025

    Lu Dai, Yijie Xu, Jinhui Ye, Hao Liu, and Hui Xiong. Seper: Measure retrieval utility through the lens of semantic perplexity reduction.arXiv preprint arXiv:2503.01478, 2025

  8. [8]

    Memory-t1: Reinforcement learning for temporal reasoning in multi-session agents.arXiv preprint arXiv:2512.20092, 2025

    Yiming Du, Baojun Wang, Yifan Xiang, Zhaowei Wang, Wenyu Huang, Boyang Xue, Bin Liang, Xingshan Zeng, Fei Mi, Haoli Bai, et al. Memory-t1: Reinforcement learning for temporal reasoning in multi-session agents.arXiv preprint arXiv:2512.20092, 2025

  9. [9]

    LightMem: Lightweight and Efficient Memory-Augmented Generation

    Jizhan Fang, Xinle Deng, Haoming Xu, Ziyan Jiang, Yuqi Tang, Ziwen Xu, Shumin Deng, Yunzhi Yao, Mengru Wang, Shuofei Qiao, et al. Lightmem: Lightweight and efficient memory- augmented generation.arXiv preprint arXiv:2510.18866, 2025

  10. [10]

    Memp: Exploring Agent Procedural Memory

    Runnan Fang, Yuan Liang, Xiaobin Wang, Jialong Wu, Shuofei Qiao, Pengjun Xie, Fei Huang, Huajun Chen, and Ningyu Zhang. Memp: Exploring agent procedural memory.arXiv preprint arXiv:2508.06433, 2025

  11. [11]

    Similarity is not all you need: Endowing retrieval augmented generation with multi layered thoughts.arXiv preprint arXiv:2405.19893, 2024

    Chunjing Gan, Dan Yang, Binbin Hu, Hanxiao Zhang, Siyuan Li, Ziqi Liu, Yue Shen, Lin Ju, Zhiqiang Zhang, Jinjie Gu, et al. Similarity is not all you need: Endowing retrieval augmented generation with multi layered thoughts.arXiv preprint arXiv:2405.19893, 2024

  12. [12]

    Smartrag: Jointly learn rag-related tasks from the environment feedback.arXiv preprint arXiv:2410.18141, 2024

    Jingsheng Gao, Linxu Li, Weiyuan Li, Yuzhuo Fu, and Bin Dai. Smartrag: Jointly learn rag-related tasks from the environment feedback.arXiv preprint arXiv:2410.18141, 2024

  13. [13]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

  14. [14]

    Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022

  15. [15]

    Hiagent: Hierarchical working memory management for solving long-horizon agent tasks with large lan- guage model

    Mengkang Hu, Tianxing Chen, Qiguang Chen, Yao Mu, Wenqi Shao, and Ping Luo. Hiagent: Hierarchical working memory management for solving long-horizon agent tasks with large lan- guage model. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 32779–32798, 2025

  16. [16]

    Discoveryworld: A virtual environment for developing and evaluating automated scientific discovery agents.Advances in Neural Information Processing Systems, 37:10088–10116, 2024

    Peter Jansen, Marc-Alexandre Côté, Tushar Khot, Erin Bransom, Bhavana Dalvi Mishra, Bodhisattwa Prasad Majumder, Oyvind Tafjord, and Peter Clark. Discoveryworld: A virtual environment for developing and evaluating automated scientific discovery agents.Advances in Neural Information Processing Systems, 37:10088–10116, 2024. 10

  17. [17]

    Personamem-v2: Towards person- alized intelligence via learning implicit user personas and agentic memory.arXiv preprint arXiv:2512.06688, 2025

    Bowen Jiang, Yuan Yuan, Maohao Shen, Zhuoqun Hao, Zhangchen Xu, Zichen Chen, Ziyi Liu, Anvesh Rao Vijjini, Jiashu He, Hanchao Yu, et al. Personamem-v2: Towards person- alized intelligence via learning implicit user personas and agentic memory.arXiv preprint arXiv:2512.06688, 2025

  18. [18]

    Adaptation of agentic ai.arXiv preprint arXiv:2512.16301, 2025

    Pengcheng Jiang, Jiacheng Lin, Zhiyi Shi, Zifeng Wang, Luxi He, Yichen Wu, Ming Zhong, Peiyang Song, Qizheng Zhang, Heng Wang, et al. Adaptation of agentic ai.arXiv preprint arXiv:2512.16301, 2025

  19. [19]

    Memory os of ai agent

    Jiazheng Kang, Mingming Ji, Zhe Zhao, and Ting Bai. Memory os of ai agent. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 25972– 25981, 2025

  20. [20]

    Dense passage retrieval for open-domain question answering

    Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. Dense passage retrieval for open-domain question answering. InProceedings of the 2020 conference on empirical methods in natural language processing (EMNLP), pages 6769–6781, 2020

  21. [21]

    Training Language Models to Self-Correct via Reinforcement Learning

    Aviral Kumar, Vincent Zhuang, Rishabh Agarwal, Yi Su, John D Co-Reyes, Avi Singh, Kate Baumli, Shariq Iqbal, Colton Bishop, Rebecca Roelofs, et al. Training language models to self-correct via reinforcement learning.arXiv preprint arXiv:2409.12917, 2024

  22. [22]

    LangMem.https://langchain-ai.github.io/langmem/, 2024

    LangChain. LangMem.https://langchain-ai.github.io/langmem/, 2024

  23. [23]

    Let’s verify step by step

    Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. InThe Twelfth International Conference on Learning Representations, 2023

  24. [24]

    DeepSeek-V3 Technical Report

    Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024

  25. [25]

    Lost in the middle: How language models use long contexts.Transactions of the association for computational linguistics, 12:157–173, 2024

    Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts.Transactions of the association for computational linguistics, 12:157–173, 2024

  26. [26]

    Understanding R1-Zero-Like Training: A Critical Perspective

    Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Understanding r1-zero-like training: A critical perspective.arXiv preprint arXiv:2503.20783, 2025

  27. [27]

    Decoupled Weight Decay Regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017

  28. [28]

    Evaluating very long-term conversational memory of llm agents

    Adyasha Maharana, Dong-Ho Lee, Sergey Tulyakov, Mohit Bansal, Francesco Barbieri, and Yuwei Fang. Evaluating very long-term conversational memory of llm agents. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 13851–13870, 2024

  29. [29]

    Skill-Pro: Learning Reusable Skills from Experience via Non-Parametric PPO for LLM Agents

    Qirui Mi, Zhijian Ma, Mengyue Yang, Haoxuan Li, Yisen Wang, Haifeng Zhang, and Jun Wang. Procmem: Learning reusable procedural memory from experience via non-parametric ppo for llm agents.arXiv preprint arXiv:2602.01869, 2026

  30. [30]

    Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

  31. [31]

    On memory construction and retrieval for personalized conversational agents.arXiv preprint arXiv:2502.05589, 2025

    Zhuoshi Pan, Qianhui Wu, Huiqiang Jiang, Xufang Luo, Hao Cheng, Dongsheng Li, Yuqing Yang, Chin-Yew Lin, H Vicky Zhao, Lili Qiu, et al. On memory construction and retrieval for personalized conversational agents.arXiv preprint arXiv:2502.05589, 2025

  32. [32]

    Generative agents: Interactive simulacra of human behavior

    Joon Sung Park, Joseph O’Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S Bernstein. Generative agents: Interactive simulacra of human behavior. InProceed- ings of the 36th annual acm symposium on user interface software and technology, pages 1–22, 2023. 11

  33. [33]

    Direct preference optimization: Your language model is secretly a reward model

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in neural information processing systems, 36:53728–53741, 2023

  34. [34]

    Sentence-bert: Sentence embeddings using siamese bert- networks

    Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using siamese bert- networks. InProceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP- IJCNLP), pages 3982–3992, 2019

  35. [35]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

  36. [36]

    Large language models can be easily distracted by irrelevant context

    Freda Shi, Xinyun Chen, Kanishka Misra, Nathan Scales, David Dohan, Ed H Chi, Nathanael Schärli, and Denny Zhou. Large language models can be easily distracted by irrelevant context. InInternational Conference on Machine Learning, pages 31210–31227. PMLR, 2023

  37. [37]

    Reflexion: Language agents with verbal reinforcement learning.Advances in Neural Information Processing Systems, 36:8634–8652, 2023

    Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning.Advances in Neural Information Processing Systems, 36:8634–8652, 2023

  38. [38]

    ALFWorld: Aligning Text and Embodied Environments for Interactive Learning

    Mohit Shridhar, Xingdi Yuan, Marc-Alexandre Côté, Yonatan Bisk, Adam Trischler, and Matthew Hausknecht. Alfworld: Aligning text and embodied environments for interactive learning.arXiv preprint arXiv:2010.03768, 2020

  39. [39]

    Preference-aware memory update for long-term llm agents.arXiv preprint arXiv:2510.09720, 2025

    Haoran Sun, Zekun Zhang, and Shaoning Zeng. Preference-aware memory update for long-term llm agents.arXiv preprint arXiv:2510.09720, 2025

  40. [40]

    Beyond a million tokens: Benchmarking and enhancing long-term memory in llms.arXiv preprint arXiv:2510.27246, 2025

    Mohammad Tavakoli, Alireza Salemi, Carrie Ye, Mohamed Abdalla, Hamed Zamani, and J Ross Mitchell. Beyond a million tokens: Benchmarking and enhancing long-term memory in llms.arXiv preprint arXiv:2510.27246, 2025

  41. [41]

    Voyager: An Open-Ended Embodied Agent with Large Language Models

    Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. V oyager: An open-ended embodied agent with large language models. arXiv preprint arXiv:2305.16291, 2023

  42. [42]

    M+: Extending memoryllm with scalable long-term memory.arXiv preprint arXiv:2502.00592, 2025

    Yu Wang, Dmitry Krotov, Yuanzhe Hu, Yifan Gao, Wangchunshu Zhou, Julian McAuley, Dan Gutfreund, Rogerio Feris, and Zexue He. M+: Extending memoryllm with scalable long-term memory.arXiv preprint arXiv:2502.00592, 2025

  43. [43]

    Browsecomp: A simple yet challenging benchmark for browsing agents, 2025

    Jason Wei, Zhiqing Sun, Spencer Papay, Scott McKinney, Jeffrey Han, Isa Fulford, Hyung Won Chung, Alex Tachard Passos, William Fedus, and Amelia Glaese. Browsecomp: A simple yet challenging benchmark for browsing agents, 2025

  44. [44]

    LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory

    Di Wu, Hongwei Wang, Wenhao Yu, Yuwei Zhang, Kai-Wei Chang, and Dong Yu. Long- memeval: Benchmarking chat assistants on long-term interactive memory.arXiv preprint arXiv:2410.10813, 2024

  45. [45]

    A-MEM: Agentic Memory for LLM Agents

    Wujiang Xu, Zujie Liang, Kai Mei, Hang Gao, Juntao Tan, and Yongfeng Zhang. A-mem: Agentic memory for llm agents.arXiv preprint arXiv:2502.12110, 2025

  46. [46]

    Training a utility-based retriever through shared context attribution for retrieval- augmented language models

    Yilong Xu, Jinhua Gao, Xiaoming Yu, Yuanhai Xue, Baolong Bi, Huawei Shen, and Xueqi Cheng. Training a utility-based retriever through shared context attribution for retrieval- augmented language models. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 629–648, 2025

  47. [47]

    Corrective retrieval augmented generation

    Shi-Qi Yan, Jia-Chen Gu, Yun Zhu, and Zhen-Hua Ling. Corrective retrieval augmented generation. 2024

  48. [48]

    Memory-R1: Enhancing Large Language Model Agents to Manage and Utilize Memories via Reinforcement Learning

    Sikuan Yan, Xiufeng Yang, Zuchao Huang, Ercong Nie, Zifeng Ding, Zonggen Li, Xiaowen Ma, Jinhe Bi, Kristian Kersting, Jeff Z Pan, et al. Memory-r1: Enhancing large language model agents to manage and utilize memories via reinforcement learning.arXiv preprint arXiv:2508.19828, 2025. 12

  49. [49]

    Swe-bench multimodal: Do ai systems generalize to visual software domains?arXiv preprint arXiv:2410.03859, 2024

    John Yang, Carlos E Jimenez, Alex L Zhang, Kilian Lieret, Joyce Yang, Xindi Wu, Ori Press, Niklas Muennighoff, Gabriel Synnaeve, Karthik R Narasimhan, et al. Swe-bench multimodal: Do ai systems generalize to visual software domains?arXiv preprint arXiv:2410.03859, 2024

  50. [50]

    Memgen: Weaving generative latent memory for self-evolving agents.arXiv preprint arXiv:2509.24704, 2025

    Guibin Zhang, Muxin Fu, and Shuicheng Yan. Memgen: Weaving generative latent memory for self-evolving agents.arXiv preprint arXiv:2509.24704, 2025

  51. [51]

    Agent-pro: Learning to evolve via policy-level reflection and optimization

    Wenqi Zhang, Ke Tang, Hai Wu, Mengna Wang, Yongliang Shen, Guiyang Hou, Zeqi Tan, Peng Li, Yueting Zhuang, and Weiming Lu. Agent-pro: Learning to evolve via policy-level reflection and optimization. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5348–5375, 2024

  52. [52]

    Evoking user memory: Personalizing llm via recollection-familiarity adaptive retrieval.arXiv preprint arXiv:2603.09250, 2026

    Yingyi Zhang, Junyi Li, Wenlin Zhang, Penyue Jia, Xianneng Li, Yichao Wang, Derong Xu, Yi Wen, Huifeng Guo, Yong Liu, et al. Evoking user memory: Personalizing llm via recollection-familiarity adaptive retrieval.arXiv preprint arXiv:2603.09250, 2026

  53. [53]

    A survey on the memory mechanism of large language model-based agents

    Zeyu Zhang, Quanyu Dai, Xiaohe Bo, Chen Ma, Rui Li, Xu Chen, Jieming Zhu, Zhenhua Dong, and Ji-Rong Wen. A survey on the memory mechanism of large language model-based agents. ACM Transactions on Information Systems, 43(6):1–47, 2025

  54. [54]

    Group Sequence Policy Optimization

    Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, et al. Group sequence policy optimization.arXiv preprint arXiv:2507.18071, 2025

  55. [55]

    Synapse: Trajectory-as-exemplar prompting with memory for computer control.arXiv preprint arXiv:2306.07863, 2023

    Longtao Zheng, Rundong Wang, Xinrun Wang, and Bo An. Synapse: Trajectory-as-exemplar prompting with memory for computer control.arXiv preprint arXiv:2306.07863, 2023

  56. [56]

    (a)”, “(b)

    Wanjun Zhong, Lianghong Guo, Qiqi Gao, He Ye, and Yanlin Wang. Memorybank: Enhancing large language models with long-term memory. InProceedings of the AAAI conference on artificial intelligence, volume 38, pages 19724–19731, 2024. 13 Contents of Appendix A Limitations, Future Work, and Broader Impact 14 B Implementation Details 15 C Training Procedure ofE...

  57. [57]

    Store Personal Preferences: Keep track of likes, dislikes, and specific preferences

  58. [58]

    Maintain Important Personal Details: Remember significant personal information like names, relation- ships, and important dates

  59. [59]

    Track Plans and Intentions: Note upcoming events, trips, goals, and any plans the user has shared

  60. [60]

    facts" : [

    Monitor Health and Wellness Preferences: Keep a record of dietary restrictions, fitness routines, and other wellness-related information. [. . . other categories omitted for brevity . . . ] Here are some few shot examples: Input: Hi, my name is John. I am a software engineer. Output: {"facts" : ["Name is John", "Is a Software engineer"]} Return the facts ...

  61. [61]

    Start from the previous meta-summary (if exists)

  62. [62]

    Add/update information based on the new dialogue

  63. [63]

    Keep it concise (1-2 sentences max)

  64. [64]

    theme":

    Maintain context coherence Previous Meta-summary:{last_meta} New Dialogue: {new_dialogue} Updated Meta-summary: Table 13:MemoryOS: Multi-Summary Prompt (subtopic extraction). Please analyze the following dialogue and generate extremely concise subtopic summaries, if applicable, with a maximum of two themes. Each summary should be very brief – just a few w...

  65. [65]

    Order from most to least important

    KEYWORDS: The most important keywords (nouns, verbs, key concepts). Order from most to least important. At least three keywords

  66. [66]

    CONTEXT: One sentence summarizing the main topic, key points, and purpose

  67. [67]

    At least three tags

    TAGS: Broad categories/themes for classification (domain, format, type). At least three tags. Respond using EXACTLY this format (one section per header): KEYWORDS: keyword1, keyword2, keyword3, ... CONTEXT: A single sentence summarizing the content. TAGS: tag1, tag2, tag3, ... Content for analysis: {content} 23 Table 15:A-MEM: Memory Evolution Decision Pr...

  68. [68]

    For each message, decide whether it contains any factual information

    You MUST process every user message in order, one by one. For each message, decide whether it contains any factual information. - If yes→extract it and rephrase into a standalone sentence. - If no, such as pure greeting, filler, or irrelevant remark,→skip it. - Do NOT skip just because the information looks minor or unimportant

  69. [69]

    user: Bought apples yesterday

    Perform light contextual completion so that each fact is a clear standalone statement. Examples: “user: Bought apples yesterday”→“User bought apples yesterday.”

  70. [70]

    data": [ {

    Output format: { "data": [ {"source_id": "<source_id>", "fact": "<complete fact with ALL specific details>"} ] } Table 17:LightMem: Memory Consolidation Prompt. You are a memory management assistant. Your task is to decide whether the target memory should be updated, deleted, or ignored based on the candidate source memories. Decision rules:

  71. [71]

    Update: If the target and candidate memories describe essentially the same fact but are not fully consistent, update by integrating additional information

  72. [72]

    Delete: If the target and candidate memories contain a direct conflict, delete the target memory

  73. [73]

    action":

    Ignore: If unrelated, no action is needed. The output must be a JSON object: { "action": "update" | "delete" | "ignore", "new_memory": {. . . } // only required when action = "update" } 24 Table 18:MeMP: Workflow Generation Prompt. You are provided with a query and a trajectory taken to solve the query. The trajectory consists of multiple steps of thought...

  74. [74]

    The general task category ( pick_and_place, heat_then_place, clean_then_place, cool_then_place,examine_in_light,pick_two)

  75. [75]

    The concrete step-by-step strategy that worked

  76. [76]

    soapbar is usually on countertop, bathtubbasin, or shelf

    Common locations where target objects are found (e.g. “soapbar is usually on countertop, bathtubbasin, or shelf”) Be specific. Use actual object/location types (countertop,sinkbasin,microwave). Output format:SKILL: [your skill text] Table 22:ElasticMem: ALFWorld Skill Extraction Prompt (Failed Trajectory). system You are an expert at analyzing household r...

  77. [77]

    The general task category

  78. [78]

    What specific mistake was made

  79. [79]

    (a)”, “(b)

    What the agent should have done differently Output format:SKILL: [your lesson text] Table 23:PersonaMem: Shared MC Answering Prompt (all baselines). QUESTION: {question} RETRIEVED MEMORY (relevant chunks from prior conversation): {retrieved_text} Answer with exactly one of the four options below, formatted as a single token like “(a)”, “(b)”, “(c)”, or “(...

  80. [80]

    go to stoveburner 3→located kettle 2

Showing first 80 references.