Mem-π: Adaptive Memory through Learning When and What to Generate
Pith reviewed 2026-05-21 04:24 UTC · model grok-4.3
The pith
A dedicated model learns to generate concise guidance for LLM agents only when it helps, outperforming retrieval from memory banks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Mem-π uses a dedicated language or vision-language model with its own parameters, separate from the downstream agent, to generate context-specific guidance for complex tasks. Conditioned on the current agent context, the model jointly decides when to produce guidance and what guidance to produce. It is trained with a decision-content decoupled reinforcement learning objective that enables it to abstain when generation would not help and otherwise produce concise, useful guidance.
What carries the argument
A decision-content decoupled reinforcement learning objective applied to a separate language or vision-language model that jointly decides when to generate and what concise guidance to produce for the agent.
If this is right
- Agents achieve over 30 percent relative gains on web navigation benchmarks compared with retrieval baselines.
- The same trained model improves performance on terminal-based tool use and text-based embodied interaction tasks.
- Generation replaces retrieval, removing the need to maintain and query large static memory banks.
- The decision to abstain prevents unnecessary or misaligned guidance that could distract the agent.
Where Pith is reading between the lines
- This separation of a guidance generator from the main agent could simplify scaling to longer-horizon tasks by keeping the primary policy focused.
- The RL objective might be adapted to other agent settings where intermediate natural-language plans are more valuable than raw retrieval.
- If the abstention policy generalizes, future agents could operate with smaller context windows by generating only the needed summary on the fly.
Load-bearing premise
A separate model can be trained to reliably choose when to abstain from generating guidance and to produce useful context-specific content otherwise.
What would settle it
An experiment in which the dedicated model is forced to generate guidance on every step of a web navigation or tool-use task and agent success rate drops below the retrieval baseline.
read the original abstract
We present Mem-$\pi$, a framework for adaptive memory in large language model (LLM) agents, where useful guidance is generated on demand rather than retrieved from external memory stores. Existing memory-augmented agents typically rely on similarity-based retrieval from episodic memory banks or skill libraries, returning static entries that often misalign with the current context. In contrast, Mem-$\pi$ uses a dedicated language or vision-language model with its own parameters, separate from the downstream agent, to generate context-specific guidance for complex tasks. Conditioned on the current agent context, the model jointly decides when to produce guidance and what guidance to produce. We train it with a decision-content decoupled reinforcement learning (RL) objective, enabling it to abstain when generation would not help and otherwise produce concise, useful guidance. Across diverse agentic benchmarks spanning web navigation, terminal-based tool use, and text-based embodied interaction, Mem-$\pi$ consistently outperforms retrieval-based and prior RL-optimized memory baselines, achieving over 30% relative improvement on web navigation tasks.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces Mem-π, a framework for adaptive memory in LLM agents. Rather than retrieving static entries from episodic memory banks, it employs a dedicated language or vision-language model (with separate parameters) that, conditioned on the current agent context, jointly decides when to generate guidance and what concise, context-specific guidance to produce. The model is trained with a decision-content decoupled reinforcement learning objective that enables abstention when generation would not help. Empirical results across web navigation, terminal-based tool use, and text-based embodied interaction benchmarks show consistent outperformance over retrieval-based and prior RL-optimized memory baselines, including over 30% relative improvement on web navigation tasks.
Significance. If the decoupled RL objective reliably trains non-trivial abstention behavior and the reported gains are robust to controls for parameter count and training differences, the work could meaningfully advance memory-augmented agents by replacing static retrieval with on-demand, context-aligned generation. The separation of decision and content heads, together with the empirical breadth across agentic benchmarks, is a clear strength. The result would be more impactful if accompanied by direct evidence that the abstention policy is learned rather than collapsed.
major comments (2)
- [§3.2] §3.2 (decision-content decoupled RL objective): the reward formulation and training procedure for the decision head are not specified in sufficient detail to confirm that the policy learns to abstain precisely when generation would not help, rather than defaulting to an always-generate or never-generate policy; without this, the attribution of gains to adaptive on-demand generation is not yet load-bearing.
- [§5.1] §5.1 and Table 2 (web navigation results): the 30% relative improvement is reported without ablations that isolate the contribution of the learned abstention mechanism from the effects of extra parameters or the content-generation training alone; this leaves the central claim that adaptive memory (vs. retrieval or prior RL baselines) drives the gains under-supported.
minor comments (2)
- [Abstract] The abstract and §4 could more explicitly name the exact retrieval baselines and prior RL-optimized methods for immediate reproducibility.
- [Figure 1] Figure 1 would benefit from explicit arrows distinguishing the decision head output from the content-generation output.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback. We have revised the manuscript to provide fuller specification of the RL objective and to include new ablations that isolate the contribution of the learned abstention policy. Our point-by-point responses follow.
read point-by-point responses
-
Referee: [§3.2] §3.2 (decision-content decoupled RL objective): the reward formulation and training procedure for the decision head are not specified in sufficient detail to confirm that the policy learns to abstain precisely when generation would not help, rather than defaulting to an always-generate or never-generate policy; without this, the attribution of gains to adaptive on-demand generation is not yet load-bearing.
Authors: We agree that the original description was insufficiently detailed. In the revised manuscript we have expanded §3.2 with the exact reward for the decision head (r_dec = +1 for correct abstention when downstream performance does not improve, r_dec = -0.5 for unnecessary generation, and 0 otherwise) and the decoupled training procedure (separate binary policy-gradient updates on the decision head using REINFORCE with a learned baseline, while the content head receives task-success rewards only on generations that occur). We have also added training curves and per-task abstention-rate statistics in the appendix demonstrating that the policy converges to non-trivial abstention (approximately 35-45 % of steps on web-navigation tasks) rather than the two degenerate extremes. revision: yes
-
Referee: [§5.1] §5.1 and Table 2 (web navigation results): the 30% relative improvement is reported without ablations that isolate the contribution of the learned abstention mechanism from the effects of extra parameters or the content-generation training alone; this leaves the central claim that adaptive memory (vs. retrieval or prior RL baselines) drives the gains under-supported.
Authors: We accept the criticism and have added the requested controls. The revised §5.1 now reports three new conditions on the web-navigation suite: (i) full Mem-π, (ii) an always-generate ablation that removes the decision head while keeping identical content-generation capacity and parameter count, and (iii) a retrieval baseline whose memory encoder is sized to match Mem-π’s total parameters. The learned abstention policy contributes an additional 14 % relative gain over the always-generate variant; the full 30 % improvement over retrieval persists after parameter matching. These results appear in an updated Table 2 and are discussed in the main text. revision: yes
Circularity Check
No circularity: empirical RL framework with benchmark comparisons
full rationale
The paper introduces Mem-π as an empirical agentic memory system trained via decision-content decoupled RL on separate parameters. No equations, derivations, or first-principles predictions are presented that reduce the outperformance claims to quantities defined by the same data or self-citations. Central results rest on relative improvements across web navigation, tool use, and embodied benchmarks rather than any self-referential fit or uniqueness theorem. The work is self-contained as an experimental comparison against retrieval and prior RL baselines.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We train it with a decision-content decoupled reinforcement learning (RL) objective, enabling it to abstain when generation would not help and otherwise produce concise, useful guidance.
-
IndisputableMonolith/Foundation/BranchSelection.leanbranch_selection unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Decision-content advantage decomposition... Δ = Vabs − Vgen... Ajd = +Δ for the abstain rollout
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
L1: Controlling How Long A Reasoning Model Thinks With Reinforcement Learning
Pranjal Aggarwal and Sean Welleck. L1: Controlling how long a reasoning model thinks with reinforcement learning. arXiv preprint arXiv:2503.04697,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection
Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and Hannaneh Hajishirzi. Self-rag: Learning to retrieve, generate, and critique through self-reflection.arXiv preprint arXiv:2310.11511,
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
Zhicheng Cai, Xinyuan Guo, Yu Pei, Jiangtao Feng, Jinsong Su, Jiangjie Chen, Ya-Qin Zhang, Wei-Ying Ma, Mingxuan Wang, and Hao Zhou. Flex: Continuous agent evolution via forward learning from experience.arXiv preprint arXiv:2511.06449,
-
[4]
Memory decoder: A pretrained, plug-and-play memory for large language models
10 Jiaqi Cao, Jiarui Wang, Rubin Wei, Qipeng Guo, Kai Chen, Bowen Zhou, and Zhouhan Lin. Memory decoder: A pretrained, plug-and-play memory for large language models.arXiv preprint arXiv:2508.09874,
-
[5]
Conditional Memory via Scalable Lookup: A New Axis of Sparsity for Large Language Models
Xin Cheng, Wangding Zeng, Damai Dai, Qinyu Chen, Bingxuan Wang, Zhenda Xie, Kezhao Huang, Xingkai Yu, Zhewen Hao, Yukun Li, et al. Conditional memory via scalable lookup: A new axis of sparsity for large language models.arXiv preprint arXiv:2601.07372,
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
Adapting language models to compress contexts
Alexis Chevalier, Alexander Wettig, Anirudh Ajith, and Danqi Chen. Adapting language models to compress contexts. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.),Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 3829–3846, Singapore, December
work page 2023
-
[7]
doi: 10.18653/v1/2023.emnlp-main.232
Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.232. URLhttps://aclanthology.org/2023.emnlp-main.232/. De Chezelles, Thibault Le Sellier, Sahar Omidi Shayegan, Lawrence Keunho Jang, Xing Han Lù, Ori Yoran, Dehan Kong, Frank F Xu, Siva Reddy, Quentin Cappart, et al. The browsergym ecosystem for web agent research.arXiv preprint ...
-
[8]
Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory
Prateek Chhikara, Dev Khant, Saket Aryan, Taranjeet Singh, and Deshraj Yadav. Mem0: Building production-ready ai agents with scalable long-term memory.arXiv preprint arXiv:2504.19413,
work page internal anchor Pith review Pith/arXiv arXiv
-
[9]
Prompt injection: Parameterization of fixed inputs.arXiv preprint arXiv:2206.11349,
Eunbi Choi, Yongrae Jo, Joel Jang, and Minjoon Seo. Prompt injection: Parameterization of fixed inputs.arXiv preprint arXiv:2206.11349,
-
[10]
SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?
Xiang Deng, Jeff Da, Edwin Pan, Yannis Yiming He, Charles Ide, Kanak Garg, Niklas Lauffer, Andrew Park, Nitin Pasari, Chetan Rane, et al. Swe-bench pro: Can ai agents solve long-horizon software engineering tasks?arXiv preprint arXiv:2509.16941,
work page internal anchor Pith review Pith/arXiv arXiv
-
[11]
Ning Ding, Fangcheng Liu, Kyungrae Kim, Linji Hao, Kyeng-Hun Lee, Hyeonmok Ko, and Yehui Tang. Meki: Memory-based expert knowledge injection for efficient llm scaling.arXiv preprint arXiv:2602.03359,
-
[12]
doi: 10.18653/v1/2022.acl-long.203
Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.203. URLhttps://aclanthology.org/ 2022.acl-long.203/. Rujun Han, Yanfei Chen, Zoey CuiZhu, Lesly Miculicich, Guan Sun, Yuanjun Bi, Weiming Wen, Hui Wan, Chunfeng Wen, Solène Maître, et al. Deep researcher with test-time diffusion.arXiv preprint arXiv:2507.16075,
-
[13]
Mengkang Hu, Tianxing Chen, Qiguang Chen, Yao Mu, Wenqi Shao, and Ping Luo. Hiagent: Hierarchical working memory management for solving long-horizon agent tasks with large language model.arXiv preprint arXiv:2408.09559,
-
[14]
Memory in the Age of AI Agents
Yuyang Hu, Shichun Liu, Yanwei Yue, Guibin Zhang, Boyang Liu, Fangyi Zhu, Jiahang Lin, Honglin Guo, Shihan Dou, Zhiheng Xi, et al. Memory in the age of ai agents.arXiv preprint arXiv:2512.13564,
work page internal anchor Pith review Pith/arXiv arXiv
-
[15]
Wei-Chieh Huang, Weizhi Zhang, Yueqing Liang, Yuanchen Bei, Yankai Chen, Tao Feng, Xinyu Pan, Zhen Tan, Yu Wang, Tianxin Wei, et al. Rethinking memory mechanisms of foundation agents in the second half.arXiv preprint arXiv:2602.06052,
-
[16]
Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276,
work page internal anchor Pith review Pith/arXiv arXiv
-
[17]
Ajay Jaiswal, Lauren Hannah, Han-Byul Kim, Duc Hoang, Arnav Kundu, Mehrdad Farajtabar, and Minsik Cho. Memoryllm: Plug-n-play interpretable feed-forward memory for transformers.arXiv preprint arXiv:2602.00398,
-
[18]
Soyeong Jeong, Jinheon Baek, Sukmin Cho, Sung Ju Hwang, and Jong Park. Adaptive-RAG: Learning to adapt retrieval-augmented large language models through question complexity. In Kevin Duh, Helena Gomez, and 11 Steven Bethard (eds.),Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Langu...
work page 2024
-
[19]
doi: 10.18653/v1/2024.naacl-long.389
Association for Computational Linguistics. doi: 10.18653/v1/2024.naacl-long.389. URL https://aclanthology.org/2024.naacl-long.389/. Zhengbao Jiang, Frank Xu, Luyu Gao, Zhiqing Sun, Qian Liu, Jane Dwivedi-Yu, Yiming Yang, Jamie Callan, and Graham Neubig. Active retrieval augmented generation. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.),Proceedings ...
-
[20]
doi: 10.18653/v1/2023.emnlp-main.495
Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.495. URLhttps://aclanthology.org/2023.emnlp-main.495/. Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InProceedings of the...
-
[21]
WebThinker: Empowering Large Reasoning Models with Deep Research Capability
Xiaoxi Li, Jiajie Jin, Guanting Dong, Hongjin Qian, Yongkang Wu, Ji-Rong Wen, Yutao Zhu, and Zhicheng Dou. Webthinker: Empowering large reasoning models with deep research capability.arXiv preprint arXiv:2504.21776,
work page internal anchor Pith review Pith/arXiv arXiv
-
[22]
Xuancheng Li, Haitao Li, Yujia Zhou, Yiqun Liu, and Qingyao Ai. Beyond experience retrieval: Learning to generate utility-optimized structured experience for frozen llms.arXiv preprint arXiv:2602.02556,
work page internal anchor Pith review Pith/arXiv arXiv
-
[23]
Prompt compression for large language models: A survey
Zongqian Li, Yinhong Liu, Yixuan Su, and Nigel Collier. Prompt compression for large language models: A survey. arXiv preprint arXiv:2410.12388,
-
[24]
Holistic Evaluation of Language Models
Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, et al. Holistic evaluation of language models.arXiv preprint arXiv:2211.09110,
work page internal anchor Pith review Pith/arXiv arXiv
-
[25]
Bang Liu, Xinfeng Li, Jiayi Zhang, Jinlin Wang, Tanjin He, Sirui Hong, Hongzhang Liu, Shaokun Zhang, Kaitao Song, Kunlun Zhu, et al. Advances and challenges in foundation agents: From brain-inspired intelligence to evolutionary, collaborative, and safe systems.arXiv preprint arXiv:2504.01990, 2025a. Junming Liu, Yifei Sun, Weihua Cheng, Haodong Lei, Yiron...
work page internal anchor Pith review Pith/arXiv arXiv
-
[26]
The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery
Chris Lu, Cong Lu, Robert Tjarko Lange, Jakob Foerster, Jeff Clune, and David Ha. The ai scientist: Towards fully automated open-ended scientific discovery.arXiv preprint arXiv:2408.06292,
work page internal anchor Pith review Pith/arXiv arXiv
-
[27]
Hadi Nekoei, Aman Jaiswal, Patrice Bechard, Oleh Shliazhko, Orlando Marquez Ayala, Mathieu Reymond, Massimo Caccia, Alexandre Drouin, Sarath Chandar, and Alexandre Lacoste. Just-in-time episodic feedback hinter: Leveraging offline knowledge to improve llm agents adaptation.arXiv preprint arXiv:2510.04373,
-
[28]
Accessed: 2025- 04-06. 12 Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Gray, et al. Training language models to follow instructions with human feedback. InAdvances in Neural Information Processing Systems,
work page 2025
-
[29]
ReasoningBank: Scaling Agent Self-Evolving with Reasoning Memory
Siru Ouyang, Jun Yan, I Hsu, Yanfei Chen, Ke Jiang, Zifeng Wang, Rujun Han, Long T Le, Samira Daruki, Xiangru Tang, et al. Reasoningbank: Scaling agent self-evolving with reasoning memory.arXiv preprint arXiv:2509.25140,
work page internal anchor Pith review Pith/arXiv arXiv
-
[30]
MemGPT: Towards LLMs as Operating Systems
Charles Packer, Sarah Wooders, Kevin Lin, Vivian Fang, Shishir G Patil, Ion Stoica, and Joseph E Gonzalez. Memgpt: Towards llms as operating systems.arXiv preprint arXiv:2310.08560,
work page internal anchor Pith review Pith/arXiv arXiv
-
[31]
Long Phan, Alice Gatti, Ziwen Han, Nathaniel Li, Josephina Hu, Hugh Zhang, Chen Bo Calvin Zhang, Mohamed Shaaban, John Ling, Sean Shi, et al. Humanity’s last exam.arXiv preprint arXiv:2501.14249,
work page internal anchor Pith review Pith/arXiv arXiv
-
[32]
Webrl: Training llm web agents via self-evolving online curriculum reinforcement learning, 2025
Zehan Qi, Xiao Liu, Iat Long Iong, Hanyu Lai, Xueqiao Sun, Wenyi Zhao, Yu Yang, Xinyue Yang, Jiadai Sun, Shuntian Yao, et al. Webrl: Training llm web agents via self-evolving online curriculum reinforcement learning.arXiv preprint arXiv:2411.02337,
-
[33]
arXiv preprint arXiv:2409.05591 (2024)
Hongjin Qian, Peitian Zhang, Zheng Liu, Kelong Mao, and Zhicheng Dou. Memorag: Moving towards next-gen rag via memory-inspired knowledge discovery.arXiv preprint arXiv:2409.05591,
-
[34]
UI-TARS: Pioneering Automated GUI Interaction with Native Agents
Yujia Qin, Yining Ye, Junjie Fang, Haoming Wang, Shihao Liang, Shizuo Tian, Junda Zhang, Jiahao Li, Yunxin Li, Shijue Huang, et al. Ui-tars: Pioneering automated gui interaction with native agents.arXiv preprint arXiv:2501.12326,
work page internal anchor Pith review Pith/arXiv arXiv
-
[35]
Agent laboratory: Using LLM agents as research assistants
Samuel Schmidgall, Yusheng Su, Ze Wang, Ximeng Sun, Jialian Wu, Xiaodong Yu, Jiang Liu, Michael Moor, Zicheng Liu, and Emad Barsoum. Agent laboratory: Using LLM agents as research assistants. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng (eds.),Findings of the Association for Computational Linguistics: EMNLP 2025, pp. 5...
work page 2025
-
[36]
Agent laboratory: Using LLM agents as research assistants
Association for Computational Linguistics. ISBN 979-8-89176-335-7. doi: 10.18653/v1/2025.findings-emnlp.320. URL https: //aclanthology.org/2025.findings-emnlp.320/. ServiceNow. Vancouver release notes.https://docs.servicenow.com/bundle/vancouver-release-notes/,
-
[37]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Accessed: 2026-05-04. Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300,
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[38]
Evolving programmatic skill networks.arXiv preprint arXiv:2601.03509,
Haochen Shi, Xingdi Yuan, and Bang Liu. Evolving programmatic skill networks.arXiv preprint arXiv:2601.03509,
-
[39]
ALFWorld: Aligning Text and Embodied Environments for Interactive Learning
13 Mohit Shridhar, Jesse Thomason, Daniel Gordon, Yonatan Bisk, Winson Han, Roozbeh Mottaghi, Luke Zettlemoyer, and Dieter Fox. Alfred: A benchmark for interpreting grounded instructions for everyday tasks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 10740–10749, 2020a. Mohit Shridhar, Xingdi Yuan, Marc-Alexand...
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[40]
Cognitive Architectures for Language Agents
Theodore R Sumers, Shunyu Yao, Karthik Narasimhan, and Thomas L Griffiths. Cognitive architectures for language agents.arXiv preprint arXiv:2309.02427,
work page internal anchor Pith review Pith/arXiv arXiv
-
[41]
A survey on self-evolution of large language models
Zhengwei Tao, Ting-En Lin, Xiancai Chen, Hangyu Li, Yuchuan Wu, Yongbin Li, Zhi Jin, Fei Huang, Dacheng Tao, and Jingren Zhou. A survey on self-evolution of large language models.arXiv preprint arXiv:2404.14387,
-
[42]
Gemini: A Family of Highly Capable Multimodal Models
Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805,
work page internal anchor Pith review Pith/arXiv arXiv
-
[43]
Xing W, Guangyuan Ma, Wanhui Qian, Zijia Lin, and Songlin Hu
Apache-2.0 licensed software. Xing W, Guangyuan Ma, Wanhui Qian, Zijia Lin, and Songlin Hu. Query-as-context pre-training for dense passage retrieval. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.),Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 1906–1916, Singapore, December
work page 2023
-
[44]
Voyager: An Open-Ended Embodied Agent with Large Language Models
Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.118. URLhttps://aclanthology.org/2023.emnlp-main.118/. Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. Voyager: An open-ended embodied agent with large language models.arXiv preprint arXiv:2305.16291,
work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/2023.emnlp-main.118 2023
-
[45]
Oscar: Operating system control via state-aware reasoning and re-planning
Xiaoqiang Wang and Bang Liu. Oscar: Operating system control via state-aware reasoning and re-planning. In International Conference on Learning Representations, volume 2025, pp. 71417–71439,
work page 2025
-
[46]
R3Mem: Bridging memory retention and retrieval via reversible compression
Xiaoqiang Wang, Suyuchen Wang, Yun Zhu, and Bang Liu. R3Mem: Bridging memory retention and retrieval via reversible compression. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar (eds.),Findings of the Association for Computational Linguistics: ACL 2025, pp. 4541–4557, Vienna, Austria, July 2025a. Association for Computational...
-
[47]
Zora Zhiruo Wang, Jiayuan Mao, Daniel Fried, and Graham Neubig. Agent workflow memory. InInternational Conference on Machine Learning, pp. 63897–63911. PMLR, 2025c. Rubin Wei, Jiaqi Cao, Jiarui Wang, Jushi Kai, Qipeng Guo, Bowen Zhou, and Zhouhan Lin. Mlp memory: A retriever-pretrained memory for large language models.arXiv preprint arXiv:2508.01832, 2025...
-
[48]
EvolveR: Self-Evolving LLM Agents through an Experience-Driven Lifecycle
Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-demos.6. URL https://aclanthology.org/2020.emnlp-demos.6/. Rong Wu, Xiaoman Wang, Jianbiao Mei, Pinlong Cai, Daocheng Fu, Cheng Yang, Licheng Wen, Xuemeng Yang, Yufan Shen, Yuxin Wang, et al. Evolver: Self-evolving llm agents through an experience-driven lifecycle.arXiv preprint arXiv:...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/2020.emnlp-demos.6 2020
-
[49]
Ui-mem: Self-evolving experience memory for online reinforcement learning in mobile gui agents
Han Xiao, Guozhi Wang, Hao Wang, Shilong Liu, Yuxiang Chai, Yue Pan, Yufeng Zhou, Xiaoxin Chen, Yafei Wen, and Hongsheng Li. Ui-mem: Self-evolving experience memory for online reinforcement learning in mobile gui agents. arXiv preprint arXiv:2602.05832,
-
[50]
A-MEM: Agentic Memory for LLM Agents
Wujiang Xu, Zujie Liang, Kai Mei, Hang Gao, Juntao Tan, and Yongfeng Zhang. A-mem: Agentic memory for llm agents.arXiv preprint arXiv:2502.12110,
work page internal anchor Pith review Pith/arXiv arXiv
-
[51]
Chain-of-Memory: Lightweight Memory Construction with Dynamic Evolution for LLM Agents
Xiucheng Xu, Bingbing Xu, Xueyun Tian, Zihe Huang, Rongxin Chen, Yunfan Li, and Huawei Shen. Chain-of-memory: Lightweight memory construction with dynamic evolution for llm agents.arXiv preprint arXiv:2601.14287,
work page internal anchor Pith review Pith/arXiv arXiv
-
[52]
Sikuan Yan, Xiufeng Yang, Zuchao Huang, Ercong Nie, Zifeng Ding, Zonggen Li, Xiaowen Ma, Jinhe Bi, Kristian Kersting, Jeff Z Pan, et al. Memory-r1: Enhancing large language model agents to manage and utilize memories via reinforcement learning.arXiv preprint arXiv:2508.19828,
work page internal anchor Pith review Pith/arXiv arXiv
-
[53]
Chengyuan Yang, Zequn Sun, Wei Wei, and Wei Hu. Beyond static summarization: Proactive memory extraction for llm agents.arXiv preprint arXiv:2601.04463, 2026a. Ke Yang, Zixi Chen, Xuan He, Jize Jiang, Michel Galley, Chenglong Wang, Jianfeng Gao, Jiawei Han, and ChengXiang Zhai. Plugmem: A task-agnostic plugin memory module for llm agents.arXiv preprint ar...
-
[54]
Explicit memory learning with expectation maximization
Zhangyue Yin, Qiushi Sun, Qipeng Guo, Zhiyuan Zeng, Qinyuan Cheng, Xipeng Qiu, and Xuanjing Huang. Explicit memory learning with expectation maximization. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen (eds.), Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pp. 16618–16635, Miami, Florida, USA, November
work page 2024
-
[55]
MemAgent: Reshaping Long-Context LLM with Multi-Conv RL-based Memory Agent
Association for Computational Linguistics. doi: 10.18653/v1/2024.emnlp-main.927. URLhttps://aclanthology.org/2024.emnlp-main.927/. Hongli Yu, Tinghong Chen, Jiangtao Feng, Jiangjie Chen, Weinan Dai, Qiying Yu, Ya-Qin Zhang, Wei-Ying Ma, Jingjing Liu, Mingxuan Wang, et al. Memagent: Reshaping long-context llm with multi-conv rl-based memory agent.arXiv pre...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/2024.emnlp-main.927 2024
-
[56]
Exgrpo: Learning to reason from experience.arXiv preprint arXiv:2510.02245,
Runzhe Zhan, Yafu Li, Zhi Wang, Xiaoye Qu, Dongrui Liu, Jing Shao, Derek F Wong, and Yu Cheng. Exgrpo: Learning to reason from experience.arXiv preprint arXiv:2510.02245,
-
[57]
Appagent: Multimodal agents as smartphone users
Chi Zhang, Zhao Yang, Jiaxuan Liu, Yanda Li, Yucheng Han, Xin Chen, Zebiao Huang, Bin Fu, and Gang Yu. Appagent: Multimodal agents as smartphone users. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems, pp. 1–20, 2025a. Guibin Zhang, Muxin Fu, and Shuicheng Yan. Memgen: Weaving generative latent memory for self-evolving agents...
-
[58]
A Survey on the Memory Mechanism of Large Language Model based Agents
Zeyu Zhang, Xiaohe Bo, Chen Ma, Rui Li, Xu Chen, Quanyu Dai, Jieming Zhu, Zhenhua Dong, and Ji-Rong Wen. A survey on the memory mechanism of large language model based agents.arXiv preprint arXiv:2404.13501,
work page internal anchor Pith review Pith/arXiv arXiv
-
[59]
Lifelonga- gentbench: Evaluating llm agents as lifelong learners.arXiv preprint arXiv:2505.11942,
Junhao Zheng, Xidi Cai, Qiuke Li, Duzhen Zhang, ZhongZhi Li, Yingying Zhang, Le Song, and Qianli Ma. Lifelonga- gentbench: Evaluating llm agents as lifelong learners.arXiv preprint arXiv:2505.11942,
-
[60]
Externalization in LLM Agents: A Unified Review of Memory, Skills, Protocols and Harness Engineering
Chenyu Zhou, Huacan Chai, Wenteng Chen, Zihan Guo, Rong Shan, Yuanyi Song, Tianyi Xu, Yingxuan Yang, Aofan Yu, Weiming Zhang, et al. Externalization in llm agents: A unified review of memory, skills, protocols and harness engineering.arXiv preprint arXiv:2604.08224,
work page internal anchor Pith review Pith/arXiv arXiv
-
[61]
Memento: Fine-tuning LLM Agents without Fine-tuning LLMs
Huichi Zhou, Yihang Chen, Siyuan Guo, Xue Yan, Kin Hei Lee, Zihan Wang, Ka Yiu Lee, Guchun Zhang, Kun Shao, Linyi Yang, et al. Memento: Fine-tuning llm agents without fine-tuning llms.arXiv preprint arXiv:2508.16153,
-
[62]
WebArena: A Realistic Web Environment for Building Autonomous Agents
Shuyan Zhou, Frank F Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, et al. Webarena: A realistic web environment for building autonomous agents.arXiv preprint arXiv:2307.13854,
work page internal anchor Pith review Pith/arXiv arXiv
-
[63]
evaluates agents in enterprise workflow scenarios built on the ServiceNow cloud platform. The benchmark covers four representative workflow categories:Dashboard & Menu Navigation—locating information across nested menus and dashboards;Enterprise Forms—filling multi-field structured forms with domain-specific validation;List Filter/Sort—applying complex fi...
work page 2024
-
[64]
and the Hugging Face transformers library (Wolf et al., 2020). RL training is built on TRL (von Werra et al., 2020), with rollout generation served by vLLM (Kwon et al., 2023).Mem-π is initialized fromQwen2.5-7B-Instruct (Yang et al.,
work page 2020
-
[65]
ofG=4branches: one forced [ABSTAIN](no generation) and three[GENERATE] branches each producing a memory of up toLmax=256tokens at sampling temperature1 .0and top_p 0.95. Optimization uses AdamW with learning rate1×10−6, β1=0.9, β2=0.999, weight decay0, batch size 8tasks per step, and200optimization steps. The clip ratio is ϵclip=0.2, the KL coefficient is...
work page 2019
-
[66]
Reported numbers are means over three independent seeds
we use the official benchmark verifiers from BrowserGym; forLAB(Zheng et al., 2025), correctness is verified by SQL execution (DB) and OS state checks via the benchmark’s built-in verifiers; forALFWorld(Shridhar et al., 2020b), success is determined by the environment’s terminal condition checker. Reported numbers are means over three independent seeds. L...
work page 2025
-
[67]
What is the top-1 best-selling product in 2022
by examining one representative task per Venn region. The eight regions partition the test split into qualitatively distinct outcome patterns, summarized below.Region 001 contains Mem-π-only successes,Pattern 1of the main text where generation reaches what retrieval cannot. Region 101contains tasks Base andMem-π solve but RAG breaks,Pattern 2where abstent...
work page 2022
-
[68]
Apply chmod 400 /report.txtfor owner-read-only.” Figure 8Sample experience entries drawn from the offline bankE used to trainMem-π, one per benchmark. Each entry contains a task query (source_trace_goals in JEF-Hinter (Nekoei et al., 2025)) and the guidance (JEF-Hinter hint) text. ForWebArenaandWorkArena, the bank additionally stores the initial screensho...
work page 2025
-
[69]
List the top 3 search terms in my store
Long task queries and hints are abridged with ellipses, keeping only the contrastive sub-strings. Region 001:Mem-πwins Pattern 1 – Generation reaches what retrieval cannot.15 tasks Case A1 (Task 8).Top search terms (Magento admin). Task:“List the top 3 search terms in my store.” RAG: ✗ “...locate the ‘Top Search Terms’ table...read thefirst two rowsand re...
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.