Recognition: 2 theorem links
· Lean TheoremReasoningBank: Scaling Agent Self-Evolving with Reasoning Memory
Pith reviewed 2026-05-15 05:38 UTC · model grok-4.3
The pith
ReasoningBank lets LLM agents distill generalizable strategies from both successes and failures to improve on new tasks over time.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
ReasoningBank distills generalizable reasoning strategies from an agent's self-judged successful and failed experiences; at test time the agent retrieves relevant memories to shape its next actions and integrates the resulting learnings back into the bank. Memory-aware test-time scaling amplifies the process by allocating additional compute to each task, producing abundant diverse experiences that yield higher-quality memory entries through contrastive synthesis. The resulting memory in turn guides more effective scaling, establishing memory-driven experience scaling as a new dimension that lets agents self-evolve with emergent behaviors.
What carries the argument
ReasoningBank, a memory store of distilled reasoning strategies drawn from both successes and failures, retrieved at test time to inform actions and updated with new learnings, together with memory-aware test-time scaling that generates diverse contrastive experiences to improve memory quality.
If this is right
- Agents using ReasoningBank outperform those that store raw trajectories or only successful routines on web-browsing and software-engineering benchmarks.
- Allocating extra compute via MaTTS produces richer experience sets that synthesize higher-quality memories and accelerate capability growth.
- Memory-driven experience scaling emerges as a distinct scaling axis that compounds with existing test-time compute scaling.
- Accumulated memories enable agents to avoid repeating past errors and exhibit emergent self-improvement behaviors across sequential tasks.
Where Pith is reading between the lines
- The same contrastive-memory loop could be applied to domains with long task sequences where forgetting prior constraints is costly, such as multi-step scientific workflows.
- If self-judgment noise is high, the framework may require an external verifier step before memory ingestion to prevent drift.
- Memory retrieval could be extended with explicit uncertainty estimates so the agent knows when to trust stored strategies versus falling back to base reasoning.
Load-bearing premise
An agent's own judgment of whether an outcome counts as success or failure supplies reliable signals that can be turned into strategies that transfer usefully to new tasks.
What would settle it
An experiment in which agents equipped with ReasoningBank show no gain or outright worse performance than raw-trajectory or success-only baselines on a held-out task distribution after several cycles of memory use and update.
read the original abstract
With the growing adoption of large language model agents in persistent real-world roles, they naturally encounter continuous streams of tasks. A key limitation, however, is their failure to learn from the accumulated interaction history, forcing them to discard valuable insights and repeat past errors. We propose ReasoningBank, a novel memory framework that distills generalizable reasoning strategies from an agent's self-judged successful and failed experiences. At test time, an agent retrieves relevant memories from ReasoningBank to inform its interaction and then integrates new learnings back, enabling it to become more capable over time. Building on this powerful experience learner, we further introduce memory-aware test-time scaling (MaTTS), which accelerates and diversifies this learning process by scaling up the agent's interaction experience. By allocating more compute to each task, the agent generates abundant, diverse experiences that provide rich contrastive signals for synthesizing higher-quality memory. The better memory in turn guides more effective scaling, establishing a powerful synergy between memory and test-time scaling. Across web browsing and software engineering benchmarks, ReasoningBank consistently outperforms existing memory mechanisms that store raw trajectories or only successful task routines, improving both effectiveness and efficiency; MaTTS further amplifies these gains. These findings establish memory-driven experience scaling as a new scaling dimension, enabling agents to self-evolve with emergent behaviors naturally arise. Our code can be found at https://github.com/google-research/reasoning-bank.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes ReasoningBank, a memory framework for LLM agents that distills generalizable reasoning strategies from self-judged successful and failed task experiences. Agents retrieve relevant memories at test time to inform interactions and integrate new learnings back into the bank. It further introduces memory-aware test-time scaling (MaTTS) to generate abundant diverse experiences via increased compute, creating a claimed synergy where better memory enables more effective scaling. Evaluations on web-browsing and software-engineering benchmarks report consistent outperformance over baselines storing raw trajectories or only successful routines, with MaTTS amplifying gains, establishing memory-driven experience scaling as a new dimension for agent self-evolution.
Significance. If the results hold under rigorous validation, the work is significant for introducing structured reasoning memory as a scalable mechanism for persistent agent improvement, distinct from raw trajectory storage. The MaTTS synergy and open-sourced code at the provided GitHub link are notable strengths that support reproducibility and further research on memory as a scaling axis.
major comments (2)
- [Experimental Evaluation] The central claim that self-judged success/failure labels produce reliable, transferable reasoning strategies (rather than noisy or biased signals) is load-bearing for the outperformance over raw-trajectory baselines, yet the manuscript reports no direct measurement of judgment accuracy, such as agreement with oracle success labels or human ratings, particularly in partially observable domains like web browsing and software engineering.
- [Results] The results section lacks ablations isolating the contribution of failure experiences (versus successes only) and does not report statistical significance, variance across runs, or full experimental protocol details, weakening support for the consistent benchmark gains and the claimed synergy with MaTTS.
minor comments (2)
- [Abstract] The abstract's final sentence has a grammatical issue ('enabling agents to self-evolve with emergent behaviors naturally arise') that reduces clarity; rephrase to 'enabling agents to self-evolve via emergent behaviors that naturally arise.'
- [Method] The distillation process in the method description would benefit from explicit pseudocode or example prompts showing how reasoning strategies are extracted from experiences to aid reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive comments, which highlight important aspects of experimental rigor. We address each major point below and will revise the manuscript to incorporate the suggested analyses and details.
read point-by-point responses
-
Referee: [Experimental Evaluation] The central claim that self-judged success/failure labels produce reliable, transferable reasoning strategies (rather than noisy or biased signals) is load-bearing for the outperformance over raw-trajectory baselines, yet the manuscript reports no direct measurement of judgment accuracy, such as agreement with oracle success labels or human ratings, particularly in partially observable domains like web browsing and software engineering.
Authors: We agree that direct measurement of self-judgment reliability would strengthen the central claim. In the revised version we will add a dedicated analysis section that (i) compares agent self-judged success labels against oracle ground-truth labels on the software-engineering tasks where verifiable outcomes exist, reporting agreement rates and confusion matrices, and (ii) presents human ratings on a random sample of web-browsing judgments (approximately 100 instances) to quantify reliability under partial observability. These additions will provide quantitative evidence on the quality of the distilled reasoning strategies. revision: yes
-
Referee: [Results] The results section lacks ablations isolating the contribution of failure experiences (versus successes only) and does not report statistical significance, variance across runs, or full experimental protocol details, weakening support for the consistent benchmark gains and the claimed synergy with MaTTS.
Authors: We acknowledge these gaps in the current presentation. The revised manuscript will include: (1) an explicit ablation comparing ReasoningBank (success + failure) against a success-only variant to isolate the value of failure-derived strategies; (2) mean and standard deviation across at least three independent runs for all main tables, together with paired t-test p-values against the strongest baseline; and (3) an expanded experimental-protocol appendix detailing retrieval hyperparameters, memory-update rules, MaTTS compute budgets, and random seeds. These changes will make the reported gains and the memory-scaling synergy more statistically robust. revision: yes
Circularity Check
No circularity: empirical framework with external benchmarks
full rationale
The paper defines ReasoningBank through retrieval and distillation operating on external task outcomes and self-judged experiences, then reports benchmark gains over raw-trajectory baselines. No equations, fitted parameters, or self-citation chains reduce the claimed improvements to inputs by construction. The derivation remains self-contained against the stated web-browsing and software-engineering evaluations.
Axiom & Free-Parameter Ledger
free parameters (1)
- retrieval and distillation hyperparameters
axioms (1)
- domain assumption Agent self-judgment of task success and failure supplies sufficiently accurate signals for distilling reusable strategies
invented entities (1)
-
ReasoningBank
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We propose ReasoningBank, a novel memory framework that distills generalizable reasoning strategies from an agent’s self-judged successful and failed experiences... LLM-as-a-Judge... memory retrieval... MaTTS
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
extract... memory items... Title... Description... Content... success insights... failure reflection
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 22 Pith papers
-
ClawForge: Generating Executable Interactive Benchmarks for Command-Line Agents
ClawForge supplies a generator that turns scenario templates into reproducible command-line tasks testing state conflict handling, where the strongest frontier model scores only 45.3 percent strict accuracy.
-
LongMemEval-V2: Evaluating Long-Term Agent Memory Toward Experienced Colleagues
LongMemEval-V2 is a new benchmark where AgentRunbook-C reaches 72.5% accuracy on long-term agent memory tasks, beating RAG baselines at 48.5% and basic coding agents at 69.3%.
-
Can LLM Agents Respond to Disasters? Benchmarking Heterogeneous Geospatial Reasoning in Emergency Operations
DORA is the first end-to-end agentic benchmark for LLM-based disaster response, covering perception, spatial analysis, evacuation planning, temporal reasoning, and report generation over heterogeneous geospatial data,...
-
MemCompiler: Compile, Don't Inject -- State-Conditioned Memory for Embodied Agents
MemCompiler reframes memory use as state-conditioned compilation, delivering relevant guidance via text and latent channels to improve embodied agent performance up to 129% and cut latency 60% versus static injection.
-
MemCompiler: Compile, Don't Inject -- State-Conditioned Memory for Embodied Agents
MemCompiler introduces state-conditioned memory compilation that dynamically selects and compiles relevant memory into text and latent guidance, yielding up to 129% gains over no-memory baselines and 60% lower latency...
-
Learning How and What to Memorize: Cognition-Inspired Two-Stage Optimization for Evolving Memory
MemCoE learns memory organization guidelines via contrastive feedback and then trains a guideline-aligned RL policy for memory updates, yielding consistent gains on personalization benchmarks.
-
Beyond Meta-Reasoning: Metacognitive Consolidation for Self-Improving LLM Reasoning
Metacognitive Consolidation lets LLMs accumulate reusable meta-reasoning skills from past episodes to improve future performance across benchmarks.
-
SkillGraph: Skill-Augmented Reinforcement Learning for Agents via Evolving Skill Graphs
SkillGraph represents skills as nodes in an evolving directed graph with typed dependency edges and updates the graph from RL trajectories to boost compositional task performance.
-
SkillRAE: Agent Skill-Based Context Compilation for Retrieval-Augmented Execution
SkillRAE organizes skills into a graph and compiles compact, grounded contexts for LLM agents, yielding 11.7% gains on SkillsBench over prior RAE methods.
-
Workspace Optimization: How to Train Your Agent
Workspace optimization evolves an agent's external workspace using multi-agent systems, with DreamTeam raising ARC-AGI-3 scores from 36% to 38.4% while using 31% fewer actions.
-
FlashEvolve: Accelerating Agent Self-Evolution with Asynchronous Stage Orchestration
FlashEvolve accelerates LLM agent self-evolution via asynchronous stage orchestration and inspectable language-space staleness handling, reporting 3.5-4.9x proposal throughput gains over synchronous baselines on GEPA ...
-
Safe Bilevel Delegation (SBD): A Formal Framework for Runtime Delegation Safety in Multi-Agent Systems
SBD is a bilevel optimization framework that learns context-dependent safety weights for runtime task delegation in hierarchical multi-agent systems, with continuous authority transfer alpha and theoretical guarantees...
-
ClawTrace: Cost-Aware Tracing for LLM Agent Skill Distillation
ClawTrace enables cost-aware LLM agent skill distillation by tracing per-step costs and generating preserve, prune, and repair patches, with ablations showing reduced regressions and prune rules transferring to cut co...
-
Training LLM Agents for Spontaneous, Reward-Free Self-Evolution via World Knowledge Exploration
LLM agents trained with a task-success reward on self-generated knowledge can spontaneously explore and adapt to new environments without any rewards or instructions at inference, yielding 20% gains on web tasks and a...
-
ReflectCAP: Detailed Image Captioning with Reflective Memory
ReflectCAP distills model-specific hallucination and oversight patterns into Structured Reflection Notes that steer LVLMs toward more factual and complete image captions, reaching the Pareto frontier on factuality-cov...
-
Procedural Knowledge at Scale Improves Reasoning
Reasoning Memory decomposes reasoning trajectories into 32 million subquestion-subroutine pairs and retrieves them via in-thought prompts to improve language model performance on math, science, and coding benchmarks b...
-
WorkflowGen:an adaptive workflow generation mechanism driven by trajectory experience
WorkflowGen reuses trajectory experiences via node-level and workflow-level extraction plus three-tier semantic routing to cut token use over 40% and raise success 20% on medium-similarity queries versus real-time pla...
-
SafeHarbor: Hierarchical Memory-Augmented Guardrail for LLM Agent Safety
SafeHarbor uses hierarchical memory with adversarial rule extraction and entropy-driven self-evolution to achieve over 93% refusal on harmful requests while reaching 63.6% benign utility on GPT-4o.
-
Training-Free Test-Time Contrastive Learning for Large Language Models
TF-TTCL lets frozen LLMs adapt online by distilling textual rules from contrastive reasoning trajectories generated via multi-agent augmentation and applying them through retrieval-based steering.
-
Scaling Teams or Scaling Time? Memory Enabled Lifelong Learning in LLM Multi-Agent Systems
LLMA-Mem improves long-horizon performance in LLM multi-agent systems over baselines while reducing cost and shows non-monotonic scaling where memory-enabled smaller teams can beat larger ones.
-
A Comprehensive Survey on Agent Skills: Taxonomy, Techniques, and Applications
The paper surveys agent skills for LLM agents, organizing the literature into a four-stage lifecycle of representation, acquisition, retrieval, and evolution while highlighting their role in system scalability.
-
ActionNex: A Virtual Outage Manager for Cloud Computing
ActionNex is an agentic system for cloud outage management that compresses multimodal signals into critical events, uses hierarchical memory for reasoning, and recommends actions with 71.4% precision on real Azure outages.
Reference graph
Works this paper leans on
-
[1]
Scaling Learning Algorithms Towards
Bengio, Yoshua and LeCun, Yann , booktitle =. Scaling Learning Algorithms Towards
-
[2]
and Osindero, Simon and Teh, Yee Whye , journal =
Hinton, Geoffrey E. and Osindero, Simon and Teh, Yee Whye , journal =. A Fast Learning Algorithm for Deep Belief Nets , volume =
-
[3]
Rulin Shao and Rui Qiao and Varsha Kishore and Niklas Muennighoff and Xi Victoria Lin and Daniela Rus and Bryan Kian Hsiang Low and Sewon Min and Wen-tau Yih and Pang Wei Koh and Luke Zettlemoyer , booktitle =. Reason
-
[4]
Lumer, Elias and Gulati, Anmol and Subbiah, Vamse Kumar and Basavaraju, Pradeep Honaganahalli and Burke, James A , journal =. MemTool: Optimizing Short-Term Memory Management for Dynamic Tool Calling in LLM Agent Multi-Turn Conversations , url =
-
[5]
Human-inspired Episodic Memory for Infinite Context
Zafeirios Fountas and Martin Benfeghoul and Adnan Oomerjee and Fenia Christopoulou and Gerasimos Lampouras and Haitham Bou Ammar and Jun Wang , booktitle =. Human-inspired Episodic Memory for Infinite Context
-
[6]
John Yang and Carlos E. Jimenez and Alexander Wettig and Kilian Lieret and Shunyu Yao and Karthik Narasimhan and Ofir Press , bibsource =. SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering , url =. Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, V...
work page 2024
-
[7]
Autonomous Evaluation and Refinement of Digital Agents , url =
Jiayi Pan and Yichi Zhang and Nicholas Tomlin and Yifei Zhou and Sergey Levine and Alane Suhr , booktitle =. Autonomous Evaluation and Refinement of Digital Agents , url =
-
[8]
Yin, Shaozhe and Guo, Jinyu and Shuang, Kai and Liu, Xia and Ou, Ruize , journal =. Learning Wisdom from Errors: Promoting LLM's Continual Relation Learning through Exploiting Error Cases , url =
-
[9]
Contextual Experience Replay for Self-Improvement of Language Agents , url =
Liu, Yitao and Si, Chenglei and Narasimhan, Karthik R and Yao, Shunyu , booktitle =. Contextual Experience Replay for Self-Improvement of Language Agents , url =. doi:10.18653/v1/2025.acl-long.694 , editor =
-
[10]
Emergent Hierarchical Reasoning in LLMs through Reinforcement Learning , url =
Wang, Haozhe and Xu, Qixin and Liu, Che and Wu, Junhong and Lin, Fangzhen and Chen, Wenhu , journal =. Emergent Hierarchical Reasoning in LLMs through Reinforcement Learning , url =
-
[11]
Lee, Jinhyuk and Chen, Feiyang and Dua, Sahil and Cer, Daniel and Shanbhogue, Madhuri and Naim, Iftekhar and. ArXiv preprint , title =
-
[12]
Ghafarollahi, Alireza and Buehler, Markus J , journal =. SciAgents: automating scientific discovery through bioinspired multi-agent intelligent graph reasoning , url =
-
[13]
A survey on llm-as-a-judge , url =
Gu, Jiawei and Jiang, Xuhui and Shi, Zhichao and Tan, Hexiang and Zhai, Xuehao and Xu, Chengjin and Li, Wei and Shen, Yinghan and Ma, Shengjie and Liu, Honghao and others , journal =. A survey on llm-as-a-judge , url =
-
[14]
SEAgent: Self-Evolving Computer Use Agent with Autonomous Learning from Experience , url =
Sun, Zeyi and Liu, Ziyu and Zang, Yuhang and Cao, Yuhang and Dong, Xiaoyi and Wu, Tong and Lin, Dahua and Wang, Jiaqi , journal =. SEAgent: Self-Evolving Computer Use Agent with Autonomous Learning from Experience , url =
-
[15]
SWE-Exp: Experience-Driven Software Issue Resolution , url =
Chen, Silin and Lin, Shaoxin and Gu, Xiaodong and Shi, Yuling and Lian, Heng and Yun, Longfei and Chen, Dong and Sun, Weiguo and Cao, Lin and Wang, Qianxiang , journal =. SWE-Exp: Experience-Driven Software Issue Resolution , url =
-
[16]
Huang and Mustafa Safdari and Yutaka Matsuo and Douglas Eck and Aleksandra Faust , bibsource =
Izzeddin Gur and Hiroki Furuta and Austin V. Huang and Mustafa Safdari and Yutaka Matsuo and Douglas Eck and Aleksandra Faust , bibsource =. A Real-World WebAgent with Planning, Long Context Understanding, and Program Synthesis , url =. The Twelfth International Conference on Learning Representations,
-
[17]
Self-Refine: Iterative Refinement with Self-Feedback , url =
Aman Madaan and Niket Tandon and Prakhar Gupta and Skyler Hallinan and Luyu Gao and Sarah Wiegreffe and Uri Alon and Nouha Dziri and Shrimai Prabhumoye and Yiming Yang and Shashank Gupta and Bodhisattwa Prasad Majumder and Katherine Hermann and Sean Welleck and Amir Yazdanbakhsh and Peter Clark , bibsource =. Self-Refine: Iterative Refinement with Self-Fe...
work page 2023
-
[18]
Ting Chen and Simon Kornblith and Mohammad Norouzi and Geoffrey E. Hinton , bibsource =. A Simple Framework for Contrastive Learning of Visual Representations , url =. Proceedings of the 37th International Conference on Machine Learning,
-
[19]
Narasimhan and Yuan Cao , bibsource =
Shunyu Yao and Jeffrey Zhao and Dian Yu and Nan Du and Izhak Shafran and Karthik R. Narasimhan and Yuan Cao , bibsource =. ReAct: Synergizing Reasoning and Acting in Language Models , url =. The Eleventh International Conference on Learning Representations,
-
[20]
Comanici, Gheorghe and Bieber, Eric and Schaekermann, Mike and Pasupat, Ice and Sachdeva, Noveen and Dhillon, Inderjit and Blistein, Marcel and Ram, Ori and Zhang, Dan and Rosen, Evan and others , journal =. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities , url =
-
[21]
Shen, Junhong and Bai, Hao and Zhang, Lunjun and Zhou, Yifei and Setlur, Amrith and Tong, Shengbang and Caples, Diego and Jiang, Nan and Zhang, Tong and Talwalkar, Ameet and others , journal =. Thinking vs. Doing: Agents that Reason by Scaling Test-Time Interaction , url =
-
[22]
Agent kb: Leveraging cross-domain experience for agentic problem solving , url =
Tang, Xiangru and Qin, Tianrui and Peng, Tianhao and Zhou, Ziyang and Shao, Daniel and Du, Tingting and Wei, Xinming and Xia, Peng and Wu, Fang and Zhu, He and others , journal =. Agent kb: Leveraging cross-domain experience for agentic problem solving , url =
-
[23]
OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments , url =
Tianbao Xie and Danyang Zhang and Jixuan Chen and Xiaochuan Li and Siheng Zhao and Ruisheng Cao and Toh Jing Hua and Zhoujun Cheng and Dongchan Shin and Fangyu Lei and Yitao Liu and Yiheng Xu and Shuyan Zhou and Silvio Savarese and Caiming Xiong and Victor Zhong and Tao Yu , bibsource =. OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real...
work page 2024
-
[24]
Liu, Bang and Li, Xinfeng and Zhang, Jiayi and Wang, Jinlin and He, Tanjin and Hong, Sirui and Liu, Hongzhang and Zhang, Shaokun and Song, Kaitao and Zhu, Kunlun and others , journal =. Advances and challenges in foundation agents: From brain-inspired intelligence to evolutionary, collaborative, and safe systems , url =
-
[25]
The Thirteenth International Conference on Learning Representations , title =
Antonis Antoniades and Albert. The Thirteenth International Conference on Learning Representations , title =
-
[26]
Memp: Exploring Agent Procedural Memory , url =
Fang, Runnan and Liang, Yuan and Wang, Xiaobin and Wu, Jialong and Qiao, Shuofei and Xie, Pengjun and Huang, Fei and Chen, Huajun and Zhang, Ningyu , journal =. Memp: Exploring Agent Procedural Memory , url =
-
[27]
Jimenez and John Yang and Alexander Wettig and Shunyu Yao and Kexin Pei and Ofir Press and Karthik R
Carlos E. Jimenez and John Yang and Alexander Wettig and Shunyu Yao and Kexin Pei and Ofir Press and Karthik R. Narasimhan , bibsource =. SWE-bench: Can Language Models Resolve Real-world Github Issues? , url =. The Twelfth International Conference on Learning Representations,
-
[28]
Zhang, Zeyu and Dai, Quanyu and Bo, Xiaohe and Ma, Chen and Li, Rui and Chen, Xu and Zhu, Jieming and Dong, Zhenhua and Wen, Ji-Rong , title =. 2025 , issue_date =. doi:10.1145/3748302 , journal =
-
[29]
MemoryBank: Enhancing large language models with long-term memory
Wanjun Zhong and Lianghong Guo and Qiqi Gao and He Ye and Yanlin Wang , bibsource =. MemoryBank: Enhancing Large Language Models with Long-Term Memory , url =. Thirty-Eighth. doi:10.1609/AAAI.V38I17.29946 , editor =
-
[30]
Beyond Goldfish Memory: Long-Term Open-Domain Conversation , url =
Xu, Jing and Szlam, Arthur and Weston, Jason , booktitle =. Beyond Goldfish Memory: Long-Term Open-Domain Conversation , url =. doi:10.18653/v1/2022.acl-long.356 , editor =
-
[31]
Investigate-consolidate-exploit: A general strategy for inter-task agent self-evolution , url =
Qian, Cheng and Liang, Shihao and Qin, Yujia and Ye, Yining and Cong, Xin and Lin, Yankai and Wu, Yesai and Liu, Zhiyuan and Sun, Maosong , journal =. Investigate-consolidate-exploit: A general strategy for inter-task agent self-evolution , url =
-
[32]
ChemAgent: Self-updating Memories in Large Language Models Improves Chemical Reasoning , url =
Xiangru Tang and Tianyu Hu and Muyang Ye and Yanjun Shao and Xunjian Yin and Siru Ouyang and Wangchunshu Zhou and Pan Lu and Zhuosheng Zhang and Yilun Zhao and Arman Cohan and Mark Gerstein , booktitle =. ChemAgent: Self-updating Memories in Large Language Models Improves Chemical Reasoning , url =
-
[33]
MEM1: Learning to Synergize Memory and Reasoning for Efficient Long-Horizon Agents , url =
Zhou, Zijian and Qu, Ao and Wu, Zhaoxuan and Kim, Sunghwan and Prakash, Alok and Rus, Daniela and Zhao, Jinhua and Low, Bryan Kian Hsiang and Liang, Paul Pu , journal =. MEM1: Learning to Synergize Memory and Reasoning for Efficient Long-Horizon Agents , url =
-
[34]
MemAgent: Reshaping Long-Context LLM with Multi-Conv RL-based Memory Agent , url =
Yu, Hongli and Chen, Tinghong and Feng, Jiangtao and Chen, Jiangjie and Dai, Weinan and Yu, Qiying and Zhang, Ya-Qin and Ma, Wei-Ying and Liu, Jingjing and Wang, Mingxuan and others , journal =. MemAgent: Reshaping Long-Context LLM with Multi-Conv RL-based Memory Agent , url =
-
[35]
MemGPT: Towards LLMs as Operating Systems , url =
Packer, Charles and Fang, Vivian and Patil, Shishir\_G and Lin, Kevin and Wooders, Sarah and Gonzalez, Joseph\_E , journal =. MemGPT: Towards LLMs as Operating Systems , url =
-
[36]
Mem0: Building production-ready ai agents with scalable long-term memory , url =
Chhikara, Prateek and Khant, Dev and Aryan, Saket and Singh, Taranjeet and Yadav, Deshraj , journal =. Mem0: Building production-ready ai agents with scalable long-term memory , url =
-
[37]
A-mem: Agentic memory for llm agents , url =
Xu, Wujiang and Liang, Zujie and Mei, Kai and Gao, Hang and Tan, Juntao and Zhang, Yongfeng , journal =. A-mem: Agentic memory for llm agents , url =
-
[38]
Yu Wang and Dmitry Krotov and Yuanzhe Hu and Yifan Gao and Wangchunshu Zhou and Julian McAuley and Dan Gutfreund and Rogerio Feris and Zexue He , booktitle =. M+: Extending Memory
-
[39]
Mind2Web: Towards a Generalist Agent for the Web , url =
Xiang Deng and Yu Gu and Boyuan Zheng and Shijie Chen and Samual Stevens and Boshi Wang and Huan Sun and Yu Su , bibsource =. Mind2Web: Towards a Generalist Agent for the Web , url =. Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, ...
work page 2023
-
[40]
Yuanzhe Hu and Yu Wang and Julian McAuley , booktitle =. Evaluating Memory in
-
[41]
LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory , url =
Di Wu and Hongwei Wang and Wenhao Yu and Yuwei Zhang and Kai-Wei Chang and Dong Yu , booktitle =. LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory , url =
-
[42]
Evaluating Very Long-Term Conversational Memory of
Maharana, Adyasha and Lee, Dong-Ho and Tulyakov, Sergey and Bansal, Mohit and Barbieri, Francesco and Fang, Yuwei , booktitle =. Evaluating Very Long-Term Conversational Memory of. doi:10.18653/v1/2024.acl-long.747 , editor =
-
[43]
Tan, Zhen and Yan, Jun and Hsu, I-Hung and Han, Rujun and Wang, Zifeng and Le, Long and Song, Yiwen and Chen, Yanfei and Palangi, Hamid and Lee, George and Iyer, Anand Rajan and Chen, Tianlong and Liu, Huan and Lee, Chen-Yu and Pfister, Tomas , booktitle =. In Prospect and Retrospect: Reflective Memory Management for Long-term Personalized Dialogue Agents...
-
[44]
Does Reinforcement Learning Really Incentivize Reasoning Capacity in
Yang Yue and Zhiqi Chen and Rui Lu and Andrew Zhao and Zhaokai Wang and Yang Yue and Shiji Song and Gao Huang , booktitle=. Does Reinforcement Learning Really Incentivize Reasoning Capacity in. 2025 , url=
work page 2025
-
[45]
Reinforcement learning: An introduction , volume =
Sutton, Richard S and Barto, Andrew G and others , number =. Reinforcement learning: An introduction , volume =
-
[46]
Scaling Test-time Compute for LLM Agents , url =
Zhu, King and Li, Hanhao and Wu, Siwei and Xing, Tianshun and Ma, Dehua and Tang, Xiangru and Liu, Minghao and Yang, Jian and Liu, Jiaheng and Jiang, Yuchen Eleanor and others , journal =. Scaling Test-time Compute for LLM Agents , url =
-
[47]
Two heads are better than one: Test-time scaling of multi-agent collaborative reasoning , url =
Jin, Can and Peng, Hongwu and Zhang, Qixin and Tang, Yujin and Metaxas, Dimitris N and Che, Tong , journal =. Two heads are better than one: Test-time scaling of multi-agent collaborative reasoning , url =
-
[48]
Xiao Yu and Baolin Peng and Vineeth Vajipey and Hao Cheng and Michel Galley and Jianfeng Gao and Zhou Yu , booktitle =. Ex
-
[49]
Scaling Test-Time Compute Without Verification or
Amrith Setlur and Nived Rajaraman and Sergey Levine and Aviral Kumar , booktitle =. Scaling Test-Time Compute Without Verification or
-
[50]
Wu, Yangzhen and Sun, Zhiqing and Li, Shanda and Welleck, Sean and Yang, Yiming , journal =. Inference scaling laws: An empirical analysis of compute-optimal inference for problem-solving with language models , url =
-
[51]
Inference-Aware Fine-Tuning for Best-of-N Sampling in Large Language Models , url =
Yinlam Chow and Guy Tennenholtz and Izzeddin Gur and Vincent Zhuang and Bo Dai and Aviral Kumar and Rishabh Agarwal and Sridhar Thiagarajan and Craig Boutilier and Aleksandra Faust , booktitle =. Inference-Aware Fine-Tuning for Best-of-N Sampling in Large Language Models , url =
-
[52]
Muennighoff, Niklas and Yang, Zitong and Shi, Weijia and Li, Xiang Lisa and Fei-Fei, Li and Hajishirzi, Hannaneh and Zettlemoyer, Luke and Liang, Percy and Candes, Emmanuel and Hashimoto, Tatsunori. s1: Simple test-time scaling. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 2025. doi:10.18653/v1/2025.emnlp-main.1025
-
[53]
Z1: Efficient test-time scaling with code , url =
Yu, Zhaojian and Wu, Yinghao and Zhao, Yilun and Cohan, Arman and Zhang, Xiao-Ping , journal =. Z1: Efficient test-time scaling with code , url =
-
[54]
S*: Test time scaling for code generation , url =
Li, Dacheng and Cao, Shiyi and Cao, Chengkun and Li, Xiuyu and Tan, Shangyin and Keutzer, Kurt and Xing, Jiarong and Gonzalez, Joseph E and Stoica, Ion , journal =. S*: Test time scaling for code generation , url =
-
[55]
Charlie Victor Snell and Jaehoon Lee and Kelvin Xu and Aviral Kumar , booktitle =. Scaling
-
[56]
Synapse: Trajectory-as-Exemplar Prompting with Memory for Computer Control , url =
Longtao Zheng and Rundong Wang and Xinrun Wang and Bo An , bibsource =. Synapse: Trajectory-as-Exemplar Prompting with Memory for Computer Control , url =. The Twelfth International Conference on Learning Representations,
-
[57]
Zora Zhiruo Wang and Jiayuan Mao and Daniel Fried and Graham Neubig , booktitle =. Agent Workflow Memory , url =
-
[58]
WebChoreArena: Evaluating Web Browsing Agents on Realistic Tedious Web Tasks , url =
Miyai, Atsuyuki and Zhao, Zaiying and Egashira, Kazuki and Sato, Atsuki and Sunada, Tatsumi and Onohara, Shota and Yamanishi, Hiromasa and Toyooka, Mashiro and Nishina, Kunato and Maeda, Ryoma and others , journal =. WebChoreArena: Evaluating Web Browsing Agents on Realistic Tedious Web Tasks , url =
-
[59]
The BrowserGym Ecosystem for Web Agent Research , url =
Thibault Le Sellier de Chezelles and Maxime Gasse and Alexandre Lacoste and Massimo Caccia and Alexandre Drouin and L. The BrowserGym Ecosystem for Web Agent Research , url =. Transactions on Machine Learning Research , note =
-
[60]
Shuyan Zhou and Frank F. Xu and Hao Zhu and Xuhui Zhou and Robert Lo and Abishek Sridhar and Xianyi Cheng and Tianyue Ou and Yonatan Bisk and Daniel Fried and Uri Alon and Graham Neubig , bibsource =. WebArena:. The Twelfth International Conference on Learning Representations,
-
[61]
Ruslan Salakhutdinov , bibsource =. Deep learning , url =. The 20th. doi:10.1145/2623330.2630809 , editor =
-
[62]
Inducing programmatic skills for agentic tasks , url =
Wang, Zora Zhiruo and Gandhi, Apurva and Neubig, Graham and Fried, Daniel , journal =. Inducing programmatic skills for agentic tasks , url =
-
[63]
A survey on large language model based autonomous agents , volume =
Wang, Lei and Ma, Chen and Feng, Xueyang and Zhang, Zeyu and Yang, Hao and Zhang, Jingsen and Chen, Zhiyuan and Tang, Jiakai and Chen, Xu and Lin, Yankai and others , journal =. A survey on large language model based autonomous agents , volume =. doi:10.1007/s11704-024-40231-1 , url =
-
[64]
StreamBench: Towards Benchmarking Continuous Improvement of Language Agents , url =
Cheng. StreamBench: Towards Benchmarking Continuous Improvement of Language Agents , url =. Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada, December 10 - 15, 2024 , editor =
work page 2024
-
[65]
Transactions on Machine Learning Research , issn=
A Survey of Self-Evolving Agents: What, When, How, and Where to Evolve on the Path to Artificial Super Intelligence , author=. Transactions on Machine Learning Research , issn=. 2026 , url=
work page 2026
-
[66]
Rap: Retrieval-augmented planning with contextual memory for multimodal llm agents , url =
Kagaya, Tomoyuki and Yuan, Thong Jing and Lou, Yuxuan and Karlekar, Jayashree and Pranata, Sugiri and Kinose, Akira and Oguri, Koki and Wick, Felix and You, Yang , journal =. Rap: Retrieval-augmented planning with contextual memory for multimodal llm agents , url =
-
[67]
MapAgent: Trajectory-Constructed Memory-Augmented Planning for Mobile Task Automation , url =
Kong, Yi and Shi, Dianxi and Yang, Guoli and Huang, Chenlin and Li, Xiaopeng and Jin, Songchang and others , journal =. MapAgent: Trajectory-Constructed Memory-Augmented Planning for Mobile Task Automation , url =
-
[68]
Andrew Zhao and Daniel Huang and Quentin Xu and Matthieu Lin and Yong. ExpeL:. Thirty-Eighth. doi:10.1609/AAAI.V38I17.29936 , editor =
-
[69]
In-Context Principle Learning from Mistakes , url =
Tianjun Zhang and Aman Madaan and Luyu Gao and Steven Zheng and Swaroop Mishra and Yiming Yang and Niket Tandon and Uri Alon , bibsource =. In-Context Principle Learning from Mistakes , url =. Forty-first International Conference on Machine Learning,
-
[70]
MemoryArena: Benchmarking Agent Memory in Interdependent Multi-Session Agentic Tasks , url =
Zexue He and Yu Wang and Churan Zhi and Yuanzhe Hu and Tzu-Ping Chen and Lang Yin and Ze Chen and Tong Arthur Wu and Siru Ouyang and Zihan Wang and Jiaxin Pei and Julian McAuley and Yejin Choi and Alex Pentland , journal =. MemoryArena: Benchmarking Agent Memory in Interdependent Multi-Session Agentic Tasks , url =
-
[71]
Liu, Yuxuan and Sun, Hongda and Liu, Wei and Luan, Jian and Du, Bo and Yan, Rui , booktitle =. MobileSteward: Integrating Multiple App-Oriented Agents with Self-Evolution to Automate Cross-App Instructions , url =. doi:10.1145/3690624.3709171 , isbn =
-
[72]
Hongjin Su and Ruoxi Sun and Jinsung Yoon and Pengcheng Yin and Tao Yu and Sercan O Arik , booktitle =. Learn-by-interact: A Data-Centric Framework For Self-Adaptive Agents in Realistic Environments , url =
-
[73]
Dynamic cheatsheet: Test-time learning with adaptive memory , url =
Suzgun, Mirac and Yuksekgonul, Mert and Bianchi, Federico and Jurafsky, Dan and Zou, James , journal =. Dynamic cheatsheet: Test-time learning with adaptive memory , url =
-
[74]
No Need for Explanations: LLM s can implicitly learn from mistakes in-context
Alazraki, Lisa and Mozes, Maximilian and Campos, Jon Ander and Yi-Chern, Tan and Rei, Marek and Bartolo, Max. No Need for Explanations: LLM s can implicitly learn from mistakes in-context. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 2025. doi:10.18653/v1/2025.emnlp-main.1686
-
[75]
Yao Fu and Dong. AutoGuide: Automated Generation and Selection of Context-Aware Guidelines for Large Language Model Agents , url =. Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada, December 10 - 15, 2024 , editor =
work page 2024
-
[76]
Self-evolving Agents with reflective and memory-augmented abilities , url =
Liang, Xuechen and He, Yangfan and Xia, Yinghui and Song, Xinyuan and Wang, Jianhui and Tao, Meiling and Sun, Li and Yuan, Xinhang and Su, Jiayi and Li, Keqin and others , journal =. Self-evolving Agents with reflective and memory-augmented abilities , url =
-
[77]
PRIME: Large Language Model Personalization with Cognitive Memory and Thought Processes , url =
Zhang, Xinliang Frederick and Beauchamp, Nick and Wang, Lu , journal =. PRIME: Large Language Model Personalization with Cognitive Memory and Thought Processes , url =
-
[78]
doi:10.18653/v1/2025.acl-long.1575 , editor =
Hu, Mengkang and Chen, Tianxing and Chen, Qiguang and Mu, Yao and Shao, Wenqi and Luo, Ping , booktitle =. doi:10.18653/v1/2025.acl-long.1575 , editor =
-
[79]
MemOS: An Operating System for Memory-Augmented Generation (MAG) in Large Language Models , url =
Li, Zhiyu and Song, Shichao and Wang, Hanyu and Niu, Simin and Chen, Ding and Yang, Jiawei and Xi, Chenyang and Lai, Huayi and Zhao, Jihao and Wang, Yezhaohui and others , journal =. MemOS: An Operating System for Memory-Augmented Generation (MAG) in Large Language Models , url =
-
[80]
Memory in the age of ai agents , url =
Hu, Yuyang and Liu, Shichun and Yue, Yanwei and Zhang, Guibin and Liu, Boyang and Zhu, Fangyi and Lin, Jiahang and Guo, Honglin and Dou, Shihan and Xi, Zhiheng and others , journal =. Memory in the age of ai agents , url =
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.