Evo-Memory: Benchmarking LLM Agent Test-time Learning with Self-Evolving Memory

Benjamin Coleman; Chi Wang; Derek Zhiyuan Cheng; Ed H. Chi; Fernando Pereira; Jingrui He; Mengting Ai; Noveen Sachdeva; Shuo Chen; Tianxin Wei

arxiv: 2511.20857 · v2 · pith:RZ7367QXnew · submitted 2025-11-25 · 💻 cs.CL · cs.AI

Evo-Memory: Benchmarking LLM Agent Test-time Learning with Self-Evolving Memory

Tianxin Wei , Noveen Sachdeva , Benjamin Coleman , Zhankui He , Yuanchen Bei , Xuying Ning , Mengting Ai , Yunzhe Li

show 7 more authors

Jingrui He Ed H. Chi Chi Wang Shuo Chen Fernando Pereira Wang-Cheng Kang Derek Zhiyuan Cheng

This is my paper

Pith reviewed 2026-05-21 18:00 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords LLM agentsself-evolving memorytest-time learningstreaming benchmarkcontinual improvementmemory managementReMemexperience reuse

0 comments

The pith

LLM agents achieve continual improvement on evolving tasks by refining memory through an integrated action-think-memory pipeline.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that memory management in LLM agents must support dynamic evolution during ongoing deployment rather than remaining static or passively retrieved from past dialogue. Existing evaluations overlook the need to accumulate and reuse experience across continuous task streams, causing agents to lose contextual insights in real-world settings like interactive assistants. To close this gap the authors convert multiple datasets into sequential task streams that force agents to search, adapt, and update memory after every interaction. They implement and compare over ten memory modules, supply an experience-retrieval baseline called ExpRAG, and introduce ReMem, a pipeline that tightly couples reasoning, actions, and memory refinement. A sympathetic reader would care because such self-evolving memory could let agents learn without external resets or supervision.

Core claim

Evo-Memory is a streaming benchmark that restructures datasets into sequential task streams so that LLMs must retrieve, integrate, and evolve memory after each interaction. Ten diverse multi-turn and single-turn datasets are used to evaluate more than ten representative memory modules. ExpRAG serves as a baseline for retrieving and applying prior experience, while ReMem is proposed as an action-think-memory refine pipeline that integrates reasoning, task actions, and memory updates to produce continual improvement across the streams.

What carries the argument

ReMem, an action-think-memory refine pipeline that tightly integrates reasoning, task actions, and memory updates to drive continual improvement.

Load-bearing premise

Reorganizing existing static datasets into sequential task streams creates a faithful test of real-world continuous deployment where agents evolve memory without external supervision or resets.

What would settle it

If agents using the ReMem pipeline show no measurable rise in success rate or efficiency across successive tasks in the Evo-Memory streams relative to static-memory baselines, the claim of continual improvement would be falsified.

read the original abstract

Statefulness is essential for large language model (LLM) agents to perform long-term planning and problem-solving. This makes memory a critical component, yet its management and evolution remain largely underexplored. Existing evaluations mostly focus on static conversational settings, where memory is passively retrieved from dialogue to answer queries, overlooking the dynamic ability to accumulate and reuse experience across evolving task streams. In real-world environments such as interactive problem assistants or embodied agents, LLMs are required to handle continuous task streams, yet often fail to learn from accumulated interactions, losing valuable contextual insights, a limitation that calls for test-time evolution, where LLMs retrieve, integrate, and update memory continuously during deployment. To bridge this gap, we introduce Evo-Memory, a comprehensive streaming benchmark and framework for evaluating self-evolving memory in LLM agents. Evo-Memory structures datasets into sequential task streams, requiring LLMs to search, adapt, and evolve memory after each interaction. We unify and implement over ten representative memory modules and evaluate them across 10 diverse multi-turn goal-oriented and single-turn reasoning and QA datasets. To better benchmark experience reuse, we provide a baseline method, ExpRAG, for retrieving and utilizing prior experience, and further propose ReMem, an action-think-memory refine pipeline that tightly integrates reasoning, task actions, and memory updates to achieve continual improvement.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Evo-Memory sets up a streaming benchmark for agent memory evolution with ReMem, though the construction from static data raises questions about its realism for unsupervised test-time learning.

read the letter

This paper's core offering is Evo-Memory, a benchmark that turns datasets into sequential task streams to evaluate how LLM agents evolve their memory over time, plus the ReMem pipeline that weaves together reasoning, task actions, and memory updates for better continual performance. They do solid work by implementing over ten memory modules and testing them across ten datasets that cover both multi-turn interactions and single-turn reasoning. The unified evaluation and the ExpRAG baseline for experience reuse make it easier to compare approaches. Framing memory as something that must be actively refined after each interaction moves beyond passive retrieval in static chats, which is a useful shift for long-horizon agent tasks. The soft spots come mainly from the benchmark design. Reorganizing static datasets into streams provides neat task boundaries and reuses underlying data patterns, which might let methods succeed by latching onto those artifacts rather than demonstrating robust self-evolution in unpredictable, reset-free settings. The claim of continual improvement with ReMem would be stronger with detailed ablations, error bars, and tests on more varied distributions. Since the abstract lacks the actual numbers, the quantitative backing remains to be checked in the full results. This is for researchers working on memory in LLM agents and continual adaptation in interactive systems. Readers who need benchmarks for testing experience accumulation would get practical value from the framework and comparisons. It is worth a serious referee because it tackles an underexplored area with a concrete proposal and broad evaluations. I recommend putting it through peer review, focusing any revisions on clarifying the benchmark's fidelity to real deployment and including comprehensive experimental details.

Referee Report

2 major / 3 minor

Summary. The paper introduces Evo-Memory, a streaming benchmark and framework for self-evolving memory in LLM agents. It structures existing datasets into sequential task streams, unifies and evaluates over ten memory modules across 10 multi-turn and single-turn datasets, provides the ExpRAG baseline for experience retrieval, and proposes ReMem, an action-think-memory refine pipeline that integrates reasoning, task actions, and memory updates to achieve continual improvement.

Significance. If the results demonstrate genuine continual improvement under the proposed setup without exploiting dataset artifacts, this work would provide a valuable standardized benchmark and method for advancing stateful LLM agents capable of long-term planning in dynamic environments, filling a gap left by static conversational evaluations.

major comments (2)

[§3 (Benchmark Construction)] §3 (Benchmark Construction): The reorganization of static datasets into sequential task streams supplies clean task boundaries and reuses the same underlying data distribution. This setup risks allowing memory modules to exploit dataset-specific patterns rather than demonstrating genuine unsupervised, reset-free self-evolution, which directly undermines the central claim that ReMem achieves continual improvement across evolving task streams.
[Evaluation and Results sections] Evaluation and Results sections: The abstract provides no quantitative results, error bars, ablation details, or performance trends over task streams for ReMem versus baselines. Without these controls and statistical reporting in the full manuscript, the strength of the continual improvement claim cannot be verified.

minor comments (3)

[Abstract] Abstract: The description of contributions is dense; consider separating the benchmark description, baseline, and ReMem proposal into clearer sentences.
[Figures] Figures: Any pipeline diagrams for ReMem should include explicit labels for the action, think, and memory refine steps to improve clarity.
[Related Work] Related Work: Verify that recent papers on test-time adaptation and agent memory are cited to contextualize the contribution.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the constructive feedback on our manuscript. We appreciate the referee's insights into potential limitations of our benchmark construction and the need for clearer quantitative reporting. Below we provide point-by-point responses to the major comments.

read point-by-point responses

Referee: [§3 (Benchmark Construction)] The reorganization of static datasets into sequential task streams supplies clean task boundaries and reuses the same underlying data distribution. This setup risks allowing memory modules to exploit dataset-specific patterns rather than demonstrating genuine unsupervised, reset-free self-evolution, which directly undermines the central claim that ReMem achieves continual improvement across evolving task streams.

Authors: We thank the referee for raising this valid concern about potential dataset artifacts. Our benchmark construction deliberately reuses data distributions to isolate the effect of sequential, reset-free evolution, where memory must accumulate and adapt without access to future tasks or resets. To reduce pattern exploitation, streams are formed from diverse multi-turn and single-turn datasets with varied task goals, and ReMem integrates explicit reasoning and action steps rather than relying on superficial correlations. We agree this does not fully eliminate the risk and will add a new subsection in §3 discussing design rationale, limitations of static-to-stream conversion, and planned follow-up experiments with randomized task permutations and cross-dataset generalization tests. revision: partial
Referee: [Evaluation and Results sections] The abstract provides no quantitative results, error bars, ablation details, or performance trends over task streams for ReMem versus baselines. Without these controls and statistical reporting in the full manuscript, the strength of the continual improvement claim cannot be verified.

Authors: We apologize for the abstract's brevity, which omitted key numbers to stay within length limits. The full manuscript's Evaluation and Results sections already report performance trends across task streams, direct comparisons to baselines including ExpRAG, component ablations for ReMem, and error bars with statistical details. To improve accessibility, we will revise the abstract to highlight main quantitative outcomes (e.g., average gains of ReMem over baselines) while explicitly referencing the presence of error bars, ablations, and stream-wise trends in the main text. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark and method proposal with external dataset support

full rationale

The paper introduces Evo-Memory as a streaming benchmark by reorganizing existing datasets into sequential task streams and evaluates over ten memory modules plus the proposed ReMem pipeline. No mathematical derivations, first-principles predictions, or fitted parameters are presented that reduce to inputs by construction. Central claims rest on performance measurements against external multi-turn and QA datasets rather than self-referential definitions or self-citation chains. The work is self-contained as an empirical contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claims rest on the assumption that existing datasets can be meaningfully restructured into sequential streams and that standard LLM prompting and retrieval techniques suffice for memory evolution without additional training.

axioms (2)

domain assumption Existing multi-turn and QA datasets can be reorganized into sequential task streams that simulate continuous deployment without loss of original task semantics.
Invoked when the paper states it structures datasets into sequential task streams.
domain assumption LLM agents can improve performance on later tasks solely through retrieval, integration, and update of memory without parameter updates or external supervision.
Underlying the claim that ReMem achieves continual improvement via the action-think-memory pipeline.

pith-pipeline@v0.9.0 · 5828 in / 1426 out tokens · 34184 ms · 2026-05-21T18:00:16.538961+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Forward citations

Cited by 37 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

RealICU: Do LLM Agents Understand Long-Context ICU Data? A Benchmark Beyond Behavior Imitation
cs.AI 2026-05 unverdicted novelty 8.0

RealICU is a new benchmark using physician hindsight labels on MIMIC-IV ICU data that exposes LLM failures in long-horizon clinical assessment, acute problem detection, action recommendation, and red-flag identification.
EXG: Self-Evolving Agents with Experience Graphs
cs.AI 2026-05 unverdicted novelty 7.0

EXG is an experience graph framework for self-evolving LLM agents that supports online real-time growth and offline reuse to enhance solution quality and efficiency on code generation and reasoning benchmarks.
Test-Time Learning with an Evolving Library
cs.LG 2026-05 unverdicted novelty 7.0

EvoLib enables LLMs to accumulate, reuse, and evolve knowledge abstractions from inference trajectories at test time, yielding substantial gains on math reasoning, code generation, and agentic benchmarks without param...
EvolveMem:Self-Evolving Memory Architecture via AutoResearch for LLM Agents
cs.LG 2026-05 unverdicted novelty 7.0

EvolveMem enables autonomous self-evolution of LLM memory retrieval configurations via LLM diagnosis and safeguards, delivering 25.7% gains over strong baselines on LoCoMo and 18.9% on MemBench with positive cross-ben...
MAGE: Multi-Agent Self-Evolution with Co-Evolutionary Knowledge Graphs
cs.AI 2026-05 unverdicted novelty 7.0

MAGE uses a four-subgraph co-evolutionary knowledge graph plus dual bandits to externalize and retrieve experience for stable self-evolution of frozen language-model agents, showing gains on nine diverse benchmarks.
AgentPSO: Evolving Agent Reasoning Skill via Multi-agent Particle Swarm Optimization
cs.AI 2026-05 unverdicted novelty 7.0

AgentPSO evolves reusable multi-agent reasoning skills via PSO-inspired natural-language updates, outperforming static agents and test-time multi-agent baselines on math and general reasoning tasks with cross-benchmar...
MemCompiler: Compile, Don't Inject -- State-Conditioned Memory for Embodied Agents
cs.RO 2026-05 unverdicted novelty 7.0

MemCompiler introduces state-conditioned memory compilation that dynamically selects and compiles relevant memory into text and latent guidance, yielding up to 129% gains over no-memory baselines and 60% lower latency...
MemCompiler: Compile, Don't Inject -- State-Conditioned Memory for Embodied Agents
cs.RO 2026-05 unverdicted novelty 7.0

MemCompiler reframes memory use as state-conditioned compilation, delivering relevant guidance via text and latent channels to improve embodied agent performance up to 129% and cut latency 60% versus static injection.
Agentic-imodels: Evolving agentic interpretability tools via autoresearch
cs.AI 2026-05 unverdicted novelty 7.0

Agentic-imodels evolves scikit-learn regressors via an autoresearch loop to jointly boost predictive performance and LLM-simulatability, improving downstream agentic data science tasks by up to 73% on the BLADE benchmark.
SkillFlow:Benchmarking Lifelong Skill Discovery and Evolution for Autonomous Agents
cs.AI 2026-04 unverdicted novelty 7.0

SkillFlow benchmark shows lifelong skill evolution yields modest gains for some models like Claude Opus 4.6 but limited or negative utility for others despite high skill usage.
M$^\star$: Every Task Deserves Its Own Memory Harness
cs.PL 2026-04 unverdicted novelty 7.0

M* evolves distinct Python memory programs per task via population-based reflective search, outperforming fixed-memory baselines on conversation, planning, and reasoning benchmarks.
Auto-Dreamer: Learning Offline Memory Consolidation for Language Agents
cs.CL 2026-05 unverdicted novelty 6.0

Auto-Dreamer trains an offline memory consolidator via GRPO on agent performance to abstract cross-session patterns, outperforming baselines by 7 points on ScienceWorld with 12x smaller memory and generalizing to ALFW...
EvoMemBench: Benchmarking Agent Memory from a Self-Evolving Perspective
cs.CL 2026-05 unverdicted novelty 6.0

EvoMemBench evaluates 15 memory methods for LLM agents and finds long-context baselines competitive with no single memory approach working consistently across settings.
Is One Score Enough? Rethinking the Evaluation of Sequentially Evolving LLM Memory
cs.LG 2026-05 unverdicted novelty 6.0

SeqMem-Eval reveals that high final accuracy in sequential LLM memory tasks often coexists with substantial forgetting and negative transfer, exposing stability-adaptability trade-offs hidden by standard aggregate metrics.
Context Training with Active Information Seeking
cs.CL 2026-05 unverdicted novelty 6.0

Adding active search tools to LLM context optimization works only when combined with a multi-candidate search-based training procedure that prunes contexts, delivering gains across low-resource translation, health, an...
Do Self-Evolving Agents Forget? Capability Degradation and Preservation in Lifelong LLM Agent Adaptation
cs.AI 2026-05 unverdicted novelty 6.0

Self-evolving LLM agents exhibit capability erosion under continual adaptation, which Capability-Preserving Evolution mitigates by raising retained simple-task performance from 41.8% to 52.8% in workflow evolution und...
Skill1: Unified Evolution of Skill-Augmented Agents via Reinforcement Learning
cs.AI 2026-05 unverdicted novelty 6.0

Skill1 trains one policy to jointly evolve skill query generation, re-ranking, task solving, and distillation from a single task-success signal, with low-frequency trends crediting selection and high-frequency variati...
TSUBASA: Improving Long-Horizon Personalization via Evolving Memory and Self-Learning with Context Distillation
cs.CL 2026-04 unverdicted novelty 6.0

TSUBASA improves long-horizon personalization in LLMs via dynamic memory evolution for writing and context-distillation self-learning for reading, outperforming Mem0 and Memory-R1 on Qwen-3 benchmarks while reducing t...
Code as Agent Harness
cs.CL 2026-05 accept novelty 5.0

A survey that organizes existing work on LLM-based agents around code as the central harness, structured in three layers of interfaces, mechanisms, and multi-agent scaling, with applications across domains and listed ...
Context Training with Active Information Seeking
cs.CL 2026-05 unverdicted novelty 5.0

Active information seeking via search tools, when combined with multi-candidate context pruning during training, produces consistent gains on translation, health, and reasoning tasks over naive tool addition or no-too...
Shepherd: A Runtime Substrate Empowering Meta-Agents with a Formalized Execution Trace
cs.AI 2026-05 unverdicted novelty 5.0 partial

Shepherd is a runtime system that formalizes meta-agent operations via typed execution traces, enabling fast forking and demonstrated improvements in agent intervention, optimization, and training on benchmarks.
MemReread: Enhancing Agentic Long-Context Reasoning via Memory-Guided Rereading
cs.CL 2026-05 unverdicted novelty 5.0

MemReread improves agent long-context reasoning by triggering rereading on insufficient final memory to recover discarded indirect facts, outperforming baselines at linear complexity.
A Comprehensive Survey on Agent Skills: Taxonomy, Techniques, and Applications
cs.IR 2026-05 unverdicted novelty 5.0

A survey that taxonomizes agent skills for LLM-based agents across representation, acquisition, retrieval, and evolution stages while reviewing methods, resources, and open challenges.
From Agent Loops to Deterministic Graphs: Execution Lineage for Reproducible AI-Native Work
cs.AI 2026-05 conditional novelty 5.0

Execution lineage models AI-native work as a DAG of computations with explicit dependencies, achieving perfect state preservation in controlled update tasks where loop-based agents introduce churn and contamination.
Skill1: Unified Evolution of Skill-Augmented Agents via Reinforcement Learning
cs.AI 2026-05 unverdicted novelty 5.0

Skill1 trains a single RL policy to co-evolve skill selection, utilization, and distillation in language model agents from one task-outcome reward, using low-frequency trends to credit selection and high-frequency var...
Skill1: Unified Evolution of Skill-Augmented Agents via Reinforcement Learning
cs.AI 2026-05 unverdicted novelty 5.0

Skill1 co-evolves skill selection, utilization, and distillation inside a single policy using only task-outcome reward, with low-frequency trends crediting selection and high-frequency variation crediting distillation...
Bian Que: An Agentic Framework with Flexible Skill Arrangement for Online System Operations
cs.AI 2026-04 unverdicted novelty 5.0

Bian Que is an agentic framework using a unified operational paradigm, flexible Skill Arrangement, and self-evolving mechanism to automate O&M tasks, achieving 75% alert reduction and over 50% MTTR cut in production d...
Bian Que: An Agentic Framework with Flexible Skill Arrangement for Online System Operations
cs.AI 2026-04 unverdicted novelty 5.0

Bian Que deploys an agentic system with flexible skills and self-evolution on a major e-commerce search engine, cutting alerts by 75%, reaching 80% root-cause accuracy, and halving resolution time.
From Procedural Skills to Strategy Genes: Towards Experience-Driven Test-Time Evolution
cs.SE 2026-04 unverdicted novelty 5.0

Compact Gene representations of experience outperform documentation-oriented Skill packages for test-time control and iterative evolution in code-solving tasks, with measured gains on CritPt from 9.1% to 18.57% and 17...
Evo-MedAgent: Beyond One-Shot Diagnosis with Agents That Remember, Reflect, and Improve
cs.AI 2026-04 unverdicted novelty 5.0

Evo-MedAgent adds three evolving memory stores to LLM agents for chest X-ray diagnosis, raising MCQ accuracy from 0.68 to 0.79 on GPT-5-mini and 0.76 to 0.87 on Gemini-3 Flash without any training.
MemCoT: Test-Time Scaling through Memory-Driven Chain-of-Thought
cs.MA 2026-04 unverdicted novelty 5.0

MemCoT transforms long-context LLM reasoning into an iterative stateful search using multi-view memory for evidence localization and dual short-term memory for guiding decisions, achieving SOTA on LoCoMo and LongMemEv...
MemCoT: Test-Time Scaling through Memory-Driven Chain-of-Thought
cs.MA 2026-04 unverdicted novelty 5.0

MemCoT redefines long-context reasoning as iterative stateful search with zoom-in/zoom-out memory perception and dual short-term memories, claiming SOTA results on LoCoMo and LongMemEval-S benchmarks.
Scaling Teams or Scaling Time? Memory Enabled Lifelong Learning in LLM Multi-Agent Systems
cs.MA 2026-03 unverdicted novelty 5.0

LLMA-Mem improves long-horizon performance in LLM multi-agent systems over baselines while reducing cost and shows non-monotonic scaling where memory-enabled smaller teams can beat larger ones.
Improve Large Language Model Systems with User Logs
cs.CL 2026-02 unverdicted novelty 5.0

UNO distills user logs into semi-structured rules and preferences, applies query-and-feedback clustering to handle heterogeneity, quantifies cognitive gaps to filter noise, and builds primary and reflective modules th...
A Comprehensive Survey on Agent Skills: Taxonomy, Techniques, and Applications
cs.IR 2026-05 unverdicted novelty 4.0

The paper surveys agent skills for LLM agents, organizing the literature into a four-stage lifecycle of representation, acquisition, retrieval, and evolution while highlighting their role in system scalability.
ActionNex: A Virtual Outage Manager for Cloud Computing
cs.AI 2026-04 unverdicted novelty 4.0

ActionNex is an agentic system for cloud outage management that compresses multimodal signals into critical events, uses hierarchical memory for reasoning, and recommends actions with 71.4% precision on real Azure outages.
Agentic Reasoning for Large Language Models
cs.AI 2026-01 unverdicted novelty 4.0

The survey structures agentic reasoning for LLMs into foundational, self-evolving, and collective multi-agent layers while distinguishing in-context orchestration from post-training optimization and reviewing applicat...

Reference graph

Works this paper leans on

2 extracted references · 2 canonical work pages · cited by 30 Pith papers · 1 internal anchor

[1]

Evaluating Very Long-Term Conversational Memory of LLM Agents

URLhttps://api.semanticscholar.org/CorpusID:278960153. X. Liang, B. Wang, H. Huang, S. Wu, P. Wu, L. Lu, Z. Ma, and Z. Li. Scm: Enhancing large language model with self-controlled memory framework. 2023. URLhttps://api.semanticscholar. org/CorpusID:258331553. J. Liu. LlamaIndex, 11 2022. URLhttps://github.com/jerryjliu/llama_index. J. Liu, N. Loo, H. Li, ...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

1,3” or “2-4

for grounded navigation and compositional reasoning,ScienceWorld(Wang et al., 2022) for open-ended scientific experimentation,Jericho(Hausknecht et al., 2020) for text-based game exploration, andPDDLtasks (Yang et al., 2023) for symbolic planning. Together, these environ- ments emphasize long-horizon reasoning, sequential decision-making, and the use of a...

work page 2022

[1] [1]

Evaluating Very Long-Term Conversational Memory of LLM Agents

URLhttps://api.semanticscholar.org/CorpusID:278960153. X. Liang, B. Wang, H. Huang, S. Wu, P. Wu, L. Lu, Z. Ma, and Z. Li. Scm: Enhancing large language model with self-controlled memory framework. 2023. URLhttps://api.semanticscholar. org/CorpusID:258331553. J. Liu. LlamaIndex, 11 2022. URLhttps://github.com/jerryjliu/llama_index. J. Liu, N. Loo, H. Li, ...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[2] [2]

1,3” or “2-4

for grounded navigation and compositional reasoning,ScienceWorld(Wang et al., 2022) for open-ended scientific experimentation,Jericho(Hausknecht et al., 2020) for text-based game exploration, andPDDLtasks (Yang et al., 2023) for symbolic planning. Together, these environ- ments emphasize long-horizon reasoning, sequential decision-making, and the use of a...

work page 2022