Mem²Evolve: Towards Self-Evolving Agents via Co-Evolutionary Capability Expansion and Experience Distillation
Pith reviewed 2026-05-10 16:40 UTC · model grok-4.3
The pith
Agents that co-evolve experience and new assets outperform those using either process in isolation.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Mem²Evolve integrates Experience Memory, which stores task outcomes and strategies, with Asset Memory, which dynamically generates and retains new tools and specialized agents. By using experience to inform asset creation and new assets to generate novel experiences, the framework achieves co-evolutionary expansion of agent capabilities. This results in higher performance across diverse tasks compared to baselines that evolve through experience alone or asset creation alone.
What carries the argument
The dual-memory architecture consisting of Experience Memory for distilling past interactions and Asset Memory for expanding the set of usable tools and agents, linked through a co-evolutionary process.
If this is right
- Agents can expand their capabilities beyond any manually predefined static toolset.
- The evolution process becomes more stable by avoiding unguided asset creation.
- Experience from new assets enriches the memory for future guidance.
- Overall performance improves by double digits over isolated evolution methods on multiple benchmarks.
Where Pith is reading between the lines
- Similar mutual reinforcement between memory types could be applied to other self-improving systems like code generation or planning agents.
- Over very long horizons the co-evolution might encounter diminishing returns that require additional mechanisms to sustain.
- Deploying such agents in open-ended environments could test whether the gains transfer beyond the controlled benchmarks used here.
Load-bearing premise
The two evolutionary processes of experience accumulation and asset creation are intrinsically interdependent in a way that produces stable, unbounded growth rather than interference or diminishing returns.
What would settle it
A controlled test that runs the Mem²Evolve process over many more iterations than reported and checks whether performance gains continue, level off, or reverse.
Figures
read the original abstract
While large language model--powered agents can self-evolve by accumulating experience or by dynamically creating new assets (i.e., tools or expert agents), existing frameworks typically treat these two evolutionary processes in isolation. This separation overlooks their intrinsic interdependence: the former is inherently bounded by a manually predefined static toolset, while the latter generates new assets from scratch without experiential guidance, leading to limited capability growth and unstable evolution. To address this limitation, we introduce a novel paradigm of co-evolutionary Capability Expansion and Experience Distillation. Guided by this paradigm, we propose the \textbf{Mem$^{\textbf{2}}$Evolve}, which integrates two core components: \textbf{Experience Memory} and \textbf{Asset Memory}. Specifically, Mem$^{2}$Evolve leverages accumulated experience to guide the dynamic creation of assets, thereby expanding the agent's capability space while simultaneously acquiring new experience to achieve co-evolution. Extensive experiments across 6 task categories and 8 benchmarks demonstrate that Mem$^{2}$Evolve achieves improvement of 18.53\% over standard LLMs, 11.80\% over agents evolving solely through experience, and 6.46\% over those evolving solely through asset creation, establishing it as a substantially more effective and stable self-evolving agent framework. Code is available at: https://buaa-irip-llm.github.io/Mem2Evolve.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Mem²Evolve, a self-evolving agent framework that integrates Experience Memory and Asset Memory under a co-evolutionary paradigm of capability expansion and experience distillation. It claims that isolating experience accumulation from dynamic asset creation leads to bounded or unstable growth, while their interdependence enables synergistic improvement. Experiments across 6 task categories and 8 benchmarks report gains of 18.53% over standard LLMs, 11.80% over experience-only evolution, and 6.46% over asset-creation-only evolution.
Significance. If the interdependence produces stable synergistic gains rather than additive or saturating effects, the framework could advance self-evolving agents beyond static toolsets or unguided creation. The availability of code at the provided link is a strength for reproducibility.
major comments (2)
- [Experimental Evaluation] The central claim that co-evolution between Experience Memory and Asset Memory produces synergistic, stable growth (rather than interference or diminishing returns) is load-bearing for the reported deltas, yet the experimental evaluation provides no ablations that disable cross-guidance while holding total memory size and iteration count fixed, nor multi-iteration trajectories demonstrating continued improvement instead of plateau. The numerical gains could therefore arise from increased complexity rather than the asserted interdependence.
- [Method and Experiments] The abstract and method description report percentage improvements without error bars, statistical significance tests, or details on how experience specifically guides asset creation (and vice versa). This leaves the implementation of the co-evolutionary guidance underspecified and the performance claims difficult to attribute to the proposed mechanism.
minor comments (1)
- [Abstract] The abstract uses inconsistent formatting for Mem²Evolve (e.g., bold superscripts); ensure uniform notation and rendering across the manuscript.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback, which helps strengthen the validation of our co-evolutionary claims. We address each major comment below and have made revisions to the manuscript to incorporate additional controls, statistical reporting, and methodological details.
read point-by-point responses
-
Referee: [Experimental Evaluation] The central claim that co-evolution between Experience Memory and Asset Memory produces synergistic, stable growth (rather than interference or diminishing returns) is load-bearing for the reported deltas, yet the experimental evaluation provides no ablations that disable cross-guidance while holding total memory size and iteration count fixed, nor multi-iteration trajectories demonstrating continued improvement instead of plateau. The numerical gains could therefore arise from increased complexity rather than the asserted interdependence.
Authors: We agree that isolating the contribution of cross-guidance is essential. In the revised version, we add controlled ablations that disable experience-to-asset and asset-to-experience guidance while exactly matching total memory capacity and iteration budget; these show consistent drops of 5–9% relative to the full co-evolutionary setting, indicating the gains are not merely from added complexity. We also include multi-iteration performance curves (10 iterations) across the main benchmarks, demonstrating sustained improvement rather than early saturation. These results are reported in a new subsection of the experimental analysis. revision: yes
-
Referee: [Method and Experiments] The abstract and method description report percentage improvements without error bars, statistical significance tests, or details on how experience specifically guides asset creation (and vice versa). This leaves the implementation of the co-evolutionary guidance underspecified and the performance claims difficult to attribute to the proposed mechanism.
Authors: We have updated all tables and figures to include error bars (standard deviation over five independent runs) and paired t-test p-values for every reported comparison. The Method section has been expanded with explicit pseudocode for the two guidance directions, together with concrete prompt templates and examples showing how retrieved experiences condition asset generation and how newly created assets are used to produce distilled experiences. These additions make the co-evolutionary loop fully specified and reproducible from the text and released code. revision: yes
Circularity Check
No circularity: procedural framework with empirical results, no derivation or fitted predictions
full rationale
The paper introduces Mem²Evolve as a procedural integration of Experience Memory and Asset Memory under a co-evolutionary paradigm. Claims of stable interdependence and capability growth are asserted in the abstract and supported solely by benchmark comparisons (18.53%, 11.80%, 6.46% gains). No equations, first-principles derivations, parameter fits, or predictions appear in the provided text. Results are measured outcomes from experiments, not quantities forced by the framework definition itself. No self-citations, uniqueness theorems, or ansatzes are invoked to justify core components. The derivation chain is self-contained as an engineering design whose validity rests on external empirical tests rather than internal reduction.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Dynamic LLM-Agent Network: An LLM-agent Collaboration Framework with Agent Team Optimization
RepoDebug: Repository-level multi-task and multi-language debugging evaluation of large lan- guage models. InFindings of the Association for Computational Linguistics: EMNLP 2025, pages 23784–23813, Suzhou, China. Association for Com- putational Linguistics. Zijun Liu, Yanzhe Zhang, Peng Li, Yang Liu, and Diyi Yang. 2023. Dynamic llm-agent network: An llm...
-
[2]
Each task description should be COMPLETE and SPECIFIC
-
[3]
Use task numbers (1,2,3...) for dependencies
- [4]
- [5]
- [6]
-
[7]
Preserve the EXACT units and format requirements from the original query in task descriptions
-
[8]
Output pure JSON format with no other content. Prompt for Assess Tool NeedBack to ToC You are a Principal Engineer. Your goal is to analyze the given multi-step task plan and determine the most efficient way to solve it. Your primary objective is to use LLM-Native capabilities and reuse existing tools. You will only determine that new tools are needed (ne...
-
[9]
name: string - professional general-purpose snake_case with domain prefix - Must be EXACTLY "{tool_name}" (do not rename or vary)
-
[10]
description: string - DETAILED description (5-10 sentences) covering: - What the tool does (core functionality) - Key capabilities and features - Typical use cases and scenarios - Input/output data types and formats - Any limitations or constraints
-
[11]
input_schema: object - JSON Schema (Claude Desktop standard) defining parameters: {{ "type": "object", "properties": {{ "param_name": {{ "type": "string|number|boolean|array|object", "description": "Detailed parameter description", "enum": ["optional", "allowed", "values"], "default": "optional_default_value", "example": "input example" }} }}, "required":...
-
[12]
returns: object - Output format specification: {{ "type": "string|object", "description": "Detailed description of return value", "format": "json|text|structured", "schema": {{"optional": "output schema for structured returns"}} }}
-
[13]
module_code: string - COMPLETE Python module source to be saved as ‘<tool_name>.py‘ Module requirements (STRICT): - Include necessary imports. - Define ONE public function implementing the tool with an EXPLICIT parameter list derived from input_schema (no **kwargs). - Implement robust validation and error handling per input_schema (types, required fields,...
-
[14]
Review the outputs from previous agents to understand progress and context
-
[15]
Utilize the available tools as appropriate
Analyze the task and decompose it if needed. Utilize the available tools as appropriate
-
[16]
Define the current step you will complete, labeling it as ’CurrentStep’
-
[17]
Choose one Action from the available tools to execute the current step. # Format example {format_example} Prompt for LLM as a JudgeBack to ToC You are a Judge LLM responsible for evaluating the entire execution trajectory of a task and determining whether it was completed correctly. OBJECTIVE Analyze the complete task execution trajectory and provide a co...
-
[18]
Task Completion Assessment: - Does the final result correctly answer the original query? - Are all decomposed subtasks properly addressed? - Is the aggregated result logically coherent and complete?
-
[19]
Agent Performance Assessment: - Did each agent execute its assigned subtask correctly? - Were the agents’ reasoning processes sound and effective? - Did agents properly utilize available tools?
-
[20]
Tool Effectiveness Assessment: - Did existing tools function correctly when used? - Were newly created tools (if any) implemented correctly? - Did tools produce expected outputs? EVALUATION OUTPUT FORMAT You must provide your evaluation in the following JSON format: {{ "task_completed": true/false, "completion_quality": "good/poor", "overall_assessment": ...
-
[21]
Be objective and thorough in your evaluation
-
[22]
Provide specific, actionable feedback
-
[23]
Identify both strengths and weaknesses
-
[24]
Focus on patterns that can inform future improvements
-
[25]
Distinguish between agent failures and tool failures
-
[26]
Consider the complexity of the task when evaluating performance
-
[27]
Ensure all JSON is properly formatted and valid Return ONLY the JSON evaluation, no additional text. Prompt for Tool Memory GenerationBack to ToC You are a technical documentation specialist creating comprehensive tool implementation guides. OBJECTIVE Generate a detailed markdown section that captures the implementation experience of this tool for future ...
-
[28]
Write in clear, professional technical documentation style
-
[29]
Use proper markdown formatting (## for main title, ### for subsections)
-
[30]
Be concise but comprehensive
-
[31]
Focus on reusability and understanding
-
[32]
Include practical context, not just code
-
[33]
Highlight key implementation patterns
-
[34]
Note any important dependencies or requirements OUTPUT FORMAT Return ONLY the complete markdown section. Do not include any preamble, explanations, or commentary outside the document itself. The section MUST start with "## How to..." as the first line. Prompt for Success Agent Memory GenerationBack to ToC You are an experience synthesis specialist creatin...
-
[35]
Write in clear, professional documentation style
-
[37]
Be specific and actionable
-
[38]
Focus on transferable knowledge
-
[39]
Highlight decision-making processes
-
[40]
Common Pitfalls When Parsing Inconsistent CSV Data Formats
Include both what worked and why it worked OUTPUT FORMAT Return ONLY the complete markdown memory entry. Do not include any preamble or explanations outside the memory entry itself. The entry MUST start with "##" as the first line. Prompt for Failure Agent Memory GenerationBack to ToC You are an error analysis specialist creating memory entries that help ...
-
[41]
Write in clear, instructive style
-
[42]
Use proper markdown formatting
-
[43]
Be honest and specific about failures
-
[44]
Provide actionable corrective guidance
-
[45]
Focus on learning and improvement
-
[46]
Help prevent similar failures OUTPUT FORMAT Return ONLY the complete markdown memory entry. Do not include any preamble or explanations outside the memory entry itself. The entry MUST start with "##" as the first line. D Case Study D.1 Tool Implementation forSimulate Piston Platform GameBack to ToC As shown in Code 1, when performing complex probabilistic...
work page 2018
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.