Marginal Advantage Accumulation for Memory-Driven Agent Self-Evolution

Congchao Cheng; Fan Jiang; Keye Zheng; Mingyu Yang; Xingkang Lu; Yefei Zheng; Yujie Liu

arxiv: 2606.20475 · v1 · pith:GXL6CJBVnew · submitted 2026-06-18 · 💻 cs.LG

Marginal Advantage Accumulation for Memory-Driven Agent Self-Evolution

Mingyu Yang , Keye Zheng , Congchao Cheng , Yujie Liu , Xingkang Lu , Fan Jiang , Yefei Zheng This is my paper

Pith reviewed 2026-06-26 17:43 UTC · model grok-4.3

classification 💻 cs.LG

keywords memory-driven agentstrace distillationevidence accumulationbatch processingagent self-evolutionpost-processingreinforcement learning

0 comments

The pith

Marginal advantage accumulation lets agents gather stable cross-batch evidence for memory operations in distillation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper identifies that batch-style trace distillation produces contradictory feedback on identical memory operations across batches, preventing distinction between stable and accidental effects. It defines two structural conditions called alignability and comparability that make differential signals comparable across batches. MAA then accumulates signed evidence per operation through exponential moving average and merges identities semantically to maintain traceability. As a post-processing step this yields higher performance than batch baselines while cutting token consumption during optimization. A reader would care if the method scales reliable memory building without continuous online retraining.

Core claim

MAA constructs differential signals to make them comparable across batches, accumulates signed evidence per operation via EMA, and ensures cross-batch traceability through semantic identity merging, allowing distinction of stably effective operations from accidental hits in memory-driven agent self-evolution.

What carries the argument

Marginal Advantage Accumulation (MAA), a post-processing architecture that accumulates signed evidence per memory operation across batches.

If this is right

Achieves the best results in 14 out of 16 settings across 4 benchmarks and 4 target models.
Consistently outperforms existing batch-level distillation baselines.
Matches or surpasses online alternatives in most settings.
Reduces optimization-phase token consumption by approximately 75 percent.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If alignability holds across more agent tasks, the same accumulation could apply to non-memory operations such as tool calls.
The method's efficiency gain suggests it could support longer self-evolution loops that would otherwise exhaust token budgets.
Semantic merging robustness could be checked by swapping the embedding model used for identity.

Load-bearing premise

The two structural conditions of alignability and comparability suffice to make cross-batch operation-level evidence accumulation reliable without new biases.

What would settle it

A run in which operations given positive accumulated scores by MAA show no higher success rate than unaccumulated ones when tested on fresh batches.

Figures

Figures reproduced from arXiv: 2606.20475 by Congchao Cheng, Fan Jiang, Keye Zheng, Mingyu Yang, Xingkang Lu, Yefei Zheng, Yujie Liu.

**Figure 2.** Figure 2: MAA method overview. Propose channel generates candidate ops; Score channel constructs [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Learning curves (Qwen3.7-Max, 5 seeds). Solid: MAA; dashed: Reactive; dotted: Frozen [PITH_FULL_IMAGE:figures/full_fig_p015_3.png] view at source ↗

**Figure 4.** Figure 4: EMA evidence trajectories of representative ops (ALFWorld). Blue: stably effective; red: spurious [PITH_FULL_IMAGE:figures/full_fig_p016_4.png] view at source ↗

read the original abstract

In batch-style trace distillation, the same memory operation may receive contradictory feedback across different batches. Existing methods lack a cross-batch, operation-level evidence accumulation mechanism, making it impossible to distinguish stably effective operations from accidental hits. This paper formalizes the requirement as two structural conditions, alignability and comparability, and proposes Marginal Advantage Accumulation (MAA). MAA constructs differential signals to make them comparable across batches, accumulates signed evidence per operation via EMA, and ensures cross-batch traceability through semantic identity merging. As a post-processing architecture, MAA achieves the best results in 14 out of 16 settings across 4 benchmarks and 4 target models, consistently outperforming existing batch-level distillation baselines and matching or surpassing online alternatives in most settings, while reducing optimization-phase token consumption by approximately 75%.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MAA gives a practical post-processing fix for cross-batch memory operation feedback in agent distillation, but the abstract-level claims need full methods to judge.

read the letter

The main takeaway is that this paper targets a real pain point in batch-style trace distillation: the same memory operation can get conflicting signals across batches, and existing methods have no way to accumulate reliable evidence at the operation level. They formalize that as two conditions (alignability and comparability) and introduce Marginal Advantage Accumulation using differential signals, per-operation EMA, and semantic identity merging to enable cross-batch tracking.

What the work does well is stay focused on an engineering gap that matters for people running distillation pipelines. The reported outcome—best results in 14 of 16 settings across four benchmarks and four models, plus roughly 75% lower token use in the optimization phase—would be useful if it holds. Treating the method as a post-processing layer rather than a full retraining change also keeps the scope manageable.

The soft spots are mostly around missing substance. The abstract states the performance numbers and the 75% figure but gives no derivation, ablation table, statistical detail, or even basic description of how the two structural conditions are checked in practice. Without those, it is impossible to tell whether the gains come from the new mechanism or from post-hoc choices in the 16 settings. There are also no citations to prior work on similar accumulation or merging ideas, which leaves the novelty claim hard to place.

This is aimed at researchers who already work on memory-driven agents and batch distillation and want a drop-in way to reduce token waste. A reader who needs concrete efficiency improvements in that narrow setting could extract value from the implementation once the details are filled in.

It deserves a serious referee. The problem is well-scoped and the empirical breadth is reasonable on paper, even if the current write-up requires more transparency on controls and reproducibility before it can be assessed properly. I would send it out with a request for the full experimental protocol and any code.

Referee Report

2 major / 1 minor

Summary. The manuscript identifies contradictory feedback for the same memory operation across batches in batch-style trace distillation. It formalizes two structural conditions (alignability and comparability), proposes Marginal Advantage Accumulation (MAA) as a post-processing architecture that constructs differential signals, accumulates signed evidence per operation via EMA, and uses semantic identity merging for traceability. As a post-processing step, MAA is claimed to achieve the best results in 14 out of 16 settings across 4 benchmarks and 4 target models, outperforming batch-level distillation baselines while matching or surpassing online alternatives in most cases and reducing optimization-phase token consumption by approximately 75%.

Significance. If the empirical superiority and 75% token reduction hold under proper controls, MAA would represent a practical advance for memory-driven agent self-evolution by enabling reliable cross-batch operation-level evidence accumulation without the overhead of online methods. The post-processing framing and explicit structural conditions are strengths that could make the approach reproducible if the full experimental details confirm the claims.

major comments (2)

[Abstract] Abstract: the central claim of best results in 14/16 settings across 4 benchmarks and 4 models supplies no statistical detail, number of runs, variance, or ablation on the alignability/comparability conditions, which is load-bearing for evaluating whether the performance is reliable or subject to post-hoc selection.
[Abstract] Abstract: the assertion that alignability and comparability suffice to make cross-batch accumulation reliable without new biases or information loss is presented without derivation, proof, or empirical test of sufficiency; this directly underpins the weakest assumption in the method's validity.

minor comments (1)

[Abstract] The token reduction is stated as 'approximately 75%' with no precise measurement protocol or baseline comparison detail.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the constructive feedback. We address each major comment on the abstract below. We will revise the abstract to incorporate the requested details on statistical reporting and theoretical foundations.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim of best results in 14/16 settings across 4 benchmarks and 4 models supplies no statistical detail, number of runs, variance, or ablation on the alignability/comparability conditions, which is load-bearing for evaluating whether the performance is reliable or subject to post-hoc selection.

Authors: We agree the abstract should include these details for clarity. The full manuscript reports all results as means over 5 independent runs with standard deviations in Table 2 and Appendix B. Ablations on the alignability and comparability conditions appear in Section 5.4, confirming their impact on performance. We will revise the abstract to state 'averaged over 5 runs with std. dev. in appendix' and reference the ablations. revision: yes
Referee: [Abstract] Abstract: the assertion that alignability and comparability suffice to make cross-batch accumulation reliable without new biases or information loss is presented without derivation, proof, or empirical test of sufficiency; this directly underpins the weakest assumption in the method's validity.

Authors: The abstract summarizes the approach at a high level. The derivation of the conditions, including formal proofs that they ensure reliable accumulation without new biases or information loss, is provided in Section 3.2 and Appendix A. Empirical tests of sufficiency are reported in Section 5.3. We will add a clarifying phrase to the abstract referencing these sections. revision: yes

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The abstract presents MAA as a post-processing method that formalizes two structural conditions (alignability and comparability) and then constructs differential signals, applies EMA accumulation, and uses semantic identity merging. No equations, fitted parameters, or self-citations are described that would reduce any claimed prediction or result to the inputs by construction. The central claim of empirical superiority is framed as an independent architectural contribution evaluated across benchmarks, with no visible self-definitional loops or load-bearing self-citations in the supplied text.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated. The two structural conditions (alignability, comparability) function as domain assumptions required for the accumulation to be valid.

axioms (1)

domain assumption Alignability and comparability are necessary and sufficient structural conditions for reliable cross-batch operation-level evidence accumulation.
The paper states it formalizes the requirement as these two conditions before proposing MAA.

pith-pipeline@v0.9.1-grok · 5677 in / 1393 out tokens · 17580 ms · 2026-06-26T17:43:11.149050+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

42 extracted references · 4 canonical work pages

[1]

International Conference on Learning Representations (ICLR) , year=

Large Language Models as Optimizers , author=. International Conference on Learning Representations (ICLR) , year=. arXiv , primaryClass=:2309.03409 , doi=

Pith/arXiv arXiv
[2]

International Conference on Learning Representations (ICLR) , year=

Large Language Models Are Human-Level Prompt Engineers , author=. International Conference on Learning Representations (ICLR) , year=. arXiv , primaryClass=:2211.01910 , doi=

Pith/arXiv arXiv
[3]

Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP) , year=

Automatic Prompt Optimization with ``Gradient Descent'' and Beam Search , author=. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP) , year=. arXiv , primaryClass=:2305.03495 , doi=

arXiv 2023
[4]

Nature , year=

Optimizing Generative AI by Backpropagating Language Model Feedback , author=. Nature , year=. doi:10.1038/s41586-025-08661-4 , eprint=

work page doi:10.1038/s41586-025-08661-4
[5]

Advances in Neural Information Processing Systems (NeurIPS) , year=

Trace is the Next AutoDiff: Generative Optimization with Rich Feedback, Execution Traces, and LLMs , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=. arXiv , primaryClass=:2406.16218 , doi=

arXiv
[6]

International Conference on Learning Representations (ICLR) , year=

DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines , author=. International Conference on Learning Representations (ICLR) , year=. arXiv , primaryClass=:2310.03714 , doi=

Pith/arXiv arXiv
[7]

International Conference on Learning Representations (ICLR) , year=

Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers , author=. International Conference on Learning Representations (ICLR) , year=. arXiv , primaryClass=:2309.08532 , doi=

Pith/arXiv arXiv
[8]

International Conference on Machine Learning (ICML) , year=

PromptBreeder: Self-Referential Self-Improvement via Prompt Evolution , author=. International Conference on Machine Learning (ICML) , year=. arXiv , primaryClass=:2309.16797 , doi=

Pith/arXiv arXiv
[9]

International Conference on Learning Representations (ICLR) , year=

GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning , author=. International Conference on Learning Representations (ICLR) , year=. arXiv , primaryClass=:2507.19457 , doi=

Pith/arXiv arXiv
[10]

Proceedings of the AAAI Conference on Artificial Intelligence , year=

MARS: Multi-Agent Adaptive Reasoning with Socratic Guidance for Automated Prompt Optimization , author=. Proceedings of the AAAI Conference on Artificial Intelligence , year=. arXiv , primaryClass=:2503.16874 , doi=

arXiv
[11]

Advances in Neural Information Processing Systems (NeurIPS) , year=

Reflexion: Language Agents with Verbal Reinforcement Learning , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=. arXiv , primaryClass=:2303.11366 , doi=

Pith/arXiv arXiv
[12]

Advances in Neural Information Processing Systems (NeurIPS) , year=

Self-Refine: Iterative Refinement with Self-Feedback , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=. arXiv , primaryClass=:2303.17651 , doi=

Pith/arXiv arXiv
[13]

ExpeL: LLM agents are experiential learners.Proceedings of the AAAI Conference on Artificial Intelligence, 38(17):19632–19642, 2024

ExpeL: LLM Agents Are Experiential Learners , author=. Proceedings of the AAAI Conference on Artificial Intelligence , year=. doi:10.1609/aaai.v38i17.29936 , eprint=

work page doi:10.1609/aaai.v38i17.29936
[14]

Transactions on Machine Learning Research (TMLR) , year=

Voyager: An Open-Ended Embodied Agent with Large Language Models , author=. Transactions on Machine Learning Research (TMLR) , year=. arXiv , primaryClass=:2305.16291 , doi=

Pith/arXiv arXiv
[15]

International Conference on Machine Learning (ICML) , year=

Agent Workflow Memory , author=. International Conference on Machine Learning (ICML) , year=. arXiv , primaryClass=:2409.07429 , doi=

Pith/arXiv arXiv
[16]

arXiv preprint , year=

Trace2Skill: Distill Trajectory-Local Lessons into Transferable Agent Skills , author=. arXiv preprint , year=. arXiv , primaryClass=:2603.25158 , doi=

Pith/arXiv arXiv
[17]

Advances in Neural Information Processing Systems (NeurIPS) , year=

A-MEM: Agentic Memory for LLM Agents , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=. arXiv , primaryClass=:2502.12110 , doi=

Pith/arXiv arXiv
[18]

arXiv preprint , year=

Evo-Memory: Benchmarking LLM Agent Test-time Learning with Self-Evolving Memory , author=. arXiv preprint , year=. arXiv , primaryClass=:2511.20857 , doi=

Pith/arXiv arXiv
[19]

International Conference on Machine Learning (ICML) , year=

MEMO: Memory-Augmented Model Context Optimization for Robust Multi-Turn Multi-Agent LLM Games , author=. International Conference on Machine Learning (ICML) , year=. arXiv , primaryClass=:2603.09022 , doi=

arXiv
[20]

arXiv preprint , year=

HyMEM: Hybrid Self-evolving Structured Memory for GUI Agents , author=. arXiv preprint , year=. arXiv , primaryClass=:2603.10291 , doi=

arXiv
[21]

arXiv preprint , year=

MemGPT: Towards LLMs as Operating Systems , author=. arXiv preprint , year=. arXiv , primaryClass=:2310.08560 , doi=

Pith/arXiv arXiv
[22]

arXiv preprint , year=

Live-Evo: Online Evolution of Agentic Memory from Continuous Feedback , author=. arXiv preprint , year=. arXiv , primaryClass=:2602.02369 , doi=

arXiv
[23]

International Conference on Machine Learning (ICML) , year=

EvolveR: Self-Evolving LLM Agents through an Experience-Driven Lifecycle , author=. International Conference on Machine Learning (ICML) , year=. arXiv , primaryClass=:2510.16079 , doi=

Pith/arXiv arXiv
[24]

Advances in Neural Information Processing Systems (NeurIPS) , year=

SE-Agent: Self-Evolution Trajectory Optimization in Multi-Step Reasoning with LLM-Based Agents , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=. arXiv , primaryClass=:2508.02085 , doi=

arXiv
[25]

arXiv preprint , year=

AgentEvolver: Towards Efficient Self-Evolving Agent System , author=. arXiv preprint , year=. arXiv , primaryClass=:2511.10395 , doi=

arXiv
[26]

arXiv preprint , year=

ReCreate: Reasoning and Creating Domain Agents Driven by Experience , author=. arXiv preprint , year=. arXiv , primaryClass=:2601.11100 , doi=

Pith/arXiv arXiv
[27]

International Conference on Learning Representations (ICLR) , year=

ReasoningBank: Scaling Agent Self-Evolving with Reasoning Memory , author=. International Conference on Learning Representations (ICLR) , year=. arXiv , primaryClass=:2509.25140 , doi=

Pith/arXiv arXiv
[28]

International Conference on Machine Learning (ICML) , year=

DeltaEvolve: Accelerating Scientific Discovery through Momentum-Driven Evolution , author=. International Conference on Machine Learning (ICML) , year=. arXiv , primaryClass=:2602.02919 , doi=

arXiv
[29]

Conference on Language Modeling (COLM) , year=

Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation , author=. Conference on Language Modeling (COLM) , year=. arXiv , primaryClass=:2310.02304 , doi=

arXiv
[30]

arXiv preprint , year=

Proximal Policy Optimization Algorithms , author=. arXiv preprint , year=. arXiv , primaryClass=:1707.06347 , doi=

Pith/arXiv arXiv
[31]

arXiv preprint , year=

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models , author=. arXiv preprint , year=. arXiv , primaryClass=:2402.03300 , doi=

Pith/arXiv arXiv
[32]

arXiv preprint , year=

RAGEN: Understanding Self-Evolution in LLM Agents via Multi-Turn Reinforcement Learning , author=. arXiv preprint , year=. arXiv , primaryClass=:2504.20073 , doi=

Pith/arXiv arXiv
[33]

arXiv preprint , year=

SkillOpt: Executive Strategy for Self-Evolving Agent Skills , author=. arXiv preprint , year=. arXiv , primaryClass=:2605.23904 , doi=

Pith/arXiv arXiv
[34]

arXiv preprint , year=

SkillGrad: Optimizing Agent Skills Like Gradient Descent , author=. arXiv preprint , year=. arXiv , primaryClass=:2605.27760 , doi=

Pith/arXiv arXiv
[35]

arXiv preprint , year=

MemSkill: Learning and Evolving Memory Skills for Self-Evolving Agents , author=. arXiv preprint , year=. arXiv , primaryClass=:2602.02474 , doi=

Pith/arXiv arXiv
[36]

International Conference on Learning Representations (ICLR) , year=

ALFWorld: Aligning Text and Embodied Environments for Interactive Learning , author=. International Conference on Learning Representations (ICLR) , year=. arXiv , primaryClass=:2010.03768 , doi=

Pith/arXiv arXiv 2010
[37]

HotpotQA: A dataset for diverse, explainable multi- hop question answering,

HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering , author=. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP) , year=. doi:10.18653/v1/D18-1259 , eprint=

work page doi:10.18653/v1/d18-1259 2018
[38]

Advances in Neural Information Processing Systems (NeurIPS), Datasets and Benchmarks Track , year=

SpreadsheetBench: Towards Challenging Real World Spreadsheet Manipulation , author=. Advances in Neural Information Processing Systems (NeurIPS), Datasets and Benchmarks Track , year=. arXiv , primaryClass=:2406.14991 , doi=

arXiv
[39]

International Conference on Learning Representations (ICLR) , year=

ScienceAgentBench: Toward Rigorous Assessment of Language Agents for Data-Driven Scientific Discovery , author=. International Conference on Learning Representations (ICLR) , year=. arXiv , primaryClass=:2410.05080 , doi=

arXiv
[40]

Advances in Neural Information Processing Systems (NeurIPS) , year=

Direct Preference Optimization: Your Language Model is Secretly a Reward Model , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=. arXiv , primaryClass=:2305.18290 , doi=

Pith/arXiv arXiv
[41]

International Conference on Machine Learning (ICML) , year=

signSGD: Compressed Optimisation for Non-Convex Problems , author=. International Conference on Machine Learning (ICML) , year=. arXiv , primaryClass=:1802.04434 , doi=

Pith/arXiv arXiv
[42]

doi: 10.18653/v1/2024.acl-long.511

Large Language Models Are Not Fair Evaluators , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL) , year=. doi:10.18653/v1/2024.acl-long.511 , eprint=

work page doi:10.18653/v1/2024.acl-long.511 2024

[1] [1]

International Conference on Learning Representations (ICLR) , year=

Large Language Models as Optimizers , author=. International Conference on Learning Representations (ICLR) , year=. arXiv , primaryClass=:2309.03409 , doi=

Pith/arXiv arXiv

[2] [2]

International Conference on Learning Representations (ICLR) , year=

Large Language Models Are Human-Level Prompt Engineers , author=. International Conference on Learning Representations (ICLR) , year=. arXiv , primaryClass=:2211.01910 , doi=

Pith/arXiv arXiv

[3] [3]

Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP) , year=

Automatic Prompt Optimization with ``Gradient Descent'' and Beam Search , author=. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP) , year=. arXiv , primaryClass=:2305.03495 , doi=

arXiv 2023

[4] [4]

Nature , year=

Optimizing Generative AI by Backpropagating Language Model Feedback , author=. Nature , year=. doi:10.1038/s41586-025-08661-4 , eprint=

work page doi:10.1038/s41586-025-08661-4

[5] [5]

Advances in Neural Information Processing Systems (NeurIPS) , year=

Trace is the Next AutoDiff: Generative Optimization with Rich Feedback, Execution Traces, and LLMs , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=. arXiv , primaryClass=:2406.16218 , doi=

arXiv

[6] [6]

International Conference on Learning Representations (ICLR) , year=

DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines , author=. International Conference on Learning Representations (ICLR) , year=. arXiv , primaryClass=:2310.03714 , doi=

Pith/arXiv arXiv

[7] [7]

International Conference on Learning Representations (ICLR) , year=

Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers , author=. International Conference on Learning Representations (ICLR) , year=. arXiv , primaryClass=:2309.08532 , doi=

Pith/arXiv arXiv

[8] [8]

International Conference on Machine Learning (ICML) , year=

PromptBreeder: Self-Referential Self-Improvement via Prompt Evolution , author=. International Conference on Machine Learning (ICML) , year=. arXiv , primaryClass=:2309.16797 , doi=

Pith/arXiv arXiv

[9] [9]

International Conference on Learning Representations (ICLR) , year=

GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning , author=. International Conference on Learning Representations (ICLR) , year=. arXiv , primaryClass=:2507.19457 , doi=

Pith/arXiv arXiv

[10] [10]

Proceedings of the AAAI Conference on Artificial Intelligence , year=

MARS: Multi-Agent Adaptive Reasoning with Socratic Guidance for Automated Prompt Optimization , author=. Proceedings of the AAAI Conference on Artificial Intelligence , year=. arXiv , primaryClass=:2503.16874 , doi=

arXiv

[11] [11]

Advances in Neural Information Processing Systems (NeurIPS) , year=

Reflexion: Language Agents with Verbal Reinforcement Learning , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=. arXiv , primaryClass=:2303.11366 , doi=

Pith/arXiv arXiv

[12] [12]

Advances in Neural Information Processing Systems (NeurIPS) , year=

Self-Refine: Iterative Refinement with Self-Feedback , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=. arXiv , primaryClass=:2303.17651 , doi=

Pith/arXiv arXiv

[13] [13]

ExpeL: LLM agents are experiential learners.Proceedings of the AAAI Conference on Artificial Intelligence, 38(17):19632–19642, 2024

ExpeL: LLM Agents Are Experiential Learners , author=. Proceedings of the AAAI Conference on Artificial Intelligence , year=. doi:10.1609/aaai.v38i17.29936 , eprint=

work page doi:10.1609/aaai.v38i17.29936

[14] [14]

Transactions on Machine Learning Research (TMLR) , year=

Voyager: An Open-Ended Embodied Agent with Large Language Models , author=. Transactions on Machine Learning Research (TMLR) , year=. arXiv , primaryClass=:2305.16291 , doi=

Pith/arXiv arXiv

[15] [15]

International Conference on Machine Learning (ICML) , year=

Agent Workflow Memory , author=. International Conference on Machine Learning (ICML) , year=. arXiv , primaryClass=:2409.07429 , doi=

Pith/arXiv arXiv

[16] [16]

arXiv preprint , year=

Trace2Skill: Distill Trajectory-Local Lessons into Transferable Agent Skills , author=. arXiv preprint , year=. arXiv , primaryClass=:2603.25158 , doi=

Pith/arXiv arXiv

[17] [17]

Advances in Neural Information Processing Systems (NeurIPS) , year=

A-MEM: Agentic Memory for LLM Agents , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=. arXiv , primaryClass=:2502.12110 , doi=

Pith/arXiv arXiv

[18] [18]

arXiv preprint , year=

Evo-Memory: Benchmarking LLM Agent Test-time Learning with Self-Evolving Memory , author=. arXiv preprint , year=. arXiv , primaryClass=:2511.20857 , doi=

Pith/arXiv arXiv

[19] [19]

International Conference on Machine Learning (ICML) , year=

MEMO: Memory-Augmented Model Context Optimization for Robust Multi-Turn Multi-Agent LLM Games , author=. International Conference on Machine Learning (ICML) , year=. arXiv , primaryClass=:2603.09022 , doi=

arXiv

[20] [20]

arXiv preprint , year=

HyMEM: Hybrid Self-evolving Structured Memory for GUI Agents , author=. arXiv preprint , year=. arXiv , primaryClass=:2603.10291 , doi=

arXiv

[21] [21]

arXiv preprint , year=

MemGPT: Towards LLMs as Operating Systems , author=. arXiv preprint , year=. arXiv , primaryClass=:2310.08560 , doi=

Pith/arXiv arXiv

[22] [22]

arXiv preprint , year=

Live-Evo: Online Evolution of Agentic Memory from Continuous Feedback , author=. arXiv preprint , year=. arXiv , primaryClass=:2602.02369 , doi=

arXiv

[23] [23]

International Conference on Machine Learning (ICML) , year=

EvolveR: Self-Evolving LLM Agents through an Experience-Driven Lifecycle , author=. International Conference on Machine Learning (ICML) , year=. arXiv , primaryClass=:2510.16079 , doi=

Pith/arXiv arXiv

[24] [24]

Advances in Neural Information Processing Systems (NeurIPS) , year=

SE-Agent: Self-Evolution Trajectory Optimization in Multi-Step Reasoning with LLM-Based Agents , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=. arXiv , primaryClass=:2508.02085 , doi=

arXiv

[25] [25]

arXiv preprint , year=

AgentEvolver: Towards Efficient Self-Evolving Agent System , author=. arXiv preprint , year=. arXiv , primaryClass=:2511.10395 , doi=

arXiv

[26] [26]

arXiv preprint , year=

ReCreate: Reasoning and Creating Domain Agents Driven by Experience , author=. arXiv preprint , year=. arXiv , primaryClass=:2601.11100 , doi=

Pith/arXiv arXiv

[27] [27]

International Conference on Learning Representations (ICLR) , year=

ReasoningBank: Scaling Agent Self-Evolving with Reasoning Memory , author=. International Conference on Learning Representations (ICLR) , year=. arXiv , primaryClass=:2509.25140 , doi=

Pith/arXiv arXiv

[28] [28]

International Conference on Machine Learning (ICML) , year=

DeltaEvolve: Accelerating Scientific Discovery through Momentum-Driven Evolution , author=. International Conference on Machine Learning (ICML) , year=. arXiv , primaryClass=:2602.02919 , doi=

arXiv

[29] [29]

Conference on Language Modeling (COLM) , year=

Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation , author=. Conference on Language Modeling (COLM) , year=. arXiv , primaryClass=:2310.02304 , doi=

arXiv

[30] [30]

arXiv preprint , year=

Proximal Policy Optimization Algorithms , author=. arXiv preprint , year=. arXiv , primaryClass=:1707.06347 , doi=

Pith/arXiv arXiv

[31] [31]

arXiv preprint , year=

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models , author=. arXiv preprint , year=. arXiv , primaryClass=:2402.03300 , doi=

Pith/arXiv arXiv

[32] [32]

arXiv preprint , year=

RAGEN: Understanding Self-Evolution in LLM Agents via Multi-Turn Reinforcement Learning , author=. arXiv preprint , year=. arXiv , primaryClass=:2504.20073 , doi=

Pith/arXiv arXiv

[33] [33]

arXiv preprint , year=

SkillOpt: Executive Strategy for Self-Evolving Agent Skills , author=. arXiv preprint , year=. arXiv , primaryClass=:2605.23904 , doi=

Pith/arXiv arXiv

[34] [34]

arXiv preprint , year=

SkillGrad: Optimizing Agent Skills Like Gradient Descent , author=. arXiv preprint , year=. arXiv , primaryClass=:2605.27760 , doi=

Pith/arXiv arXiv

[35] [35]

arXiv preprint , year=

MemSkill: Learning and Evolving Memory Skills for Self-Evolving Agents , author=. arXiv preprint , year=. arXiv , primaryClass=:2602.02474 , doi=

Pith/arXiv arXiv

[36] [36]

International Conference on Learning Representations (ICLR) , year=

ALFWorld: Aligning Text and Embodied Environments for Interactive Learning , author=. International Conference on Learning Representations (ICLR) , year=. arXiv , primaryClass=:2010.03768 , doi=

Pith/arXiv arXiv 2010

[37] [37]

HotpotQA: A dataset for diverse, explainable multi- hop question answering,

HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering , author=. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP) , year=. doi:10.18653/v1/D18-1259 , eprint=

work page doi:10.18653/v1/d18-1259 2018

[38] [38]

Advances in Neural Information Processing Systems (NeurIPS), Datasets and Benchmarks Track , year=

SpreadsheetBench: Towards Challenging Real World Spreadsheet Manipulation , author=. Advances in Neural Information Processing Systems (NeurIPS), Datasets and Benchmarks Track , year=. arXiv , primaryClass=:2406.14991 , doi=

arXiv

[39] [39]

International Conference on Learning Representations (ICLR) , year=

ScienceAgentBench: Toward Rigorous Assessment of Language Agents for Data-Driven Scientific Discovery , author=. International Conference on Learning Representations (ICLR) , year=. arXiv , primaryClass=:2410.05080 , doi=

arXiv

[40] [40]

Advances in Neural Information Processing Systems (NeurIPS) , year=

Direct Preference Optimization: Your Language Model is Secretly a Reward Model , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=. arXiv , primaryClass=:2305.18290 , doi=

Pith/arXiv arXiv

[41] [41]

International Conference on Machine Learning (ICML) , year=

signSGD: Compressed Optimisation for Non-Convex Problems , author=. International Conference on Machine Learning (ICML) , year=. arXiv , primaryClass=:1802.04434 , doi=

Pith/arXiv arXiv

[42] [42]

doi: 10.18653/v1/2024.acl-long.511

Large Language Models Are Not Fair Evaluators , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL) , year=. doi:10.18653/v1/2024.acl-long.511 , eprint=

work page doi:10.18653/v1/2024.acl-long.511 2024