pith. sign in

arxiv: 2605.18421 · v1 · pith:5P4VYI7Rnew · submitted 2026-05-18 · 💻 cs.CL · cs.AI· cs.LG

EvoMemBench: Benchmarking Agent Memory from a Self-Evolving Perspective

Pith reviewed 2026-05-20 11:31 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG
keywords LLM agentsagent memorybenchmarkingmemory evaluationlong-context baselinesself-evolving agentsknowledge vs execution tasks
0
0 comments X

The pith

Current memory systems for LLM agents fall short of a general solution, with long-context baselines remaining competitive.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a benchmark to evaluate memory in agents that must handle information across evolving interactions. It divides memory needs into scope, either within one episode or across episodes, and content, either facts or procedures. Testing fifteen memory approaches against long-context baselines reveals that memory adds value mainly when immediate context is too short or tasks grow difficult. No single memory type succeeds across all cases, though retrieval suits knowledge tasks and procedural storage aids execution when past experience matches the structure. This matters because agents require dependable recall to manage ongoing work without starting over each time.

Core claim

The paper shows that existing memory mechanisms do not deliver a reliable edge over extended context windows in agent settings. Systematic tests organized by in-episode versus cross-episode scope and knowledge-oriented versus execution-oriented content demonstrate that memory benefits emerge chiefly under limited context or higher task difficulty, retrieval methods lead in knowledge settings, and procedural or long-term methods help execution tasks only when stored experience fits the task structure.

What carries the argument

EvoMemBench, a benchmark structured along memory scope and memory content axes under a standardized self-evolving protocol for comparing agent memory methods.

If this is right

  • Memory systems deliver the most help when the current context window cannot hold all required information.
  • Higher task difficulty increases the practical value of any memory mechanism.
  • Retrieval-based memory remains effective for tasks centered on retrieving and using knowledge.
  • Procedural and long-term memory approaches gain traction for execution tasks when stored experience matches the current task demands.
  • No single memory form produces steady gains across the full range of tested settings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Agent builders could begin with strong long-context models and add memory only for specific short-context or complex scenarios.
  • The two-axis structure points toward hybrid memory designs that select types based on whether a task needs facts or procedures.
  • Standardized benchmarks of this kind could track incremental progress toward memory that works reliably in varied agent environments.
  • Extending the evaluation to include more interactive or longer-horizon agent behaviors would test whether self-evolving memory gains further importance.

Load-bearing premise

The chosen tasks, standardized protocol, and two-axis organization capture the memory challenges that matter for real LLM agent deployments.

What would settle it

A follow-up evaluation where one memory method outperforms long-context baselines on every task type, scope, and difficulty level would challenge the finding that no general solution yet exists.

Figures

Figures reproduced from arXiv: 2605.18421 by Bing Tong, Chen Zhang, Jia Li, Kaichi Yu, Miao Peng, Mo Chi, Yan Zhou, Yuhan Li, Yuyao Wang, Zhongjian Zhang.

Figure 1
Figure 1. Figure 1: Overview of EvoMemBench. Existing memory benchmarks cover only parts of this space. Text-centric benchmarks such as LoCoMo [22], LongMemEval [35], and MemoryAgentBench [17] mainly evaluate knowledge retention, retrieval, or revision in conversational or document-like contexts, but do not test whether memory supports action and tool-based execution. Recent agentic or lifelong memory benchmarks, including Me… view at source ↗
Figure 2
Figure 2. Figure 2: Accuracy and cost on INEP-KNOW. Finding 1: Strong long-context baselines re￾main highly competitive. As shown in Ta￾ble 3 and [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Cross-environment transfer results in CROSSEP-TOOL. 4!+(,% (,(&'0 (#) 3-  *!#% (#)  *!#% *%!,  *!#% %!0  *!#% --*  *!#% !.&%0-+!(, -1.#%-+!(, 4!+(,% (,(&'0 (#) 3-  *!#% (#)  *!#% *%!,  *!#% %!0  *!#% --*  *!#% ! 3%, +"              [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Cross-environment transfer results in CROSSEP-EMB. In Embodied AI, procedural long-term memory performs best, with an average rank of 5.60. This result shows that cross-episode execution evolution cannot be addressed by one fixed memory form. Retrieval-augmented memory reuses similar past cases, which fits tool-use tasks where similar API calls and parameter patterns recur. General long-term memory maintai… view at source ↗
Figure 5
Figure 5. Figure 5: Cross-environment transfer results in CROSSEP-EMB for BM25, GraphRAG, Mem0, and MemOS. 3",)-& )-)'(0 )$* 2.  +"$& )$*  +"$& +&"-  +"$& &"0  +"$& ..+  +"$& "/'&0.,")- .1/$&.,")- 3",)-& )-)'(0 )$* 2.  +"$& )$*  +"$& +&"-  +"$& &"0  +"$& ..+  +"$& " &,./4             [PITH_FULL_IMAGE:figures/full_fig_p022_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Cross-environment transfer results in CROSSEP-EMB for MemoryOS, AWM, AgentKB, and ACE. C.1.2 CROSSEP-TOOL + $(" )-, + /!& ))%$(" )+$&& $&!0,-!' !#$ &! )(-+)&  +"!-)' $( ).+ !)' $( + $(" )-, + /!& ))%$(" )+$&& $&!0,-!' !#$ &! )(-+)&        [PITH_FULL_IMAGE:figures/full_fig_p022_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Cross-environment transfer results in CROSSEP-TOOL for BM25, GraphRAG, Mem0, and MemOS. 22 [PITH_FULL_IMAGE:figures/full_fig_p022_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Cross-environment transfer results in CROSSEP-TOOL for MemoryOS, AWM, AgentKB, and ACE. C.2 Additional Efficiency Results for Cross-Episode Settings We report token usage for cross-episode knowledge evolution and both average steps and token usage for cross-episode execution settings [PITH_FULL_IMAGE:figures/full_fig_p023_8.png] view at source ↗
read the original abstract

Recent benchmarks for Large Language Model (LLM) agents mainly evaluate reasoning, planning, and execution. However, memory is also essential for agents, as it enables them to store, update, and retrieve information over time. This ability remains under-evaluated, largely because existing benchmarks do not provide a systematic way to assess memory mechanisms. In this paper, we study agent memory from a self-evolving perspective and introduce EvoMemBench, a unified benchmark organized along two axes: memory scope (in-episode vs. cross-episode) and memory content (knowledge-oriented vs. execution-oriented). We compare 15 representative memory methods with strong long-context baselines under a standardized protocol. Results show that current memory systems are still far from a general solution: long-context baselines remain highly competitive, memory helps most when the current context is insufficient or tasks are difficult, and no single memory form works consistently across all settings. Retrieval-based methods remain strong for knowledge-intensive settings, whereas procedural and long-term memory methods are more effective for execution-oriented tasks when their stored experience matches the task structure. We hope EvoMemBench facilitates future research on more effective memory systems for LLM-based agents. Our code is available at https://github.com/DSAIL-Memory/EvoMemBench.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces EvoMemBench, a unified benchmark for LLM agent memory organized along two axes (memory scope: in-episode vs. cross-episode; memory content: knowledge-oriented vs. execution-oriented). It evaluates 15 representative memory methods against strong long-context baselines under a standardized protocol and concludes that current memory systems remain far from a general solution, with long-context baselines highly competitive, memory most helpful when context is insufficient or tasks difficult, and no single memory form performing consistently across settings.

Significance. If the two-axis protocol and task construction successfully isolate intrinsic memory effects from confounds such as token budget and evolution triggers, the work would be a useful empirical contribution by documenting the conditional utility of memory mechanisms and the persistent strength of long-context approaches. The open code release supports reproducibility and future extensions.

major comments (2)
  1. [§3 and §4] §3 (Benchmark Construction) and §4 (Standardized Protocol): The description does not explicitly demonstrate that every method—including long-context baselines—receives identical total token budgets, identical information density across episodes, and identical self-evolving update triggers. Without these controls, the competitiveness of long-context baselines and the conditional benefit of memory could be artifacts of task scaffolding rather than properties of the memory mechanisms.
  2. [§5] §5 (Experimental Results): The central claim that 'no single memory form works consistently across all settings' rests on the chosen tasks and two-axis splits, yet the section provides no details on statistical tests, error bars, exclusion criteria, or how the in-episode/cross-episode and knowledge/execution axes control for variables such as episode length or retrieval leakage. This makes it difficult to judge whether the observed patterns generalize or are task-specific.
minor comments (2)
  1. [Abstract and §2] Abstract and §2: The term 'self-evolving perspective' is used without a concise operational definition early in the paper; a short paragraph clarifying how evolution is implemented (e.g., update frequency, trigger conditions) would improve readability.
  2. [Tables and Figures] Table captions and figure legends: Some tables comparing the 15 methods lack explicit column headers for context-window size or total tokens used, making direct comparison with long-context baselines harder to verify at a glance.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments. Below we respond point-by-point to the major concerns, clarifying the controls in our protocol and committing to added statistical details and explicit demonstrations in the revised manuscript.

read point-by-point responses
  1. Referee: [§3 and §4] §3 (Benchmark Construction) and §4 (Standardized Protocol): The description does not explicitly demonstrate that every method—including long-context baselines—receives identical total token budgets, identical information density across episodes, and identical self-evolving update triggers. Without these controls, the competitiveness of long-context baselines and the conditional benefit of memory could be artifacts of task scaffolding rather than properties of the memory mechanisms.

    Authors: We agree that explicit verification of these equivalences strengthens the claims. The standardized protocol in §4 applies the same environment, episode generator, and interaction loop to all methods. Token budgets are capped identically per turn and per episode by the task configuration; long-context baselines receive the full accumulated history up to the same context-window limit used for memory-augmented agents. Self-evolving update triggers are defined by task-level rules (performance thresholds and new-observation detection) that do not depend on the memory implementation. Information density is fixed by the benchmark’s episode templates in §3. To make these controls transparent, we will add a dedicated paragraph and comparison table in §4. revision: yes

  2. Referee: [§5] §5 (Experimental Results): The central claim that 'no single memory form works consistently across all settings' rests on the chosen tasks and two-axis splits, yet the section provides no details on statistical tests, error bars, exclusion criteria, or how the in-episode/cross-episode and knowledge/execution axes control for variables such as episode length or retrieval leakage. This makes it difficult to judge whether the observed patterns generalize or are task-specific.

    Authors: We appreciate the request for greater statistical transparency. All reported numbers are means over five independent runs with different random seeds; standard-deviation error bars appear in the figures but were not described in the text. No runs were excluded. Episode length is standardized within each axis category by the task generator. Retrieval leakage is avoided because memory updates occur only on newly observed information that is absent from the current context window. We will expand §5 with an explicit subsection on these controls, report the run count, and add a short discussion of generalizability limits. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark comparison with no derivations or self-referential reductions

full rationale

The paper introduces EvoMemBench as an empirical evaluation framework for LLM agent memory, organizing tasks along in-episode vs. cross-episode and knowledge vs. execution axes, then reporting experimental comparisons of 15 existing memory methods against long-context baselines under a standardized protocol. All claims (e.g., long-context competitiveness, conditional benefits of memory, lack of consistent winner) rest directly on observed results from these runs rather than any first-principles derivation, fitted parameter renamed as prediction, or load-bearing self-citation chain. No equations, uniqueness theorems, or ansatzes appear; the work is self-contained as a benchmark study whose validity can be assessed against external task reproductions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper is a benchmark introduction that rests on standard domain assumptions about agent memory needs rather than new derivations or entities.

axioms (1)
  • domain assumption Existing benchmarks do not provide a systematic way to assess memory mechanisms in LLM agents.
    This premise is stated directly in the abstract as justification for creating EvoMemBench.

pith-pipeline@v0.9.0 · 5780 in / 1197 out tokens · 53682 ms · 2026-05-20T11:31:48.438213+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

63 extracted references · 63 canonical work pages · 15 internal anchors

  1. [1]

    MemoryBench: A Benchmark for Memory and Continual Learning in LLM Systems

    Qingyao Ai, Yichen Tang, Changyue Wang, Jianming Long, Weihang Su, and Yiqun Liu. Memorybench: A benchmark for memory and continual learning in llm systems.arXiv preprint arXiv:2510.17281, 2025. URLhttps://arxiv.org/abs/2510.17281

  2. [2]

    Arigraph: Learning knowledge graph world models with episodic memory for llm agents, 2025

    Petr Anokhin, Nikita Semenov, Artyom Sorokin, Dmitry Evseev, Andrey Kravchenko, Mikhail Burtsev, and Evgeny Burnaev. Arigraph: Learning knowledge graph world models with episodic memory for llm agents, 2025. URLhttps://arxiv.org/abs/2407.04363

  3. [3]

    Building self-evolving agents via experience-driven lifelong learning: A framework and benchmark.arXiv preprint arXiv:2508.19005, 2025

    Yuxuan Cai, Yipeng Hao, Jie Zhou, Hang Yan, Zhikai Lei, Rui Zhen, Zhenhua Han, Yutao Yang, Junsong Li, Qianjun Pan, Tianyu Huai, Qin Chen, Xin Li, Kai Chen, Bo Zhang, Xipeng Qiu, and Liang He. Building self-evolving agents via experience-driven lifelong learning: A framework and benchmark.arXiv preprint arXiv:2508.19005, 2025. URL https://arxiv. org/abs/2...

  4. [4]

    xbench: Tracking agents productivity scaling with profession-aligned real-world evaluations, 2025

    Kaiyuan Chen, Yixin Ren, Yang Liu, Xiaobo Hu, Haotong Tian, Tianbao Xie, Fangfu Liu, Haoye Zhang, Hongzhang Liu, Yuan Gong, et al. xbench: Tracking agents productivity scaling with profession-aligned real-world evaluations.arXiv preprint arXiv:2506.13651, 2025

  5. [5]

    Mem0: Building production-ready ai agents with scalable long-term memory

    Prateek Chhikara, Dev Khant, Saket Aryan, Taranjeet Singh, and Deshraj Yadav. Mem0: Building production-ready ai agents with scalable long-term memory. InECAI, 2025

  6. [6]

    Cl-bench: A benchmark for context learning

    Shihan Dou, Ming Zhang, Zhangyue Yin, Chenhao Huang, Yujiong Shen, Junzhe Wang, Jiayi Chen, Yuchen Ni, Junjie Ye, Cheng Zhang, et al. Cl-bench: A benchmark for context learning. arXiv preprint arXiv:2602.03587, 2026

  7. [7]

    DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents

    Mingxuan Du, Benfeng Xu, Chiwei Zhu, Xiaorui Wang, and Zhendong Mao. Deepresearch bench: A comprehensive benchmark for deep research agents, 2025. URL https://arxiv. org/abs/2506.11763

  8. [8]

    Pan, Ruifeng Xu, and Kam-Fai Wong

    Yiming Du, Bingbing Wang, Yang He, Bin Liang, Baojun Wang, Zhongyang Li, Lin Gui, Jeff Z. Pan, Ruifeng Xu, and Kam-Fai Wong. Memguide: Intent-driven memory selection for goal-oriented multi-session llm agents, 2025. URLhttps://arxiv.org/abs/2505.20231

  9. [9]

    From Local to Global: A Graph RAG Approach to Query-Focused Summarization

    Darren Edge, Ha Trinh, Newman Cheng, Joshua Bradley, Alex Chao, Apurva Mody, Steven Truitt, Dasha Metropolitansky, Robert Osazuwa Ness, and Jonathan Larson. From local to global: A graph rag approach to query-focused summarization.arXiv preprint arXiv:2404.16130, 2024

  10. [10]

    A Comprehensive Survey of Self-Evolving AI Agents: A New Paradigm Bridging Foundation Models and Lifelong Agentic Systems

    Jinyuan Fang, Yanwen Peng, Xi Zhang, Yingxu Wang, Xinhao Yi, Guibin Zhang, Yi Xu, Bin Wu, Siwei Liu, Zihao Li, Zhaochun Ren, Nikos Aletras, Xi Wang, Han Zhou, and Zaiqiao Meng. A comprehensive survey of self-evolving ai agents: A new paradigm bridging foundation models and lifelong agentic systems, 2025. URLhttps://arxiv.org/abs/2508.07407

  11. [11]

    Memp: Exploring Agent Procedural Memory

    Runnan Fang, Yuan Liang, Xiaobin Wang, Jialong Wu, Shuofei Qiao, Pengjun Xie, Fei Huang, Huajun Chen, and Ningyu Zhang. Memp: Exploring agent procedural memory, 2026. URL https://arxiv.org/abs/2508.06433

  12. [12]

    Szostkiewicz, Jon M

    Ali Essam Ghareeb, Benjamin Chang, Ludovico Mitchener, Angela Yiu, Caralyn J. Szostkiewicz, Jon M. Laurent, Muhammed T. Razzak, Andrew D. White, Michaela M. Hinks, and Samuel G. Rodriques. Robin: A multi-agent system for automating scientific discovery,

  13. [13]

    URLhttps://arxiv.org/abs/2505.13400

  14. [14]

    Gemini 3 flash: Frontier intelligence built for speed

    Google DeepMind. Gemini 3 flash: Frontier intelligence built for speed. https://blog. google/products-and-platforms/products/gemini/gemini-3-flash/, 2025

  15. [15]

    Legomem: Modular procedural memory for multi-agent llm systems for workflow automation, 2025

    Dongge Han, Camille Couturier, Daniel Madrigal Diaz, Xuchao Zhang, Victor Rühle, and Saravan Rajmohan. Legomem: Modular procedural memory for multi-agent llm systems for workflow automation, 2025. URLhttps://arxiv.org/abs/2510.04851

  16. [16]

    Memoryarena: Benchmarking agent memory in interde- pendent multi-session agentic tasks.ICML, 2026

    Zexue He, Yu Wang, Churan Zhi, Yuanzhe Hu, Tzu-Ping Chen, Lang Yin, Ze Chen, Tong Arthur Wu, Siru Ouyang, Zihan Wang, et al. Memoryarena: Benchmarking agent memory in interde- pendent multi-session agentic tasks.ICML, 2026. 11

  17. [17]

    Evaluating Memory in LLM Agents via Incremental Multi-Turn Interactions

    Yuanzhe Hu, Yu Wang, and Julian McAuley. Evaluating memory in llm agents via incremental multi-turn interactions.arXiv preprint arXiv:2507.05257, 2025

  18. [18]

    Evaluating memory in llm agents via incremental multi-turn interactions

    Yuanzhe Hu, Yu Wang, and Julian McAuley. Evaluating memory in llm agents via incremental multi-turn interactions. InICLR, 2026

  19. [19]

    Memory os of ai agent

    Jiazheng Kang, Mingming Ji, Zhe Zhao, and Ting Bai. Memory os of ai agent. InEMNLP, pages 25972–25981, 2025

  20. [20]

    Experience-evolving multi-turn tool-use agent with hybrid episodic-procedural memory, 2026

    Sijia Li, Yuchen Huang, Zifan Liu, Zijian Li, Jingjing fu, Lei Song, Jiang Bian, Jun Zhang, and Rui Wang. Experience-evolving multi-turn tool-use agent with hybrid episodic-procedural memory, 2026. URLhttps://arxiv.org/abs/2512.07287

  21. [21]

    MemOS: A Memory OS for AI System

    Zhiyu Li, Chenyang Xi, Chunyu Li, Ding Chen, Boyu Chen, Shichao Song, Simin Niu, Hanyu Wang, Jiawei Yang, Chen Tang, et al. Memos: A memory os for ai system.arXiv preprint arXiv:2507.03724, 2025

  22. [22]

    DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models

    Aixin Liu, Aoxue Mei, Bangcai Lin, Bing Xue, Bingxuan Wang, Bingzheng Xu, Bochao Wu, Bowei Zhang, Chaofan Lin, Chen Dong, et al. Deepseek-v3. 2: Pushing the frontier of open large language models.arXiv preprint arXiv:2512.02556, 2025

  23. [23]

    Evaluating Very Long-Term Conversational Memory of LLM Agents

    Adyasha Maharana, Dong-Ho Lee, Sergey Tulyakov, Mohit Bansal, Francesco Barbieri, and Yuwei Fang. Evaluating very long-term conversational memory of llm agents.arXiv preprint arXiv:2402.17753, 2024

  24. [24]

    Byterover: Agent-native memory through llm-curated hierarchical context, 2026

    Andy Nguyen, Danh Doan, Hoang Pham, Bao Ha, Dat Pham, Linh Nguyen, Hieu Nguyen, Thien Nguyen, Cuong Do, Phat Nguyen, and Toan Nguyen. Byterover: Agent-native memory through llm-curated hierarchical context, 2026. URLhttps://arxiv.org/abs/2604.01599

  25. [25]

    Introducing gpt-5.https://openai.com/index/introducing-gpt-5/, 2025

    OpenAI. Introducing gpt-5.https://openai.com/index/introducing-gpt-5/, 2025

  26. [26]

    Reasoningbank: Scaling agent self-evolving with reasoning memory

    Siru Ouyang, Jun Yan, I-Hung Hsu, Yanfei Chen, Ke Jiang, Zifeng Wang, Rujun Han, Long T Le, Samira Daruki, Xiangru Tang, et al. Reasoningbank: Scaling agent self-evolving with reasoning memory. InICLR, 2026

  27. [27]

    The berkeley function calling leaderboard (bfcl): From tool use to agentic evaluation of large language models

    Shishir G Patil, Huanzhi Mao, Fanjia Yan, Charlie Cheng-Jie Ji, Vishnu Suresh, Ion Stoica, and Joseph E Gonzalez. The berkeley function calling leaderboard (bfcl): From tool use to agentic evaluation of large language models. InICML, pages 48371–48392, 2025

  28. [28]

    Memobrain: Executive memory as an agentic brain for reasoning

    Hongjin Qian, Zhao Cao, and Zheng Liu. Memobrain: Executive memory as an agentic brain for reasoning. InACL Findings, 2026

  29. [29]

    Now Publishers Inc, 2009

    Stephen Robertson and Hugo Zaragoza.The probabilistic relevance framework: BM25 and beyond, volume 4. Now Publishers Inc, 2009

  30. [30]

    Meminsight: Autonomous memory augmentation for llm agents, 2025

    Rana Salama, Jason Cai, Michelle Yuan, Anna Currey, Monica Sunkara, Yi Zhang, and Yassine Benajiba. Meminsight: Autonomous memory augmentation for llm agents, 2025. URL https://arxiv.org/abs/2503.21760

  31. [31]

    Reflexion: Language agents with verbal reinforcement learning

    Noah Shinn, Federico Cassano, Edward Berman, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning. InNeurIPS, 2023

  32. [32]

    Alfworld: Aligning text and embodied environments for interactive learning

    Mohit Shridhar, Xingdi Yuan, Marc-Alexandre Côté, Yonatan Bisk, Adam Trischler, and Matthew Hausknecht. Alfworld: Aligning text and embodied environments for interactive learning. InICLR, 2021

  33. [33]

    arXiv preprint arXiv:2507.06229 , year=

    Xiangru Tang, Tianrui Qin, Tianhao Peng, Ziyang Zhou, Daniel Shao, Tingting Du, Xinming Wei, Peng Xia, Fang Wu, He Zhu, et al. Agent kb: Leveraging cross-domain experience for agentic problem solving.arXiv preprint arXiv:2507.06229, 2025

  34. [34]

    Agent workflow memory

    Zora Zhiruo Wang, Jiayuan Mao, Daniel Fried, and Graham Neubig. Agent workflow memory. InICML, pages 63897–63911, 2025. 12

  35. [35]

    Evo-Memory: Benchmarking LLM Agent Test-time Learning with Self-Evolving Memory

    Tianxin Wei, Noveen Sachdeva, Benjamin Coleman, Zhankui He, Yuanchen Bei, Xuying Ning, Mengting Ai, Yunzhe Li, Jingrui He, Ed H. Chi, Chi Wang, Shuo Chen, Fernando Pereira, Wang- Cheng Kang, and Derek Zhiyuan Cheng. Evo-memory: Benchmarking llm agent test-time learning with self-evolving memory.arXiv preprint arXiv:2511.20857, 2025

  36. [36]

    Long- memeval: Benchmarking chat assistants on long-term interactive memory

    Di Wu, Hongwei Wang, Wenhao Yu, Yuwei Zhang, Kai-Wei Chang, and Dong Yu. Long- memeval: Benchmarking chat assistants on long-term interactive memory. InICLR, 2025

  37. [37]

    Webwalker: Benchmarking llms in web traversal

    Jialong Wu, Wenbiao Yin, Yong Jiang, Zhenglin Wang, Zekun Xi, Runnan Fang, Linhai Zhang, Yulan He, Deyu Zhou, Pengjun Xie, et al. Webwalker: Benchmarking llms in web traversal. In ACL, pages 10290–10305, 2025

  38. [38]

    A-mem: Agentic memory for llm agents

    Wujiang Xu, Zujie Liang, Kai Mei, Hang Gao, Juntao Tan, and Yongfeng Zhang. A-mem: Agentic memory for llm agents. InNeurIPS, 2025

  39. [39]

    ReAct: Synergizing reasoning and acting in language models

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. ReAct: Synergizing reasoning and acting in language models. InICLR, 2023

  40. [40]

    MemAgent: Reshaping Long-Context LLM with Multi-Conv RL-based Memory Agent

    Hongli Yu, Tinghong Chen, Jiangtao Feng, Jiangjie Chen, Weinan Dai, Qiying Yu, Ya-Qin Zhang, Wei-Ying Ma, Jingjing Liu, Mingxuan Wang, et al. Memagent: Reshaping long-context llm with multi-conv rl-based memory agent.arXiv preprint arXiv:2507.02259, 2025

  41. [41]

    MemEvolve: Meta-Evolution of Agent Memory Systems

    Guibin Zhang, Haotian Ren, Chong Zhan, Zhenhong Zhou, Junhao Wang, He Zhu, Wangchun- shu Zhou, and Shuicheng Yan. Memevolve: Meta-evolution of agent memory systems.arXiv preprint arXiv:2512.18746, 2025

  42. [42]

    Agentic context engineering: Evolving contexts for self-improving language models

    Qizheng Zhang, Changran Hu, Shubhangi Upasani, Boyuan Ma, Fenglu Hong, Vamsidhar Ka- manuru, Jay Rainton, Chen Wu, Mengmeng Ji, Hanchen Li, et al. Agentic context engineering: Evolving contexts for self-improving language models. InICLR, 2026

  43. [43]

    Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models

    Yanzhao Zhang, Mingxin Li, Dingkun Long, Xin Zhang, Huan Lin, Baosong Yang, Pengjun Xie, An Yang, Dayiheng Liu, Junyang Lin, et al. Qwen3 embedding: Advancing text embedding and reranking through foundation models.arXiv preprint arXiv:2506.05176, 2025

  44. [44]

    SkillWeaver: Web Agents can Self-Improve by Discovering and Honing Skills

    Boyuan Zheng, Michael Y Fatemi, Xiaolong Jin, Zora Zhiruo Wang, Apurva Gandhi, Yueqi Song, Yu Gu, Jayanth Srinivasa, Gaowen Liu, Graham Neubig, et al. Skillweaver: Web agents can self-improve by discovering and honing skills.arXiv preprint arXiv:2504.07079, 2025

  45. [45]

    Memento: Fine-tuning llm agents without fine-tuning llms

    Huichi Zhou, Yihang Chen, Siyuan Guo, Xue Yan, Kin Hei Lee, Zihan Wang, Ka Yiu Lee, Guchun Zhang, Kun Shao, Linyi Yang, and Jun Wang. Memento: Fine-tuning llm agents without fine-tuning llms, 2025. URLhttps://arxiv.org/abs/2508.16153. 13 A Additional Details onEvoMemBench A.1 Details of Datasets All datasets used in our study are previously published benc...

  46. [46]

    Its ground-truth contains 2 or more actions; and

  47. [47]

    the directory you created

    those actions form a natural user-level dependency, such as: - lookup -> use result - authenticate -> perform protected action - prepare state -> execute action - retrieve current info -> commit operation Do NOT split: - single-action turns - turns whose actions are merely parallel - turns whose internal steps are only low-level implementation details wit...

  48. [48]

    Never split a single-action turn

  49. [49]

    Never merge two original turns

  50. [50]

    Never split a turn into more parts than the number of actions in its original ground-truth

  51. [51]

    Every split part must contain at least one action

  52. [52]

    Later split queries may refer to earlier split queries, but must not depend on future turns

  53. [53]

    Keep unsplit turns semantically unchanged except for light editing if needed for flow. 19

  54. [54]

    After splitting, do not expose cross-turn dependencies more explicitly than in the original task

  55. [55]

    id": "<same id as input>

    Prefer memory-dependent phrasing over answer-revealing phrasing. Positive example: Original turn: Query: Move'final_report.pdf'within document directory to'temp' directory in document. Make sure to create the directory Ground truth: - cd(folder='document') - mkdir(dir_name='temp') - mv(source='final_report.pdf', destination='temp') Good rewrite: Turn 1 Qu...

  56. [56]

    Every original action appears exactly once in rewritten_ground_truth

  57. [57]

    The global action order is preserved

  58. [58]

    Each rewritten query matches its rewritten ground-truth

  59. [59]

    No new actions were introduced

  60. [60]

    No original actions were dropped

  61. [61]

    The rewritten dialogue is coherent

  62. [62]

    Later rewritten turns use implicit references wherever appropriate

  63. [63]

    The Dark Z and Charged Higgs Decay

    Cross-turn dependencies are not unnecessarily exposed by explicit restatement. Now process the following input. Query JSON: <PASTE_QUERY_JSON_HERE> Ground Truth JSON: <PASTE_GROUND_TRUTH_JSON_HERE> B.2 Details of Experiments. To ensure a unified implementation, each memory method is wrapped with two interfaces: utilize and update. The utilize interface ta...