EvoMemBench: Benchmarking Agent Memory from a Self-Evolving Perspective

Bing Tong; Chen Zhang; Jia Li; Kaichi Yu; Miao Peng; Mo Chi; Yan Zhou; Yuhan Li; Yuyao Wang; Zhongjian Zhang

arxiv: 2605.18421 · v1 · pith:5P4VYI7Rnew · submitted 2026-05-18 · 💻 cs.CL · cs.AI· cs.LG

EvoMemBench: Benchmarking Agent Memory from a Self-Evolving Perspective

Yuyao Wang , Zhongjian Zhang , Mo Chi , Kaichi Yu , Yuhan Li , Miao Peng , Bing Tong , Chen Zhang

show 2 more authors

Yan Zhou Jia Li

This is my paper

Pith reviewed 2026-05-20 11:31 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG

keywords LLM agentsagent memorybenchmarkingmemory evaluationlong-context baselinesself-evolving agentsknowledge vs execution tasks

0 comments

The pith

Current memory systems for LLM agents fall short of a general solution, with long-context baselines remaining competitive.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a benchmark to evaluate memory in agents that must handle information across evolving interactions. It divides memory needs into scope, either within one episode or across episodes, and content, either facts or procedures. Testing fifteen memory approaches against long-context baselines reveals that memory adds value mainly when immediate context is too short or tasks grow difficult. No single memory type succeeds across all cases, though retrieval suits knowledge tasks and procedural storage aids execution when past experience matches the structure. This matters because agents require dependable recall to manage ongoing work without starting over each time.

Core claim

The paper shows that existing memory mechanisms do not deliver a reliable edge over extended context windows in agent settings. Systematic tests organized by in-episode versus cross-episode scope and knowledge-oriented versus execution-oriented content demonstrate that memory benefits emerge chiefly under limited context or higher task difficulty, retrieval methods lead in knowledge settings, and procedural or long-term methods help execution tasks only when stored experience fits the task structure.

What carries the argument

EvoMemBench, a benchmark structured along memory scope and memory content axes under a standardized self-evolving protocol for comparing agent memory methods.

If this is right

Memory systems deliver the most help when the current context window cannot hold all required information.
Higher task difficulty increases the practical value of any memory mechanism.
Retrieval-based memory remains effective for tasks centered on retrieving and using knowledge.
Procedural and long-term memory approaches gain traction for execution tasks when stored experience matches the current task demands.
No single memory form produces steady gains across the full range of tested settings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Agent builders could begin with strong long-context models and add memory only for specific short-context or complex scenarios.
The two-axis structure points toward hybrid memory designs that select types based on whether a task needs facts or procedures.
Standardized benchmarks of this kind could track incremental progress toward memory that works reliably in varied agent environments.
Extending the evaluation to include more interactive or longer-horizon agent behaviors would test whether self-evolving memory gains further importance.

Load-bearing premise

The chosen tasks, standardized protocol, and two-axis organization capture the memory challenges that matter for real LLM agent deployments.

What would settle it

A follow-up evaluation where one memory method outperforms long-context baselines on every task type, scope, and difficulty level would challenge the finding that no general solution yet exists.

Figures

Figures reproduced from arXiv: 2605.18421 by Bing Tong, Chen Zhang, Jia Li, Kaichi Yu, Miao Peng, Mo Chi, Yan Zhou, Yuhan Li, Yuyao Wang, Zhongjian Zhang.

**Figure 1.** Figure 1: Overview of EvoMemBench. Existing memory benchmarks cover only parts of this space. Text-centric benchmarks such as LoCoMo [22], LongMemEval [35], and MemoryAgentBench [17] mainly evaluate knowledge retention, retrieval, or revision in conversational or document-like contexts, but do not test whether memory supports action and tool-based execution. Recent agentic or lifelong memory benchmarks, including Me… view at source ↗

**Figure 2.** Figure 2: Accuracy and cost on INEP-KNOW. Finding 1: Strong long-context baselines remain highly competitive. As shown in Table 3 and [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Cross-environment transfer results in CROSSEP-TOOL. 4!+(,% (,(&'0 (#) 3- *!#% (#) *!#% *%!, *!#% %!0 *!#% --* *!#% !.&%0-+!(, -1.#%-+!(, 4!+(,% (,(&'0 (#) 3- *!#% (#) *!#% *%!, *!#% %!0 *!#% --* *!#% ! 3%, +" [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: Cross-environment transfer results in CROSSEP-EMB. In Embodied AI, procedural long-term memory performs best, with an average rank of 5.60. This result shows that cross-episode execution evolution cannot be addressed by one fixed memory form. Retrieval-augmented memory reuses similar past cases, which fits tool-use tasks where similar API calls and parameter patterns recur. General long-term memory maintai… view at source ↗

**Figure 5.** Figure 5: Cross-environment transfer results in CROSSEP-EMB for BM25, GraphRAG, Mem0, and MemOS. 3",)-& )-)'(0 )$* 2. +"$& )$* +"$& +&"- +"$& &"0 +"$& ..+ +"$& "/'&0.,")- .1/$&.,")- 3",)-& )-)'(0 )$* 2. +"$& )$* +"$& +&"- +"$& &"0 +"$& ..+ +"$& " &,./4 [PITH_FULL_IMAGE:figures/full_fig_p022_5.png] view at source ↗

**Figure 6.** Figure 6: Cross-environment transfer results in CROSSEP-EMB for MemoryOS, AWM, AgentKB, and ACE. C.1.2 CROSSEP-TOOL + $(" )-, + /!& ))%$(" )+$&& $&!0,-!' !#$ &! )(-+)& +"!-)' $( ).+ !)' $( + $(" )-, + /!& ))%$(" )+$&& $&!0,-!' !#$ &! )(-+)& [PITH_FULL_IMAGE:figures/full_fig_p022_6.png] view at source ↗

**Figure 7.** Figure 7: Cross-environment transfer results in CROSSEP-TOOL for BM25, GraphRAG, Mem0, and MemOS. 22 [PITH_FULL_IMAGE:figures/full_fig_p022_7.png] view at source ↗

**Figure 8.** Figure 8: Cross-environment transfer results in CROSSEP-TOOL for MemoryOS, AWM, AgentKB, and ACE. C.2 Additional Efficiency Results for Cross-Episode Settings We report token usage for cross-episode knowledge evolution and both average steps and token usage for cross-episode execution settings [PITH_FULL_IMAGE:figures/full_fig_p023_8.png] view at source ↗

read the original abstract

Recent benchmarks for Large Language Model (LLM) agents mainly evaluate reasoning, planning, and execution. However, memory is also essential for agents, as it enables them to store, update, and retrieve information over time. This ability remains under-evaluated, largely because existing benchmarks do not provide a systematic way to assess memory mechanisms. In this paper, we study agent memory from a self-evolving perspective and introduce EvoMemBench, a unified benchmark organized along two axes: memory scope (in-episode vs. cross-episode) and memory content (knowledge-oriented vs. execution-oriented). We compare 15 representative memory methods with strong long-context baselines under a standardized protocol. Results show that current memory systems are still far from a general solution: long-context baselines remain highly competitive, memory helps most when the current context is insufficient or tasks are difficult, and no single memory form works consistently across all settings. Retrieval-based methods remain strong for knowledge-intensive settings, whereas procedural and long-term memory methods are more effective for execution-oriented tasks when their stored experience matches the task structure. We hope EvoMemBench facilitates future research on more effective memory systems for LLM-based agents. Our code is available at https://github.com/DSAIL-Memory/EvoMemBench.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

EvoMemBench adds a two-axis benchmark for agent memory and shows long-context baselines holding up, but the results may not cleanly separate memory from task construction choices.

read the letter

EvoMemBench gives a new way to compare memory mechanisms in LLM agents by splitting along scope (in-episode versus cross-episode) and content (knowledge versus execution), with a self-evolving framing. The paper runs fifteen methods plus long-context baselines under one protocol and reports that memory helps most when context is short or tasks are hard, retrieval suits knowledge work, and procedural memory fits execution when it matches the structure. No single approach wins across the board, and long-context stays competitive overall. Releasing the code is a practical step that lets others inspect or extend the setup directly.

Referee Report

2 major / 2 minor

Summary. The paper introduces EvoMemBench, a unified benchmark for LLM agent memory organized along two axes (memory scope: in-episode vs. cross-episode; memory content: knowledge-oriented vs. execution-oriented). It evaluates 15 representative memory methods against strong long-context baselines under a standardized protocol and concludes that current memory systems remain far from a general solution, with long-context baselines highly competitive, memory most helpful when context is insufficient or tasks difficult, and no single memory form performing consistently across settings.

Significance. If the two-axis protocol and task construction successfully isolate intrinsic memory effects from confounds such as token budget and evolution triggers, the work would be a useful empirical contribution by documenting the conditional utility of memory mechanisms and the persistent strength of long-context approaches. The open code release supports reproducibility and future extensions.

major comments (2)

[§3 and §4] §3 (Benchmark Construction) and §4 (Standardized Protocol): The description does not explicitly demonstrate that every method—including long-context baselines—receives identical total token budgets, identical information density across episodes, and identical self-evolving update triggers. Without these controls, the competitiveness of long-context baselines and the conditional benefit of memory could be artifacts of task scaffolding rather than properties of the memory mechanisms.
[§5] §5 (Experimental Results): The central claim that 'no single memory form works consistently across all settings' rests on the chosen tasks and two-axis splits, yet the section provides no details on statistical tests, error bars, exclusion criteria, or how the in-episode/cross-episode and knowledge/execution axes control for variables such as episode length or retrieval leakage. This makes it difficult to judge whether the observed patterns generalize or are task-specific.

minor comments (2)

[Abstract and §2] Abstract and §2: The term 'self-evolving perspective' is used without a concise operational definition early in the paper; a short paragraph clarifying how evolution is implemented (e.g., update frequency, trigger conditions) would improve readability.
[Tables and Figures] Table captions and figure legends: Some tables comparing the 15 methods lack explicit column headers for context-window size or total tokens used, making direct comparison with long-context baselines harder to verify at a glance.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments. Below we respond point-by-point to the major concerns, clarifying the controls in our protocol and committing to added statistical details and explicit demonstrations in the revised manuscript.

read point-by-point responses

Referee: [§3 and §4] §3 (Benchmark Construction) and §4 (Standardized Protocol): The description does not explicitly demonstrate that every method—including long-context baselines—receives identical total token budgets, identical information density across episodes, and identical self-evolving update triggers. Without these controls, the competitiveness of long-context baselines and the conditional benefit of memory could be artifacts of task scaffolding rather than properties of the memory mechanisms.

Authors: We agree that explicit verification of these equivalences strengthens the claims. The standardized protocol in §4 applies the same environment, episode generator, and interaction loop to all methods. Token budgets are capped identically per turn and per episode by the task configuration; long-context baselines receive the full accumulated history up to the same context-window limit used for memory-augmented agents. Self-evolving update triggers are defined by task-level rules (performance thresholds and new-observation detection) that do not depend on the memory implementation. Information density is fixed by the benchmark’s episode templates in §3. To make these controls transparent, we will add a dedicated paragraph and comparison table in §4. revision: yes
Referee: [§5] §5 (Experimental Results): The central claim that 'no single memory form works consistently across all settings' rests on the chosen tasks and two-axis splits, yet the section provides no details on statistical tests, error bars, exclusion criteria, or how the in-episode/cross-episode and knowledge/execution axes control for variables such as episode length or retrieval leakage. This makes it difficult to judge whether the observed patterns generalize or are task-specific.

Authors: We appreciate the request for greater statistical transparency. All reported numbers are means over five independent runs with different random seeds; standard-deviation error bars appear in the figures but were not described in the text. No runs were excluded. Episode length is standardized within each axis category by the task generator. Retrieval leakage is avoided because memory updates occur only on newly observed information that is absent from the current context window. We will expand §5 with an explicit subsection on these controls, report the run count, and add a short discussion of generalizability limits. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark comparison with no derivations or self-referential reductions

full rationale

The paper introduces EvoMemBench as an empirical evaluation framework for LLM agent memory, organizing tasks along in-episode vs. cross-episode and knowledge vs. execution axes, then reporting experimental comparisons of 15 existing memory methods against long-context baselines under a standardized protocol. All claims (e.g., long-context competitiveness, conditional benefits of memory, lack of consistent winner) rest directly on observed results from these runs rather than any first-principles derivation, fitted parameter renamed as prediction, or load-bearing self-citation chain. No equations, uniqueness theorems, or ansatzes appear; the work is self-contained as a benchmark study whose validity can be assessed against external task reproductions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper is a benchmark introduction that rests on standard domain assumptions about agent memory needs rather than new derivations or entities.

axioms (1)

domain assumption Existing benchmarks do not provide a systematic way to assess memory mechanisms in LLM agents.
This premise is stated directly in the abstract as justification for creating EvoMemBench.

pith-pipeline@v0.9.0 · 5780 in / 1197 out tokens · 53682 ms · 2026-05-20T11:31:48.438213+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

EvoMemBench, a unified benchmark organized along two axes: memory scope (in-episode vs. cross-episode) and memory content (knowledge-oriented vs. execution-oriented).
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We compare 15 representative memory methods with strong long-context baselines under a standardized protocol.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

63 extracted references · 63 canonical work pages · 15 internal anchors

[1]

MemoryBench: A Benchmark for Memory and Continual Learning in LLM Systems

Qingyao Ai, Yichen Tang, Changyue Wang, Jianming Long, Weihang Su, and Yiqun Liu. Memorybench: A benchmark for memory and continual learning in llm systems.arXiv preprint arXiv:2510.17281, 2025. URLhttps://arxiv.org/abs/2510.17281

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

Arigraph: Learning knowledge graph world models with episodic memory for llm agents, 2025

Petr Anokhin, Nikita Semenov, Artyom Sorokin, Dmitry Evseev, Andrey Kravchenko, Mikhail Burtsev, and Evgeny Burnaev. Arigraph: Learning knowledge graph world models with episodic memory for llm agents, 2025. URLhttps://arxiv.org/abs/2407.04363

work page arXiv 2025
[3]

Building self-evolving agents via experience-driven lifelong learning: A framework and benchmark.arXiv preprint arXiv:2508.19005, 2025

Yuxuan Cai, Yipeng Hao, Jie Zhou, Hang Yan, Zhikai Lei, Rui Zhen, Zhenhua Han, Yutao Yang, Junsong Li, Qianjun Pan, Tianyu Huai, Qin Chen, Xin Li, Kai Chen, Bo Zhang, Xipeng Qiu, and Liang He. Building self-evolving agents via experience-driven lifelong learning: A framework and benchmark.arXiv preprint arXiv:2508.19005, 2025. URL https://arxiv. org/abs/2...

work page arXiv 2025
[4]

xbench: Tracking agents productivity scaling with profession-aligned real-world evaluations.arXiv preprint arXiv:2506.13651, 2025

Kaiyuan Chen, Yixin Ren, Yang Liu, Xiaobo Hu, Haotong Tian, Tianbao Xie, Fangfu Liu, Haoye Zhang, Hongzhang Liu, Yuan Gong, et al. xbench: Tracking agents productivity scaling with profession-aligned real-world evaluations.arXiv preprint arXiv:2506.13651, 2025

work page arXiv 2025
[5]

Mem0: Building production-ready ai agents with scalable long-term memory

Prateek Chhikara, Dev Khant, Saket Aryan, Taranjeet Singh, and Deshraj Yadav. Mem0: Building production-ready ai agents with scalable long-term memory. InECAI, 2025

work page 2025
[6]

Cl-bench: A benchmark for context learning

Shihan Dou, Ming Zhang, Zhangyue Yin, Chenhao Huang, Yujiong Shen, Junzhe Wang, Jiayi Chen, Yuchen Ni, Junjie Ye, Cheng Zhang, et al. Cl-bench: A benchmark for context learning. arXiv preprint arXiv:2602.03587, 2026

work page arXiv 2026
[7]

DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents

Mingxuan Du, Benfeng Xu, Chiwei Zhu, Xiaorui Wang, and Zhendong Mao. Deepresearch bench: A comprehensive benchmark for deep research agents, 2025. URL https://arxiv. org/abs/2506.11763

work page internal anchor Pith review Pith/arXiv arXiv 2025
[8]

Pan, Ruifeng Xu, and Kam-Fai Wong

Yiming Du, Bingbing Wang, Yang He, Bin Liang, Baojun Wang, Zhongyang Li, Lin Gui, Jeff Z. Pan, Ruifeng Xu, and Kam-Fai Wong. Memguide: Intent-driven memory selection for goal-oriented multi-session llm agents, 2025. URLhttps://arxiv.org/abs/2505.20231

work page arXiv 2025
[9]

From Local to Global: A Graph RAG Approach to Query-Focused Summarization

Darren Edge, Ha Trinh, Newman Cheng, Joshua Bradley, Alex Chao, Apurva Mody, Steven Truitt, Dasha Metropolitansky, Robert Osazuwa Ness, and Jonathan Larson. From local to global: A graph rag approach to query-focused summarization.arXiv preprint arXiv:2404.16130, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[10]

A Comprehensive Survey of Self-Evolving AI Agents: A New Paradigm Bridging Foundation Models and Lifelong Agentic Systems

Jinyuan Fang, Yanwen Peng, Xi Zhang, Yingxu Wang, Xinhao Yi, Guibin Zhang, Yi Xu, Bin Wu, Siwei Liu, Zihao Li, Zhaochun Ren, Nikos Aletras, Xi Wang, Han Zhou, and Zaiqiao Meng. A comprehensive survey of self-evolving ai agents: A new paradigm bridging foundation models and lifelong agentic systems, 2025. URLhttps://arxiv.org/abs/2508.07407

work page internal anchor Pith review Pith/arXiv arXiv 2025
[11]

Memp: Exploring Agent Procedural Memory

Runnan Fang, Yuan Liang, Xiaobin Wang, Jialong Wu, Shuofei Qiao, Pengjun Xie, Fei Huang, Huajun Chen, and Ningyu Zhang. Memp: Exploring agent procedural memory, 2026. URL https://arxiv.org/abs/2508.06433

work page internal anchor Pith review Pith/arXiv arXiv 2026
[12]

Szostkiewicz, Jon M

Ali Essam Ghareeb, Benjamin Chang, Ludovico Mitchener, Angela Yiu, Caralyn J. Szostkiewicz, Jon M. Laurent, Muhammed T. Razzak, Andrew D. White, Michaela M. Hinks, and Samuel G. Rodriques. Robin: A multi-agent system for automating scientific discovery,

work page
[13]

URLhttps://arxiv.org/abs/2505.13400

work page arXiv
[14]

Gemini 3 flash: Frontier intelligence built for speed

Google DeepMind. Gemini 3 flash: Frontier intelligence built for speed. https://blog. google/products-and-platforms/products/gemini/gemini-3-flash/, 2025

work page 2025
[15]

Legomem: Modular procedural memory for multi-agent llm systems for workflow automation, 2025

Dongge Han, Camille Couturier, Daniel Madrigal Diaz, Xuchao Zhang, Victor Rühle, and Saravan Rajmohan. Legomem: Modular procedural memory for multi-agent llm systems for workflow automation, 2025. URLhttps://arxiv.org/abs/2510.04851

work page arXiv 2025
[16]

Memoryarena: Benchmarking agent memory in interde- pendent multi-session agentic tasks.ICML, 2026

Zexue He, Yu Wang, Churan Zhi, Yuanzhe Hu, Tzu-Ping Chen, Lang Yin, Ze Chen, Tong Arthur Wu, Siru Ouyang, Zihan Wang, et al. Memoryarena: Benchmarking agent memory in interde- pendent multi-session agentic tasks.ICML, 2026. 11

work page 2026
[17]

Evaluating Memory in LLM Agents via Incremental Multi-Turn Interactions

Yuanzhe Hu, Yu Wang, and Julian McAuley. Evaluating memory in llm agents via incremental multi-turn interactions.arXiv preprint arXiv:2507.05257, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[18]

Evaluating memory in llm agents via incremental multi-turn interactions

Yuanzhe Hu, Yu Wang, and Julian McAuley. Evaluating memory in llm agents via incremental multi-turn interactions. InICLR, 2026

work page 2026
[19]

Memory os of ai agent

Jiazheng Kang, Mingming Ji, Zhe Zhao, and Ting Bai. Memory os of ai agent. InEMNLP, pages 25972–25981, 2025

work page 2025
[20]

Experience-evolving multi-turn tool-use agent with hybrid episodic-procedural memory, 2026

Sijia Li, Yuchen Huang, Zifan Liu, Zijian Li, Jingjing fu, Lei Song, Jiang Bian, Jun Zhang, and Rui Wang. Experience-evolving multi-turn tool-use agent with hybrid episodic-procedural memory, 2026. URLhttps://arxiv.org/abs/2512.07287

work page arXiv 2026
[21]

MemOS: A Memory OS for AI System

Zhiyu Li, Chenyang Xi, Chunyu Li, Ding Chen, Boyu Chen, Shichao Song, Simin Niu, Hanyu Wang, Jiawei Yang, Chen Tang, et al. Memos: A memory os for ai system.arXiv preprint arXiv:2507.03724, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[22]

DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models

Aixin Liu, Aoxue Mei, Bangcai Lin, Bing Xue, Bingxuan Wang, Bingzheng Xu, Bochao Wu, Bowei Zhang, Chaofan Lin, Chen Dong, et al. Deepseek-v3. 2: Pushing the frontier of open large language models.arXiv preprint arXiv:2512.02556, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[23]

Evaluating Very Long-Term Conversational Memory of LLM Agents

Adyasha Maharana, Dong-Ho Lee, Sergey Tulyakov, Mohit Bansal, Francesco Barbieri, and Yuwei Fang. Evaluating very long-term conversational memory of llm agents.arXiv preprint arXiv:2402.17753, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[24]

Byterover: Agent-native memory through llm-curated hierarchical context, 2026

Andy Nguyen, Danh Doan, Hoang Pham, Bao Ha, Dat Pham, Linh Nguyen, Hieu Nguyen, Thien Nguyen, Cuong Do, Phat Nguyen, and Toan Nguyen. Byterover: Agent-native memory through llm-curated hierarchical context, 2026. URLhttps://arxiv.org/abs/2604.01599

work page arXiv 2026
[25]

Introducing gpt-5.https://openai.com/index/introducing-gpt-5/, 2025

OpenAI. Introducing gpt-5.https://openai.com/index/introducing-gpt-5/, 2025

work page 2025
[26]

Reasoningbank: Scaling agent self-evolving with reasoning memory

Siru Ouyang, Jun Yan, I-Hung Hsu, Yanfei Chen, Ke Jiang, Zifeng Wang, Rujun Han, Long T Le, Samira Daruki, Xiangru Tang, et al. Reasoningbank: Scaling agent self-evolving with reasoning memory. InICLR, 2026

work page 2026
[27]

The berkeley function calling leaderboard (bfcl): From tool use to agentic evaluation of large language models

Shishir G Patil, Huanzhi Mao, Fanjia Yan, Charlie Cheng-Jie Ji, Vishnu Suresh, Ion Stoica, and Joseph E Gonzalez. The berkeley function calling leaderboard (bfcl): From tool use to agentic evaluation of large language models. InICML, pages 48371–48392, 2025

work page 2025
[28]

Memobrain: Executive memory as an agentic brain for reasoning

Hongjin Qian, Zhao Cao, and Zheng Liu. Memobrain: Executive memory as an agentic brain for reasoning. InACL Findings, 2026

work page 2026
[29]

Now Publishers Inc, 2009

Stephen Robertson and Hugo Zaragoza.The probabilistic relevance framework: BM25 and beyond, volume 4. Now Publishers Inc, 2009

work page 2009
[30]

Meminsight: Autonomous memory augmentation for llm agents, 2025

Rana Salama, Jason Cai, Michelle Yuan, Anna Currey, Monica Sunkara, Yi Zhang, and Yassine Benajiba. Meminsight: Autonomous memory augmentation for llm agents, 2025. URL https://arxiv.org/abs/2503.21760

work page arXiv 2025
[31]

Reflexion: Language agents with verbal reinforcement learning

Noah Shinn, Federico Cassano, Edward Berman, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning. InNeurIPS, 2023

work page 2023
[32]

Alfworld: Aligning text and embodied environments for interactive learning

Mohit Shridhar, Xingdi Yuan, Marc-Alexandre Côté, Yonatan Bisk, Adam Trischler, and Matthew Hausknecht. Alfworld: Aligning text and embodied environments for interactive learning. InICLR, 2021

work page 2021
[33]

Agent kb: Leveraging cross-domain experience for agentic problem solving.arXiv preprint arXiv:2507.06229, 2025

Xiangru Tang, Tianrui Qin, Tianhao Peng, Ziyang Zhou, Daniel Shao, Tingting Du, Xinming Wei, Peng Xia, Fang Wu, He Zhu, et al. Agent kb: Leveraging cross-domain experience for agentic problem solving.arXiv preprint arXiv:2507.06229, 2025

work page arXiv 2025
[34]

Agent workflow memory

Zora Zhiruo Wang, Jiayuan Mao, Daniel Fried, and Graham Neubig. Agent workflow memory. InICML, pages 63897–63911, 2025. 12

work page 2025
[35]

Evo-Memory: Benchmarking LLM Agent Test-time Learning with Self-Evolving Memory

Tianxin Wei, Noveen Sachdeva, Benjamin Coleman, Zhankui He, Yuanchen Bei, Xuying Ning, Mengting Ai, Yunzhe Li, Jingrui He, Ed H. Chi, Chi Wang, Shuo Chen, Fernando Pereira, Wang- Cheng Kang, and Derek Zhiyuan Cheng. Evo-memory: Benchmarking llm agent test-time learning with self-evolving memory.arXiv preprint arXiv:2511.20857, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[36]

Long- memeval: Benchmarking chat assistants on long-term interactive memory

Di Wu, Hongwei Wang, Wenhao Yu, Yuwei Zhang, Kai-Wei Chang, and Dong Yu. Long- memeval: Benchmarking chat assistants on long-term interactive memory. InICLR, 2025

work page 2025
[37]

Webwalker: Benchmarking llms in web traversal

Jialong Wu, Wenbiao Yin, Yong Jiang, Zhenglin Wang, Zekun Xi, Runnan Fang, Linhai Zhang, Yulan He, Deyu Zhou, Pengjun Xie, et al. Webwalker: Benchmarking llms in web traversal. In ACL, pages 10290–10305, 2025

work page 2025
[38]

A-mem: Agentic memory for llm agents

Wujiang Xu, Zujie Liang, Kai Mei, Hang Gao, Juntao Tan, and Yongfeng Zhang. A-mem: Agentic memory for llm agents. InNeurIPS, 2025

work page 2025
[39]

ReAct: Synergizing reasoning and acting in language models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. ReAct: Synergizing reasoning and acting in language models. InICLR, 2023

work page 2023
[40]

MemAgent: Reshaping Long-Context LLM with Multi-Conv RL-based Memory Agent

Hongli Yu, Tinghong Chen, Jiangtao Feng, Jiangjie Chen, Weinan Dai, Qiying Yu, Ya-Qin Zhang, Wei-Ying Ma, Jingjing Liu, Mingxuan Wang, et al. Memagent: Reshaping long-context llm with multi-conv rl-based memory agent.arXiv preprint arXiv:2507.02259, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[41]

MemEvolve: Meta-Evolution of Agent Memory Systems

Guibin Zhang, Haotian Ren, Chong Zhan, Zhenhong Zhou, Junhao Wang, He Zhu, Wangchun- shu Zhou, and Shuicheng Yan. Memevolve: Meta-evolution of agent memory systems.arXiv preprint arXiv:2512.18746, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[42]

Agentic context engineering: Evolving contexts for self-improving language models

Qizheng Zhang, Changran Hu, Shubhangi Upasani, Boyuan Ma, Fenglu Hong, Vamsidhar Ka- manuru, Jay Rainton, Chen Wu, Mengmeng Ji, Hanchen Li, et al. Agentic context engineering: Evolving contexts for self-improving language models. InICLR, 2026

work page 2026
[43]

Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models

Yanzhao Zhang, Mingxin Li, Dingkun Long, Xin Zhang, Huan Lin, Baosong Yang, Pengjun Xie, An Yang, Dayiheng Liu, Junyang Lin, et al. Qwen3 embedding: Advancing text embedding and reranking through foundation models.arXiv preprint arXiv:2506.05176, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[44]

SkillWeaver: Web Agents can Self-Improve by Discovering and Honing Skills

Boyuan Zheng, Michael Y Fatemi, Xiaolong Jin, Zora Zhiruo Wang, Apurva Gandhi, Yueqi Song, Yu Gu, Jayanth Srinivasa, Gaowen Liu, Graham Neubig, et al. Skillweaver: Web agents can self-improve by discovering and honing skills.arXiv preprint arXiv:2504.07079, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[45]

Memento: Fine-tuning llm agents without fine-tuning llms

Huichi Zhou, Yihang Chen, Siyuan Guo, Xue Yan, Kin Hei Lee, Zihan Wang, Ka Yiu Lee, Guchun Zhang, Kun Shao, Linyi Yang, and Jun Wang. Memento: Fine-tuning llm agents without fine-tuning llms, 2025. URLhttps://arxiv.org/abs/2508.16153. 13 A Additional Details onEvoMemBench A.1 Details of Datasets All datasets used in our study are previously published benc...

work page arXiv 2025
[46]

Its ground-truth contains 2 or more actions; and

work page
[47]

the directory you created

those actions form a natural user-level dependency, such as: - lookup -> use result - authenticate -> perform protected action - prepare state -> execute action - retrieve current info -> commit operation Do NOT split: - single-action turns - turns whose actions are merely parallel - turns whose internal steps are only low-level implementation details wit...

work page
[48]

Never split a single-action turn

work page
[49]

Never merge two original turns

work page
[50]

Never split a turn into more parts than the number of actions in its original ground-truth

work page
[51]

Every split part must contain at least one action

work page
[52]

Later split queries may refer to earlier split queries, but must not depend on future turns

work page
[53]

Keep unsplit turns semantically unchanged except for light editing if needed for flow. 19

work page
[54]

After splitting, do not expose cross-turn dependencies more explicitly than in the original task

work page
[55]

id": "<same id as input>

Prefer memory-dependent phrasing over answer-revealing phrasing. Positive example: Original turn: Query: Move'final_report.pdf'within document directory to'temp' directory in document. Make sure to create the directory Ground truth: - cd(folder='document') - mkdir(dir_name='temp') - mv(source='final_report.pdf', destination='temp') Good rewrite: Turn 1 Qu...

work page
[56]

Every original action appears exactly once in rewritten_ground_truth

work page
[57]

The global action order is preserved

work page
[58]

Each rewritten query matches its rewritten ground-truth

work page
[59]

No new actions were introduced

work page
[60]

No original actions were dropped

work page
[61]

The rewritten dialogue is coherent

work page
[62]

Later rewritten turns use implicit references wherever appropriate

work page
[63]

The Dark Z and Charged Higgs Decay

Cross-turn dependencies are not unnecessarily exposed by explicit restatement. Now process the following input. Query JSON: <PASTE_QUERY_JSON_HERE> Ground Truth JSON: <PASTE_GROUND_TRUTH_JSON_HERE> B.2 Details of Experiments. To ensure a unified implementation, each memory method is wrapped with two interfaces: utilize and update. The utilize interface ta...

work page internal anchor Pith review Pith/arXiv arXiv

[1] [1]

MemoryBench: A Benchmark for Memory and Continual Learning in LLM Systems

Qingyao Ai, Yichen Tang, Changyue Wang, Jianming Long, Weihang Su, and Yiqun Liu. Memorybench: A benchmark for memory and continual learning in llm systems.arXiv preprint arXiv:2510.17281, 2025. URLhttps://arxiv.org/abs/2510.17281

work page internal anchor Pith review Pith/arXiv arXiv 2025

[2] [2]

Arigraph: Learning knowledge graph world models with episodic memory for llm agents, 2025

Petr Anokhin, Nikita Semenov, Artyom Sorokin, Dmitry Evseev, Andrey Kravchenko, Mikhail Burtsev, and Evgeny Burnaev. Arigraph: Learning knowledge graph world models with episodic memory for llm agents, 2025. URLhttps://arxiv.org/abs/2407.04363

work page arXiv 2025

[3] [3]

Building self-evolving agents via experience-driven lifelong learning: A framework and benchmark.arXiv preprint arXiv:2508.19005, 2025

Yuxuan Cai, Yipeng Hao, Jie Zhou, Hang Yan, Zhikai Lei, Rui Zhen, Zhenhua Han, Yutao Yang, Junsong Li, Qianjun Pan, Tianyu Huai, Qin Chen, Xin Li, Kai Chen, Bo Zhang, Xipeng Qiu, and Liang He. Building self-evolving agents via experience-driven lifelong learning: A framework and benchmark.arXiv preprint arXiv:2508.19005, 2025. URL https://arxiv. org/abs/2...

work page arXiv 2025

[4] [4]

xbench: Tracking agents productivity scaling with profession-aligned real-world evaluations.arXiv preprint arXiv:2506.13651, 2025

Kaiyuan Chen, Yixin Ren, Yang Liu, Xiaobo Hu, Haotong Tian, Tianbao Xie, Fangfu Liu, Haoye Zhang, Hongzhang Liu, Yuan Gong, et al. xbench: Tracking agents productivity scaling with profession-aligned real-world evaluations.arXiv preprint arXiv:2506.13651, 2025

work page arXiv 2025

[5] [5]

Mem0: Building production-ready ai agents with scalable long-term memory

Prateek Chhikara, Dev Khant, Saket Aryan, Taranjeet Singh, and Deshraj Yadav. Mem0: Building production-ready ai agents with scalable long-term memory. InECAI, 2025

work page 2025

[6] [6]

Cl-bench: A benchmark for context learning

Shihan Dou, Ming Zhang, Zhangyue Yin, Chenhao Huang, Yujiong Shen, Junzhe Wang, Jiayi Chen, Yuchen Ni, Junjie Ye, Cheng Zhang, et al. Cl-bench: A benchmark for context learning. arXiv preprint arXiv:2602.03587, 2026

work page arXiv 2026

[7] [7]

DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents

Mingxuan Du, Benfeng Xu, Chiwei Zhu, Xiaorui Wang, and Zhendong Mao. Deepresearch bench: A comprehensive benchmark for deep research agents, 2025. URL https://arxiv. org/abs/2506.11763

work page internal anchor Pith review Pith/arXiv arXiv 2025

[8] [8]

Pan, Ruifeng Xu, and Kam-Fai Wong

Yiming Du, Bingbing Wang, Yang He, Bin Liang, Baojun Wang, Zhongyang Li, Lin Gui, Jeff Z. Pan, Ruifeng Xu, and Kam-Fai Wong. Memguide: Intent-driven memory selection for goal-oriented multi-session llm agents, 2025. URLhttps://arxiv.org/abs/2505.20231

work page arXiv 2025

[9] [9]

From Local to Global: A Graph RAG Approach to Query-Focused Summarization

Darren Edge, Ha Trinh, Newman Cheng, Joshua Bradley, Alex Chao, Apurva Mody, Steven Truitt, Dasha Metropolitansky, Robert Osazuwa Ness, and Jonathan Larson. From local to global: A graph rag approach to query-focused summarization.arXiv preprint arXiv:2404.16130, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[10] [10]

A Comprehensive Survey of Self-Evolving AI Agents: A New Paradigm Bridging Foundation Models and Lifelong Agentic Systems

Jinyuan Fang, Yanwen Peng, Xi Zhang, Yingxu Wang, Xinhao Yi, Guibin Zhang, Yi Xu, Bin Wu, Siwei Liu, Zihao Li, Zhaochun Ren, Nikos Aletras, Xi Wang, Han Zhou, and Zaiqiao Meng. A comprehensive survey of self-evolving ai agents: A new paradigm bridging foundation models and lifelong agentic systems, 2025. URLhttps://arxiv.org/abs/2508.07407

work page internal anchor Pith review Pith/arXiv arXiv 2025

[11] [11]

Memp: Exploring Agent Procedural Memory

Runnan Fang, Yuan Liang, Xiaobin Wang, Jialong Wu, Shuofei Qiao, Pengjun Xie, Fei Huang, Huajun Chen, and Ningyu Zhang. Memp: Exploring agent procedural memory, 2026. URL https://arxiv.org/abs/2508.06433

work page internal anchor Pith review Pith/arXiv arXiv 2026

[12] [12]

Szostkiewicz, Jon M

Ali Essam Ghareeb, Benjamin Chang, Ludovico Mitchener, Angela Yiu, Caralyn J. Szostkiewicz, Jon M. Laurent, Muhammed T. Razzak, Andrew D. White, Michaela M. Hinks, and Samuel G. Rodriques. Robin: A multi-agent system for automating scientific discovery,

work page

[13] [13]

URLhttps://arxiv.org/abs/2505.13400

work page arXiv

[14] [14]

Gemini 3 flash: Frontier intelligence built for speed

Google DeepMind. Gemini 3 flash: Frontier intelligence built for speed. https://blog. google/products-and-platforms/products/gemini/gemini-3-flash/, 2025

work page 2025

[15] [15]

Legomem: Modular procedural memory for multi-agent llm systems for workflow automation, 2025

Dongge Han, Camille Couturier, Daniel Madrigal Diaz, Xuchao Zhang, Victor Rühle, and Saravan Rajmohan. Legomem: Modular procedural memory for multi-agent llm systems for workflow automation, 2025. URLhttps://arxiv.org/abs/2510.04851

work page arXiv 2025

[16] [16]

Memoryarena: Benchmarking agent memory in interde- pendent multi-session agentic tasks.ICML, 2026

Zexue He, Yu Wang, Churan Zhi, Yuanzhe Hu, Tzu-Ping Chen, Lang Yin, Ze Chen, Tong Arthur Wu, Siru Ouyang, Zihan Wang, et al. Memoryarena: Benchmarking agent memory in interde- pendent multi-session agentic tasks.ICML, 2026. 11

work page 2026

[17] [17]

Evaluating Memory in LLM Agents via Incremental Multi-Turn Interactions

Yuanzhe Hu, Yu Wang, and Julian McAuley. Evaluating memory in llm agents via incremental multi-turn interactions.arXiv preprint arXiv:2507.05257, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[18] [18]

Evaluating memory in llm agents via incremental multi-turn interactions

Yuanzhe Hu, Yu Wang, and Julian McAuley. Evaluating memory in llm agents via incremental multi-turn interactions. InICLR, 2026

work page 2026

[19] [19]

Memory os of ai agent

Jiazheng Kang, Mingming Ji, Zhe Zhao, and Ting Bai. Memory os of ai agent. InEMNLP, pages 25972–25981, 2025

work page 2025

[20] [20]

Experience-evolving multi-turn tool-use agent with hybrid episodic-procedural memory, 2026

Sijia Li, Yuchen Huang, Zifan Liu, Zijian Li, Jingjing fu, Lei Song, Jiang Bian, Jun Zhang, and Rui Wang. Experience-evolving multi-turn tool-use agent with hybrid episodic-procedural memory, 2026. URLhttps://arxiv.org/abs/2512.07287

work page arXiv 2026

[21] [21]

MemOS: A Memory OS for AI System

Zhiyu Li, Chenyang Xi, Chunyu Li, Ding Chen, Boyu Chen, Shichao Song, Simin Niu, Hanyu Wang, Jiawei Yang, Chen Tang, et al. Memos: A memory os for ai system.arXiv preprint arXiv:2507.03724, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[22] [22]

DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models

Aixin Liu, Aoxue Mei, Bangcai Lin, Bing Xue, Bingxuan Wang, Bingzheng Xu, Bochao Wu, Bowei Zhang, Chaofan Lin, Chen Dong, et al. Deepseek-v3. 2: Pushing the frontier of open large language models.arXiv preprint arXiv:2512.02556, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[23] [23]

Evaluating Very Long-Term Conversational Memory of LLM Agents

Adyasha Maharana, Dong-Ho Lee, Sergey Tulyakov, Mohit Bansal, Francesco Barbieri, and Yuwei Fang. Evaluating very long-term conversational memory of llm agents.arXiv preprint arXiv:2402.17753, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[24] [24]

Byterover: Agent-native memory through llm-curated hierarchical context, 2026

Andy Nguyen, Danh Doan, Hoang Pham, Bao Ha, Dat Pham, Linh Nguyen, Hieu Nguyen, Thien Nguyen, Cuong Do, Phat Nguyen, and Toan Nguyen. Byterover: Agent-native memory through llm-curated hierarchical context, 2026. URLhttps://arxiv.org/abs/2604.01599

work page arXiv 2026

[25] [25]

Introducing gpt-5.https://openai.com/index/introducing-gpt-5/, 2025

OpenAI. Introducing gpt-5.https://openai.com/index/introducing-gpt-5/, 2025

work page 2025

[26] [26]

Reasoningbank: Scaling agent self-evolving with reasoning memory

Siru Ouyang, Jun Yan, I-Hung Hsu, Yanfei Chen, Ke Jiang, Zifeng Wang, Rujun Han, Long T Le, Samira Daruki, Xiangru Tang, et al. Reasoningbank: Scaling agent self-evolving with reasoning memory. InICLR, 2026

work page 2026

[27] [27]

The berkeley function calling leaderboard (bfcl): From tool use to agentic evaluation of large language models

Shishir G Patil, Huanzhi Mao, Fanjia Yan, Charlie Cheng-Jie Ji, Vishnu Suresh, Ion Stoica, and Joseph E Gonzalez. The berkeley function calling leaderboard (bfcl): From tool use to agentic evaluation of large language models. InICML, pages 48371–48392, 2025

work page 2025

[28] [28]

Memobrain: Executive memory as an agentic brain for reasoning

Hongjin Qian, Zhao Cao, and Zheng Liu. Memobrain: Executive memory as an agentic brain for reasoning. InACL Findings, 2026

work page 2026

[29] [29]

Now Publishers Inc, 2009

Stephen Robertson and Hugo Zaragoza.The probabilistic relevance framework: BM25 and beyond, volume 4. Now Publishers Inc, 2009

work page 2009

[30] [30]

Meminsight: Autonomous memory augmentation for llm agents, 2025

Rana Salama, Jason Cai, Michelle Yuan, Anna Currey, Monica Sunkara, Yi Zhang, and Yassine Benajiba. Meminsight: Autonomous memory augmentation for llm agents, 2025. URL https://arxiv.org/abs/2503.21760

work page arXiv 2025

[31] [31]

Reflexion: Language agents with verbal reinforcement learning

Noah Shinn, Federico Cassano, Edward Berman, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning. InNeurIPS, 2023

work page 2023

[32] [32]

Alfworld: Aligning text and embodied environments for interactive learning

Mohit Shridhar, Xingdi Yuan, Marc-Alexandre Côté, Yonatan Bisk, Adam Trischler, and Matthew Hausknecht. Alfworld: Aligning text and embodied environments for interactive learning. InICLR, 2021

work page 2021

[33] [33]

Agent kb: Leveraging cross-domain experience for agentic problem solving.arXiv preprint arXiv:2507.06229, 2025

Xiangru Tang, Tianrui Qin, Tianhao Peng, Ziyang Zhou, Daniel Shao, Tingting Du, Xinming Wei, Peng Xia, Fang Wu, He Zhu, et al. Agent kb: Leveraging cross-domain experience for agentic problem solving.arXiv preprint arXiv:2507.06229, 2025

work page arXiv 2025

[34] [34]

Agent workflow memory

Zora Zhiruo Wang, Jiayuan Mao, Daniel Fried, and Graham Neubig. Agent workflow memory. InICML, pages 63897–63911, 2025. 12

work page 2025

[35] [35]

Evo-Memory: Benchmarking LLM Agent Test-time Learning with Self-Evolving Memory

Tianxin Wei, Noveen Sachdeva, Benjamin Coleman, Zhankui He, Yuanchen Bei, Xuying Ning, Mengting Ai, Yunzhe Li, Jingrui He, Ed H. Chi, Chi Wang, Shuo Chen, Fernando Pereira, Wang- Cheng Kang, and Derek Zhiyuan Cheng. Evo-memory: Benchmarking llm agent test-time learning with self-evolving memory.arXiv preprint arXiv:2511.20857, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[36] [36]

Long- memeval: Benchmarking chat assistants on long-term interactive memory

Di Wu, Hongwei Wang, Wenhao Yu, Yuwei Zhang, Kai-Wei Chang, and Dong Yu. Long- memeval: Benchmarking chat assistants on long-term interactive memory. InICLR, 2025

work page 2025

[37] [37]

Webwalker: Benchmarking llms in web traversal

Jialong Wu, Wenbiao Yin, Yong Jiang, Zhenglin Wang, Zekun Xi, Runnan Fang, Linhai Zhang, Yulan He, Deyu Zhou, Pengjun Xie, et al. Webwalker: Benchmarking llms in web traversal. In ACL, pages 10290–10305, 2025

work page 2025

[38] [38]

A-mem: Agentic memory for llm agents

Wujiang Xu, Zujie Liang, Kai Mei, Hang Gao, Juntao Tan, and Yongfeng Zhang. A-mem: Agentic memory for llm agents. InNeurIPS, 2025

work page 2025

[39] [39]

ReAct: Synergizing reasoning and acting in language models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. ReAct: Synergizing reasoning and acting in language models. InICLR, 2023

work page 2023

[40] [40]

MemAgent: Reshaping Long-Context LLM with Multi-Conv RL-based Memory Agent

Hongli Yu, Tinghong Chen, Jiangtao Feng, Jiangjie Chen, Weinan Dai, Qiying Yu, Ya-Qin Zhang, Wei-Ying Ma, Jingjing Liu, Mingxuan Wang, et al. Memagent: Reshaping long-context llm with multi-conv rl-based memory agent.arXiv preprint arXiv:2507.02259, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[41] [41]

MemEvolve: Meta-Evolution of Agent Memory Systems

Guibin Zhang, Haotian Ren, Chong Zhan, Zhenhong Zhou, Junhao Wang, He Zhu, Wangchun- shu Zhou, and Shuicheng Yan. Memevolve: Meta-evolution of agent memory systems.arXiv preprint arXiv:2512.18746, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[42] [42]

Agentic context engineering: Evolving contexts for self-improving language models

Qizheng Zhang, Changran Hu, Shubhangi Upasani, Boyuan Ma, Fenglu Hong, Vamsidhar Ka- manuru, Jay Rainton, Chen Wu, Mengmeng Ji, Hanchen Li, et al. Agentic context engineering: Evolving contexts for self-improving language models. InICLR, 2026

work page 2026

[43] [43]

Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models

Yanzhao Zhang, Mingxin Li, Dingkun Long, Xin Zhang, Huan Lin, Baosong Yang, Pengjun Xie, An Yang, Dayiheng Liu, Junyang Lin, et al. Qwen3 embedding: Advancing text embedding and reranking through foundation models.arXiv preprint arXiv:2506.05176, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[44] [44]

SkillWeaver: Web Agents can Self-Improve by Discovering and Honing Skills

Boyuan Zheng, Michael Y Fatemi, Xiaolong Jin, Zora Zhiruo Wang, Apurva Gandhi, Yueqi Song, Yu Gu, Jayanth Srinivasa, Gaowen Liu, Graham Neubig, et al. Skillweaver: Web agents can self-improve by discovering and honing skills.arXiv preprint arXiv:2504.07079, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[45] [45]

Memento: Fine-tuning llm agents without fine-tuning llms

Huichi Zhou, Yihang Chen, Siyuan Guo, Xue Yan, Kin Hei Lee, Zihan Wang, Ka Yiu Lee, Guchun Zhang, Kun Shao, Linyi Yang, and Jun Wang. Memento: Fine-tuning llm agents without fine-tuning llms, 2025. URLhttps://arxiv.org/abs/2508.16153. 13 A Additional Details onEvoMemBench A.1 Details of Datasets All datasets used in our study are previously published benc...

work page arXiv 2025

[46] [46]

Its ground-truth contains 2 or more actions; and

work page

[47] [47]

the directory you created

those actions form a natural user-level dependency, such as: - lookup -> use result - authenticate -> perform protected action - prepare state -> execute action - retrieve current info -> commit operation Do NOT split: - single-action turns - turns whose actions are merely parallel - turns whose internal steps are only low-level implementation details wit...

work page

[48] [48]

Never split a single-action turn

work page

[49] [49]

Never merge two original turns

work page

[50] [50]

Never split a turn into more parts than the number of actions in its original ground-truth

work page

[51] [51]

Every split part must contain at least one action

work page

[52] [52]

Later split queries may refer to earlier split queries, but must not depend on future turns

work page

[53] [53]

Keep unsplit turns semantically unchanged except for light editing if needed for flow. 19

work page

[54] [54]

After splitting, do not expose cross-turn dependencies more explicitly than in the original task

work page

[55] [55]

id": "<same id as input>

Prefer memory-dependent phrasing over answer-revealing phrasing. Positive example: Original turn: Query: Move'final_report.pdf'within document directory to'temp' directory in document. Make sure to create the directory Ground truth: - cd(folder='document') - mkdir(dir_name='temp') - mv(source='final_report.pdf', destination='temp') Good rewrite: Turn 1 Qu...

work page

[56] [56]

Every original action appears exactly once in rewritten_ground_truth

work page

[57] [57]

The global action order is preserved

work page

[58] [58]

Each rewritten query matches its rewritten ground-truth

work page

[59] [59]

No new actions were introduced

work page

[60] [60]

No original actions were dropped

work page

[61] [61]

The rewritten dialogue is coherent

work page

[62] [62]

Later rewritten turns use implicit references wherever appropriate

work page

[63] [63]

The Dark Z and Charged Higgs Decay

Cross-turn dependencies are not unnecessarily exposed by explicit restatement. Now process the following input. Query JSON: <PASTE_QUERY_JSON_HERE> Ground Truth JSON: <PASTE_GROUND_TRUTH_JSON_HERE> B.2 Details of Experiments. To ensure a unified implementation, each memory method is wrapped with two interfaces: utilize and update. The utilize interface ta...

work page internal anchor Pith review Pith/arXiv arXiv