arxiv: 2605.13880 · v1 · submitted 2026-05-11 · 💻 cs.AI · cs.CL

Recognition: no theorem link

PREPING: Building Agent Memory without Tasks

Yumin Choi , Sangwoo Park , Minki Kang , Jinheon Baek , Sung Ju Hwang

Authors on Pith no claims yet

Pith reviewed 2026-05-15 06:10 UTC · model grok-4.3

classification 💻 cs.AI cs.CL

keywords agent memory constructionsynthetic task generationpre-task preparationproposer-validator loopcold-start problemprocedural memorymemory without tasks

0 comments

The pith

Agents can construct competitive procedural memory for new environments using only self-generated synthetic tasks before any real experience.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines whether agents can close the cold-start gap in new environments by building memory ahead of time through purely synthetic practice. Preping introduces a proposer that maintains a structured control state to generate tasks, a solver that executes them, and a validator that filters useful trajectories while supplying feedback to improve future proposals. This loop prevents synthetic data from becoming redundant or infeasible and keeps memory from degrading. Experiments across AppWorld, BFCL v3, and MCP-Universe show the resulting memory outperforms a no-memory baseline and matches strong methods that rely on offline demonstrations or online interactions, but at substantially lower deployment cost. The gains stem from controlled coverage rather than the sheer amount of synthetic data produced.

Core claim

Preping shows that procedural memory can be built pre-task by maintaining a proposer memory state that conditions synthetic task generation. A solver runs the generated tasks and a validator selects only eligible trajectories for memory insertion while returning feedback that refines the next round of proposals. The resulting memory substantially improves over a no-memory baseline and reaches performance levels competitive with playbook methods built from real experience, while cutting deployment cost by factors of 2.99 on AppWorld and 2.23 on BFCL v3.

What carries the argument

Proposer memory, the structured control state that shapes future synthetic task proposals, together with the closed proposer-solver-validator loop that enforces feasibility, reduces redundancy, and supplies targeted updates to memory.

If this is right

Memory construction becomes possible with zero direct experience of target tasks.
Deployment costs fall by more than twofold relative to online memory construction.
Gains arise from proposer-side control over feasibility and coverage rather than data volume.
The approach applies across AppWorld, BFCL v3, and MCP-Universe benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If synthetic task generators become more faithful, agents could initialize in entirely new domains with no real-world data collection at all.
The method could be tested in robotics or simulation-heavy domains where creating synthetic tasks is cheap but real interactions are expensive.
The proposer memory state itself might be initialized from minimal seed examples and then refined purely through the validation loop.

Load-bearing premise

Synthetic tasks generated without any direct exposure to real target-environment tasks will still produce trajectories that transfer usefully when the agent later encounters those real tasks.

What would settle it

Inserting Preping memory into an agent and testing it on real tasks where the synthetic proposals systematically omit key patterns of the target environment would cause performance to fall back to the no-memory baseline level.

Figures

Figures reproduced from arXiv: 2605.13880 by Jinheon Baek, Minki Kang, Sangwoo Park, Sung Ju Hwang, Yumin Choi.

**Figure 1.** Figure 1: PREPING builds memory before the first user task and mitigates online cold-start. Left: Unlike offline methods that require prior human tasks or online methods that start from empty memory, PREPING constructs procedural memory through self-generated synthetic practice before deployment. Right: PREPING establishes broad tool coverage before deployment, whereas ACEOnline must accumulate coverage from user-f… view at source ↗

**Figure 2.** Figure 2: Overview of PREPING. PREPING builds procedural memory before deployment through a synthetic-practice loop: proposing tasks, executing them in the environment, validating the resulting trajectories, and updating memory. Proposer memory (Mprop) shapes what to practice next, while the Validator filters which synthetic trajectories are reliable enough to become solver memory (Msol). has already been practiced,… view at source ↗

**Figure 3.** Figure 3: PREPING mitigates online cold-start before memory builds up. MCP-Universe. Furthermore, despite using no human-defined or deployment-time user tasks for memory construction, PREPING remains competitive with task-informed methods. Specifically, on AppWorld, PREPING exceeds ACE-Offline and is close to ACE-Online, while on BFCL v3, it surpasses ACE-Online on average. These results suggest that PREPING can pr… view at source ↗

**Figure 4.** Figure 4: Construction budget. AppWorld Test-Normal TGC as a function of the number of synthetic tasks. has been practiced but not necessarily what is executable in the current environment. Environmental information has a complementary effect: it keeps proposals anchored to observed entities, states, and constraints, but by itself lacks the history needed to broaden practice. PREPING combines both signals, yielding… view at source ↗

**Figure 5.** Figure 5: Deployment time cost per task. PREPING Reduces Deployment-Time Cost. Online memory construction can improve performance, but it also adds memory-update calls during user-facing deployment. In contrast, PREPING constructs solver memory before deployment, allowing user tasks to be executed without additional memory-construction calls. As shown in [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: BFCL task-generation meta prompt used in our implementation of P [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗

**Figure 7.** Figure 7: BFCL validator meta prompt used in our implementation of P [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗

**Figure 8.** Figure 8: BFCL grounded environment information summarization prompt used to extract proposer [PITH_FULL_IMAGE:figures/full_fig_p015_8.png] view at source ↗

**Figure 9.** Figure 9: AppWorld ACE reflector prompt used for trajectory analysis before playbook update. [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗

**Figure 10.** Figure 10: AppWorld ACE curator prompt used to merge reusable memory into the solver memory. [PITH_FULL_IMAGE:figures/full_fig_p017_10.png] view at source ↗

**Figure 11.** Figure 11: AppWorld task-solving generation prompt. [PITH_FULL_IMAGE:figures/full_fig_p018_11.png] view at source ↗

**Figure 12.** Figure 12: BFCL task-solving generation prompt. USER: You are a ReAct (Reasoning and Acting) agent. {{INSTRUCTION}} You have access to these tools: ### Tools Description ### {{TOOLS_DESCRIPTION}} ### End of Tools Description ### {{Solver Memory}} You need to answer the following question: Question: {{QUESTION}} Your goal is to reason about the question and decide on the best course of action to answer it accurately.… view at source ↗

**Figure 13.** Figure 13: MCP-Universe task-solving generation prompt. [PITH_FULL_IMAGE:figures/full_fig_p019_13.png] view at source ↗

**Figure 14.** Figure 14: BFCL exploration baseline prompt templates. A.6 Compute Resources All experiments used API-hosted LLM inference, so no local GPU training was performed. Agent execution and memory-construction jobs were run on CPU workers with parallel API calls; the main compute cost is therefore reported in token cost rather than GPU-hours. Wall-clock time varies with provider latency and worker parallelism. 20 [PITH_F… view at source ↗

**Figure 15.** Figure 15: AppWorld exploration baseline prompt templates. Both prompts include the shared AppWorld execution rules shown at the bottom. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_15.png] view at source ↗

**Figure 16.** Figure 16: Iteration dynamics of AppWorld component ablations. 1 2 4 6 8 10 Iteration 0 10 20 count Cumulative Invalid Tasks Naive Val. only Val. + Hist. Val. + Env. PREPING 1 2 4 6 8 10 Iteration 20 40 60 80 100 120 count Cumulative Unique Tools 1 2 4 6 8 10 Iteration 4 5 6 bits Cumulative Tool Entropy BFCL [PITH_FULL_IMAGE:figures/full_fig_p023_16.png] view at source ↗

**Figure 17.** Figure 17: Iteration dynamics of BFCL component ablations. Figs. 16 and 17 show how the component-ablation variants evolve over the ten synthetic task construction iterations. Curves are averaged over three independent runs, with shaded bands denoting standard deviation. Naive is omitted from the invalid-task panels because it has no validator labels. B.3 Ablating Validator Signals in Memory Updates [PITH_FULL_IMAG… view at source ↗

read the original abstract

Agent memory is typically constructed either offline from curated demonstrations or online from post-deployment interactions. However, regardless of how it is built, an agent faces a cold-start gap when first introduced to a new environment without any task-specific experience available. In this paper, we study pre-task memory construction: whether an agent can build procedural memory before observing any target-environment tasks, using only self-generated synthetic practice. Yet, synthetic interaction alone is insufficient, as without controlling what to practice and what to store, synthetic tasks become redundant, infeasible, and ultimately uninformative, and memory further degrades quickly due to unfiltered trajectories. To overcome this, we present Preping, a proposer-guided memory construction framework. At its core is proposer memory, a structured control state that shapes future practice. A Proposer generates synthetic tasks conditioned on this state, a Solver executes them, and a Validator determines which trajectories are eligible for memory insertion while also providing feedback to guide future proposals. Experiments on AppWorld, BFCL v3, and MCP-Universe show that Preping substantially improves over a no-memory baseline and achieves performance competitive with strong playbook-based methods built from offline or online experience, with deployment cost $2.99\times$ lower on AppWorld and $2.23\times$ lower on BFCL v3 than online memory construction. Further analyses reveal that the main benefit does not come from synthetic volume alone, but from proposer-side control over feasibility, redundancy, and coverage, combined with selective memory updates.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that Preping enables pre-task construction of agent memory via a proposer-guided loop: a structured 'proposer memory' state conditions synthetic task generation, a solver executes them, and a validator filters trajectories for memory insertion while supplying feedback. On AppWorld, BFCL v3, and MCP-Universe, the resulting memory yields substantial gains over a no-memory baseline, matches strong offline/online playbook methods, and reduces deployment cost by 2.99× and 2.23× respectively versus online memory construction; the benefit is attributed to control of feasibility, redundancy, and coverage rather than synthetic volume alone.

Significance. If the synthetic-to-real transfer holds under distributional controls, the result would meaningfully reduce cold-start costs for agent deployment in new environments and weaken reliance on expensive post-deployment interaction data, with direct implications for scalable agentic systems.

major comments (2)

[Experiments] Experiments section: the headline performance claims rest on comparisons whose statistical robustness is not reported (no p-values, confidence intervals, or details on trajectory filtering rules and ablation controls for volume vs. proposer control). This makes it impossible to confirm that gains arise from the claimed mechanisms rather than uncontrolled factors.
[Method] Method and Experiments: the central transfer assumption—that proposer-generated synthetic tasks (shaped by proposer memory, validator filtering, and feedback) sufficiently overlap the real-task distribution on AppWorld/BFCL v3—is load-bearing yet unsupported by any coverage metric (tool-call histograms, state-transition statistics, or embedding distances). Without such checks, observed improvements could stem from generic scaffolding rather than targeted pre-task preparation.

minor comments (2)

[Abstract] Abstract and §3: the term 'proposer memory' is introduced as a 'structured control state' but its precise representation (e.g., data structures, update rules) is not formalized early enough for readers to follow the loop without backtracking.
[Related Work] Related work: the positioning against prior synthetic-data and self-play methods for agents would benefit from explicit citations to recent work on procedural task generation and memory consolidation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which highlight important aspects of statistical rigor and distributional overlap. We address each major point below and have revised the manuscript to incorporate additional analyses and reporting.

read point-by-point responses

Referee: [Experiments] Experiments section: the headline performance claims rest on comparisons whose statistical robustness is not reported (no p-values, confidence intervals, or details on trajectory filtering rules and ablation controls for volume vs. proposer control). This makes it impossible to confirm that gains arise from the claimed mechanisms rather than uncontrolled factors.

Authors: We agree that the original submission omitted p-values, confidence intervals, and explicit ablation controls separating volume from proposer guidance. In the revised manuscript we now report 95% confidence intervals and p-values for all headline comparisons on AppWorld, BFCL v3, and MCP-Universe. We have also expanded the Experiments section with (i) the precise validator filtering rules and (ii) new volume-controlled ablations that hold proposer memory fixed while varying the number of synthetic trajectories. These additions show that performance scales with proposer-guided selection rather than raw volume. revision: yes
Referee: [Method] Method and Experiments: the central transfer assumption—that proposer-generated synthetic tasks (shaped by proposer memory, validator filtering, and feedback) sufficiently overlap the real-task distribution on AppWorld/BFCL v3—is load-bearing yet unsupported by any coverage metric (tool-call histograms, state-transition statistics, or embedding distances). Without such checks, observed improvements could stem from generic scaffolding rather than targeted pre-task preparation.

Authors: The referee correctly notes the absence of explicit coverage metrics in the original version. While end-to-end gains on held-out real tasks provide indirect support for transfer, we have added direct distributional checks in the revision: tool-call histograms, state-transition statistics, and embedding-distance comparisons between the synthetic trajectories and the real task distributions. These metrics indicate substantial overlap in tool usage and state coverage, consistent with the claim that gains arise from targeted, proposer-controlled practice rather than generic scaffolding. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical framework evaluated on external benchmarks

full rationale

The paper presents Preping as a proposer-validator framework for synthetic pre-task memory construction and supports its claims solely through experimental results on AppWorld, BFCL v3, and MCP-Universe. No mathematical derivations, equations, or first-principles predictions are advanced that could reduce by construction to fitted inputs, self-definitions, or self-citation chains. Performance comparisons (e.g., against no-memory baselines and playbook methods) rest on direct, externally measurable benchmark outcomes rather than any internal reduction, satisfying the self-contained criterion.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 2 invented entities

The framework introduces new components (proposer memory, proposer, validator) that are postulated to solve redundancy and feasibility issues; no free parameters or external axioms are stated in the abstract.

invented entities (2)

proposer memory no independent evidence
purpose: structured control state that shapes future synthetic task proposals
Core new construct introduced to control practice quality; no independent evidence outside the framework itself.
Validator no independent evidence
purpose: determines eligible trajectories for memory insertion and supplies feedback to the proposer
New filtering and feedback component; independent evidence not provided beyond reported benchmark gains.

pith-pipeline@v0.9.0 · 5573 in / 1167 out tokens · 39746 ms · 2026-05-15T06:10:55.020382+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

53 extracted references · 53 canonical work pages · 6 internal anchors

[1]

Tool-R0: Self-evolving LLM agents for tool-learning from zero data.arXiv, 2026

Emre Can Acikgoz, Cheng Qian, Jonas Hübotter, Heng Ji, Dilek Hakkani-Tür, and Gokhan Tur. Tool-R0: Self-evolving LLM agents for tool-learning from zero data.arXiv, 2026. doi: 10.48550/arXiv.2602.21320. URLhttps://arxiv.org/abs/2602.21320

work page doi:10.48550/arxiv.2602.21320 2026
[2]

Introducing the model context protocol, November 2024

Anthropic. Introducing the model context protocol, November 2024. URL https://www. anthropic.com/news/model-context-protocol. Accessed: 2026-05-01

work page 2024
[3]

DeepSeek-V3.2: Pushing the frontier of open large language models, 2025

DeepSeek-AI. DeepSeek-V3.2: Pushing the frontier of open large language models, 2025. URL https://huggingface.co/deepseek-ai/DeepSeek-V3.2. Model card

work page 2025
[4]

Legomem: Modular procedural memory for multi-agent LLM systems for workflow automation.arXiv, 2025

Dongge Han, Camille Couturier, Daniel Madrigal Díaz, Xuchao Zhang, Victor Rühle, and Saravan Rajmohan. Legomem: Modular procedural memory for multi-agent LLM systems for workflow automation.arXiv, 2025. doi: 10.48550/arXiv.2510.04851. URL https: //arxiv.org/abs/2510.04851

work page doi:10.48550/arxiv.2510.04851 2025
[5]

R-Zero: Self-Evolving Reasoning LLM from Zero Data

Chengsong Huang, Wenhao Yu, Xiaoyang Wang, Hongming Zhang, Zongxia Li, Ruosen Li, Jiaxin Huang, Haitao Mi, and Dong Yu. R-zero: Self-evolving reasoning LLM from zero data. arXiv, 2025. doi: 10.48550/arXiv.2508.05004. URL https://arxiv.org/abs/2508.05004

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2508.05004 2025
[6]

Decomposed prompting: A modular approach for solving complex tasks

Tushar Khot, Harsh Trivedi, Matthew Finlayson, Yao Fu, Kyle Richardson, Peter Clark, and Ashish Sabharwal. Decomposed prompting: A modular approach for solving complex tasks. InInternational Conference on Learning Representations (ICLR), 2023. URL https:// openreview.net/forum?id=_nGgzQjzaRy

work page 2023
[7]

Spice: Self-play in corpus environments improves reasoning.arXiv, 2025

Bo Liu, Chuanyang Jin, Seungone Kim, Weizhe Yuan, Wenting Zhao, Ilia Kulikov, Xian Li, Sainbayar Sukhbaatar, Jack Lanchantin, and Jason Weston. Spice: Self-play in corpus environments improves reasoning.arXiv, 2025. doi: 10.48550/arXiv.2510.24684. URL https://arxiv.org/abs/2510.24684

work page doi:10.48550/arxiv.2510.24684 2025
[8]

MCP-universe: Benchmarking large language models with real-world model context protocol servers.arXiv, 2025

Ziyang Luo, Zhiqi Shen, Wenzhuo Yang, Zirui Zhao, Prathyusha Jwalapuram, Amrita Saha, Doyen Sahoo, Silvio Savarese, Caiming Xiong, and Junnan Li. MCP-universe: Benchmarking large language models with real-world model context protocol servers.arXiv, 2025. doi: 10.48550/arXiv.2508.14704. URLhttps://arxiv.org/abs/2508.14704

work page doi:10.48550/arxiv.2508.14704 2025
[9]

control bars

Bodhisattwa Prasad Majumder, Bhavana Dalvi Mishra, Peter Jansen, Oyvind Tafjord, Niket Tandon, Li Zhang, Chris Callison-Burch, and Peter Clark. CLIN: A continually learning language agent for rapid task adaptation and generalization.arXiv, 2023. doi: 10.48550/arXiv. 2310.10134. URLhttps://arxiv.org/abs/2310.10134

work page internal anchor Pith review doi:10.48550/arxiv 2023
[10]

A Survey of Context Engineering for Large Language Models

Lingrui Mei, Jiayu Yao, Yuyao Ge, Yiwei Wang, Baolong Bi, Yujun Cai, Jiazhi Liu, Mingyu Li, Zhong-Zhi Li, Duzhen Zhang, Chenlin Zhou, Jiayi Mao, Tianze Xia, Jiafeng Guo, and Shenghua Liu. A survey of context engineering for large language models.arXiv, 2025. doi: 10.48550/arXiv.2507.13334. URLhttps://arxiv.org/abs/2507.13334

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2507.13334 2025
[11]

Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces

Mike A. Merrill, Alexander G. Shaw, Nicholas Carlini, Boxuan Li, Harsh Raj, et al. Terminal- bench: Benchmarking agents on hard, realistic tasks in command line interfaces.arXiv, 2026. doi: 10.48550/arXiv.2601.11868. URLhttps://arxiv.org/abs/2601.11868

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2601.11868 2026
[12]

JEF- hinter: Leveraging offline knowledge for improving web agents adaptation.arXiv, 2025

Hadi Nekoei, Aman Jaiswal, Patrice Béchard, Oleh Shliazhko, Orlando Marquez Ayala, Mathieu Reymond, Massimo Caccia, Alexandre Drouin, Sarath Chandar, and Alexandre Lacoste. JEF- hinter: Leveraging offline knowledge for improving web agents adaptation.arXiv, 2025. doi: 10.48550/arXiv.2510.04373. URLhttps://arxiv.org/abs/2510.04373

work page doi:10.48550/arxiv.2510.04373 2025
[13]

GPT-5.1 model, 2025

OpenAI. GPT-5.1 model, 2025. URL https://platform.openai.com/docs/models/ gpt-5.1/. OpenAI API documentation; accessed: 2026-05-02

work page 2025
[14]

gpt-oss-120b and gpt-oss-20b model card, August 2025

OpenAI. gpt-oss-120b and gpt-oss-20b model card, August 2025. URL https://openai. com/index/gpt-oss-model-card/. Accessed: 2026-05-02. 10

work page 2025
[15]

Le, Samira Daruki, Xiangru Tang, Vishy Tirumalashetty, George Lee, Mahsan Rofouei, Hangfei Lin, Jiawei Han, Chen-Yu Lee, and Tomas Pfister

Siru Ouyang, Jun Yan, I-Hung Hsu, Yanfei Chen, Ke Jiang, Zifeng Wang, Rujun Han, Long T. Le, Samira Daruki, Xiangru Tang, Vishy Tirumalashetty, George Lee, Mahsan Rofouei, Hangfei Lin, Jiawei Han, Chen-Yu Lee, and Tomas Pfister. Reasoningbank: Scaling agent self-evolving with reasoning memory. InInternational Conference on Learning Representations (ICLR),

work page
[16]

URLhttps://openreview.net/forum?id=jL7fwchScm

work page
[17]

MemGPT: Towards LLMs as Operating Systems

Charles Packer, Sarah Wooders, Kevin Lin, Vivian Fang, Shishir G. Patil, Ion Stoica, and Joseph E. Gonzalez. MemGPT: Towards LLMs as operating systems.arXiv, 2023. doi: 10.48550/arXiv.2310.08560. URLhttps://arxiv.org/abs/2310.08560

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2310.08560 2023
[18]

Patil, Tianjun Zhang, Xin Wang, and Joseph E

Shishir G. Patil, Tianjun Zhang, Xin Wang, and Joseph E. Gonzalez. Gorilla: Large language model connected with massive APIs. InNeural Information Processing Systems (NeurIPS),

work page
[19]

URLhttps://openreview.net/forum?id=tBRNC6YemY

work page
[20]

Gonzalez

Shishir G Patil, Huanzhi Mao, Fanjia Yan, Charlie Cheng-Jie Ji, Vishnu Suresh, Ion Stoica, and Joseph E. Gonzalez. The berkeley function calling leaderboard (BFCL): From tool use to agentic evaluation of large language models. InInternational Conference on Machine Learning (ICML), 2025

work page 2025
[21]

ToolLLM: Facilitating large language models to master 16000+ real-world APIs

Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, Sihan Zhao, Lauren Hong, Runchu Tian, Ruobing Xie, Jie Zhou, Mark Gerstein, Dahai Li, Zhiyuan Liu, and Maosong Sun. ToolLLM: Facilitating large language models to master 16000+ real-world APIs. InInternational Conference on Learning Representat...

work page 2024
[22]

Toolformer: Language models can teach themselves to use tools

Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools. InNeural Information Processing Systems (NeurIPS), 2023. URL https://openreview.net/forum?id=Yacmpz84TH

work page 2023
[23]

Dynamic cheatsheet: Test-time learning with adaptive memory.arXiv, 2025

Mirac Suzgun, Mert Yüksekgönul, Federico Bianchi, Dan Jurafsky, and James Zou. Dynamic cheatsheet: Test-time learning with adaptive memory.arXiv, 2025. doi: 10.48550/arXiv.2504. 07952. URLhttps://arxiv.org/abs/2504.07952

work page doi:10.48550/arxiv.2504 2025
[24]

Qwen3 Technical Report

Qwen Team. Qwen3 technical report.arXiv, 2025. doi: 10.48550/arXiv.2505.09388. URL https://arxiv.org/abs/2505.09388

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2505.09388 2025
[25]

InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Harsh Trivedi, Tushar Khot, Mareike Hartmann, Ruskin Manku, Vinty Dong, Edward Li, Shashank Gupta, Ashish Sabharwal, and Niranjan Balasubramanian. Appworld: A controllable world of apps and people for benchmarking interactive coding agents. InAnnual Meeting of the Association for Computational Linguistics (ACL), pages 16022–16076, 2024. doi: 10.18653/v1/ ...

work page doi:10.18653/v1/ 2024
[26]

OfficeBench: Benchmarking language agents across multiple applications for office automation

Zilong Wang, Yuedong Cui, Li Zhong, Zimin Zhang, Da Yin, Bill Yuchen Lin, and Jingbo Shang. OfficeBench: Benchmarking language agents across multiple applications for office automation. arXiv, 2024. doi: 10.48550/arXiv.2407.19056. URL https://arxiv.org/abs/2407.19056

work page doi:10.48550/arxiv.2407.19056 2024
[27]

Agent workflow memory

Zora Zhiruo Wang, Jiayuan Mao, Daniel Fried, and Graham Neubig. Agent workflow memory. InInternational Conference on Machine Learning (ICML), 2025. URL https://openreview. net/forum?id=NTAhi2JEEE

work page 2025
[28]

Yuxiang Wei, Zhiqing Sun, Emily McMilin, Jonas Gehring, David Zhang, Gabriel Synnaeve, Daniel Fried, Lingming Zhang, and Sida I. Wang. Toward training superintelligent software agents through self-play SWE-RL.arXiv, 2025. doi: 10.48550/arXiv.2512.18552. URL https://arxiv.org/abs/2512.18552

work page doi:10.48550/arxiv.2512.18552 2025
[29]

Agent0: Unleashing self-evolving agents from zero data via tool-integrated reasoning

Peng Xia, Kaide Zeng, Jiaqi Liu, Can Qin, Fang Wu, Yiyang Zhou, Caiming Xiong, and Huaxiu Yao. Agent0: Unleashing self-evolving agents from zero data via tool-integrated reasoning. arXiv, 2025. doi: 10.48550/arXiv.2511.16043. URL https://arxiv.org/abs/2511.16043

work page doi:10.48550/arxiv.2511.16043 2025
[30]

Narasimhan, and Yuan Cao

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R. Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InInternational Conference on Learning Representations (ICLR), 2023. URL https://openreview.net/forum?id= WE_vluYUL-X. 11

work page 2023
[31]

Zhenrui Yue, Kartikeya Upasani, Xianjun Yang, Suyu Ge, Shaoliang Nie, Yuning Mao, Zhe Liu, and Dong Wang. Dr. zero: Self-evolving search agents without training data.arXiv, 2026. doi: 10.48550/arXiv.2601.07055. URLhttps://arxiv.org/abs/2601.07055

work page doi:10.48550/arxiv.2601.07055 2026
[32]

Agentevolver: Towards efficient self-evolving agent system.arXiv, 2025

Yunpeng Zhai, Shuchang Tao, Cheng Chen, Anni Zou, Ziqian Chen, Qingxu Fu, Shinji Mai, Li Yu, Jiaji Deng, Zouying Cao, Zhaoyang Liu, Bolin Ding, and Jingren Zhou. Agentevolver: Towards efficient self-evolving agent system.arXiv, 2025. doi: 10.48550/arXiv.2511.10395. URLhttps://arxiv.org/abs/2511.10395

work page doi:10.48550/arxiv.2511.10395 2025
[33]

G- memory: Tracing hierarchical memory for multi-agent systems.arXiv, 2025

Guibin Zhang, Muxin Fu, Guancheng Wan, Miao Yu, Kun Wang, and Shuicheng Yan. G- memory: Tracing hierarchical memory for multi-agent systems.arXiv, 2025. doi: 10.48550/ arXiv.2506.07398. URLhttps://arxiv.org/abs/2506.07398

work page arXiv 2025
[34]

Memevolve: Meta-evolution of agent memory systems.arXiv,

Guibin Zhang, Haotian Ren, Chong Zhan, Zhenhong Zhou, Junhao Wang, He Zhu, Wangchun- shu Zhou, and Shuicheng Yan. Memevolve: Meta-evolution of agent memory systems.arXiv,

work page
[35]

URLhttps://arxiv.org/abs/2512.18746

doi: 10.48550/arXiv.2512.18746. URLhttps://arxiv.org/abs/2512.18746

work page doi:10.48550/arxiv.2512.18746
[36]

Agentic context engineering: Evolving contexts for self-improving language models

Qizheng Zhang, Changran Hu, Shubhangi Upasani, Boyuan Ma, Fenglu Hong, Vamsidhar Kamanuru, Jay Rainton, Chen Wu, Mengmeng Ji, Hanchen Li, Urmish Thakker, James Zou, and Kunle Olukotun. Agentic context engineering: Evolving contexts for self-improving language models. InInternational Conference on Learning Representations (ICLR), 2026. URL https://openrevi...

work page 2026
[37]

ExpeL: LLM agents are experiential learners

Andrew Zhao, Daniel Huang, Quentin Xu, Matthieu Lin, Yong-Jin Liu, and Gao Huang. ExpeL: LLM agents are experiential learners. InProceedings of the AAAI Conference on Artificial Intelligence, pages 19632–19642, 2024. doi: 10.1609/aaai.v38i17.29936. URL https://doi.org/10.1609/aaai.v38i17.29936

work page doi:10.1609/aaai.v38i17.29936 2024
[38]

Synapse: Trajectory-as-exemplar prompting with memory for computer control

Longtao Zheng, Rundong Wang, Xinrun Wang, and Bo An. Synapse: Trajectory-as-exemplar prompting with memory for computer control. InInternational Conference on Learning Representations (ICLR), 2024. URLhttps://openreview.net/forum?id=Pc8AU1aF5e

work page 2024
[39]

MemoryBank: Enhancing large language models with long-term memory

Wanjun Zhong, Lianghong Guo, Qiqi Gao, He Ye, and Yanlin Wang. MemoryBank: Enhancing large language models with long-term memory. InProceedings of the AAAI Conference on Artificial Intelligence, pages 19724–19731, 2024. doi: 10.1609/aaai.v38i17.29946. URL https://doi.org/10.1609/aaai.v38i17.29946

work page doi:10.1609/aaai.v38i17.29946 2024
[40]

Memento: Fine-tuning LLM agents without fine-tuning LLMs.arXiv, 2025

Huichi Zhou, Yihang Chen, Siyuan Guo, Xue Yan, Kin Hei Lee, Zihan Wang, Ka Yiu Lee, Guchun Zhang, Kun Shao, Linyi Yang, and Jun Wang. Memento: Fine-tuning LLM agents without fine-tuning LLMs.arXiv, 2025. doi: 10.48550/arXiv.2508.16153. URL https: //arxiv.org/abs/2508.16153

work page doi:10.48550/arxiv.2508.16153 2025
[41]

look around

Yifei Zhou, Sergey Levine, Jason E. Weston, Xian Li, and Sainbayar Sukhbaatar. Self- challenging language model agents. InNeural Information Processing Systems (NeurIPS), 2025. URLhttps://openreview.net/forum?id=9yusqX9DpR. 12 Appendix The appendix complements the main text by recording the implementation details needed to reproduce PREPINGand providing a...

work page 2025
[42]

reasoning

ADD: Create new bullet points with fresh IDs - section: the section to add the new bullet to - content: the new content of the bullet. Note: no need to include the bullet_id in the content like ’[ctx-00263] helpful=1 harmful=0 ::’, the bullet_id will be added by the system. RESPONSE FORMAT - Output ONLY this JSON structure (no markdown, no code blocks): {...

work page
[43]

Analyze the query, previous reasoning steps, and results

work page
[44]

Decide on the next action: use a tool or provide a final answer

work page
[45]

You MUST output the final answer within {{MAX_STEPS}} steps

work page
[46]

thought":

Respond in the following JSON format: If you need to use a tool: { "thought": "Your detailed reasoning about what to do next", "action": { "reason": "Explanation of why you chose this tool", "server": "server-name", "tool": "tool-name", "arguments": { "argument-name": "argument-value" } } } If you have enough information to answer the query: { "thought": ...

work page
[47]

Try different apps randomly - don’t stick to one app too long

work page
[51]

Start by checking what apps are available with apis.api_docs.show_app_descriptions()

When you feel you’ve explored enough, call: apis.supervisor.complete_task(). Start by checking what apps are available with apis.api_docs.show_app_descriptions(). Before calling an API, inspect its documentation with apis.api_docs.show_api_doc(...). Write small code chunks, use only one code block per step, and verify behavior before making irreversible c...

work page
[52]

PRIORITIZE APIs you haven’t visited before

work page
[53]

Test various API endpoints with different parameters

work page
[54]

Be curious and try unexpected combinations

work page
[55]

Observe outputs carefully - note formats, errors, edge cases

work page
[56]

supervisor

When you feel you’ve explored enough, call: apis.supervisor.complete_task(). === EXPLORATION PROGRESS === Unique APIs visited so far: {unique_apis_visited} Total API calls made: {total_api_calls} Apps explored: {apps_explored} === ALREADY VISITED APIs === {visited_api_summary} Focus on exploring NEW APIs that are NOT in the list above. Start by checking a...

work page 2022