pith. machine review for the scientific record. sign in

arxiv: 2605.13880 · v1 · submitted 2026-05-11 · 💻 cs.AI · cs.CL

Recognition: no theorem link

PREPING: Building Agent Memory without Tasks

Authors on Pith no claims yet

Pith reviewed 2026-05-15 06:10 UTC · model grok-4.3

classification 💻 cs.AI cs.CL
keywords agent memory constructionsynthetic task generationpre-task preparationproposer-validator loopcold-start problemprocedural memorymemory without tasks
0
0 comments X

The pith

Agents can construct competitive procedural memory for new environments using only self-generated synthetic tasks before any real experience.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines whether agents can close the cold-start gap in new environments by building memory ahead of time through purely synthetic practice. Preping introduces a proposer that maintains a structured control state to generate tasks, a solver that executes them, and a validator that filters useful trajectories while supplying feedback to improve future proposals. This loop prevents synthetic data from becoming redundant or infeasible and keeps memory from degrading. Experiments across AppWorld, BFCL v3, and MCP-Universe show the resulting memory outperforms a no-memory baseline and matches strong methods that rely on offline demonstrations or online interactions, but at substantially lower deployment cost. The gains stem from controlled coverage rather than the sheer amount of synthetic data produced.

Core claim

Preping shows that procedural memory can be built pre-task by maintaining a proposer memory state that conditions synthetic task generation. A solver runs the generated tasks and a validator selects only eligible trajectories for memory insertion while returning feedback that refines the next round of proposals. The resulting memory substantially improves over a no-memory baseline and reaches performance levels competitive with playbook methods built from real experience, while cutting deployment cost by factors of 2.99 on AppWorld and 2.23 on BFCL v3.

What carries the argument

Proposer memory, the structured control state that shapes future synthetic task proposals, together with the closed proposer-solver-validator loop that enforces feasibility, reduces redundancy, and supplies targeted updates to memory.

If this is right

  • Memory construction becomes possible with zero direct experience of target tasks.
  • Deployment costs fall by more than twofold relative to online memory construction.
  • Gains arise from proposer-side control over feasibility and coverage rather than data volume.
  • The approach applies across AppWorld, BFCL v3, and MCP-Universe benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If synthetic task generators become more faithful, agents could initialize in entirely new domains with no real-world data collection at all.
  • The method could be tested in robotics or simulation-heavy domains where creating synthetic tasks is cheap but real interactions are expensive.
  • The proposer memory state itself might be initialized from minimal seed examples and then refined purely through the validation loop.

Load-bearing premise

Synthetic tasks generated without any direct exposure to real target-environment tasks will still produce trajectories that transfer usefully when the agent later encounters those real tasks.

What would settle it

Inserting Preping memory into an agent and testing it on real tasks where the synthetic proposals systematically omit key patterns of the target environment would cause performance to fall back to the no-memory baseline level.

Figures

Figures reproduced from arXiv: 2605.13880 by Jinheon Baek, Minki Kang, Sangwoo Park, Sung Ju Hwang, Yumin Choi.

Figure 1
Figure 1. Figure 1: PREPING builds memory before the first user task and mitigates online cold-start. Left: Unlike offline methods that require prior human tasks or online methods that start from empty memory, PREPING constructs procedural memory through self-generated synthetic practice before deployment. Right: PREPING establishes broad tool coverage before deployment, whereas ACE￾Online must accumulate coverage from user-f… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of PREPING. PREPING builds procedural memory before deployment through a synthetic-practice loop: proposing tasks, executing them in the environment, validating the resulting trajectories, and updating memory. Proposer memory (Mprop) shapes what to practice next, while the Validator filters which synthetic trajectories are reliable enough to become solver memory (Msol). has already been practiced,… view at source ↗
Figure 3
Figure 3. Figure 3: PREPING mitigates online cold-start before memory builds up. MCP-Universe. Furthermore, despite using no human-defined or deployment-time user tasks for memory construction, PREPING remains competitive with task-informed methods. Specifically, on AppWorld, PREPING exceeds ACE-Offline and is close to ACE-Online, while on BFCL v3, it sur￾passes ACE-Online on average. These results suggest that PREPING can pr… view at source ↗
Figure 4
Figure 4. Figure 4: Construction budget. App￾World Test-Normal TGC as a function of the number of synthetic tasks. has been practiced but not necessarily what is executable in the current environment. Environmental information has a complementary effect: it keeps proposals anchored to observed entities, states, and constraints, but by itself lacks the history needed to broaden practice. PREPING combines both signals, yielding… view at source ↗
Figure 5
Figure 5. Figure 5: Deployment time cost per task. PREPING Reduces Deployment-Time Cost. Online memory con￾struction can improve performance, but it also adds memory-update calls during user-facing deployment. In contrast, PREPING constructs solver memory before deployment, allowing user tasks to be executed without additional memory-construction calls. As shown in [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: BFCL task-generation meta prompt used in our implementation of P [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: BFCL validator meta prompt used in our implementation of P [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: BFCL grounded environment information summarization prompt used to extract proposer [PITH_FULL_IMAGE:figures/full_fig_p015_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: AppWorld ACE reflector prompt used for trajectory analysis before playbook update. [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: AppWorld ACE curator prompt used to merge reusable memory into the solver memory. [PITH_FULL_IMAGE:figures/full_fig_p017_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: AppWorld task-solving generation prompt. [PITH_FULL_IMAGE:figures/full_fig_p018_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: BFCL task-solving generation prompt. USER: You are a ReAct (Reasoning and Acting) agent. {{INSTRUCTION}} You have access to these tools: ### Tools Description ### {{TOOLS_DESCRIPTION}} ### End of Tools Description ### {{Solver Memory}} You need to answer the following question: Question: {{QUESTION}} Your goal is to reason about the question and decide on the best course of action to answer it accurately.… view at source ↗
Figure 13
Figure 13. Figure 13: MCP-Universe task-solving generation prompt. [PITH_FULL_IMAGE:figures/full_fig_p019_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: BFCL exploration baseline prompt templates. A.6 Compute Resources All experiments used API-hosted LLM inference, so no local GPU training was performed. Agent execution and memory-construction jobs were run on CPU workers with parallel API calls; the main compute cost is therefore reported in token cost rather than GPU-hours. Wall-clock time varies with provider latency and worker parallelism. 20 [PITH_F… view at source ↗
Figure 15
Figure 15. Figure 15: AppWorld exploration baseline prompt templates. Both prompts include the shared AppWorld execution rules shown at the bottom. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Iteration dynamics of AppWorld component ablations. 1 2 4 6 8 10 Iteration 0 10 20 count Cumulative Invalid Tasks Naive Val. only Val. + Hist. Val. + Env. PREPING 1 2 4 6 8 10 Iteration 20 40 60 80 100 120 count Cumulative Unique Tools 1 2 4 6 8 10 Iteration 4 5 6 bits Cumulative Tool Entropy BFCL [PITH_FULL_IMAGE:figures/full_fig_p023_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Iteration dynamics of BFCL component ablations. Figs. 16 and 17 show how the component-ablation variants evolve over the ten synthetic task construction iterations. Curves are averaged over three independent runs, with shaded bands denoting standard deviation. Naive is omitted from the invalid-task panels because it has no validator labels. B.3 Ablating Validator Signals in Memory Updates [PITH_FULL_IMAG… view at source ↗
read the original abstract

Agent memory is typically constructed either offline from curated demonstrations or online from post-deployment interactions. However, regardless of how it is built, an agent faces a cold-start gap when first introduced to a new environment without any task-specific experience available. In this paper, we study pre-task memory construction: whether an agent can build procedural memory before observing any target-environment tasks, using only self-generated synthetic practice. Yet, synthetic interaction alone is insufficient, as without controlling what to practice and what to store, synthetic tasks become redundant, infeasible, and ultimately uninformative, and memory further degrades quickly due to unfiltered trajectories. To overcome this, we present Preping, a proposer-guided memory construction framework. At its core is proposer memory, a structured control state that shapes future practice. A Proposer generates synthetic tasks conditioned on this state, a Solver executes them, and a Validator determines which trajectories are eligible for memory insertion while also providing feedback to guide future proposals. Experiments on AppWorld, BFCL v3, and MCP-Universe show that Preping substantially improves over a no-memory baseline and achieves performance competitive with strong playbook-based methods built from offline or online experience, with deployment cost $2.99\times$ lower on AppWorld and $2.23\times$ lower on BFCL v3 than online memory construction. Further analyses reveal that the main benefit does not come from synthetic volume alone, but from proposer-side control over feasibility, redundancy, and coverage, combined with selective memory updates.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that Preping enables pre-task construction of agent memory via a proposer-guided loop: a structured 'proposer memory' state conditions synthetic task generation, a solver executes them, and a validator filters trajectories for memory insertion while supplying feedback. On AppWorld, BFCL v3, and MCP-Universe, the resulting memory yields substantial gains over a no-memory baseline, matches strong offline/online playbook methods, and reduces deployment cost by 2.99× and 2.23× respectively versus online memory construction; the benefit is attributed to control of feasibility, redundancy, and coverage rather than synthetic volume alone.

Significance. If the synthetic-to-real transfer holds under distributional controls, the result would meaningfully reduce cold-start costs for agent deployment in new environments and weaken reliance on expensive post-deployment interaction data, with direct implications for scalable agentic systems.

major comments (2)
  1. [Experiments] Experiments section: the headline performance claims rest on comparisons whose statistical robustness is not reported (no p-values, confidence intervals, or details on trajectory filtering rules and ablation controls for volume vs. proposer control). This makes it impossible to confirm that gains arise from the claimed mechanisms rather than uncontrolled factors.
  2. [Method] Method and Experiments: the central transfer assumption—that proposer-generated synthetic tasks (shaped by proposer memory, validator filtering, and feedback) sufficiently overlap the real-task distribution on AppWorld/BFCL v3—is load-bearing yet unsupported by any coverage metric (tool-call histograms, state-transition statistics, or embedding distances). Without such checks, observed improvements could stem from generic scaffolding rather than targeted pre-task preparation.
minor comments (2)
  1. [Abstract] Abstract and §3: the term 'proposer memory' is introduced as a 'structured control state' but its precise representation (e.g., data structures, update rules) is not formalized early enough for readers to follow the loop without backtracking.
  2. [Related Work] Related work: the positioning against prior synthetic-data and self-play methods for agents would benefit from explicit citations to recent work on procedural task generation and memory consolidation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which highlight important aspects of statistical rigor and distributional overlap. We address each major point below and have revised the manuscript to incorporate additional analyses and reporting.

read point-by-point responses
  1. Referee: [Experiments] Experiments section: the headline performance claims rest on comparisons whose statistical robustness is not reported (no p-values, confidence intervals, or details on trajectory filtering rules and ablation controls for volume vs. proposer control). This makes it impossible to confirm that gains arise from the claimed mechanisms rather than uncontrolled factors.

    Authors: We agree that the original submission omitted p-values, confidence intervals, and explicit ablation controls separating volume from proposer guidance. In the revised manuscript we now report 95% confidence intervals and p-values for all headline comparisons on AppWorld, BFCL v3, and MCP-Universe. We have also expanded the Experiments section with (i) the precise validator filtering rules and (ii) new volume-controlled ablations that hold proposer memory fixed while varying the number of synthetic trajectories. These additions show that performance scales with proposer-guided selection rather than raw volume. revision: yes

  2. Referee: [Method] Method and Experiments: the central transfer assumption—that proposer-generated synthetic tasks (shaped by proposer memory, validator filtering, and feedback) sufficiently overlap the real-task distribution on AppWorld/BFCL v3—is load-bearing yet unsupported by any coverage metric (tool-call histograms, state-transition statistics, or embedding distances). Without such checks, observed improvements could stem from generic scaffolding rather than targeted pre-task preparation.

    Authors: The referee correctly notes the absence of explicit coverage metrics in the original version. While end-to-end gains on held-out real tasks provide indirect support for transfer, we have added direct distributional checks in the revision: tool-call histograms, state-transition statistics, and embedding-distance comparisons between the synthetic trajectories and the real task distributions. These metrics indicate substantial overlap in tool usage and state coverage, consistent with the claim that gains arise from targeted, proposer-controlled practice rather than generic scaffolding. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical framework evaluated on external benchmarks

full rationale

The paper presents Preping as a proposer-validator framework for synthetic pre-task memory construction and supports its claims solely through experimental results on AppWorld, BFCL v3, and MCP-Universe. No mathematical derivations, equations, or first-principles predictions are advanced that could reduce by construction to fitted inputs, self-definitions, or self-citation chains. Performance comparisons (e.g., against no-memory baselines and playbook methods) rest on direct, externally measurable benchmark outcomes rather than any internal reduction, satisfying the self-contained criterion.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 2 invented entities

The framework introduces new components (proposer memory, proposer, validator) that are postulated to solve redundancy and feasibility issues; no free parameters or external axioms are stated in the abstract.

invented entities (2)
  • proposer memory no independent evidence
    purpose: structured control state that shapes future synthetic task proposals
    Core new construct introduced to control practice quality; no independent evidence outside the framework itself.
  • Validator no independent evidence
    purpose: determines eligible trajectories for memory insertion and supplies feedback to the proposer
    New filtering and feedback component; independent evidence not provided beyond reported benchmark gains.

pith-pipeline@v0.9.0 · 5573 in / 1167 out tokens · 39746 ms · 2026-05-15T06:10:55.020382+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

53 extracted references · 53 canonical work pages · 6 internal anchors

  1. [1]

    Tool-R0: Self-evolving LLM agents for tool-learning from zero data.arXiv, 2026

    Emre Can Acikgoz, Cheng Qian, Jonas Hübotter, Heng Ji, Dilek Hakkani-Tür, and Gokhan Tur. Tool-R0: Self-evolving LLM agents for tool-learning from zero data.arXiv, 2026. doi: 10.48550/arXiv.2602.21320. URLhttps://arxiv.org/abs/2602.21320

  2. [2]

    Introducing the model context protocol, November 2024

    Anthropic. Introducing the model context protocol, November 2024. URL https://www. anthropic.com/news/model-context-protocol. Accessed: 2026-05-01

  3. [3]

    DeepSeek-V3.2: Pushing the frontier of open large language models, 2025

    DeepSeek-AI. DeepSeek-V3.2: Pushing the frontier of open large language models, 2025. URL https://huggingface.co/deepseek-ai/DeepSeek-V3.2. Model card

  4. [4]

    Legomem: Modular procedural memory for multi-agent LLM systems for workflow automation.arXiv, 2025

    Dongge Han, Camille Couturier, Daniel Madrigal Díaz, Xuchao Zhang, Victor Rühle, and Saravan Rajmohan. Legomem: Modular procedural memory for multi-agent LLM systems for workflow automation.arXiv, 2025. doi: 10.48550/arXiv.2510.04851. URL https: //arxiv.org/abs/2510.04851

  5. [5]

    R-Zero: Self-Evolving Reasoning LLM from Zero Data

    Chengsong Huang, Wenhao Yu, Xiaoyang Wang, Hongming Zhang, Zongxia Li, Ruosen Li, Jiaxin Huang, Haitao Mi, and Dong Yu. R-zero: Self-evolving reasoning LLM from zero data. arXiv, 2025. doi: 10.48550/arXiv.2508.05004. URL https://arxiv.org/abs/2508.05004

  6. [6]

    Decomposed prompting: A modular approach for solving complex tasks

    Tushar Khot, Harsh Trivedi, Matthew Finlayson, Yao Fu, Kyle Richardson, Peter Clark, and Ashish Sabharwal. Decomposed prompting: A modular approach for solving complex tasks. InInternational Conference on Learning Representations (ICLR), 2023. URL https:// openreview.net/forum?id=_nGgzQjzaRy

  7. [7]

    Spice: Self-play in corpus environments improves reasoning.arXiv, 2025

    Bo Liu, Chuanyang Jin, Seungone Kim, Weizhe Yuan, Wenting Zhao, Ilia Kulikov, Xian Li, Sainbayar Sukhbaatar, Jack Lanchantin, and Jason Weston. Spice: Self-play in corpus environments improves reasoning.arXiv, 2025. doi: 10.48550/arXiv.2510.24684. URL https://arxiv.org/abs/2510.24684

  8. [8]

    MCP-universe: Benchmarking large language models with real-world model context protocol servers.arXiv, 2025

    Ziyang Luo, Zhiqi Shen, Wenzhuo Yang, Zirui Zhao, Prathyusha Jwalapuram, Amrita Saha, Doyen Sahoo, Silvio Savarese, Caiming Xiong, and Junnan Li. MCP-universe: Benchmarking large language models with real-world model context protocol servers.arXiv, 2025. doi: 10.48550/arXiv.2508.14704. URLhttps://arxiv.org/abs/2508.14704

  9. [9]

    control bars

    Bodhisattwa Prasad Majumder, Bhavana Dalvi Mishra, Peter Jansen, Oyvind Tafjord, Niket Tandon, Li Zhang, Chris Callison-Burch, and Peter Clark. CLIN: A continually learning language agent for rapid task adaptation and generalization.arXiv, 2023. doi: 10.48550/arXiv. 2310.10134. URLhttps://arxiv.org/abs/2310.10134

  10. [10]

    A Survey of Context Engineering for Large Language Models

    Lingrui Mei, Jiayu Yao, Yuyao Ge, Yiwei Wang, Baolong Bi, Yujun Cai, Jiazhi Liu, Mingyu Li, Zhong-Zhi Li, Duzhen Zhang, Chenlin Zhou, Jiayi Mao, Tianze Xia, Jiafeng Guo, and Shenghua Liu. A survey of context engineering for large language models.arXiv, 2025. doi: 10.48550/arXiv.2507.13334. URLhttps://arxiv.org/abs/2507.13334

  11. [11]

    Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces

    Mike A. Merrill, Alexander G. Shaw, Nicholas Carlini, Boxuan Li, Harsh Raj, et al. Terminal- bench: Benchmarking agents on hard, realistic tasks in command line interfaces.arXiv, 2026. doi: 10.48550/arXiv.2601.11868. URLhttps://arxiv.org/abs/2601.11868

  12. [12]

    JEF- hinter: Leveraging offline knowledge for improving web agents adaptation.arXiv, 2025

    Hadi Nekoei, Aman Jaiswal, Patrice Béchard, Oleh Shliazhko, Orlando Marquez Ayala, Mathieu Reymond, Massimo Caccia, Alexandre Drouin, Sarath Chandar, and Alexandre Lacoste. JEF- hinter: Leveraging offline knowledge for improving web agents adaptation.arXiv, 2025. doi: 10.48550/arXiv.2510.04373. URLhttps://arxiv.org/abs/2510.04373

  13. [13]

    GPT-5.1 model, 2025

    OpenAI. GPT-5.1 model, 2025. URL https://platform.openai.com/docs/models/ gpt-5.1/. OpenAI API documentation; accessed: 2026-05-02

  14. [14]

    gpt-oss-120b and gpt-oss-20b model card, August 2025

    OpenAI. gpt-oss-120b and gpt-oss-20b model card, August 2025. URL https://openai. com/index/gpt-oss-model-card/. Accessed: 2026-05-02. 10

  15. [15]

    Le, Samira Daruki, Xiangru Tang, Vishy Tirumalashetty, George Lee, Mahsan Rofouei, Hangfei Lin, Jiawei Han, Chen-Yu Lee, and Tomas Pfister

    Siru Ouyang, Jun Yan, I-Hung Hsu, Yanfei Chen, Ke Jiang, Zifeng Wang, Rujun Han, Long T. Le, Samira Daruki, Xiangru Tang, Vishy Tirumalashetty, George Lee, Mahsan Rofouei, Hangfei Lin, Jiawei Han, Chen-Yu Lee, and Tomas Pfister. Reasoningbank: Scaling agent self-evolving with reasoning memory. InInternational Conference on Learning Representations (ICLR),

  16. [16]

    URLhttps://openreview.net/forum?id=jL7fwchScm

  17. [17]

    MemGPT: Towards LLMs as Operating Systems

    Charles Packer, Sarah Wooders, Kevin Lin, Vivian Fang, Shishir G. Patil, Ion Stoica, and Joseph E. Gonzalez. MemGPT: Towards LLMs as operating systems.arXiv, 2023. doi: 10.48550/arXiv.2310.08560. URLhttps://arxiv.org/abs/2310.08560

  18. [18]

    Patil, Tianjun Zhang, Xin Wang, and Joseph E

    Shishir G. Patil, Tianjun Zhang, Xin Wang, and Joseph E. Gonzalez. Gorilla: Large language model connected with massive APIs. InNeural Information Processing Systems (NeurIPS),

  19. [19]

    URLhttps://openreview.net/forum?id=tBRNC6YemY

  20. [20]

    Gonzalez

    Shishir G Patil, Huanzhi Mao, Fanjia Yan, Charlie Cheng-Jie Ji, Vishnu Suresh, Ion Stoica, and Joseph E. Gonzalez. The berkeley function calling leaderboard (BFCL): From tool use to agentic evaluation of large language models. InInternational Conference on Machine Learning (ICML), 2025

  21. [21]

    ToolLLM: Facilitating large language models to master 16000+ real-world APIs

    Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, Sihan Zhao, Lauren Hong, Runchu Tian, Ruobing Xie, Jie Zhou, Mark Gerstein, Dahai Li, Zhiyuan Liu, and Maosong Sun. ToolLLM: Facilitating large language models to master 16000+ real-world APIs. InInternational Conference on Learning Representat...

  22. [22]

    Toolformer: Language models can teach themselves to use tools

    Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools. InNeural Information Processing Systems (NeurIPS), 2023. URL https://openreview.net/forum?id=Yacmpz84TH

  23. [23]

    Dynamic cheatsheet: Test-time learning with adaptive memory.arXiv, 2025

    Mirac Suzgun, Mert Yüksekgönul, Federico Bianchi, Dan Jurafsky, and James Zou. Dynamic cheatsheet: Test-time learning with adaptive memory.arXiv, 2025. doi: 10.48550/arXiv.2504. 07952. URLhttps://arxiv.org/abs/2504.07952

  24. [24]

    Qwen3 Technical Report

    Qwen Team. Qwen3 technical report.arXiv, 2025. doi: 10.48550/arXiv.2505.09388. URL https://arxiv.org/abs/2505.09388

  25. [25]

    InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

    Harsh Trivedi, Tushar Khot, Mareike Hartmann, Ruskin Manku, Vinty Dong, Edward Li, Shashank Gupta, Ashish Sabharwal, and Niranjan Balasubramanian. Appworld: A controllable world of apps and people for benchmarking interactive coding agents. InAnnual Meeting of the Association for Computational Linguistics (ACL), pages 16022–16076, 2024. doi: 10.18653/v1/ ...

  26. [26]

    OfficeBench: Benchmarking language agents across multiple applications for office automation

    Zilong Wang, Yuedong Cui, Li Zhong, Zimin Zhang, Da Yin, Bill Yuchen Lin, and Jingbo Shang. OfficeBench: Benchmarking language agents across multiple applications for office automation. arXiv, 2024. doi: 10.48550/arXiv.2407.19056. URL https://arxiv.org/abs/2407.19056

  27. [27]

    Agent workflow memory

    Zora Zhiruo Wang, Jiayuan Mao, Daniel Fried, and Graham Neubig. Agent workflow memory. InInternational Conference on Machine Learning (ICML), 2025. URL https://openreview. net/forum?id=NTAhi2JEEE

  28. [28]

    Yuxiang Wei, Zhiqing Sun, Emily McMilin, Jonas Gehring, David Zhang, Gabriel Synnaeve, Daniel Fried, Lingming Zhang, and Sida I. Wang. Toward training superintelligent software agents through self-play SWE-RL.arXiv, 2025. doi: 10.48550/arXiv.2512.18552. URL https://arxiv.org/abs/2512.18552

  29. [29]

    Agent0: Unleashing self-evolving agents from zero data via tool-integrated reasoning

    Peng Xia, Kaide Zeng, Jiaqi Liu, Can Qin, Fang Wu, Yiyang Zhou, Caiming Xiong, and Huaxiu Yao. Agent0: Unleashing self-evolving agents from zero data via tool-integrated reasoning. arXiv, 2025. doi: 10.48550/arXiv.2511.16043. URL https://arxiv.org/abs/2511.16043

  30. [30]

    Narasimhan, and Yuan Cao

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R. Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InInternational Conference on Learning Representations (ICLR), 2023. URL https://openreview.net/forum?id= WE_vluYUL-X. 11

  31. [31]

    Zhenrui Yue, Kartikeya Upasani, Xianjun Yang, Suyu Ge, Shaoliang Nie, Yuning Mao, Zhe Liu, and Dong Wang. Dr. zero: Self-evolving search agents without training data.arXiv, 2026. doi: 10.48550/arXiv.2601.07055. URLhttps://arxiv.org/abs/2601.07055

  32. [32]

    Agentevolver: Towards efficient self-evolving agent system.arXiv, 2025

    Yunpeng Zhai, Shuchang Tao, Cheng Chen, Anni Zou, Ziqian Chen, Qingxu Fu, Shinji Mai, Li Yu, Jiaji Deng, Zouying Cao, Zhaoyang Liu, Bolin Ding, and Jingren Zhou. Agentevolver: Towards efficient self-evolving agent system.arXiv, 2025. doi: 10.48550/arXiv.2511.10395. URLhttps://arxiv.org/abs/2511.10395

  33. [33]

    G- memory: Tracing hierarchical memory for multi-agent systems.arXiv, 2025

    Guibin Zhang, Muxin Fu, Guancheng Wan, Miao Yu, Kun Wang, and Shuicheng Yan. G- memory: Tracing hierarchical memory for multi-agent systems.arXiv, 2025. doi: 10.48550/ arXiv.2506.07398. URLhttps://arxiv.org/abs/2506.07398

  34. [34]

    Memevolve: Meta-evolution of agent memory systems.arXiv,

    Guibin Zhang, Haotian Ren, Chong Zhan, Zhenhong Zhou, Junhao Wang, He Zhu, Wangchun- shu Zhou, and Shuicheng Yan. Memevolve: Meta-evolution of agent memory systems.arXiv,

  35. [35]

    URLhttps://arxiv.org/abs/2512.18746

    doi: 10.48550/arXiv.2512.18746. URLhttps://arxiv.org/abs/2512.18746

  36. [36]

    Agentic context engineering: Evolving contexts for self-improving language models

    Qizheng Zhang, Changran Hu, Shubhangi Upasani, Boyuan Ma, Fenglu Hong, Vamsidhar Kamanuru, Jay Rainton, Chen Wu, Mengmeng Ji, Hanchen Li, Urmish Thakker, James Zou, and Kunle Olukotun. Agentic context engineering: Evolving contexts for self-improving language models. InInternational Conference on Learning Representations (ICLR), 2026. URL https://openrevi...

  37. [37]

    ExpeL: LLM agents are experiential learners

    Andrew Zhao, Daniel Huang, Quentin Xu, Matthieu Lin, Yong-Jin Liu, and Gao Huang. ExpeL: LLM agents are experiential learners. InProceedings of the AAAI Conference on Artificial Intelligence, pages 19632–19642, 2024. doi: 10.1609/aaai.v38i17.29936. URL https://doi.org/10.1609/aaai.v38i17.29936

  38. [38]

    Synapse: Trajectory-as-exemplar prompting with memory for computer control

    Longtao Zheng, Rundong Wang, Xinrun Wang, and Bo An. Synapse: Trajectory-as-exemplar prompting with memory for computer control. InInternational Conference on Learning Representations (ICLR), 2024. URLhttps://openreview.net/forum?id=Pc8AU1aF5e

  39. [39]

    MemoryBank: Enhancing large language models with long-term memory

    Wanjun Zhong, Lianghong Guo, Qiqi Gao, He Ye, and Yanlin Wang. MemoryBank: Enhancing large language models with long-term memory. InProceedings of the AAAI Conference on Artificial Intelligence, pages 19724–19731, 2024. doi: 10.1609/aaai.v38i17.29946. URL https://doi.org/10.1609/aaai.v38i17.29946

  40. [40]

    Memento: Fine-tuning LLM agents without fine-tuning LLMs.arXiv, 2025

    Huichi Zhou, Yihang Chen, Siyuan Guo, Xue Yan, Kin Hei Lee, Zihan Wang, Ka Yiu Lee, Guchun Zhang, Kun Shao, Linyi Yang, and Jun Wang. Memento: Fine-tuning LLM agents without fine-tuning LLMs.arXiv, 2025. doi: 10.48550/arXiv.2508.16153. URL https: //arxiv.org/abs/2508.16153

  41. [41]

    look around

    Yifei Zhou, Sergey Levine, Jason E. Weston, Xian Li, and Sainbayar Sukhbaatar. Self- challenging language model agents. InNeural Information Processing Systems (NeurIPS), 2025. URLhttps://openreview.net/forum?id=9yusqX9DpR. 12 Appendix The appendix complements the main text by recording the implementation details needed to reproduce PREPINGand providing a...

  42. [42]

    reasoning

    ADD: Create new bullet points with fresh IDs - section: the section to add the new bullet to - content: the new content of the bullet. Note: no need to include the bullet_id in the content like ’[ctx-00263] helpful=1 harmful=0 ::’, the bullet_id will be added by the system. RESPONSE FORMAT - Output ONLY this JSON structure (no markdown, no code blocks): {...

  43. [43]

    Analyze the query, previous reasoning steps, and results

  44. [44]

    Decide on the next action: use a tool or provide a final answer

  45. [45]

    You MUST output the final answer within {{MAX_STEPS}} steps

  46. [46]

    thought":

    Respond in the following JSON format: If you need to use a tool: { "thought": "Your detailed reasoning about what to do next", "action": { "reason": "Explanation of why you chose this tool", "server": "server-name", "tool": "tool-name", "arguments": { "argument-name": "argument-value" } } } If you have enough information to answer the query: { "thought": ...

  47. [47]

    Try different apps randomly - don’t stick to one app too long

  48. [51]

    Start by checking what apps are available with apis.api_docs.show_app_descriptions()

    When you feel you’ve explored enough, call: apis.supervisor.complete_task(). Start by checking what apps are available with apis.api_docs.show_app_descriptions(). Before calling an API, inspect its documentation with apis.api_docs.show_api_doc(...). Write small code chunks, use only one code block per step, and verify behavior before making irreversible c...

  49. [52]

    PRIORITIZE APIs you haven’t visited before

  50. [53]

    Test various API endpoints with different parameters

  51. [54]

    Be curious and try unexpected combinations

  52. [55]

    Observe outputs carefully - note formats, errors, edge cases

  53. [56]

    supervisor

    When you feel you’ve explored enough, call: apis.supervisor.complete_task(). === EXPLORATION PROGRESS === Unique APIs visited so far: {unique_apis_visited} Total API calls made: {total_api_calls} Apps explored: {apps_explored} === ALREADY VISITED APIs === {visited_api_summary} Focus on exploring NEW APIs that are NOT in the list above. Start by checking a...