Recognition: no theorem link
PREPING: Building Agent Memory without Tasks
Pith reviewed 2026-05-15 06:10 UTC · model grok-4.3
The pith
Agents can construct competitive procedural memory for new environments using only self-generated synthetic tasks before any real experience.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Preping shows that procedural memory can be built pre-task by maintaining a proposer memory state that conditions synthetic task generation. A solver runs the generated tasks and a validator selects only eligible trajectories for memory insertion while returning feedback that refines the next round of proposals. The resulting memory substantially improves over a no-memory baseline and reaches performance levels competitive with playbook methods built from real experience, while cutting deployment cost by factors of 2.99 on AppWorld and 2.23 on BFCL v3.
What carries the argument
Proposer memory, the structured control state that shapes future synthetic task proposals, together with the closed proposer-solver-validator loop that enforces feasibility, reduces redundancy, and supplies targeted updates to memory.
If this is right
- Memory construction becomes possible with zero direct experience of target tasks.
- Deployment costs fall by more than twofold relative to online memory construction.
- Gains arise from proposer-side control over feasibility and coverage rather than data volume.
- The approach applies across AppWorld, BFCL v3, and MCP-Universe benchmarks.
Where Pith is reading between the lines
- If synthetic task generators become more faithful, agents could initialize in entirely new domains with no real-world data collection at all.
- The method could be tested in robotics or simulation-heavy domains where creating synthetic tasks is cheap but real interactions are expensive.
- The proposer memory state itself might be initialized from minimal seed examples and then refined purely through the validation loop.
Load-bearing premise
Synthetic tasks generated without any direct exposure to real target-environment tasks will still produce trajectories that transfer usefully when the agent later encounters those real tasks.
What would settle it
Inserting Preping memory into an agent and testing it on real tasks where the synthetic proposals systematically omit key patterns of the target environment would cause performance to fall back to the no-memory baseline level.
Figures
read the original abstract
Agent memory is typically constructed either offline from curated demonstrations or online from post-deployment interactions. However, regardless of how it is built, an agent faces a cold-start gap when first introduced to a new environment without any task-specific experience available. In this paper, we study pre-task memory construction: whether an agent can build procedural memory before observing any target-environment tasks, using only self-generated synthetic practice. Yet, synthetic interaction alone is insufficient, as without controlling what to practice and what to store, synthetic tasks become redundant, infeasible, and ultimately uninformative, and memory further degrades quickly due to unfiltered trajectories. To overcome this, we present Preping, a proposer-guided memory construction framework. At its core is proposer memory, a structured control state that shapes future practice. A Proposer generates synthetic tasks conditioned on this state, a Solver executes them, and a Validator determines which trajectories are eligible for memory insertion while also providing feedback to guide future proposals. Experiments on AppWorld, BFCL v3, and MCP-Universe show that Preping substantially improves over a no-memory baseline and achieves performance competitive with strong playbook-based methods built from offline or online experience, with deployment cost $2.99\times$ lower on AppWorld and $2.23\times$ lower on BFCL v3 than online memory construction. Further analyses reveal that the main benefit does not come from synthetic volume alone, but from proposer-side control over feasibility, redundancy, and coverage, combined with selective memory updates.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that Preping enables pre-task construction of agent memory via a proposer-guided loop: a structured 'proposer memory' state conditions synthetic task generation, a solver executes them, and a validator filters trajectories for memory insertion while supplying feedback. On AppWorld, BFCL v3, and MCP-Universe, the resulting memory yields substantial gains over a no-memory baseline, matches strong offline/online playbook methods, and reduces deployment cost by 2.99× and 2.23× respectively versus online memory construction; the benefit is attributed to control of feasibility, redundancy, and coverage rather than synthetic volume alone.
Significance. If the synthetic-to-real transfer holds under distributional controls, the result would meaningfully reduce cold-start costs for agent deployment in new environments and weaken reliance on expensive post-deployment interaction data, with direct implications for scalable agentic systems.
major comments (2)
- [Experiments] Experiments section: the headline performance claims rest on comparisons whose statistical robustness is not reported (no p-values, confidence intervals, or details on trajectory filtering rules and ablation controls for volume vs. proposer control). This makes it impossible to confirm that gains arise from the claimed mechanisms rather than uncontrolled factors.
- [Method] Method and Experiments: the central transfer assumption—that proposer-generated synthetic tasks (shaped by proposer memory, validator filtering, and feedback) sufficiently overlap the real-task distribution on AppWorld/BFCL v3—is load-bearing yet unsupported by any coverage metric (tool-call histograms, state-transition statistics, or embedding distances). Without such checks, observed improvements could stem from generic scaffolding rather than targeted pre-task preparation.
minor comments (2)
- [Abstract] Abstract and §3: the term 'proposer memory' is introduced as a 'structured control state' but its precise representation (e.g., data structures, update rules) is not formalized early enough for readers to follow the loop without backtracking.
- [Related Work] Related work: the positioning against prior synthetic-data and self-play methods for agents would benefit from explicit citations to recent work on procedural task generation and memory consolidation.
Simulated Author's Rebuttal
We thank the referee for the constructive comments, which highlight important aspects of statistical rigor and distributional overlap. We address each major point below and have revised the manuscript to incorporate additional analyses and reporting.
read point-by-point responses
-
Referee: [Experiments] Experiments section: the headline performance claims rest on comparisons whose statistical robustness is not reported (no p-values, confidence intervals, or details on trajectory filtering rules and ablation controls for volume vs. proposer control). This makes it impossible to confirm that gains arise from the claimed mechanisms rather than uncontrolled factors.
Authors: We agree that the original submission omitted p-values, confidence intervals, and explicit ablation controls separating volume from proposer guidance. In the revised manuscript we now report 95% confidence intervals and p-values for all headline comparisons on AppWorld, BFCL v3, and MCP-Universe. We have also expanded the Experiments section with (i) the precise validator filtering rules and (ii) new volume-controlled ablations that hold proposer memory fixed while varying the number of synthetic trajectories. These additions show that performance scales with proposer-guided selection rather than raw volume. revision: yes
-
Referee: [Method] Method and Experiments: the central transfer assumption—that proposer-generated synthetic tasks (shaped by proposer memory, validator filtering, and feedback) sufficiently overlap the real-task distribution on AppWorld/BFCL v3—is load-bearing yet unsupported by any coverage metric (tool-call histograms, state-transition statistics, or embedding distances). Without such checks, observed improvements could stem from generic scaffolding rather than targeted pre-task preparation.
Authors: The referee correctly notes the absence of explicit coverage metrics in the original version. While end-to-end gains on held-out real tasks provide indirect support for transfer, we have added direct distributional checks in the revision: tool-call histograms, state-transition statistics, and embedding-distance comparisons between the synthetic trajectories and the real task distributions. These metrics indicate substantial overlap in tool usage and state coverage, consistent with the claim that gains arise from targeted, proposer-controlled practice rather than generic scaffolding. revision: yes
Circularity Check
No circularity: empirical framework evaluated on external benchmarks
full rationale
The paper presents Preping as a proposer-validator framework for synthetic pre-task memory construction and supports its claims solely through experimental results on AppWorld, BFCL v3, and MCP-Universe. No mathematical derivations, equations, or first-principles predictions are advanced that could reduce by construction to fitted inputs, self-definitions, or self-citation chains. Performance comparisons (e.g., against no-memory baselines and playbook methods) rest on direct, externally measurable benchmark outcomes rather than any internal reduction, satisfying the self-contained criterion.
Axiom & Free-Parameter Ledger
invented entities (2)
-
proposer memory
no independent evidence
-
Validator
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Tool-R0: Self-evolving LLM agents for tool-learning from zero data.arXiv, 2026
Emre Can Acikgoz, Cheng Qian, Jonas Hübotter, Heng Ji, Dilek Hakkani-Tür, and Gokhan Tur. Tool-R0: Self-evolving LLM agents for tool-learning from zero data.arXiv, 2026. doi: 10.48550/arXiv.2602.21320. URLhttps://arxiv.org/abs/2602.21320
-
[2]
Introducing the model context protocol, November 2024
Anthropic. Introducing the model context protocol, November 2024. URL https://www. anthropic.com/news/model-context-protocol. Accessed: 2026-05-01
work page 2024
-
[3]
DeepSeek-V3.2: Pushing the frontier of open large language models, 2025
DeepSeek-AI. DeepSeek-V3.2: Pushing the frontier of open large language models, 2025. URL https://huggingface.co/deepseek-ai/DeepSeek-V3.2. Model card
work page 2025
-
[4]
Legomem: Modular procedural memory for multi-agent LLM systems for workflow automation.arXiv, 2025
Dongge Han, Camille Couturier, Daniel Madrigal Díaz, Xuchao Zhang, Victor Rühle, and Saravan Rajmohan. Legomem: Modular procedural memory for multi-agent LLM systems for workflow automation.arXiv, 2025. doi: 10.48550/arXiv.2510.04851. URL https: //arxiv.org/abs/2510.04851
-
[5]
R-Zero: Self-Evolving Reasoning LLM from Zero Data
Chengsong Huang, Wenhao Yu, Xiaoyang Wang, Hongming Zhang, Zongxia Li, Ruosen Li, Jiaxin Huang, Haitao Mi, and Dong Yu. R-zero: Self-evolving reasoning LLM from zero data. arXiv, 2025. doi: 10.48550/arXiv.2508.05004. URL https://arxiv.org/abs/2508.05004
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2508.05004 2025
-
[6]
Decomposed prompting: A modular approach for solving complex tasks
Tushar Khot, Harsh Trivedi, Matthew Finlayson, Yao Fu, Kyle Richardson, Peter Clark, and Ashish Sabharwal. Decomposed prompting: A modular approach for solving complex tasks. InInternational Conference on Learning Representations (ICLR), 2023. URL https:// openreview.net/forum?id=_nGgzQjzaRy
work page 2023
-
[7]
Spice: Self-play in corpus environments improves reasoning.arXiv, 2025
Bo Liu, Chuanyang Jin, Seungone Kim, Weizhe Yuan, Wenting Zhao, Ilia Kulikov, Xian Li, Sainbayar Sukhbaatar, Jack Lanchantin, and Jason Weston. Spice: Self-play in corpus environments improves reasoning.arXiv, 2025. doi: 10.48550/arXiv.2510.24684. URL https://arxiv.org/abs/2510.24684
-
[8]
Ziyang Luo, Zhiqi Shen, Wenzhuo Yang, Zirui Zhao, Prathyusha Jwalapuram, Amrita Saha, Doyen Sahoo, Silvio Savarese, Caiming Xiong, and Junnan Li. MCP-universe: Benchmarking large language models with real-world model context protocol servers.arXiv, 2025. doi: 10.48550/arXiv.2508.14704. URLhttps://arxiv.org/abs/2508.14704
-
[9]
Bodhisattwa Prasad Majumder, Bhavana Dalvi Mishra, Peter Jansen, Oyvind Tafjord, Niket Tandon, Li Zhang, Chris Callison-Burch, and Peter Clark. CLIN: A continually learning language agent for rapid task adaptation and generalization.arXiv, 2023. doi: 10.48550/arXiv. 2310.10134. URLhttps://arxiv.org/abs/2310.10134
work page internal anchor Pith review doi:10.48550/arxiv 2023
-
[10]
A Survey of Context Engineering for Large Language Models
Lingrui Mei, Jiayu Yao, Yuyao Ge, Yiwei Wang, Baolong Bi, Yujun Cai, Jiazhi Liu, Mingyu Li, Zhong-Zhi Li, Duzhen Zhang, Chenlin Zhou, Jiayi Mao, Tianze Xia, Jiafeng Guo, and Shenghua Liu. A survey of context engineering for large language models.arXiv, 2025. doi: 10.48550/arXiv.2507.13334. URLhttps://arxiv.org/abs/2507.13334
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2507.13334 2025
-
[11]
Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces
Mike A. Merrill, Alexander G. Shaw, Nicholas Carlini, Boxuan Li, Harsh Raj, et al. Terminal- bench: Benchmarking agents on hard, realistic tasks in command line interfaces.arXiv, 2026. doi: 10.48550/arXiv.2601.11868. URLhttps://arxiv.org/abs/2601.11868
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2601.11868 2026
-
[12]
JEF- hinter: Leveraging offline knowledge for improving web agents adaptation.arXiv, 2025
Hadi Nekoei, Aman Jaiswal, Patrice Béchard, Oleh Shliazhko, Orlando Marquez Ayala, Mathieu Reymond, Massimo Caccia, Alexandre Drouin, Sarath Chandar, and Alexandre Lacoste. JEF- hinter: Leveraging offline knowledge for improving web agents adaptation.arXiv, 2025. doi: 10.48550/arXiv.2510.04373. URLhttps://arxiv.org/abs/2510.04373
-
[13]
OpenAI. GPT-5.1 model, 2025. URL https://platform.openai.com/docs/models/ gpt-5.1/. OpenAI API documentation; accessed: 2026-05-02
work page 2025
-
[14]
gpt-oss-120b and gpt-oss-20b model card, August 2025
OpenAI. gpt-oss-120b and gpt-oss-20b model card, August 2025. URL https://openai. com/index/gpt-oss-model-card/. Accessed: 2026-05-02. 10
work page 2025
-
[15]
Siru Ouyang, Jun Yan, I-Hung Hsu, Yanfei Chen, Ke Jiang, Zifeng Wang, Rujun Han, Long T. Le, Samira Daruki, Xiangru Tang, Vishy Tirumalashetty, George Lee, Mahsan Rofouei, Hangfei Lin, Jiawei Han, Chen-Yu Lee, and Tomas Pfister. Reasoningbank: Scaling agent self-evolving with reasoning memory. InInternational Conference on Learning Representations (ICLR),
-
[16]
URLhttps://openreview.net/forum?id=jL7fwchScm
-
[17]
MemGPT: Towards LLMs as Operating Systems
Charles Packer, Sarah Wooders, Kevin Lin, Vivian Fang, Shishir G. Patil, Ion Stoica, and Joseph E. Gonzalez. MemGPT: Towards LLMs as operating systems.arXiv, 2023. doi: 10.48550/arXiv.2310.08560. URLhttps://arxiv.org/abs/2310.08560
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2310.08560 2023
-
[18]
Patil, Tianjun Zhang, Xin Wang, and Joseph E
Shishir G. Patil, Tianjun Zhang, Xin Wang, and Joseph E. Gonzalez. Gorilla: Large language model connected with massive APIs. InNeural Information Processing Systems (NeurIPS),
-
[19]
URLhttps://openreview.net/forum?id=tBRNC6YemY
-
[20]
Shishir G Patil, Huanzhi Mao, Fanjia Yan, Charlie Cheng-Jie Ji, Vishnu Suresh, Ion Stoica, and Joseph E. Gonzalez. The berkeley function calling leaderboard (BFCL): From tool use to agentic evaluation of large language models. InInternational Conference on Machine Learning (ICML), 2025
work page 2025
-
[21]
ToolLLM: Facilitating large language models to master 16000+ real-world APIs
Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, Sihan Zhao, Lauren Hong, Runchu Tian, Ruobing Xie, Jie Zhou, Mark Gerstein, Dahai Li, Zhiyuan Liu, and Maosong Sun. ToolLLM: Facilitating large language models to master 16000+ real-world APIs. InInternational Conference on Learning Representat...
work page 2024
-
[22]
Toolformer: Language models can teach themselves to use tools
Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools. InNeural Information Processing Systems (NeurIPS), 2023. URL https://openreview.net/forum?id=Yacmpz84TH
work page 2023
-
[23]
Dynamic cheatsheet: Test-time learning with adaptive memory.arXiv, 2025
Mirac Suzgun, Mert Yüksekgönul, Federico Bianchi, Dan Jurafsky, and James Zou. Dynamic cheatsheet: Test-time learning with adaptive memory.arXiv, 2025. doi: 10.48550/arXiv.2504. 07952. URLhttps://arxiv.org/abs/2504.07952
-
[24]
Qwen Team. Qwen3 technical report.arXiv, 2025. doi: 10.48550/arXiv.2505.09388. URL https://arxiv.org/abs/2505.09388
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2505.09388 2025
-
[25]
InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Harsh Trivedi, Tushar Khot, Mareike Hartmann, Ruskin Manku, Vinty Dong, Edward Li, Shashank Gupta, Ashish Sabharwal, and Niranjan Balasubramanian. Appworld: A controllable world of apps and people for benchmarking interactive coding agents. InAnnual Meeting of the Association for Computational Linguistics (ACL), pages 16022–16076, 2024. doi: 10.18653/v1/ ...
-
[26]
OfficeBench: Benchmarking language agents across multiple applications for office automation
Zilong Wang, Yuedong Cui, Li Zhong, Zimin Zhang, Da Yin, Bill Yuchen Lin, and Jingbo Shang. OfficeBench: Benchmarking language agents across multiple applications for office automation. arXiv, 2024. doi: 10.48550/arXiv.2407.19056. URL https://arxiv.org/abs/2407.19056
-
[27]
Zora Zhiruo Wang, Jiayuan Mao, Daniel Fried, and Graham Neubig. Agent workflow memory. InInternational Conference on Machine Learning (ICML), 2025. URL https://openreview. net/forum?id=NTAhi2JEEE
work page 2025
-
[28]
Yuxiang Wei, Zhiqing Sun, Emily McMilin, Jonas Gehring, David Zhang, Gabriel Synnaeve, Daniel Fried, Lingming Zhang, and Sida I. Wang. Toward training superintelligent software agents through self-play SWE-RL.arXiv, 2025. doi: 10.48550/arXiv.2512.18552. URL https://arxiv.org/abs/2512.18552
-
[29]
Agent0: Unleashing self-evolving agents from zero data via tool-integrated reasoning
Peng Xia, Kaide Zeng, Jiaqi Liu, Can Qin, Fang Wu, Yiyang Zhou, Caiming Xiong, and Huaxiu Yao. Agent0: Unleashing self-evolving agents from zero data via tool-integrated reasoning. arXiv, 2025. doi: 10.48550/arXiv.2511.16043. URL https://arxiv.org/abs/2511.16043
-
[30]
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R. Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InInternational Conference on Learning Representations (ICLR), 2023. URL https://openreview.net/forum?id= WE_vluYUL-X. 11
work page 2023
-
[31]
Zhenrui Yue, Kartikeya Upasani, Xianjun Yang, Suyu Ge, Shaoliang Nie, Yuning Mao, Zhe Liu, and Dong Wang. Dr. zero: Self-evolving search agents without training data.arXiv, 2026. doi: 10.48550/arXiv.2601.07055. URLhttps://arxiv.org/abs/2601.07055
-
[32]
Agentevolver: Towards efficient self-evolving agent system.arXiv, 2025
Yunpeng Zhai, Shuchang Tao, Cheng Chen, Anni Zou, Ziqian Chen, Qingxu Fu, Shinji Mai, Li Yu, Jiaji Deng, Zouying Cao, Zhaoyang Liu, Bolin Ding, and Jingren Zhou. Agentevolver: Towards efficient self-evolving agent system.arXiv, 2025. doi: 10.48550/arXiv.2511.10395. URLhttps://arxiv.org/abs/2511.10395
-
[33]
G- memory: Tracing hierarchical memory for multi-agent systems.arXiv, 2025
Guibin Zhang, Muxin Fu, Guancheng Wan, Miao Yu, Kun Wang, and Shuicheng Yan. G- memory: Tracing hierarchical memory for multi-agent systems.arXiv, 2025. doi: 10.48550/ arXiv.2506.07398. URLhttps://arxiv.org/abs/2506.07398
-
[34]
Memevolve: Meta-evolution of agent memory systems.arXiv,
Guibin Zhang, Haotian Ren, Chong Zhan, Zhenhong Zhou, Junhao Wang, He Zhu, Wangchun- shu Zhou, and Shuicheng Yan. Memevolve: Meta-evolution of agent memory systems.arXiv,
-
[35]
URLhttps://arxiv.org/abs/2512.18746
doi: 10.48550/arXiv.2512.18746. URLhttps://arxiv.org/abs/2512.18746
-
[36]
Agentic context engineering: Evolving contexts for self-improving language models
Qizheng Zhang, Changran Hu, Shubhangi Upasani, Boyuan Ma, Fenglu Hong, Vamsidhar Kamanuru, Jay Rainton, Chen Wu, Mengmeng Ji, Hanchen Li, Urmish Thakker, James Zou, and Kunle Olukotun. Agentic context engineering: Evolving contexts for self-improving language models. InInternational Conference on Learning Representations (ICLR), 2026. URL https://openrevi...
work page 2026
-
[37]
ExpeL: LLM agents are experiential learners
Andrew Zhao, Daniel Huang, Quentin Xu, Matthieu Lin, Yong-Jin Liu, and Gao Huang. ExpeL: LLM agents are experiential learners. InProceedings of the AAAI Conference on Artificial Intelligence, pages 19632–19642, 2024. doi: 10.1609/aaai.v38i17.29936. URL https://doi.org/10.1609/aaai.v38i17.29936
-
[38]
Synapse: Trajectory-as-exemplar prompting with memory for computer control
Longtao Zheng, Rundong Wang, Xinrun Wang, and Bo An. Synapse: Trajectory-as-exemplar prompting with memory for computer control. InInternational Conference on Learning Representations (ICLR), 2024. URLhttps://openreview.net/forum?id=Pc8AU1aF5e
work page 2024
-
[39]
MemoryBank: Enhancing large language models with long-term memory
Wanjun Zhong, Lianghong Guo, Qiqi Gao, He Ye, and Yanlin Wang. MemoryBank: Enhancing large language models with long-term memory. InProceedings of the AAAI Conference on Artificial Intelligence, pages 19724–19731, 2024. doi: 10.1609/aaai.v38i17.29946. URL https://doi.org/10.1609/aaai.v38i17.29946
-
[40]
Memento: Fine-tuning LLM agents without fine-tuning LLMs.arXiv, 2025
Huichi Zhou, Yihang Chen, Siyuan Guo, Xue Yan, Kin Hei Lee, Zihan Wang, Ka Yiu Lee, Guchun Zhang, Kun Shao, Linyi Yang, and Jun Wang. Memento: Fine-tuning LLM agents without fine-tuning LLMs.arXiv, 2025. doi: 10.48550/arXiv.2508.16153. URL https: //arxiv.org/abs/2508.16153
-
[41]
Yifei Zhou, Sergey Levine, Jason E. Weston, Xian Li, and Sainbayar Sukhbaatar. Self- challenging language model agents. InNeural Information Processing Systems (NeurIPS), 2025. URLhttps://openreview.net/forum?id=9yusqX9DpR. 12 Appendix The appendix complements the main text by recording the implementation details needed to reproduce PREPINGand providing a...
work page 2025
-
[42]
ADD: Create new bullet points with fresh IDs - section: the section to add the new bullet to - content: the new content of the bullet. Note: no need to include the bullet_id in the content like ’[ctx-00263] helpful=1 harmful=0 ::’, the bullet_id will be added by the system. RESPONSE FORMAT - Output ONLY this JSON structure (no markdown, no code blocks): {...
-
[43]
Analyze the query, previous reasoning steps, and results
-
[44]
Decide on the next action: use a tool or provide a final answer
-
[45]
You MUST output the final answer within {{MAX_STEPS}} steps
-
[46]
Respond in the following JSON format: If you need to use a tool: { "thought": "Your detailed reasoning about what to do next", "action": { "reason": "Explanation of why you chose this tool", "server": "server-name", "tool": "tool-name", "arguments": { "argument-name": "argument-value" } } } If you have enough information to answer the query: { "thought": ...
-
[47]
Try different apps randomly - don’t stick to one app too long
-
[51]
Start by checking what apps are available with apis.api_docs.show_app_descriptions()
When you feel you’ve explored enough, call: apis.supervisor.complete_task(). Start by checking what apps are available with apis.api_docs.show_app_descriptions(). Before calling an API, inspect its documentation with apis.api_docs.show_api_doc(...). Write small code chunks, use only one code block per step, and verify behavior before making irreversible c...
-
[52]
PRIORITIZE APIs you haven’t visited before
-
[53]
Test various API endpoints with different parameters
-
[54]
Be curious and try unexpected combinations
-
[55]
Observe outputs carefully - note formats, errors, edge cases
-
[56]
When you feel you’ve explored enough, call: apis.supervisor.complete_task(). === EXPLORATION PROGRESS === Unique APIs visited so far: {unique_apis_visited} Total API calls made: {total_api_calls} Apps explored: {apps_explored} === ALREADY VISITED APIs === {visited_api_summary} Focus on exploring NEW APIs that are NOT in the list above. Start by checking a...
work page 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.