HarnessX: A Composable, Adaptive, and Evolvable Agent Harness Foundry

Chao Li; Hanlin Teng; Heng Qu; Jian Liang; Jian Luan; Kang Zhao; Kun Shao; Shuo Lu; Tianhao Li; Tingyang Chen

arxiv: 2606.14249 · v2 · pith:PBTVU63Knew · submitted 2026-06-12 · 💻 cs.AI

HarnessX: A Composable, Adaptive, and Evolvable Agent Harness Foundry

Tingyang Chen , Shuo Lu , Kang Zhao , Weicheng Meng , Hanlin Teng , Tianhao Li , Chao Li , Xule Liu

show 6 more authors

Jian Liang Zhizhong Zhang Yuan Xie Heng Qu Kun Shao Jian Luan

This is my paper

Pith reviewed 2026-07-03 23:41 UTC · model grok-4.3

classification 💻 cs.AI

keywords AI agentsagent harnesscomposable primitivestrace-driven evolutionruntime adaptationbenchmark performanceexecution feedback

0 comments

The pith

Composing and evolving agent harnesses from execution traces improves performance by 14.5 percent on average across five benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that AI agent performance depends on the runtime harness of prompts, tools, memory, and control flow, which are typically hand-crafted and static. HarnessX supplies a foundry that assembles these harnesses from typed primitives using a substitution algebra, then adapts them via the AEGIS engine that converts execution traces into harness updates and model training signals. If this holds, agent progress becomes possible through systematic interface improvement rather than model scaling alone. The approach is tested on ALFWorld, GAIA, WebShop, tau^3-Bench, and SWE-bench Verified, where gains average 14.5 percent and reach 44 percent on the weakest baselines.

Core claim

HarnessX assembles typed harness primitives via a substitution algebra, adapts them through AEGIS, a trace-driven multi-agent evolution engine grounded in an operational mirror between symbolic adaptation and reinforcement learning, and closes the harness-model loop by turning trajectories into both harness updates and model training signal, producing an average gain of 14.5 percent (up to 44.0 percent) across the five benchmarks.

What carries the argument

The substitution algebra that composes typed harness primitives together with the AEGIS trace-driven multi-agent evolution engine that adapts them from execution feedback.

If this is right

Execution traces become a direct source of both harness updates and model training data.
Agent systems can improve on tasks where current baselines are weakest without requiring larger models.
Runtime interfaces shift from static hand-crafted scaffolding to systematically evolvable components.
Progress on agent benchmarks can be measured and achieved separately from model capability increases.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same harness-evolution loop could be applied to domains outside the five tested benchmarks to check generality.
If harness evolution works, it suggests a division of labor where model scaling handles core reasoning and harness adaptation handles task-specific mediation.
Open-sourcing the codebase would allow direct tests of whether the substitution algebra alone accounts for part of the gains.

Load-bearing premise

The reported performance gains arise from the composable harness assembly and AEGIS evolution rather than from unstated choices of models, benchmark tuning, or baseline implementations.

What would settle it

Reproduce the five benchmarks using identical base models and baseline implementations while disabling the substitution algebra and AEGIS engine; the gains should disappear if the claim is correct.

read the original abstract

AI agent performance depends critically on the runtime harness, comprising the prompts, tools, memory, and control flow that mediate how a model observes, reasons, and acts. Yet today's harnesses remain largely hand-crafted and static: each new model or task still demands bespoke scaffolding, and the rich traces produced during execution are rarely distilled back into systematic improvement. We introduce HarnessX, a foundry for composable, adaptive, and evolvable agent harnesses. HarnessX assembles typed harness primitives via a substitution algebra, adapts them through AEGIS, a trace-driven multi-agent evolution engine grounded in an operational mirror between symbolic adaptation and reinforcement learning, and closes the harness-model loop by turning trajectories into both harness updates and model training signal. Across five benchmarks (ALFWorld, GAIA, WebShop, tau^3-Bench, and SWE-bench Verified), HarnessX yields an average gain of +14.5% (up to +44.0%), with gains largest where baselines are lowest. These results suggest that agent progress need not come from model scaling alone: composing and evolving runtime interfaces from execution feedback is an actionable and complementary lever. The complete codebase will be open-sourced in a future release.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

HarnessX claims +14.5% gains from evolving harnesses with substitution algebra and AEGIS, but the abstract supplies no ablations or baseline details to support the attribution.

read the letter

The main takeaway is that this paper describes a system called HarnessX for building and evolving agent runtime harnesses automatically from execution traces. It uses a substitution algebra to compose typed primitives and AEGIS, a multi-agent evolution engine, to adapt them, then feeds trajectories back into both harness updates and model training. The reported result is an average +14.5% lift (up to +44%) across ALFWorld, GAIA, WebShop, tau^3-Bench, and SWE-bench Verified, with bigger gains on weaker baselines.

The new pieces are the substitution algebra for harness composition and the trace-driven AEGIS loop that treats harness adaptation as an operational mirror to reinforcement learning. Framing the harness as the changeable interface rather than a fixed hand-crafted layer is a reasonable engineering angle, and the idea of distilling execution data back into systematic harness changes is worth exploring if it scales.

The soft spot is the complete absence of experimental controls in the abstract. No model versions, no description of baseline prompt or tool implementations, no ablation that removes the evolution engine or the composable assembly, and no statistical details. The pattern of larger gains on weak baselines could reflect the method working or could reflect uneven baseline strength. The code is not released yet, so none of this can be checked independently.

This is for people working on agent runtime design who want ideas beyond model scaling. A reader could extract the high-level architecture and the claim, but the lack of detail limits how much weight to put on the numbers right now.

It deserves peer review so the full paper can be checked for the missing controls and ablations. If those are present and solid, the work is worth engaging with; if not, the central claim stays unverified.

Referee Report

1 major / 1 minor

Summary. The manuscript introduces HarnessX, a foundry for composable, adaptive, and evolvable agent harnesses. Harness primitives are assembled via a substitution algebra; AEGIS provides a trace-driven multi-agent evolution engine that mirrors symbolic adaptation with reinforcement learning; execution trajectories are used both to update harnesses and to supply model training signals. Across ALFWorld, GAIA, WebShop, tau^3-Bench, and SWE-bench Verified the system is reported to deliver an average +14.5% gain (maximum +44.0%), with larger improvements where baselines are weakest.

Significance. If the performance deltas can be shown to arise specifically from the substitution algebra and AEGIS evolution engine, the work would be significant: it supplies a concrete, complementary lever—runtime harness composition and trace-driven evolution—for agent improvement that does not rely solely on model scaling.

major comments (1)

[Evaluation] Evaluation section: the manuscript reports aggregate gains of +14.5% (up to +44.0%) but supplies no model versions, baseline prompt/tool/memory implementations, hyperparameter search protocol, statistical tests, or ablations that remove either the composable assembly or the AEGIS loop. Because the central claim attributes these deltas specifically to the proposed mechanisms, and because gains are largest where baselines are weakest, the absence of these controls is load-bearing for the empirical conclusion.

minor comments (1)

The statement that the complete codebase will be open-sourced only in a future release should be accompanied by a reproducibility checklist or at minimum a detailed description of baseline re-implementations sufficient for independent verification.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on the evaluation. We agree that additional controls and details are necessary to substantiate the attribution of gains to the substitution algebra and AEGIS engine, and we will revise the manuscript to address this.

read point-by-point responses

Referee: [Evaluation] Evaluation section: the manuscript reports aggregate gains of +14.5% (up to +44.0%) but supplies no model versions, baseline prompt/tool/memory implementations, hyperparameter search protocol, statistical tests, or ablations that remove either the composable assembly or the AEGIS loop. Because the central claim attributes these deltas specifically to the proposed mechanisms, and because gains are largest where baselines are weakest, the absence of these controls is load-bearing for the empirical conclusion.

Authors: We acknowledge the validity of this concern. The current manuscript does not include the requested details on model versions, baseline implementations, hyperparameter protocols, statistical tests, or component ablations. To address this, the revised manuscript will provide explicit specifications for all models and baselines used, describe the hyperparameter search process, report statistical significance tests (e.g., paired t-tests or bootstrap confidence intervals), and include ablations that systematically remove the substitution algebra and the AEGIS loop to isolate their contributions. These additions will strengthen the evidence that the observed gains arise specifically from the proposed mechanisms. revision: yes

Circularity Check

0 steps flagged

No derivation chain or equations; empirical results not circular by construction

full rationale

The paper reports empirical benchmark gains from HarnessX without presenting any mathematical derivations, first-principles predictions, fitted parameters renamed as outputs, or self-referential definitions. Claims rest on observed performance deltas across ALFWorld, GAIA, WebShop, tau^3-Bench, and SWE-bench Verified rather than any algebraic or definitional reduction. Absence of equations or load-bearing self-citations in a formal chain means no steps qualify under the enumerated circularity patterns; concerns about ablations or baseline details pertain to experimental rigor, not circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no details on free parameters, axioms, or invented entities; review is limited to summary information only.

pith-pipeline@v0.9.1-grok · 5787 in / 1106 out tokens · 27711 ms · 2026-07-03T23:41:50.966098+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

66 extracted references · 20 canonical work pages · 14 internal anchors

[1]

Langchain.https://github.com/langchain-ai/langchain, 2022

2022
[2]

Claude code.https://github.com/anthropics/claude-code, 2025

Anthropic. Claude code.https://github.com/anthropics/claude-code, 2025

2025
[3]

Introducing dynamic workflows in claude code

Anthropic. Introducing dynamic workflows in claude code. https://claude.com/blog/ introducing-dynamic-workflows-in-claude-code, 2026

2026
[4]

Cursor.https://www.cursor.com, 2023

Anysphere. Cursor.https://www.cursor.com, 2023

2023
[5]

Deerflow.https://github.com/bytedance/deer-flow, 2025

ByteDance. Deerflow.https://github.com/bytedance/deer-flow, 2025

2025
[6]

Deepseek-v4: Towards highly efficient million-token context intelligence, 2026

DeepSeek-AI. Deepseek-v4: Towards highly efficient million-token context intelligence, 2026

2026
[7]

ReTool: Reinforcement Learning for Strategic Tool Use in LLMs

Jiazhan Feng, Shijue Huang, Xingwei Qu, Ge Zhang, Yujia Qin, Baoquan Zhong, Chengquan Jiang, Jinxin Chi, and Wanjun Zhong. Retool: Reinforcement learning for strategic tool use in llms.arXiv preprint arXiv:2504.11536, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[8]

Prompt- breeder: Self-referential self-improvement via prompt evolution

Chrisantha Fernando, Dylan Sunil Banarse, Henryk Michalewski, Simon Osindero, and Tim Rocktäschel. Prompt- breeder: Self-referential self-improvement via prompt evolution. InInternational Conference on Machine Learning, pages 13481–13544. PMLR, 2024

2024
[9]

GLM-5: from Vibe Coding to Agentic Engineering

GLM-5-Team. Glm-5: from vibe coding to agentic engineering, 2026. URLhttps://arxiv.org/abs/2602.15763

work page internal anchor Pith review Pith/arXiv arXiv 2026
[10]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[11]

Connecting large language models with evolutionary algorithms yields powerful prompt optimizers

Qingyan Guo, Rui Wang, Junliang Guo, Bei Li, Kaitao Song, Xu Tan, Guoqing Liu, Jiang Bian, and Yujiu Yang. Connecting large language models with evolutionary algorithms yields powerful prompt optimizers. In International Conference on Learning Representations, 2024

2024
[12]

Automated design of agentic systems

Shengran Hu, Cong Lu, and Jeff Clune. Automated design of agentic systems. InInternational Conference on Learning Representations, 2025

2025
[13]

Swe-bench: Can language models resolve real-world github issues? InInternational Conference on Learning Representations, 2024

Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues? InInternational Conference on Learning Representations, 2024

2024
[14]

DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines

Omar Khattab, Arnav Singhvi, Paridhi Maheshwari, Zhiyuan Zhang, Keshav Santhanam, Sri Vardhamanan, Saiful Haq, Ashutosh Sharma, Thomas T Joshi, Hanna Moazam, et al. Dspy: Compiling declarative language model calls into self-improving pipelines.arXiv preprint arXiv:2310.03714, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[15]

Overcoming catastrophic forgetting in neural networks.Proceedings of the national academy of sciences, 114(13):3521–3526, 2017

James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural networks.Proceedings of the national academy of sciences, 114(13):3521–3526, 2017

2017
[16]

Exploration in deep reinforcement learning: A survey.Information Fusion, 85:1–22, 2022

Pawel Ladosz, Lilian Weng, Minwoo Kim, and Hyondong Oh. Exploration in deep reinforcement learning: A survey.Information Fusion, 85:1–22, 2022

2022
[17]

Langgraph.https://github.com/langchain-ai/langgraph, 2024

LangChain AI. Langgraph.https://github.com/langchain-ai/langgraph, 2024

2024
[18]

The Darwin Gödel Machine: Open-ended evolution of self-improving agents.arXiv preprint arXiv:2505.22535, 2025

Robert Tjarko Lange, Yujin Tang, and Yingtao Tian. The Darwin Gödel Machine: Open-ended evolution of self-improving agents.arXiv preprint arXiv:2505.22535, 2025

work page arXiv 2025
[19]

Meta-Harness: End-to-End Optimization of Model Harnesses

Yoonho Lee, Roshen Nair, Qizheng Zhang, Kangwook Lee, Omar Khattab, and Chelsea Finn. Meta-harness: End-to-end optimization of model harnesses.arXiv preprint arXiv:2603.28052, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[20]

Agent harness engineering: A survey.arXiv preprint, 2026

Junjie Li, Xi Xiao, Yunbei Zhang, Chen Liu, Lin Zhao, Xiaoying Liao, Yingrui Ji, Janet Wang, Jianyang Gu, Yingqiang Ge, et al. Agent harness engineering: A survey.arXiv preprint, 2026

2026
[21]

Agentswift: Efficient llm agent design via value-guided hierarchical search

Yu Li, Lehui Li, Zhihao Wu, Qingmin Liao, Jianye Hao, Kun Shao, and Fengli Xu. Agentswift: Efficient llm agent design via value-guided hierarchical search. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 31843–31851, 2026. 24

2026
[22]

Agentic Harness Engineering: Observability-Driven Automatic Evolution of Coding-Agent Harnesses

Jiahang Lin, Shichun Liu, Chengjun Pan, Lizhi Lin, Shihan Dou, Xuanjing Huang, Hang Yan, Zhenhua Han, and Tao Gui. Agentic harness engineering: Observability-driven automatic evolution of coding-agent harnesses.arXiv preprint arXiv:2604.25850, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[23]

LlamaIndex, 11 2022

Jerry Liu. LlamaIndex, 11 2022. URLhttps://github.com/jerryjliu/llama_index

2022
[24]

Openclaw research: A systematic survey of large language model agents in open deployment

Shuo Lu, Kecheng Yu, Siru Jiang, Yinuo Xu, Bing Zhan, Yanbo Wang, Changxin Ke, Yuan Xu, Xin Xiong, Xinyun Zhou, et al. Openclaw research: A systematic survey of large language model agents in open deployment. 2026

2026
[25]

Gaia: a benchmark for general ai assistants

Grégoire Mialon, Clémentine Fourrier, Thomas Wolf, Yann LeCun, and Thomas Scialom. Gaia: a benchmark for general ai assistants. InInternational Conference on Learning Representations, 2024

2024
[26]

Crewai: Framework for orchestrating role-playing autonomous ai agents, 2025

João Moura. Crewai: Framework for orchestrating role-playing autonomous ai agents, 2025

2025
[27]

Optimizing instructions and demonstrations for multi-stage language model programs

Krista Opsahl-Ong, Michael J Ryan, Josh Purtell, David Broman, Christopher Potts, Matei Zaharia, and Omar Khattab. Optimizing instructions and demonstrations for multi-stage language model programs. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 9340–9366, 2024

2024
[28]

Memgpt: towards llms as operating systems

Charles Packer, Vivian Fang, Shishir G Patil, Kevin Lin, Sarah Wooders, and Joseph E Gonzalez. Memgpt: towards llms as operating systems. 2023

2023
[29]

gradient descent

Reid Pryzant, Dan Iter, Jerry Li, Yin Lee, Chenguang Zhu, and Michael Zeng. Automatic prompt optimization with “gradient descent” and beam search. InProceedings of the 2023 conference on empirical methods in natural language processing, pages 7957–7968, 2023

2023
[30]

Memory Intelligence Agent

Jingyang Qiao, Weicheng Meng, Yu Cheng, Zhihang Lin, Zhizhong Zhang, Xin Tan, Jingyu Gong, Kun Shao, and Yuan Xie. Memory intelligence agent.arXiv preprint arXiv:2604.04503, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[31]

A self-improving coding agent.arXiv preprint arXiv:2504.15228, 2025

Maxime Robeyns, Martin Szummer, and Laurence Aitchison. A self-improving coding agent.arXiv preprint arXiv:2504.15228, 2025

work page arXiv 2025
[32]

‘smola- gents‘: a smol library to build great agentic systems.https://github.com/huggingface/smolagents, 2025

Aymeric Roucher, Albert Villanova del Moral, Thomas Wolf, Leandro von Werra, and Erik Kaunismäki. ‘smola- gents‘: a smol library to build great agentic systems.https://github.com/huggingface/smolagents, 2025

2025
[33]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[34]

From mind to machine: The rise of manus ai as a fully autonomous digital agent.arXiv preprint arXiv:2505.02024, 2025

Minjie Shen, Yanshu Li, Lulu Chen, Zhichao Fan, Yanhang Li, and Qikai Yang. From mind to machine: The rise of manus ai as a fully autonomous digital agent.arXiv preprint arXiv:2505.02024, 2025

work page arXiv 2025
[35]

ALFWorld: Aligning Text and Embodied Environments for Interactive Learning

Mohit Shridhar, Xingdi Yuan, Marc-Alexandre Côté, Yonatan Bisk, Adam Trischler, and Matthew Hausknecht. Alfworld: Aligning text and embodied environments for interactive learning.arXiv preprint arXiv:2010.03768, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010
[36]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[37]

Evoagentx: An automated framework for evolving agentic workflows

Yingxu Wang, Siwei Liu, Jinyuan Fang, and Zaiqiao Meng. Evoagentx: An automated framework for evolving agentic workflows. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 643–655, 2025

2025
[38]

Learning beyond gradients.https://trinkle23897.github.io/learning-beyond-gradients/, May

Jiayi Weng. Learning beyond gradients.https://trinkle23897.github.io/learning-beyond-gradients/, May
[39]

Autogen: Enabling next-gen llm applications via multi-agent conversations

Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, et al. Autogen: Enabling next-gen llm applications via multi-agent conversations. InFirst conference on language modeling, 2024

2024
[40]

Kˆ 2-agent: Co-evolving know-what and know-how for hierarchical mobile device control.arXiv preprint arXiv:2603.00676, 2026

Zhe Wu, Donglin Mo, Hongjin Lu, Junliang Xing, Jianheng Liu, Yuheng Jing, Kai Li, Kun Shao, Jianye Hao, and Yuanchun Shi. Kˆ 2-agent: Co-evolving know-what and know-how for hierarchical mobile device control.arXiv preprint arXiv:2603.00676, 2026

work page arXiv 2026
[41]

Adapting the Interface, Not the Model: Runtime Harness Adaptation for Deterministic LLM Agents

Tianshi Xu, Huifeng Wen, and Meng Li. Adapting the interface, not the model: Runtime harness adaptation for deterministic llm agents.arXiv preprint arXiv:2605.22166, 2026. 25

work page internal anchor Pith review Pith/arXiv arXiv 2026
[42]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[43]

Large language models as optimizers

Chengrun Yang, Xuezhi Wang, Yifeng Lu, Hanxiao Liu, Quoc V Le, Denny Zhou, and Xinyun Chen. Large language models as optimizers. InInternational Conference on Learning Representations, 2024

2024
[44]

Webshop: Towards scalable real-world web interaction with grounded language agents.Advances in Neural Information Processing Systems, 35:20744–20757, 2022

Shunyu Yao, Howard Chen, John Yang, and Karthik Narasimhan. Webshop: Towards scalable real-world web interaction with grounded language agents.Advances in Neural Information Processing Systems, 35:20744–20757, 2022

2022
[45]

$\tau$-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains

Shunyu Yao, Noah Shinn, Pedram Razavi, and Karthik Narasimhan. tau-bench: A benchmark for tool-agent-user interaction in real-world domains.arXiv preprint arXiv:2406.12045, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[46]

TextGrad: Automatic "Differentiation" via Text

Mert Yuksekgonul, Federico Bianchi, Joseph Boen, Sheng Liu, Zhi Huang, Carlos Guestrin, and James Zou. Textgrad: Automatic" differentiation" via text.arXiv preprint arXiv:2406.07496, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[47]

Hyperagents.arXiv preprint arXiv:2603.19461, 2026

Jenny Zhang, Bingchen Zhao, Wannan Yang, Jakob Foerster, Jeff Clune, Minqi Jiang, Sam Devlin, and Tatiana Shavrina. Hyperagents.arXiv preprint arXiv:2603.19461, 2026

work page arXiv 2026
[48]

Aflow: Automating agentic workflow generation

Jiayi Zhang, Jinyu Xiang, Zhaoyang Yu, Fengwei Teng, Xionghui Chen, Jiaqi Chen, Mingchen Zhuge, Xin Cheng, Sirui Hong, Jinlin Wang, et al. Aflow: Automating agentic workflow generation. InInternational Conference on Learning Representations, 2025

2025
[49]

A2flow: Automating agentic workflow generation via self-adaptive abstraction operators

Mingming Zhao, Xiaokang Wei, Yuanqi Shao, Kaiwen Zhou, Lin Yang, Siwei Rao, Junhui Zhan, and Zhitang Chen. A2flow: Automating agentic workflow generation via self-adaptive abstraction operators. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 29930–29938, 2026

2026
[50]

Memento: Fine-tuning llm agents without fine-tuning llms.arXiv preprint arXiv:2508.16153, 2025

Huichi Zhou, Yihang Chen, Siyuan Guo, Xue Yan, Kin Hei Lee, Zihan Wang, Ka Yiu Lee, Guchun Zhang, Kun Shao, Linyi Yang, et al. Memento: Fine-tuning llm agents without fine-tuning llms.arXiv preprint arXiv:2508.16153, 2025

work page arXiv 2025
[51]

Proposer-agent-evaluator (pae): Autonomous skill discovery for foundation model internet agents

Yifei Zhou, Qianlan Yang, Kaixiang Lin, Min Bai, Xiong Zhou, Yu-Xiong Wang, Sergey Levine, and Li Erran Li. Proposer-agent-evaluator (pae): Autonomous skill discovery for foundation model internet agents. InForty-second International Conference on Machine Learning, 2025

2025
[52]

Large language models are human-level prompt engineers

Yongchao Zhou, Andrei Ioan Muresanu, Ziwen Han, Keiran Paster, Silviu Pitis, Harris Chan, and Jimmy Ba. Large language models are human-level prompt engineers. InThe Eleventh International Conference on Learning Representations, 2023

2023
[53]

Resmas: Resilience optimization in llm-based multi-agent systems

Zhilun Zhou, Zihan Liu, Jiahe Liu, Qingyu Shao, Yihan Wang, Kun Shao, Depeng Jin, and Fengli Xu. Resmas: Resilience optimization in llm-based multi-agent systems. InProceedings of the AAAI Conference on Artificial Intelligence, 2026

2026
[54]

Put a cooled apple in the microwave

Mingchen Zhuge, Wenyi Wang, Louis Kirsch, Francesco Faccio, Dmitrii Khizbullin, and Jürgen Schmidhuber. Gptswarm: Language agents as optimizable graphs. InForty-first International Conference on Machine Learning, 2024. 26 Contributions and Acknowledgments Core Contributors •Tingyang Chen* •Shuo Lu* •Kang Zhao* •Weicheng Meng •Kun Shao † •Jian Luan † Contr...

2024
[55]

Write the code to your scratch dir
[56]

Two levels: - Level 1 -- unit call works: instantiate the processor/tool, drive the async hook, assert the expected state mutation happened

Verify by actually running it -- not by reasoning about it. Two levels: - Level 1 -- unit call works: instantiate the processor/tool, drive the async hook, assert the expected state mutation happened. - Level 2 -- round-trip reaches the model: a unit call that returns does not prove the agent sees the return. Simulate the path from your code to the model'...
[57]

Do NOT hide the failure in a try/except

Iterate if verification fails -- fix the bug, or pivot if the environment does not support what you assumed. Do NOT hide the failure in a try/except
[58]

I believe this will work

Attach the verifying output as`capability_evidence`. "I believe this will work" is not acceptable; paste the actual command and its output. A candidate whose new code has not been observed to work will burn a round's ship slot for zero flips. Pure prompt-bucket candidates (no code asset) are exempt -- the counterfactual gate provides the equivalent smoke ...
[59]

Decompose the goal into ordered sub-goals (locate -> acquire -> transform -> deliver) and complete each before moving on
[60]

Open closed containers before judging them empty -- the admissible list surfaces`open <recep>`when you arrive at a closed one

Systematic exploration: search each surface and container at most once before revisiting. Open closed containers before judging them empty -- the admissible list surfaces`open <recep>`when you arrive at a closed one
[61]

Grab immediately: when a required object appears, take it on the very next step before moving elsewhere
[62]

Transform before placing: perform any clean/heat/cool state change at the appropriate appliance before heading to the final destination
[63]

Direct delivery: once holding the goal object, navigate straight to the target receptacle and place it
[64]

Only stop searching when the count reaches zero

Track progress: keep an internal count of objects still to find and place. Only stop searching when the count reaches zero
[65]

If stuck, move to a different unexplored location

Avoid loops: never repeat the same action more than twice in a row. If stuck, move to a different unexplored location
[66]

", evidence:

Trust the admissible list: if`take X from Y`does not appear, you are not at Y, Y is closed, or X is not visible --`go to`,`open`, or move on rather than guessing. ## Common Mistakes to Avoid - Revisiting searched locations without new evidence. - Ignoring visible objects -- if the target appears, take it immediately. - Skipping the state change -- do not ...

[1] [1]

Langchain.https://github.com/langchain-ai/langchain, 2022

2022

[2] [2]

Claude code.https://github.com/anthropics/claude-code, 2025

Anthropic. Claude code.https://github.com/anthropics/claude-code, 2025

2025

[3] [3]

Introducing dynamic workflows in claude code

Anthropic. Introducing dynamic workflows in claude code. https://claude.com/blog/ introducing-dynamic-workflows-in-claude-code, 2026

2026

[4] [4]

Cursor.https://www.cursor.com, 2023

Anysphere. Cursor.https://www.cursor.com, 2023

2023

[5] [5]

Deerflow.https://github.com/bytedance/deer-flow, 2025

ByteDance. Deerflow.https://github.com/bytedance/deer-flow, 2025

2025

[6] [6]

Deepseek-v4: Towards highly efficient million-token context intelligence, 2026

DeepSeek-AI. Deepseek-v4: Towards highly efficient million-token context intelligence, 2026

2026

[7] [7]

ReTool: Reinforcement Learning for Strategic Tool Use in LLMs

Jiazhan Feng, Shijue Huang, Xingwei Qu, Ge Zhang, Yujia Qin, Baoquan Zhong, Chengquan Jiang, Jinxin Chi, and Wanjun Zhong. Retool: Reinforcement learning for strategic tool use in llms.arXiv preprint arXiv:2504.11536, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[8] [8]

Prompt- breeder: Self-referential self-improvement via prompt evolution

Chrisantha Fernando, Dylan Sunil Banarse, Henryk Michalewski, Simon Osindero, and Tim Rocktäschel. Prompt- breeder: Self-referential self-improvement via prompt evolution. InInternational Conference on Machine Learning, pages 13481–13544. PMLR, 2024

2024

[9] [9]

GLM-5: from Vibe Coding to Agentic Engineering

GLM-5-Team. Glm-5: from vibe coding to agentic engineering, 2026. URLhttps://arxiv.org/abs/2602.15763

work page internal anchor Pith review Pith/arXiv arXiv 2026

[10] [10]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[11] [11]

Connecting large language models with evolutionary algorithms yields powerful prompt optimizers

Qingyan Guo, Rui Wang, Junliang Guo, Bei Li, Kaitao Song, Xu Tan, Guoqing Liu, Jiang Bian, and Yujiu Yang. Connecting large language models with evolutionary algorithms yields powerful prompt optimizers. In International Conference on Learning Representations, 2024

2024

[12] [12]

Automated design of agentic systems

Shengran Hu, Cong Lu, and Jeff Clune. Automated design of agentic systems. InInternational Conference on Learning Representations, 2025

2025

[13] [13]

Swe-bench: Can language models resolve real-world github issues? InInternational Conference on Learning Representations, 2024

Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues? InInternational Conference on Learning Representations, 2024

2024

[14] [14]

DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines

Omar Khattab, Arnav Singhvi, Paridhi Maheshwari, Zhiyuan Zhang, Keshav Santhanam, Sri Vardhamanan, Saiful Haq, Ashutosh Sharma, Thomas T Joshi, Hanna Moazam, et al. Dspy: Compiling declarative language model calls into self-improving pipelines.arXiv preprint arXiv:2310.03714, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[15] [15]

Overcoming catastrophic forgetting in neural networks.Proceedings of the national academy of sciences, 114(13):3521–3526, 2017

James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural networks.Proceedings of the national academy of sciences, 114(13):3521–3526, 2017

2017

[16] [16]

Exploration in deep reinforcement learning: A survey.Information Fusion, 85:1–22, 2022

Pawel Ladosz, Lilian Weng, Minwoo Kim, and Hyondong Oh. Exploration in deep reinforcement learning: A survey.Information Fusion, 85:1–22, 2022

2022

[17] [17]

Langgraph.https://github.com/langchain-ai/langgraph, 2024

LangChain AI. Langgraph.https://github.com/langchain-ai/langgraph, 2024

2024

[18] [18]

The Darwin Gödel Machine: Open-ended evolution of self-improving agents.arXiv preprint arXiv:2505.22535, 2025

Robert Tjarko Lange, Yujin Tang, and Yingtao Tian. The Darwin Gödel Machine: Open-ended evolution of self-improving agents.arXiv preprint arXiv:2505.22535, 2025

work page arXiv 2025

[19] [19]

Meta-Harness: End-to-End Optimization of Model Harnesses

Yoonho Lee, Roshen Nair, Qizheng Zhang, Kangwook Lee, Omar Khattab, and Chelsea Finn. Meta-harness: End-to-end optimization of model harnesses.arXiv preprint arXiv:2603.28052, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[20] [20]

Agent harness engineering: A survey.arXiv preprint, 2026

Junjie Li, Xi Xiao, Yunbei Zhang, Chen Liu, Lin Zhao, Xiaoying Liao, Yingrui Ji, Janet Wang, Jianyang Gu, Yingqiang Ge, et al. Agent harness engineering: A survey.arXiv preprint, 2026

2026

[21] [21]

Agentswift: Efficient llm agent design via value-guided hierarchical search

Yu Li, Lehui Li, Zhihao Wu, Qingmin Liao, Jianye Hao, Kun Shao, and Fengli Xu. Agentswift: Efficient llm agent design via value-guided hierarchical search. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 31843–31851, 2026. 24

2026

[22] [22]

Agentic Harness Engineering: Observability-Driven Automatic Evolution of Coding-Agent Harnesses

Jiahang Lin, Shichun Liu, Chengjun Pan, Lizhi Lin, Shihan Dou, Xuanjing Huang, Hang Yan, Zhenhua Han, and Tao Gui. Agentic harness engineering: Observability-driven automatic evolution of coding-agent harnesses.arXiv preprint arXiv:2604.25850, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[23] [23]

LlamaIndex, 11 2022

Jerry Liu. LlamaIndex, 11 2022. URLhttps://github.com/jerryjliu/llama_index

2022

[24] [24]

Openclaw research: A systematic survey of large language model agents in open deployment

Shuo Lu, Kecheng Yu, Siru Jiang, Yinuo Xu, Bing Zhan, Yanbo Wang, Changxin Ke, Yuan Xu, Xin Xiong, Xinyun Zhou, et al. Openclaw research: A systematic survey of large language model agents in open deployment. 2026

2026

[25] [25]

Gaia: a benchmark for general ai assistants

Grégoire Mialon, Clémentine Fourrier, Thomas Wolf, Yann LeCun, and Thomas Scialom. Gaia: a benchmark for general ai assistants. InInternational Conference on Learning Representations, 2024

2024

[26] [26]

Crewai: Framework for orchestrating role-playing autonomous ai agents, 2025

João Moura. Crewai: Framework for orchestrating role-playing autonomous ai agents, 2025

2025

[27] [27]

Optimizing instructions and demonstrations for multi-stage language model programs

Krista Opsahl-Ong, Michael J Ryan, Josh Purtell, David Broman, Christopher Potts, Matei Zaharia, and Omar Khattab. Optimizing instructions and demonstrations for multi-stage language model programs. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 9340–9366, 2024

2024

[28] [28]

Memgpt: towards llms as operating systems

Charles Packer, Vivian Fang, Shishir G Patil, Kevin Lin, Sarah Wooders, and Joseph E Gonzalez. Memgpt: towards llms as operating systems. 2023

2023

[29] [29]

gradient descent

Reid Pryzant, Dan Iter, Jerry Li, Yin Lee, Chenguang Zhu, and Michael Zeng. Automatic prompt optimization with “gradient descent” and beam search. InProceedings of the 2023 conference on empirical methods in natural language processing, pages 7957–7968, 2023

2023

[30] [30]

Memory Intelligence Agent

Jingyang Qiao, Weicheng Meng, Yu Cheng, Zhihang Lin, Zhizhong Zhang, Xin Tan, Jingyu Gong, Kun Shao, and Yuan Xie. Memory intelligence agent.arXiv preprint arXiv:2604.04503, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[31] [31]

A self-improving coding agent.arXiv preprint arXiv:2504.15228, 2025

Maxime Robeyns, Martin Szummer, and Laurence Aitchison. A self-improving coding agent.arXiv preprint arXiv:2504.15228, 2025

work page arXiv 2025

[32] [32]

‘smola- gents‘: a smol library to build great agentic systems.https://github.com/huggingface/smolagents, 2025

Aymeric Roucher, Albert Villanova del Moral, Thomas Wolf, Leandro von Werra, and Erik Kaunismäki. ‘smola- gents‘: a smol library to build great agentic systems.https://github.com/huggingface/smolagents, 2025

2025

[33] [33]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[34] [34]

From mind to machine: The rise of manus ai as a fully autonomous digital agent.arXiv preprint arXiv:2505.02024, 2025

Minjie Shen, Yanshu Li, Lulu Chen, Zhichao Fan, Yanhang Li, and Qikai Yang. From mind to machine: The rise of manus ai as a fully autonomous digital agent.arXiv preprint arXiv:2505.02024, 2025

work page arXiv 2025

[35] [35]

ALFWorld: Aligning Text and Embodied Environments for Interactive Learning

Mohit Shridhar, Xingdi Yuan, Marc-Alexandre Côté, Yonatan Bisk, Adam Trischler, and Matthew Hausknecht. Alfworld: Aligning text and embodied environments for interactive learning.arXiv preprint arXiv:2010.03768, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010

[36] [36]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[37] [37]

Evoagentx: An automated framework for evolving agentic workflows

Yingxu Wang, Siwei Liu, Jinyuan Fang, and Zaiqiao Meng. Evoagentx: An automated framework for evolving agentic workflows. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 643–655, 2025

2025

[38] [38]

Learning beyond gradients.https://trinkle23897.github.io/learning-beyond-gradients/, May

Jiayi Weng. Learning beyond gradients.https://trinkle23897.github.io/learning-beyond-gradients/, May

[39] [39]

Autogen: Enabling next-gen llm applications via multi-agent conversations

Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, et al. Autogen: Enabling next-gen llm applications via multi-agent conversations. InFirst conference on language modeling, 2024

2024

[40] [40]

Kˆ 2-agent: Co-evolving know-what and know-how for hierarchical mobile device control.arXiv preprint arXiv:2603.00676, 2026

Zhe Wu, Donglin Mo, Hongjin Lu, Junliang Xing, Jianheng Liu, Yuheng Jing, Kai Li, Kun Shao, Jianye Hao, and Yuanchun Shi. Kˆ 2-agent: Co-evolving know-what and know-how for hierarchical mobile device control.arXiv preprint arXiv:2603.00676, 2026

work page arXiv 2026

[41] [41]

Adapting the Interface, Not the Model: Runtime Harness Adaptation for Deterministic LLM Agents

Tianshi Xu, Huifeng Wen, and Meng Li. Adapting the interface, not the model: Runtime harness adaptation for deterministic llm agents.arXiv preprint arXiv:2605.22166, 2026. 25

work page internal anchor Pith review Pith/arXiv arXiv 2026

[42] [42]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[43] [43]

Large language models as optimizers

Chengrun Yang, Xuezhi Wang, Yifeng Lu, Hanxiao Liu, Quoc V Le, Denny Zhou, and Xinyun Chen. Large language models as optimizers. InInternational Conference on Learning Representations, 2024

2024

[44] [44]

Webshop: Towards scalable real-world web interaction with grounded language agents.Advances in Neural Information Processing Systems, 35:20744–20757, 2022

Shunyu Yao, Howard Chen, John Yang, and Karthik Narasimhan. Webshop: Towards scalable real-world web interaction with grounded language agents.Advances in Neural Information Processing Systems, 35:20744–20757, 2022

2022

[45] [45]

$\tau$-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains

Shunyu Yao, Noah Shinn, Pedram Razavi, and Karthik Narasimhan. tau-bench: A benchmark for tool-agent-user interaction in real-world domains.arXiv preprint arXiv:2406.12045, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[46] [46]

TextGrad: Automatic "Differentiation" via Text

Mert Yuksekgonul, Federico Bianchi, Joseph Boen, Sheng Liu, Zhi Huang, Carlos Guestrin, and James Zou. Textgrad: Automatic" differentiation" via text.arXiv preprint arXiv:2406.07496, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[47] [47]

Hyperagents.arXiv preprint arXiv:2603.19461, 2026

Jenny Zhang, Bingchen Zhao, Wannan Yang, Jakob Foerster, Jeff Clune, Minqi Jiang, Sam Devlin, and Tatiana Shavrina. Hyperagents.arXiv preprint arXiv:2603.19461, 2026

work page arXiv 2026

[48] [48]

Aflow: Automating agentic workflow generation

Jiayi Zhang, Jinyu Xiang, Zhaoyang Yu, Fengwei Teng, Xionghui Chen, Jiaqi Chen, Mingchen Zhuge, Xin Cheng, Sirui Hong, Jinlin Wang, et al. Aflow: Automating agentic workflow generation. InInternational Conference on Learning Representations, 2025

2025

[49] [49]

A2flow: Automating agentic workflow generation via self-adaptive abstraction operators

Mingming Zhao, Xiaokang Wei, Yuanqi Shao, Kaiwen Zhou, Lin Yang, Siwei Rao, Junhui Zhan, and Zhitang Chen. A2flow: Automating agentic workflow generation via self-adaptive abstraction operators. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 29930–29938, 2026

2026

[50] [50]

Memento: Fine-tuning llm agents without fine-tuning llms.arXiv preprint arXiv:2508.16153, 2025

Huichi Zhou, Yihang Chen, Siyuan Guo, Xue Yan, Kin Hei Lee, Zihan Wang, Ka Yiu Lee, Guchun Zhang, Kun Shao, Linyi Yang, et al. Memento: Fine-tuning llm agents without fine-tuning llms.arXiv preprint arXiv:2508.16153, 2025

work page arXiv 2025

[51] [51]

Proposer-agent-evaluator (pae): Autonomous skill discovery for foundation model internet agents

Yifei Zhou, Qianlan Yang, Kaixiang Lin, Min Bai, Xiong Zhou, Yu-Xiong Wang, Sergey Levine, and Li Erran Li. Proposer-agent-evaluator (pae): Autonomous skill discovery for foundation model internet agents. InForty-second International Conference on Machine Learning, 2025

2025

[52] [52]

Large language models are human-level prompt engineers

Yongchao Zhou, Andrei Ioan Muresanu, Ziwen Han, Keiran Paster, Silviu Pitis, Harris Chan, and Jimmy Ba. Large language models are human-level prompt engineers. InThe Eleventh International Conference on Learning Representations, 2023

2023

[53] [53]

Resmas: Resilience optimization in llm-based multi-agent systems

Zhilun Zhou, Zihan Liu, Jiahe Liu, Qingyu Shao, Yihan Wang, Kun Shao, Depeng Jin, and Fengli Xu. Resmas: Resilience optimization in llm-based multi-agent systems. InProceedings of the AAAI Conference on Artificial Intelligence, 2026

2026

[54] [54]

Put a cooled apple in the microwave

Mingchen Zhuge, Wenyi Wang, Louis Kirsch, Francesco Faccio, Dmitrii Khizbullin, and Jürgen Schmidhuber. Gptswarm: Language agents as optimizable graphs. InForty-first International Conference on Machine Learning, 2024. 26 Contributions and Acknowledgments Core Contributors •Tingyang Chen* •Shuo Lu* •Kang Zhao* •Weicheng Meng •Kun Shao † •Jian Luan † Contr...

2024

[55] [55]

Write the code to your scratch dir

[56] [56]

Two levels: - Level 1 -- unit call works: instantiate the processor/tool, drive the async hook, assert the expected state mutation happened

Verify by actually running it -- not by reasoning about it. Two levels: - Level 1 -- unit call works: instantiate the processor/tool, drive the async hook, assert the expected state mutation happened. - Level 2 -- round-trip reaches the model: a unit call that returns does not prove the agent sees the return. Simulate the path from your code to the model'...

[57] [57]

Do NOT hide the failure in a try/except

Iterate if verification fails -- fix the bug, or pivot if the environment does not support what you assumed. Do NOT hide the failure in a try/except

[58] [58]

I believe this will work

Attach the verifying output as`capability_evidence`. "I believe this will work" is not acceptable; paste the actual command and its output. A candidate whose new code has not been observed to work will burn a round's ship slot for zero flips. Pure prompt-bucket candidates (no code asset) are exempt -- the counterfactual gate provides the equivalent smoke ...

[59] [59]

Decompose the goal into ordered sub-goals (locate -> acquire -> transform -> deliver) and complete each before moving on

[60] [60]

Open closed containers before judging them empty -- the admissible list surfaces`open <recep>`when you arrive at a closed one

Systematic exploration: search each surface and container at most once before revisiting. Open closed containers before judging them empty -- the admissible list surfaces`open <recep>`when you arrive at a closed one

[61] [61]

Grab immediately: when a required object appears, take it on the very next step before moving elsewhere

[62] [62]

Transform before placing: perform any clean/heat/cool state change at the appropriate appliance before heading to the final destination

[63] [63]

Direct delivery: once holding the goal object, navigate straight to the target receptacle and place it

[64] [64]

Only stop searching when the count reaches zero

Track progress: keep an internal count of objects still to find and place. Only stop searching when the count reaches zero

[65] [65]

If stuck, move to a different unexplored location

Avoid loops: never repeat the same action more than twice in a row. If stuck, move to a different unexplored location

[66] [66]

", evidence:

Trust the admissible list: if`take X from Y`does not appear, you are not at Y, Y is closed, or X is not visible --`go to`,`open`, or move on rather than guessing. ## Common Mistakes to Avoid - Revisiting searched locations without new evidence. - Ignoring visible objects -- if the target appears, take it immediately. - Skipping the state change -- do not ...