pith. sign in

arxiv: 2606.14249 · v2 · pith:PBTVU63Knew · submitted 2026-06-12 · 💻 cs.AI

HarnessX: A Composable, Adaptive, and Evolvable Agent Harness Foundry

Pith reviewed 2026-07-03 23:41 UTC · model grok-4.3

classification 💻 cs.AI
keywords AI agentsagent harnesscomposable primitivestrace-driven evolutionruntime adaptationbenchmark performanceexecution feedback
0
0 comments X

The pith

Composing and evolving agent harnesses from execution traces improves performance by 14.5 percent on average across five benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that AI agent performance depends on the runtime harness of prompts, tools, memory, and control flow, which are typically hand-crafted and static. HarnessX supplies a foundry that assembles these harnesses from typed primitives using a substitution algebra, then adapts them via the AEGIS engine that converts execution traces into harness updates and model training signals. If this holds, agent progress becomes possible through systematic interface improvement rather than model scaling alone. The approach is tested on ALFWorld, GAIA, WebShop, tau^3-Bench, and SWE-bench Verified, where gains average 14.5 percent and reach 44 percent on the weakest baselines.

Core claim

HarnessX assembles typed harness primitives via a substitution algebra, adapts them through AEGIS, a trace-driven multi-agent evolution engine grounded in an operational mirror between symbolic adaptation and reinforcement learning, and closes the harness-model loop by turning trajectories into both harness updates and model training signal, producing an average gain of 14.5 percent (up to 44.0 percent) across the five benchmarks.

What carries the argument

The substitution algebra that composes typed harness primitives together with the AEGIS trace-driven multi-agent evolution engine that adapts them from execution feedback.

If this is right

  • Execution traces become a direct source of both harness updates and model training data.
  • Agent systems can improve on tasks where current baselines are weakest without requiring larger models.
  • Runtime interfaces shift from static hand-crafted scaffolding to systematically evolvable components.
  • Progress on agent benchmarks can be measured and achieved separately from model capability increases.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same harness-evolution loop could be applied to domains outside the five tested benchmarks to check generality.
  • If harness evolution works, it suggests a division of labor where model scaling handles core reasoning and harness adaptation handles task-specific mediation.
  • Open-sourcing the codebase would allow direct tests of whether the substitution algebra alone accounts for part of the gains.

Load-bearing premise

The reported performance gains arise from the composable harness assembly and AEGIS evolution rather than from unstated choices of models, benchmark tuning, or baseline implementations.

What would settle it

Reproduce the five benchmarks using identical base models and baseline implementations while disabling the substitution algebra and AEGIS engine; the gains should disappear if the claim is correct.

read the original abstract

AI agent performance depends critically on the runtime harness, comprising the prompts, tools, memory, and control flow that mediate how a model observes, reasons, and acts. Yet today's harnesses remain largely hand-crafted and static: each new model or task still demands bespoke scaffolding, and the rich traces produced during execution are rarely distilled back into systematic improvement. We introduce HarnessX, a foundry for composable, adaptive, and evolvable agent harnesses. HarnessX assembles typed harness primitives via a substitution algebra, adapts them through AEGIS, a trace-driven multi-agent evolution engine grounded in an operational mirror between symbolic adaptation and reinforcement learning, and closes the harness-model loop by turning trajectories into both harness updates and model training signal. Across five benchmarks (ALFWorld, GAIA, WebShop, tau^3-Bench, and SWE-bench Verified), HarnessX yields an average gain of +14.5% (up to +44.0%), with gains largest where baselines are lowest. These results suggest that agent progress need not come from model scaling alone: composing and evolving runtime interfaces from execution feedback is an actionable and complementary lever. The complete codebase will be open-sourced in a future release.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The manuscript introduces HarnessX, a foundry for composable, adaptive, and evolvable agent harnesses. Harness primitives are assembled via a substitution algebra; AEGIS provides a trace-driven multi-agent evolution engine that mirrors symbolic adaptation with reinforcement learning; execution trajectories are used both to update harnesses and to supply model training signals. Across ALFWorld, GAIA, WebShop, tau^3-Bench, and SWE-bench Verified the system is reported to deliver an average +14.5% gain (maximum +44.0%), with larger improvements where baselines are weakest.

Significance. If the performance deltas can be shown to arise specifically from the substitution algebra and AEGIS evolution engine, the work would be significant: it supplies a concrete, complementary lever—runtime harness composition and trace-driven evolution—for agent improvement that does not rely solely on model scaling.

major comments (1)
  1. [Evaluation] Evaluation section: the manuscript reports aggregate gains of +14.5% (up to +44.0%) but supplies no model versions, baseline prompt/tool/memory implementations, hyperparameter search protocol, statistical tests, or ablations that remove either the composable assembly or the AEGIS loop. Because the central claim attributes these deltas specifically to the proposed mechanisms, and because gains are largest where baselines are weakest, the absence of these controls is load-bearing for the empirical conclusion.
minor comments (1)
  1. The statement that the complete codebase will be open-sourced only in a future release should be accompanied by a reproducibility checklist or at minimum a detailed description of baseline re-implementations sufficient for independent verification.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on the evaluation. We agree that additional controls and details are necessary to substantiate the attribution of gains to the substitution algebra and AEGIS engine, and we will revise the manuscript to address this.

read point-by-point responses
  1. Referee: [Evaluation] Evaluation section: the manuscript reports aggregate gains of +14.5% (up to +44.0%) but supplies no model versions, baseline prompt/tool/memory implementations, hyperparameter search protocol, statistical tests, or ablations that remove either the composable assembly or the AEGIS loop. Because the central claim attributes these deltas specifically to the proposed mechanisms, and because gains are largest where baselines are weakest, the absence of these controls is load-bearing for the empirical conclusion.

    Authors: We acknowledge the validity of this concern. The current manuscript does not include the requested details on model versions, baseline implementations, hyperparameter protocols, statistical tests, or component ablations. To address this, the revised manuscript will provide explicit specifications for all models and baselines used, describe the hyperparameter search process, report statistical significance tests (e.g., paired t-tests or bootstrap confidence intervals), and include ablations that systematically remove the substitution algebra and the AEGIS loop to isolate their contributions. These additions will strengthen the evidence that the observed gains arise specifically from the proposed mechanisms. revision: yes

Circularity Check

0 steps flagged

No derivation chain or equations; empirical results not circular by construction

full rationale

The paper reports empirical benchmark gains from HarnessX without presenting any mathematical derivations, first-principles predictions, fitted parameters renamed as outputs, or self-referential definitions. Claims rest on observed performance deltas across ALFWorld, GAIA, WebShop, tau^3-Bench, and SWE-bench Verified rather than any algebraic or definitional reduction. Absence of equations or load-bearing self-citations in a formal chain means no steps qualify under the enumerated circularity patterns; concerns about ablations or baseline details pertain to experimental rigor, not circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no details on free parameters, axioms, or invented entities; review is limited to summary information only.

pith-pipeline@v0.9.1-grok · 5787 in / 1106 out tokens · 27711 ms · 2026-07-03T23:41:50.966098+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

66 extracted references · 20 canonical work pages · 14 internal anchors

  1. [1]

    Langchain.https://github.com/langchain-ai/langchain, 2022

  2. [2]

    Claude code.https://github.com/anthropics/claude-code, 2025

    Anthropic. Claude code.https://github.com/anthropics/claude-code, 2025

  3. [3]

    Introducing dynamic workflows in claude code

    Anthropic. Introducing dynamic workflows in claude code. https://claude.com/blog/ introducing-dynamic-workflows-in-claude-code, 2026

  4. [4]

    Cursor.https://www.cursor.com, 2023

    Anysphere. Cursor.https://www.cursor.com, 2023

  5. [5]

    Deerflow.https://github.com/bytedance/deer-flow, 2025

    ByteDance. Deerflow.https://github.com/bytedance/deer-flow, 2025

  6. [6]

    Deepseek-v4: Towards highly efficient million-token context intelligence, 2026

    DeepSeek-AI. Deepseek-v4: Towards highly efficient million-token context intelligence, 2026

  7. [7]

    ReTool: Reinforcement Learning for Strategic Tool Use in LLMs

    Jiazhan Feng, Shijue Huang, Xingwei Qu, Ge Zhang, Yujia Qin, Baoquan Zhong, Chengquan Jiang, Jinxin Chi, and Wanjun Zhong. Retool: Reinforcement learning for strategic tool use in llms.arXiv preprint arXiv:2504.11536, 2025

  8. [8]

    Prompt- breeder: Self-referential self-improvement via prompt evolution

    Chrisantha Fernando, Dylan Sunil Banarse, Henryk Michalewski, Simon Osindero, and Tim Rocktäschel. Prompt- breeder: Self-referential self-improvement via prompt evolution. InInternational Conference on Machine Learning, pages 13481–13544. PMLR, 2024

  9. [9]

    GLM-5: from Vibe Coding to Agentic Engineering

    GLM-5-Team. Glm-5: from vibe coding to agentic engineering, 2026. URLhttps://arxiv.org/abs/2602.15763

  10. [10]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

  11. [11]

    Connecting large language models with evolutionary algorithms yields powerful prompt optimizers

    Qingyan Guo, Rui Wang, Junliang Guo, Bei Li, Kaitao Song, Xu Tan, Guoqing Liu, Jiang Bian, and Yujiu Yang. Connecting large language models with evolutionary algorithms yields powerful prompt optimizers. In International Conference on Learning Representations, 2024

  12. [12]

    Automated design of agentic systems

    Shengran Hu, Cong Lu, and Jeff Clune. Automated design of agentic systems. InInternational Conference on Learning Representations, 2025

  13. [13]

    Swe-bench: Can language models resolve real-world github issues? InInternational Conference on Learning Representations, 2024

    Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues? InInternational Conference on Learning Representations, 2024

  14. [14]

    DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines

    Omar Khattab, Arnav Singhvi, Paridhi Maheshwari, Zhiyuan Zhang, Keshav Santhanam, Sri Vardhamanan, Saiful Haq, Ashutosh Sharma, Thomas T Joshi, Hanna Moazam, et al. Dspy: Compiling declarative language model calls into self-improving pipelines.arXiv preprint arXiv:2310.03714, 2023

  15. [15]

    Overcoming catastrophic forgetting in neural networks.Proceedings of the national academy of sciences, 114(13):3521–3526, 2017

    James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural networks.Proceedings of the national academy of sciences, 114(13):3521–3526, 2017

  16. [16]

    Exploration in deep reinforcement learning: A survey.Information Fusion, 85:1–22, 2022

    Pawel Ladosz, Lilian Weng, Minwoo Kim, and Hyondong Oh. Exploration in deep reinforcement learning: A survey.Information Fusion, 85:1–22, 2022

  17. [17]

    Langgraph.https://github.com/langchain-ai/langgraph, 2024

    LangChain AI. Langgraph.https://github.com/langchain-ai/langgraph, 2024

  18. [18]

    The Darwin Gödel Machine: Open-ended evolution of self-improving agents.arXiv preprint arXiv:2505.22535, 2025

    Robert Tjarko Lange, Yujin Tang, and Yingtao Tian. The Darwin Gödel Machine: Open-ended evolution of self-improving agents.arXiv preprint arXiv:2505.22535, 2025

  19. [19]

    Meta-Harness: End-to-End Optimization of Model Harnesses

    Yoonho Lee, Roshen Nair, Qizheng Zhang, Kangwook Lee, Omar Khattab, and Chelsea Finn. Meta-harness: End-to-end optimization of model harnesses.arXiv preprint arXiv:2603.28052, 2026

  20. [20]

    Agent harness engineering: A survey.arXiv preprint, 2026

    Junjie Li, Xi Xiao, Yunbei Zhang, Chen Liu, Lin Zhao, Xiaoying Liao, Yingrui Ji, Janet Wang, Jianyang Gu, Yingqiang Ge, et al. Agent harness engineering: A survey.arXiv preprint, 2026

  21. [21]

    Agentswift: Efficient llm agent design via value-guided hierarchical search

    Yu Li, Lehui Li, Zhihao Wu, Qingmin Liao, Jianye Hao, Kun Shao, and Fengli Xu. Agentswift: Efficient llm agent design via value-guided hierarchical search. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 31843–31851, 2026. 24

  22. [22]

    Agentic Harness Engineering: Observability-Driven Automatic Evolution of Coding-Agent Harnesses

    Jiahang Lin, Shichun Liu, Chengjun Pan, Lizhi Lin, Shihan Dou, Xuanjing Huang, Hang Yan, Zhenhua Han, and Tao Gui. Agentic harness engineering: Observability-driven automatic evolution of coding-agent harnesses.arXiv preprint arXiv:2604.25850, 2026

  23. [23]

    LlamaIndex, 11 2022

    Jerry Liu. LlamaIndex, 11 2022. URLhttps://github.com/jerryjliu/llama_index

  24. [24]

    Openclaw research: A systematic survey of large language model agents in open deployment

    Shuo Lu, Kecheng Yu, Siru Jiang, Yinuo Xu, Bing Zhan, Yanbo Wang, Changxin Ke, Yuan Xu, Xin Xiong, Xinyun Zhou, et al. Openclaw research: A systematic survey of large language model agents in open deployment. 2026

  25. [25]

    Gaia: a benchmark for general ai assistants

    Grégoire Mialon, Clémentine Fourrier, Thomas Wolf, Yann LeCun, and Thomas Scialom. Gaia: a benchmark for general ai assistants. InInternational Conference on Learning Representations, 2024

  26. [26]

    Crewai: Framework for orchestrating role-playing autonomous ai agents, 2025

    João Moura. Crewai: Framework for orchestrating role-playing autonomous ai agents, 2025

  27. [27]

    Optimizing instructions and demonstrations for multi-stage language model programs

    Krista Opsahl-Ong, Michael J Ryan, Josh Purtell, David Broman, Christopher Potts, Matei Zaharia, and Omar Khattab. Optimizing instructions and demonstrations for multi-stage language model programs. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 9340–9366, 2024

  28. [28]

    Memgpt: towards llms as operating systems

    Charles Packer, Vivian Fang, Shishir G Patil, Kevin Lin, Sarah Wooders, and Joseph E Gonzalez. Memgpt: towards llms as operating systems. 2023

  29. [29]

    gradient descent

    Reid Pryzant, Dan Iter, Jerry Li, Yin Lee, Chenguang Zhu, and Michael Zeng. Automatic prompt optimization with “gradient descent” and beam search. InProceedings of the 2023 conference on empirical methods in natural language processing, pages 7957–7968, 2023

  30. [30]

    Memory Intelligence Agent

    Jingyang Qiao, Weicheng Meng, Yu Cheng, Zhihang Lin, Zhizhong Zhang, Xin Tan, Jingyu Gong, Kun Shao, and Yuan Xie. Memory intelligence agent.arXiv preprint arXiv:2604.04503, 2026

  31. [31]

    A self-improving coding agent.arXiv preprint arXiv:2504.15228, 2025

    Maxime Robeyns, Martin Szummer, and Laurence Aitchison. A self-improving coding agent.arXiv preprint arXiv:2504.15228, 2025

  32. [32]

    ‘smola- gents‘: a smol library to build great agentic systems.https://github.com/huggingface/smolagents, 2025

    Aymeric Roucher, Albert Villanova del Moral, Thomas Wolf, Leandro von Werra, and Erik Kaunismäki. ‘smola- gents‘: a smol library to build great agentic systems.https://github.com/huggingface/smolagents, 2025

  33. [33]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024

  34. [34]

    From mind to machine: The rise of manus ai as a fully autonomous digital agent.arXiv preprint arXiv:2505.02024, 2025

    Minjie Shen, Yanshu Li, Lulu Chen, Zhichao Fan, Yanhang Li, and Qikai Yang. From mind to machine: The rise of manus ai as a fully autonomous digital agent.arXiv preprint arXiv:2505.02024, 2025

  35. [35]

    ALFWorld: Aligning Text and Embodied Environments for Interactive Learning

    Mohit Shridhar, Xingdi Yuan, Marc-Alexandre Côté, Yonatan Bisk, Adam Trischler, and Matthew Hausknecht. Alfworld: Aligning text and embodied environments for interactive learning.arXiv preprint arXiv:2010.03768, 2020

  36. [36]

    Gemini: A Family of Highly Capable Multimodal Models

    Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023

  37. [37]

    Evoagentx: An automated framework for evolving agentic workflows

    Yingxu Wang, Siwei Liu, Jinyuan Fang, and Zaiqiao Meng. Evoagentx: An automated framework for evolving agentic workflows. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 643–655, 2025

  38. [38]

    Learning beyond gradients.https://trinkle23897.github.io/learning-beyond-gradients/, May

    Jiayi Weng. Learning beyond gradients.https://trinkle23897.github.io/learning-beyond-gradients/, May

  39. [39]

    Autogen: Enabling next-gen llm applications via multi-agent conversations

    Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, et al. Autogen: Enabling next-gen llm applications via multi-agent conversations. InFirst conference on language modeling, 2024

  40. [40]

    Kˆ 2-agent: Co-evolving know-what and know-how for hierarchical mobile device control.arXiv preprint arXiv:2603.00676, 2026

    Zhe Wu, Donglin Mo, Hongjin Lu, Junliang Xing, Jianheng Liu, Yuheng Jing, Kai Li, Kun Shao, Jianye Hao, and Yuanchun Shi. Kˆ 2-agent: Co-evolving know-what and know-how for hierarchical mobile device control.arXiv preprint arXiv:2603.00676, 2026

  41. [41]

    Adapting the Interface, Not the Model: Runtime Harness Adaptation for Deterministic LLM Agents

    Tianshi Xu, Huifeng Wen, and Meng Li. Adapting the interface, not the model: Runtime harness adaptation for deterministic llm agents.arXiv preprint arXiv:2605.22166, 2026. 25

  42. [42]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

  43. [43]

    Large language models as optimizers

    Chengrun Yang, Xuezhi Wang, Yifeng Lu, Hanxiao Liu, Quoc V Le, Denny Zhou, and Xinyun Chen. Large language models as optimizers. InInternational Conference on Learning Representations, 2024

  44. [44]

    Webshop: Towards scalable real-world web interaction with grounded language agents.Advances in Neural Information Processing Systems, 35:20744–20757, 2022

    Shunyu Yao, Howard Chen, John Yang, and Karthik Narasimhan. Webshop: Towards scalable real-world web interaction with grounded language agents.Advances in Neural Information Processing Systems, 35:20744–20757, 2022

  45. [45]

    $\tau$-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains

    Shunyu Yao, Noah Shinn, Pedram Razavi, and Karthik Narasimhan. tau-bench: A benchmark for tool-agent-user interaction in real-world domains.arXiv preprint arXiv:2406.12045, 2024

  46. [46]

    TextGrad: Automatic "Differentiation" via Text

    Mert Yuksekgonul, Federico Bianchi, Joseph Boen, Sheng Liu, Zhi Huang, Carlos Guestrin, and James Zou. Textgrad: Automatic" differentiation" via text.arXiv preprint arXiv:2406.07496, 2024

  47. [47]

    Hyperagents.arXiv preprint arXiv:2603.19461, 2026

    Jenny Zhang, Bingchen Zhao, Wannan Yang, Jakob Foerster, Jeff Clune, Minqi Jiang, Sam Devlin, and Tatiana Shavrina. Hyperagents.arXiv preprint arXiv:2603.19461, 2026

  48. [48]

    Aflow: Automating agentic workflow generation

    Jiayi Zhang, Jinyu Xiang, Zhaoyang Yu, Fengwei Teng, Xionghui Chen, Jiaqi Chen, Mingchen Zhuge, Xin Cheng, Sirui Hong, Jinlin Wang, et al. Aflow: Automating agentic workflow generation. InInternational Conference on Learning Representations, 2025

  49. [49]

    A2flow: Automating agentic workflow generation via self-adaptive abstraction operators

    Mingming Zhao, Xiaokang Wei, Yuanqi Shao, Kaiwen Zhou, Lin Yang, Siwei Rao, Junhui Zhan, and Zhitang Chen. A2flow: Automating agentic workflow generation via self-adaptive abstraction operators. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 29930–29938, 2026

  50. [50]

    Memento: Fine-tuning llm agents without fine-tuning llms.arXiv preprint arXiv:2508.16153, 2025

    Huichi Zhou, Yihang Chen, Siyuan Guo, Xue Yan, Kin Hei Lee, Zihan Wang, Ka Yiu Lee, Guchun Zhang, Kun Shao, Linyi Yang, et al. Memento: Fine-tuning llm agents without fine-tuning llms.arXiv preprint arXiv:2508.16153, 2025

  51. [51]

    Proposer-agent-evaluator (pae): Autonomous skill discovery for foundation model internet agents

    Yifei Zhou, Qianlan Yang, Kaixiang Lin, Min Bai, Xiong Zhou, Yu-Xiong Wang, Sergey Levine, and Li Erran Li. Proposer-agent-evaluator (pae): Autonomous skill discovery for foundation model internet agents. InForty-second International Conference on Machine Learning, 2025

  52. [52]

    Large language models are human-level prompt engineers

    Yongchao Zhou, Andrei Ioan Muresanu, Ziwen Han, Keiran Paster, Silviu Pitis, Harris Chan, and Jimmy Ba. Large language models are human-level prompt engineers. InThe Eleventh International Conference on Learning Representations, 2023

  53. [53]

    Resmas: Resilience optimization in llm-based multi-agent systems

    Zhilun Zhou, Zihan Liu, Jiahe Liu, Qingyu Shao, Yihan Wang, Kun Shao, Depeng Jin, and Fengli Xu. Resmas: Resilience optimization in llm-based multi-agent systems. InProceedings of the AAAI Conference on Artificial Intelligence, 2026

  54. [54]

    Put a cooled apple in the microwave

    Mingchen Zhuge, Wenyi Wang, Louis Kirsch, Francesco Faccio, Dmitrii Khizbullin, and Jürgen Schmidhuber. Gptswarm: Language agents as optimizable graphs. InForty-first International Conference on Machine Learning, 2024. 26 Contributions and Acknowledgments Core Contributors •Tingyang Chen* •Shuo Lu* •Kang Zhao* •Weicheng Meng •Kun Shao † •Jian Luan † Contr...

  55. [55]

    Write the code to your scratch dir

  56. [56]

    Two levels: - Level 1 -- unit call works: instantiate the processor/tool, drive the async hook, assert the expected state mutation happened

    Verify by actually running it -- not by reasoning about it. Two levels: - Level 1 -- unit call works: instantiate the processor/tool, drive the async hook, assert the expected state mutation happened. - Level 2 -- round-trip reaches the model: a unit call that returns does not prove the agent sees the return. Simulate the path from your code to the model'...

  57. [57]

    Do NOT hide the failure in a try/except

    Iterate if verification fails -- fix the bug, or pivot if the environment does not support what you assumed. Do NOT hide the failure in a try/except

  58. [58]

    I believe this will work

    Attach the verifying output as`capability_evidence`. "I believe this will work" is not acceptable; paste the actual command and its output. A candidate whose new code has not been observed to work will burn a round's ship slot for zero flips. Pure prompt-bucket candidates (no code asset) are exempt -- the counterfactual gate provides the equivalent smoke ...

  59. [59]

    Decompose the goal into ordered sub-goals (locate -> acquire -> transform -> deliver) and complete each before moving on

  60. [60]

    Open closed containers before judging them empty -- the admissible list surfaces`open <recep>`when you arrive at a closed one

    Systematic exploration: search each surface and container at most once before revisiting. Open closed containers before judging them empty -- the admissible list surfaces`open <recep>`when you arrive at a closed one

  61. [61]

    Grab immediately: when a required object appears, take it on the very next step before moving elsewhere

  62. [62]

    Transform before placing: perform any clean/heat/cool state change at the appropriate appliance before heading to the final destination

  63. [63]

    Direct delivery: once holding the goal object, navigate straight to the target receptacle and place it

  64. [64]

    Only stop searching when the count reaches zero

    Track progress: keep an internal count of objects still to find and place. Only stop searching when the count reaches zero

  65. [65]

    If stuck, move to a different unexplored location

    Avoid loops: never repeat the same action more than twice in a row. If stuck, move to a different unexplored location

  66. [66]

    ", evidence:

    Trust the admissible list: if`take X from Y`does not appear, you are not at Y, Y is closed, or X is not visible --`go to`,`open`, or move on rather than guessing. ## Common Mistakes to Avoid - Revisiting searched locations without new evidence. - Ignoring visible objects -- if the target appears, take it immediately. - Skipping the state change -- do not ...