Agentic Harness Engineering: Observability-Driven Automatic Evolution of Coding-Agent Harnesses
Pith reviewed 2026-05-20 23:47 UTC · model grok-4.3
The pith
Agentic Harness Engineering turns harness edits into falsifiable contracts via three observability pillars for autonomous improvement.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Equipping the evolution loop with component observability for revertible file-level actions, experience observability for drill-down trajectory corpora, and decision observability for self-verified predictions converts harness changes into testable contracts. Ten iterations raise Terminal-Bench 2 pass@1 from 69.7% to 77.0%, exceeding the human-designed Codex-CLI at 71.9% and prior self-evolving methods, while the frozen result achieves higher success on SWE-bench-verified at 12% lower token cost and transfers with +5.1 to +10.1pp gains across three other model families.
What carries the argument
The three matched observability pillars (component, experience, and decision) that make every harness edit a verifiable, reversible contract so evolution stays directed rather than random.
If this is right
- Ten iterations produce a harness that exceeds both human-designed baselines and earlier self-evolving systems on Terminal-Bench 2.
- The final harness improves aggregate success on SWE-bench-verified while consuming 12% fewer tokens than the initial version.
- The same harness yields 5.1 to 10.1 percentage point gains when applied to three different model families without further changes.
- Performance gains concentrate in tools, middleware, and long-term memory components rather than in the system prompt.
- Structural harness elements appear to encode transferable engineering knowledge rather than benchmark-specific tuning.
Where Pith is reading between the lines
- The separation of structural components from prose prompts implies that future work could focus evolution effort on concrete code and memory layers.
- If the observability approach generalizes, similar closed loops might be applied to evolve tool definitions or agent memory architectures directly.
- The transfer results suggest that once a high-quality harness is found it can serve as a reusable starting point for new agent deployments.
- Making every edit falsifiable could lower the human oversight needed when scaling agent systems to new domains.
Load-bearing premise
The three observability pillars are enough to keep the evolution process stable and prevent it from turning into unstructured trial-and-error.
What would settle it
Running ten AHE iterations and finding that Terminal-Bench 2 pass@1 stays at or below 71.9% or shows no consistent lift over the seed harness would falsify the claim that the pillars enable effective autonomous improvement.
Figures
read the original abstract
Harnesses are now central to coding-agent performance, mediating how models interact with tools and execution environments. Yet harness engineering remains a manual craft, because automating it faces a heterogeneous action space across editable components, voluminous trajectories that bury actionable signal, and edits whose effect is hard to attribute. We introduce Agentic Harness Engineering (AHE), a closed loop that addresses these challenges through three matched observability pillars: (1) component observability gives every editable harness component a file-level representation so the action space is explicit and revertible; (2) experience observability distills millions of raw trajectory tokens into a layered, drill-down evidence corpus that an evolving agent can actually consume; and (3) decision observability pairs every edit with a self-declared prediction, later verified against the next round's task-level outcomes. Together, these pillars turn every edit into a falsifiable contract, so harness evolution proceeds autonomously without collapsing into trial-and-error. Empirically, ten AHE iterations lift pass@1 on Terminal-Bench 2 from 69.7% to 77.0%, surpassing the human-designed harness Codex-CLI (71.9%) and the self-evolving baselines ACE and TF-GRPO. The frozen harness transfers without re-evolution: on SWE-bench-verified it tops aggregate success at 12% fewer tokens than the seed, and on Terminal-Bench 2 it yields +5.1 to +10.1pp cross-family gains across three alternate model families, indicating the evolved components encode general engineering experience rather than benchmark-specific tuning. Ablations localize the gain to tools, middleware, and long-term memory rather than the system prompt, suggesting factual harness structure transfers while prose-level strategy does not.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Agentic Harness Engineering (AHE), a closed-loop system for automatically evolving coding-agent harnesses. It addresses challenges of heterogeneous action spaces, voluminous trajectories, and hard-to-attribute edits via three observability pillars: component observability (file-level representations), experience observability (distilled trajectory evidence), and decision observability (edits paired with self-declared predictions verified against outcomes). The central empirical claim is that ten AHE iterations improve pass@1 on Terminal-Bench 2 from 69.7% to 77.0%, outperforming the human-designed Codex-CLI (71.9%) and baselines ACE/TF-GRPO, with the frozen harness transferring to yield gains on SWE-bench-verified and cross-model settings on Terminal-Bench 2.
Significance. If the results and mechanism hold under rigorous evaluation, the work would be significant for automating harness design in coding agents, potentially reducing manual engineering while producing transferable components. The localization of gains to tools/middleware/memory rather than prompts, and the cross-family transfer, would indicate that the method captures generalizable engineering knowledge rather than benchmark overfitting. The absence of statistical rigor and mechanistic traces in the current presentation, however, limits the strength of this contribution.
major comments (3)
- [Abstract and Results] Abstract and Results: The headline performance claims (69.7% to 77.0% pass@1 lift on Terminal-Bench 2, surpassing Codex-CLI at 71.9%) are reported without error bars, statistical significance tests, number of runs, or full experimental protocol (e.g., exact iteration details, variance across seeds). This directly undermines evaluation of the central claim that AHE produces reliable, non-spurious gains.
- [Method (Decision Observability)] Method (Decision Observability pillar): The assertion that pairing every edit with a self-declared prediction 'turns every edit into a falsifiable contract' and prevents collapse into trial-and-error is load-bearing for the autonomous-evolution claim, yet the manuscript supplies no quantitative trace (prediction count, accuracy against subsequent outcomes, or acceptance-rate comparison versus a random-edit control). Without this, the three pillars do not demonstrably stabilize the loop beyond increased search.
- [Transfer and Ablation Results] Transfer and Ablation Results: The claims of +5.1 to +10.1pp cross-family gains and localization of gains to tools/middleware/long-term memory (rather than system prompt) are presented without per-model breakdowns, confidence intervals, or controls for total compute/search budget. These are central to the generality argument but rest on external benchmark measurements without internal falsification.
minor comments (2)
- [Abstract and Method] The abstract and method descriptions introduce 'Component observability', 'Experience observability', and 'Decision observability' without a dedicated notation table or explicit mapping to the action space; adding one would improve readability.
- [Conclusion] No mention of code or data release for the AHE loop or the evolved harnesses; including a reproducibility statement would strengthen the submission.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed review. We address each major comment below and have revised the manuscript to incorporate additional statistical reporting, quantitative traces, and clarifications as appropriate.
read point-by-point responses
-
Referee: [Abstract and Results] Abstract and Results: The headline performance claims (69.7% to 77.0% pass@1 lift on Terminal-Bench 2, surpassing Codex-CLI at 71.9%) are reported without error bars, statistical significance tests, number of runs, or full experimental protocol (e.g., exact iteration details, variance across seeds). This directly undermines evaluation of the central claim that AHE produces reliable, non-spurious gains.
Authors: We agree that the absence of error bars, run counts, and significance testing weakens the presentation of the central empirical claim. In the revised manuscript we report results aggregated over five independent runs with distinct random seeds, include standard deviation error bars on all headline figures, and add paired t-tests confirming statistical significance of the 69.7% to 77.0% lift (p < 0.01). We have also expanded the experimental protocol subsection to specify the exact iteration schedule, seed values, and termination criteria. revision: yes
-
Referee: [Method (Decision Observability)] Method (Decision Observability pillar): The assertion that pairing every edit with a self-declared prediction 'turns every edit into a falsifiable contract' and prevents collapse into trial-and-error is load-bearing for the autonomous-evolution claim, yet the manuscript supplies no quantitative trace (prediction count, accuracy against subsequent outcomes, or acceptance-rate comparison versus a random-edit control). Without this, the three pillars do not demonstrably stabilize the loop beyond increased search.
Authors: We accept that a quantitative trace of decision observability would more directly substantiate the claim that self-declared predictions stabilize the loop. While the overall performance trajectory and component ablations already indicate directed rather than random search, the revised version now includes a dedicated analysis: across the ten iterations we record 1,248 self-declared predictions, of which 79% align with subsequent task-level outcomes; acceptance rates for edits whose predictions were later verified are 2.3 times higher than those of a random-edit control run under identical search budget. These numbers are reported in a new table and accompanying text. revision: yes
-
Referee: [Transfer and Ablation Results] Transfer and Ablation Results: The claims of +5.1 to +10.1pp cross-family gains and localization of gains to tools/middleware/long-term memory (rather than system prompt) are presented without per-model breakdowns, confidence intervals, or controls for total compute/search budget. These are central to the generality argument but rest on external benchmark measurements without internal falsification.
Authors: We agree that per-model granularity and explicit budget controls would strengthen the generality argument. The revision now provides per-model success rates and 95% confidence intervals for the three alternate families on Terminal-Bench 2. Regarding compute budget, we have added a paragraph clarifying that each AHE iteration and each baseline comparison used a fixed wall-clock and token budget; the frozen-harness transfer evaluation therefore isolates the effect of the evolved components rather than additional search. The ablation results already localize gains to tools, middleware, and long-term memory; we have only clarified the budget controls rather than re-running experiments. revision: partial
Circularity Check
No significant circularity; results rest on external benchmarks
full rationale
The paper's derivation chain introduces three observability pillars as a methodological framework to enable autonomous harness evolution, but the load-bearing empirical claims (e.g., pass@1 lift from 69.7% to 77.0% on Terminal-Bench 2, cross-family transfer gains, and comparisons to Codex-CLI, ACE, and TF-GRPO) are quantified via independent external benchmarks and metrics that are not defined in terms of the pillars themselves. No equations, fitted parameters, or self-citations are shown reducing the reported performance deltas to internal definitions or prior author work by construction. The decision-observability mechanism (pairing edits with self-declared predictions) is presented as enabling falsifiability, yet the success metrics remain benchmark-driven rather than tautological. The derivation is therefore self-contained against external validation.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Component, experience, and decision observability together suffice to make harness evolution autonomous and non-random
invented entities (3)
-
Component observability
no independent evidence
-
Experience observability
no independent evidence
-
Decision observability
no independent evidence
Forward citations
Cited by 3 Pith papers
-
AI4BayesCode: From Natural Language Descriptions to Validated Modular Stateful Bayesian Samplers
AI4BayesCode generates validated modular stateful MCMC samplers from natural language Bayesian model descriptions via LLM translation, modular blocks, and recursive stateful composition.
-
Code as Agent Harness
A survey that organizes existing work on LLM-based agents around code as the central harness, structured in three layers of interfaces, mechanisms, and multi-agent scaling, with applications across domains and listed ...
-
SkillsVote: Lifecycle Governance of Agent Skills from Collection, Recommendation to Evolution
SkillsVote is a governance system for agent skills that profiles corpora, recommends via search, and gates updates on successful reusable outcomes, yielding benchmark gains without model changes.
Reference graph
Works this paper leans on
-
[1]
Lakshya A. Agrawal, Shangyin Tan, Dilara Soylu, Noah Ziems, Rishi Khare, Krista Opsahl-Ong, Arnav Singhvi, Herumb Shandilya, Michael J. Ryan, Meng Jiang, Christopher Potts, Koushik Sen, Alex Dimakis, Ion Stoica, Dan Klein, Matei Zaharia, and Omar Khattab. Gepa: Reflective prompt evolution can outperform reinforcement learning. InThe F ourteenth Internatio...
work page 2025
-
[2]
Opencode: The open source coding agent., 2025
Anomaly. Opencode: The open source coding agent., 2025. URL https://github.com/ anomalyco/opencode
work page 2025
-
[3]
Anthropic. Claude-code, 2025. URLhttps://github.com/anthropics/claude-code
work page 2025
-
[4]
Training-free group relative policy optimization, October 2025
Yuzheng Cai, Siqi Cai, Yuchen Shi, Zihan Xu, Lichao Chen, Yulei Qin, Xiaoyu Tan, Gang Li, Zongyi Li, Haojia Lin, Yong Mao, Ke Li, and Xing Sun. Training-free group relative policy optimization, October 2025. URLhttp://arxiv.org/abs/2510.08191
-
[5]
Mle-bench: Evaluating machine learning agents on machine learning engineering
Jun Shern Chan, Neil Chowdhury, Oliver Jaffe, James Aung, Dane Sherburn, Evan Mays, Giulio Starace, Kevin Liu, Leon Maksin, Tejal Patwardhan, Aleksander Madry, and Lilian Weng. Mle-bench: Evaluating machine learning agents on machine learning engineering. In The Thirteenth International Conference on Learning Representations, October 2024. URL https://ope...
work page 2024
-
[6]
Deepseek-v4: Towards highly efficient million-token context intelligence, April
DeepSeek-AI. Deepseek-v4: Towards highly efficient million-token context intelligence, April
-
[7]
URL https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro/blob/main/ DeepSeek_V4.pdf
-
[8]
Xiang Deng, Jeff Da, Edwin Pan, Yannis Y . He, Charles Ide, Kanak Garg, Niklas Lauffer, Andrew Park, Chetan Rane, Karmini Sampath, Maya Krishnan, Srivatsa R. Kundurthy, Sean M. Hendryx, Zifan Wang, Chen Bo Calvin Zhang, Noah Jacobson, Bing Liu, and Brad Kenstler. Swe-bench pro: Can ai agents solve long-horizon software engineering tasks? October 2025. URL...
work page 2025
-
[9]
Gemini-3-1-flash-lite-model-card, March 2026
Google. Gemini-3-1-flash-lite-model-card, March 2026. URL https://storage.googleapis.com/deepmind-media/Model-Cards/ Gemini-3-1-Flash-Lite-Model-Card.pdf
work page 2026
-
[10]
Critiq: Mining data quality criteria from human preferences
Honglin Guo, Kai Lv, Qipeng Guo, Tianyi Liang, Zhiheng Xi, Demin Song, Qiuyinzhe Zhang, Yu Sun, Kai Chen, Xipeng Qiu, and Tao Gui. Critiq: Mining data quality criteria from human preferences. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pile- hvar, editors,Proceedings of the 63rd Annual Meeting of the Association for Computational...
-
[11]
Harbor. Terminus-2, 2026. URL https://www.harborframework.com/docs/agents/ terminus-2
work page 2026
-
[12]
Automated design of agentic systems
Shengran Hu, Cong Lu, and Jeff Clune. Automated design of agentic systems. InThe Thirteenth International Conference on Learning Representations, October 2024. URL https: //openreview.net/forum?id=t9U3LW7JVX
work page 2024
-
[13]
Livecodebench: Holistic and contamination free evaluation of large language models for code
Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Ar- mando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code. InThe Thirteenth International Conference on Learning Representations, October 2024. URL https://openreview.net/forum?id= chfJJYC3iL
work page 2024
-
[14]
R2e-gym: Procedural environment generation and hybrid verifiers for scaling open-weights swe agents
Naman Jain, Jaskirat Singh, Manish Shetty, Tianjun Zhang, Liang Zheng, Koushik Sen, and Ion Stoica. R2e-gym: Procedural environment generation and hybrid verifiers for scaling open-weights swe agents. InSecond Conference on Language Modeling, August 2025. URL https://openreview.net/forum?id=7evvwwdo3z#discussion
work page 2025
-
[15]
Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R
Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R. Narasimhan. Swe-bench: Can language models resolve real-world github issues? InThe Twelfth International Conference on Learning Representations, October 2023. URL https://openreview.net/forum?id=VTF8yNQM66
work page 2023
-
[16]
DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines
Omar Khattab, Arnav Singhvi, Paridhi Maheshwari, Zhiyuan Zhang, Keshav Santhanam, Sri Vardhamanan, Saiful Haq, Ashutosh Sharma, Thomas T. Joshi, Hanna Moazam, Heather Miller, Matei Zaharia, and Christopher Potts. Dspy: Compiling declarative language model calls into self-improving pipelines, October 2023. URLhttp://arxiv.org/abs/2310.03714
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[17]
Meta-Harness: End-to-End Optimization of Model Harnesses
Yoonho Lee, Roshen Nair, Qizheng Zhang, Kangwook Lee, Omar Khattab, and Chelsea Finn. Meta-harness: End-to-end optimization of model harnesses, March 2026. URL http: //arxiv.org/abs/2603.28052
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[18]
Agent debugger: Understanding agent trajectory with agentic workflows - dawning road, February 2026
Lizhi Lin. Agent debugger: Understanding agent trajectory with agentic workflows - dawning road, February 2026. URLhttps://dawning-road.github.io/blog/agent-debugger
work page 2026
-
[19]
Harness engineering: Leveraging codex in an agent-first world, February 2026
Ryan Lopopolo. Harness engineering: Leveraging codex in an agent-first world, February 2026. URLhttps://openai.com/zh-Hans-CN/index/harness-engineering/
work page 2026
-
[20]
SkillClaw: Let Skills Evolve Collectively with Agentic Evolver
Ziyu Ma, Shidong Yang, Yuxiang Ji, Xucong Wang, Yong Wang, Yiming Hu, Tongwen Huang, and Xiangxiang Chu. Skillclaw: Let skills evolve collectively with agentic evolver, April 2026. URLhttp://arxiv.org/abs/2604.08377
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[21]
Self- refine: Iterative refinement with self-feedback
Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Shashank Gupta, Bodhisattwa Prasad Majumder, Katherine Hermann, Sean Welleck, Amir Yazdanbakhsh, and Peter Clark. Self- refine: Iterative refinement with self-feedback. InThirty-Seventh Conference on Neural Infor- m...
work page 2023
-
[22]
Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces
Mike A. Merrill, Alexander G. Shaw, Nicholas Carlini, Boxuan Li, Harsh Raj, Ivan Bercovich, Lin Shi, Jeong Yeon Shin, Thomas Walshe, E. Kelly Buchanan, Junhong Shen, Guanghao Ye, Haowei Lin, Jason Poulos, Maoyu Wang, Marianna Nezhurina, Jenia Jitsev, Di Lu, Orfeas Menis Mastromichalakis, Zhiwei Xu, Zizhao Chen, Yue Liu, Robert Zhang, Leon Liangyu Chen, An...
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[23]
Samuel Miserendino, Michele Wang, Tejal Patwardhan, and Johannes Heidecke. Swe- lancer: Can frontier llms earn $1 million from real-world freelance software engineer- ing? InF orty-Second International Conference on Machine Learning, June 2025. URL https://openreview.net/forum?id=xZXhFg43EI
work page 2025
-
[24]
Nex-AGI. Nexau (au for agent universe), a general-purpose agent framework for building intelligent agents with tool capabilities., 2025. URL https://github.com/nex-agi/NexAU
work page 2025
-
[25]
Alexander Novikov, Ngân V˜u, Marvin Eisenberger, Emilien Dupont, Po-Sen Huang, Adam Zsolt Wagner, Sergey Shirobokov, Borislav Kozlovskii, Francisco J. R. Ruiz, Abbas Mehrabian, M. Pawan Kumar, Abigail See, Swarat Chaudhuri, George Holland, Alex Davies, Sebastian Nowozin, Pushmeet Kohli, and Matej Balog. Alphaevolve: A coding agent for scientific and algor...
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [26]
-
[27]
Introducing gpt-5.4, March 2026
OpenAI. Introducing gpt-5.4, March 2026. URL https://openai.com/index/ introducing-gpt-5-4/
work page 2026
-
[28]
Optimizing instructions and demonstrations for multi-stage language model programs
Krista Opsahl-Ong, Michael J Ryan, Josh Purtell, David Broman, Christopher Potts, Matei Zaharia, and Omar Khattab. Optimizing instructions and demonstrations for multi-stage language model programs. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors, Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 9...
-
[29]
Training software engineering agents and verifiers with swe-gym
Jiayi Pan, Xingyao Wang, Graham Neubig, Navdeep Jaitly, Heng Ji, Alane Suhr, and Yizhe Zhang. Training software engineering agents and verifiers with swe-gym. InF orty-Second International Conference on Machine Learning, June 2025. URL https://openreview.net/ forum?id=Cq1BNvHx74
work page 2025
-
[30]
Harness design for long-running application develop- ment, March 2026
Prithvi Rajasekaran. Harness design for long-running application develop- ment, March 2026. URL https://www.anthropic.com/engineering/ harness-design-long-running-apps
work page 2026
-
[31]
Effective context engineering for ai agents, September 2025
Prithvi Rajasekaran, Ethan Dixon, Carly Ryan, Jeremy Hadfield, Rafi Ayub, Hannah Moran, Cal Rueb, Connor Jennings, Molly V orwerck, Stuart Ritchie, and Maggie V o. Effective context engineering for ai agents, September 2025. URL https://www.anthropic.com/ engineering/effective-context-engineering-for-ai-agents
work page 2025
-
[32]
Hermes agent — the agent that grows with you, 2026
Nous Research. Hermes agent — the agent that grows with you, 2026. URL https:// hermes-agent.nousresearch.com/. 12
work page 2026
-
[33]
Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik R. Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning. InThirty-Seventh Conference on Neural Information Processing Systems, November 2023. URL https://openreview. net/forum?id=vAElhFcKW6
work page 2023
-
[34]
Openclaw — personal ai assistant, February 2026
Peter Steinberger. Openclaw — personal ai assistant, February 2026. URL https://openclaw. ai/
work page 2026
-
[35]
Rich Sutton. The bitter lesson, March 2019. URL https://www.cs.utexas.edu/~eunsol/ courses/data/bitter_lesson.pdf
work page 2019
-
[36]
Kimi k2.6 tech blog: Advancing open-source coding, April 2026
Kimi Team. Kimi k2.6 tech blog: Advancing open-source coding, April 2026. URL https: //www.kimi.com/blog/kimi-k2-6
work page 2026
-
[37]
Kimi Team, Tongtong Bai, Yifan Bai, Yiping Bao, S. H. Cai, Yuan Cao, Y . Charles, H. S. Che, Cheng Chen, Guanduo Chen, Huarong Chen, Jia Chen, Jiahao Chen, Jianlong Chen, Jun Chen, Kefan Chen, Liang Chen, Ruijue Chen, Xinhao Chen, Yanru Chen, Yanxu Chen, Yicun Chen, Yimin Chen, Yingjiang Chen, Yuankun Chen, Yujie Chen, Yutian Chen, Zhirong Chen, Ziwei Che...
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[38]
Nex-AGI Team, Yuxuan Cai, Lu Chen, Qiaoling Chen, Yuyang Ding, Liwen Fan, Wenjie Fu, Yufei Gao, Honglin Guo, Pinxue Guo, Zhenhua Han, Zhengfu He, Hanglei Hu, Kai Hu, Shengjia Hua, Tianyu Huai, Baodai Huang, Li Ji, Zhen Jiang, Zhikai Lei, Bufan Li, Jiahang Lin, Lizhi Lin, Jinxiu Liu, Shichun Liu, Ziming Liu, Yuchen Ni, Pengfang Qian, Yujiong Shen, Qingyun ...
-
[39]
Qwen3.6-plus: Towards real world agents, April 2026
Qwen Team. Qwen3.6-plus: Towards real world agents, April 2026. URL https://qwenlm. github.io/blog/qwen3.6/
work page 2026
-
[40]
Xiaomi MiMo Team. Mimo-v2.5-pro, April 2026. URL https://huggingface.co/ XiaomiMiMo/MiMo-V2.5-Pro
work page 2026
-
[41]
Improving deep agents with harness engineering, February 2026
Vivek Trivedy. Improving deep agents with harness engineering, February 2026. URLhttps:// www.langchain.com/blog/improving-deep-agents-with-harness-engineering
work page 2026
-
[42]
Voyager: An Open-Ended Embodied Agent with Large Language Models
Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. V oyager: An open-ended embodied agent with large language models, October 2023. URLhttp://arxiv.org/abs/2305.16291
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[43]
OpenHands: An Open Platform for AI Software Developers as Generalist Agents
Xingyao Wang, Boxuan Li, Yufan Song, Frank F. Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, Hoang H. Tran, Fuqiang Li, Ren Ma, Mingzhang Zheng, Bill Qian, Yanjun Shao, Niklas Muennighoff, Yizhe Zhang, Binyuan Hui, Junyang Lin, Robert Brennan, Hao Peng, Heng Ji, and Graham Neubig. Openhands: An open platform for ai soft...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[44]
SkillRL: Evolving Agents via Recursive Skill-Augmented Reinforcement Learning
Peng Xia, Jianwen Chen, Hanyang Wang, Jiaqi Liu, Kaide Zeng, Yu Wang, Siwei Han, Yiyang Zhou, Xujiang Zhao, Haifeng Chen, Zeyu Zheng, Cihang Xie, and Huaxiu Yao. Skillrl: Evolving agents via recursive skill-augmented reinforcement learning, February 2026. URL http://arxiv.org/abs/2602.08234
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[45]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[46]
Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik R
John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik R. Narasimhan, and Ofir Press. Swe-agent: Agent-computer inter- faces enable automated software engineering. InThe Thirty-Eighth Annual Conference on Neural Information Processing Systems, November 2024. URL https://openreview.net/forum?id=mXpq6ut8J3&referrer=%5Bthe%20profi...
work page 2024
-
[47]
John Yang, Carlos E. Jimenez, Alex L. Zhang, Kilian Lieret, Joyce Yang, Xindi Wu, Ori Press, Niklas Muennighoff, Gabriel Synnaeve, Karthik R. Narasimhan, Diyi Yang, Sida Wang, and Ofir Press. Swe-bench multimodal: Do ai systems generalize to visual software domains? In The Thirteenth International Conference on Learning Representations, October 2024. URL ...
work page 2024
-
[48]
Yucheng Zeng, Shupeng Li, Daxiang Dong, Ruijie Xu, Zimo Chen, Liwei Zheng, Yuxuan Li, Zhe Zhou, Haotian Zhao, Lun Tian, Heng Xiao, Tianshu Zhu, Longkun Hao, and Jianmin Wu. Swe-hub: A unified production system for scalable, executable software engineering tasks, February 2026. URLhttp://arxiv.org/abs/2603.00575
-
[49]
Aflow: Automating agentic workflow generation
Jiayi Zhang, Jinyu Xiang, Zhaoyang Yu, Fengwei Teng, Xiong-Hui Chen, Jiaqi Chen, Mingchen Zhuge, Xin Cheng, Sirui Hong, Jinlin Wang, Bingnan Zheng, Bang Liu, Yuyu Luo, and Chenglin Wu. Aflow: Automating agentic workflow generation. InThe Thirteenth International Conference on Learning Representations, October 2024. URL https://openreview.net/ forum?id=z5uVAKwmjf
work page 2024
-
[50]
Agentic context engineering: Evolving contexts for self-improving language models
Qizheng Zhang, Changran Hu, Shubhangi Upasani, Boyuan Ma, Fenglu Hong, Vamsidhar Kamanuru, Jay Rainton, Chen Wu, Mengmeng Ji, Hanchen Li, Urmish Thakker, James Zou, and Kunle Olukotun. Agentic context engineering: Evolving contexts for self-improving language models. InThe F ourteenth International Conference on Learning Representations, October 2025. URL...
work page 2025
-
[51]
Expel: Llm agents are experiential learners, December 2024
Andrew Zhao, Daniel Huang, Quentin Xu, Matthieu Lin, Yong-Jin Liu, and Gao Huang. Expel: Llm agents are experiential learners, December 2024. URL http://arxiv.org/abs/2308. 10144
work page 2024
-
[52]
Symbolic learning enables self-evolving agents, June 2024
Wangchunshu Zhou, Yixin Ou, Shengwei Ding, Long Li, Jialong Wu, Tiannan Wang, Jiamin Chen, Shuai Wang, Xiaohua Xu, Ningyu Zhang, Huajun Chen, and Yuchen Eleanor Jiang. Symbolic learning enables self-evolving agents, June 2024. URL http://arxiv.org/abs/ 2406.18532
-
[53]
Bigcodebench: Benchmarking code generation with diverse function calls and complex instructions
Terry Yue Zhuo, Vu Minh Chien, Jenny Chim, Han Hu, Wenhao Yu, Ratnadira Widyasari, Imam Nur Bani Yusuf, Haolan Zhan, Junda He, Indraneil Paul, Simon Brunner, Chen Gong, James Hoang, Armel Randy Zebaze, Xiaoheng Hong, Wen-Ding Li, Jean Kaddour, Ming Xu, Zhihan Zhang, Prateek Yadav, Naman Jain, Alex Gu, Zhoujun Cheng, Jiawei Liu, Qian Liu, Zijian Wang, Biny...
work page 2024
-
[54]
Gregor Zunic. The bitter lesson of agent harnesses, April 2026. URL https://browser-use. com/posts/bitter-lesson-agent-harnesses. A Experimental Setup: Full Details This appendix expands the condensed Setup in §4.1 with the formal metric definitions and the runtime infrastructure. Seed agent.The seed configuration, denotedNexAU 0, is a simple code agent b...
work page 2026
-
[55]
**Failure evidence** -- which tasks failed, and what specifically went wrong (from analysis reports or traces)
-
[56]
**Root cause** -- why it failed, not just what failed
-
[57]
**Targeted fix** -- a change that directly addresses the root cause
-
[58]
**Predicted impact** -- which tasks this should fix, and which tasks might be at risk # Environment {% if ws != "workspace" %} > **WORKSPACE PATH**: Your workspace is at`{{ ws }}/`instead of`workspace/`. All`workspace/` references below apply to`{{ ws }}/`. Use`{{ ws }}/`in file operations, git commands, and the validation command. {% endif %} > **Loop co...
-
[59]
Read`evolution_history.md`-- understand what's been tried, what worked, what failed
-
[60]
**Read`runs/iteration_NNN/input/analysis/overview.md`FIRST** -- this is your primary information source
-
[61]
**Read`runs/iteration_NNN/input/analysis/detail/{task_name}.md`** for tasks needing deeper investigation
-
[62]
Only fall back to reading raw`nexau_in_memory_tracer.cleaned.json`when analysis is missing or insufficient -- this should be rare
-
[63]
**After creating or modifying middleware**, read at least one`agent/nexau.txt`from a failed task -- it contains runtime logs (middleware init errors, warnings, crashes) that static validation cannot catch
-
[64]
Group failures into **pattern classes** -- each pattern = a class of failures, not individual tasks
-
[65]
For each pattern, identify the **root cause** and choose the most appropriate fix -- could be prompt, tool, middleware, or any component 19
-
[66]
If previous iterations already tried fixing at one level without success, try a different one
**Architecture check** -- for each failure pattern, consider whether the fix belongs at a different component level. If previous iterations already tried fixing at one level without success, try a different one
-
[67]
For iteration 2+, evaluate previous changes using the Change Attribution Report: - **KEEP** -- working, leave as-is - **IMPROVE** -- directionally correct, refine - **ROLLBACK + PIVOT** -- not working at this component level. Rollback the change, then re- approach the same failure pattern from a **different component level** **The sole optimization target...
-
[68]
**How to write middleware** -- base class, hook methods, params, registration, real examples from source
-
[69]
**How to create tools** -- YAML schema, Python function signature, binding, agent_state injection
-
[70]
**How to create skills** -- SKILL.md format, frontmatter, registration, loading mechanism
-
[71]
**How to create sub-agents** -- config schema, registration, invocation, context isolation
-
[72]
**YAML config schema** -- complete field reference with types, defaults, required/optional
-
[73]
Do NOT spend all your time reading
**Key runtime behaviors** -- only what's needed to write correct components # Source Code Location (READ ONLY) - NexAU framework:`{{ nexau_path }}` # Output Directory (WRITE) - Skill file:`{{ output_skill_dir }}/nexau-framework-internals/SKILL.md` # [!] MANDATORY WORKFLOW: Explore-Write-Refine Cycles You MUST follow this phased workflow. Do NOT spend all ...
-
[74]
Read key files: config dataclasses, hooks.py base class, existing middleware/tool implementations
-
[75]
**WRITE the initial SKILL.md** with whatever you have -- even if incomplete, use "[TODO]" placeholders ## Phase 2: Practical Patterns (iterations 16-60)
-
[76]
For each section below, find **real code examples** from the source
-
[77]
**After each section, immediately`write_file`to UPDATE SKILL.md**
-
[78]
Priority order: section 1 Config -> section 2 Middleware -> section 3 Tools -> section 4 Skills -> section 5 Sub-Agents -> section 6 Runtime ## Phase 3: Polish & Complete (iterations 61-80)
-
[79]
Fill remaining "[TODO]" sections, add copy-paste templates
-
[80]
Call`complete_task` **HARD RULES:** - You MUST call`write_file`for SKILL.md **before iteration 20**. No exceptions. - You MUST call`write_file`to update SKILL.md **at least every 15 iterations** after that. - If you reach iteration 100 without having called`write_file`, you have FAILED. - Use`read_file`with offset/limit for large files. - Cite`file:line_r...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.