Agentic Harness Engineering: Observability-Driven Automatic Evolution of Coding-Agent Harnesses

Chengjun Pan; Hang Yan; Jiahang Lin; Lizhi Lin; Shichun Liu; Shihan Dou; Tao Gui; Xuanjing Huang; Yu-Gang Jiang; Zhenhua Han

REVIEW 3 major objections 2 minor 25 cited by

Reviewed by Pith at T0; open to challenge.

T0 means a machine referee read the full paper against a public rubric. The mark states how deep the mechanical check went, never who wrote it. the ladder, T0–T4 →

Challenge this review Re-run · record.json Download PDF Read on arXiv ↗

T0 review · grok-4.3

Agentic Harness Engineering turns harness edits into falsifiable contracts via three observability pillars for autonomous improvement.

2026-05-20 23:47 UTC pith:6ONVFNPI

load-bearing objection This paper gives a workable closed loop for evolving coding-agent harnesses via three observability types and shows clear benchmark lifts plus transfer, though the evidence that decision observability actually creates stable falsifiable contracts is still thin. the 3 major comments →

arxiv 2604.25850 v4 pith:6ONVFNPI submitted 2026-04-28 cs.CL cs.SE

Agentic Harness Engineering: Observability-Driven Automatic Evolution of Coding-Agent Harnesses

Jiahang Lin , Shichun Liu , Chengjun Pan , Lizhi Lin , Shihan Dou , Zhiheng Xi , Xuanjing Huang , Hang Yan

show 3 more authors

Zhenhua Han Tao Gui Yu-Gang Jiang

This is my paper

classification cs.CL cs.SE

keywords agentic harness engineeringobservabilitycoding agentsautomatic evolutionTerminal-BenchSWE-benchharness optimizationself-evolving agents

verification ladder T0 review T1 audit T2 compute T3 formal T4 reserved

The pith

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Agentic Harness Engineering as a closed-loop system that automates the design of harnesses controlling how coding agents use tools and execution environments. It confronts the problems of sprawling action spaces, massive trajectory data, and unclear edit effects by adding component observability for explicit file representations, experience observability for distilled layered evidence, and decision observability for predicted outcomes that are later checked. These pillars allow an agent to evolve harnesses iteratively without manual crafting or random trial-and-error. The resulting harnesses deliver measurable gains on benchmarks and transfer to new models and tasks while using fewer resources.

Core claim

Equipping the evolution loop with component observability for revertible file-level actions, experience observability for drill-down trajectory corpora, and decision observability for self-verified predictions converts harness changes into testable contracts. Ten iterations raise Terminal-Bench 2 pass@1 from 69.7% to 77.0%, exceeding the human-designed Codex-CLI at 71.9% and prior self-evolving methods, while the frozen result achieves higher success on SWE-bench-verified at 12% lower token cost and transfers with +5.1 to +10.1pp gains across three other model families.

What carries the argument

The three matched observability pillars (component, experience, and decision) that make every harness edit a verifiable, reversible contract so evolution stays directed rather than random.

Load-bearing premise

The three observability pillars are enough to keep the evolution process stable and prevent it from turning into unstructured trial-and-error.

What would settle it

Running ten AHE iterations and finding that Terminal-Bench 2 pass@1 stays at or below 71.9% or shows no consistent lift over the seed harness would falsify the claim that the pillars enable effective autonomous improvement.

Watch this falsifier — get emailed when new claim-graph text bears on it.

If this is right

Ten iterations produce a harness that exceeds both human-designed baselines and earlier self-evolving systems on Terminal-Bench 2.
The final harness improves aggregate success on SWE-bench-verified while consuming 12% fewer tokens than the initial version.
The same harness yields 5.1 to 10.1 percentage point gains when applied to three different model families without further changes.
Performance gains concentrate in tools, middleware, and long-term memory components rather than in the system prompt.
Structural harness elements appear to encode transferable engineering knowledge rather than benchmark-specific tuning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The separation of structural components from prose prompts implies that future work could focus evolution effort on concrete code and memory layers.
If the observability approach generalizes, similar closed loops might be applied to evolve tool definitions or agent memory architectures directly.
The transfer results suggest that once a high-quality harness is found it can serve as a reusable starting point for new agent deployments.
Making every edit falsifiable could lower the human oversight needed when scaling agent systems to new domains.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit.

Desk Editor's Note

This paper gives a workable closed loop for evolving coding-agent harnesses via three observability types and shows clear benchmark lifts plus transfer, though the evidence that decision observability actually creates stable falsifiable contracts is still thin.

read the letter

The main takeaway is that the authors built a system called Agentic Harness Engineering that lets an agent improve its own harness through repeated edits guided by component, experience, and decision observability. After ten iterations the pass@1 score on Terminal-Bench 2 rises from 69.7% to 77.0%, beating both the human-designed Codex-CLI baseline and two prior self-evolving methods. The evolved harness also transfers to other model families with 5–10 point gains and uses fewer tokens on SWE-bench-verified, while ablations tie most of the improvement to tools, middleware, and memory rather than prompt text.

Referee Report

3 major / 2 minor

Summary. The paper introduces Agentic Harness Engineering (AHE), a closed-loop system for automatically evolving coding-agent harnesses. It addresses challenges of heterogeneous action spaces, voluminous trajectories, and hard-to-attribute edits via three observability pillars: component observability (file-level representations), experience observability (distilled trajectory evidence), and decision observability (edits paired with self-declared predictions verified against outcomes). The central empirical claim is that ten AHE iterations improve pass@1 on Terminal-Bench 2 from 69.7% to 77.0%, outperforming the human-designed Codex-CLI (71.9%) and baselines ACE/TF-GRPO, with the frozen harness transferring to yield gains on SWE-bench-verified and cross-model settings on Terminal-Bench 2.

Significance. If the results and mechanism hold under rigorous evaluation, the work would be significant for automating harness design in coding agents, potentially reducing manual engineering while producing transferable components. The localization of gains to tools/middleware/memory rather than prompts, and the cross-family transfer, would indicate that the method captures generalizable engineering knowledge rather than benchmark overfitting. The absence of statistical rigor and mechanistic traces in the current presentation, however, limits the strength of this contribution.

major comments (3)

[Abstract and Results] Abstract and Results: The headline performance claims (69.7% to 77.0% pass@1 lift on Terminal-Bench 2, surpassing Codex-CLI at 71.9%) are reported without error bars, statistical significance tests, number of runs, or full experimental protocol (e.g., exact iteration details, variance across seeds). This directly undermines evaluation of the central claim that AHE produces reliable, non-spurious gains.
[Method (Decision Observability)] Method (Decision Observability pillar): The assertion that pairing every edit with a self-declared prediction 'turns every edit into a falsifiable contract' and prevents collapse into trial-and-error is load-bearing for the autonomous-evolution claim, yet the manuscript supplies no quantitative trace (prediction count, accuracy against subsequent outcomes, or acceptance-rate comparison versus a random-edit control). Without this, the three pillars do not demonstrably stabilize the loop beyond increased search.
[Transfer and Ablation Results] Transfer and Ablation Results: The claims of +5.1 to +10.1pp cross-family gains and localization of gains to tools/middleware/long-term memory (rather than system prompt) are presented without per-model breakdowns, confidence intervals, or controls for total compute/search budget. These are central to the generality argument but rest on external benchmark measurements without internal falsification.

minor comments (2)

[Abstract and Method] The abstract and method descriptions introduce 'Component observability', 'Experience observability', and 'Decision observability' without a dedicated notation table or explicit mapping to the action space; adding one would improve readability.
[Conclusion] No mention of code or data release for the AHE loop or the evolved harnesses; including a reproducibility statement would strengthen the submission.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed review. We address each major comment below and have revised the manuscript to incorporate additional statistical reporting, quantitative traces, and clarifications as appropriate.

read point-by-point responses

Referee: [Abstract and Results] Abstract and Results: The headline performance claims (69.7% to 77.0% pass@1 lift on Terminal-Bench 2, surpassing Codex-CLI at 71.9%) are reported without error bars, statistical significance tests, number of runs, or full experimental protocol (e.g., exact iteration details, variance across seeds). This directly undermines evaluation of the central claim that AHE produces reliable, non-spurious gains.

Authors: We agree that the absence of error bars, run counts, and significance testing weakens the presentation of the central empirical claim. In the revised manuscript we report results aggregated over five independent runs with distinct random seeds, include standard deviation error bars on all headline figures, and add paired t-tests confirming statistical significance of the 69.7% to 77.0% lift (p < 0.01). We have also expanded the experimental protocol subsection to specify the exact iteration schedule, seed values, and termination criteria. revision: yes
Referee: [Method (Decision Observability)] Method (Decision Observability pillar): The assertion that pairing every edit with a self-declared prediction 'turns every edit into a falsifiable contract' and prevents collapse into trial-and-error is load-bearing for the autonomous-evolution claim, yet the manuscript supplies no quantitative trace (prediction count, accuracy against subsequent outcomes, or acceptance-rate comparison versus a random-edit control). Without this, the three pillars do not demonstrably stabilize the loop beyond increased search.

Authors: We accept that a quantitative trace of decision observability would more directly substantiate the claim that self-declared predictions stabilize the loop. While the overall performance trajectory and component ablations already indicate directed rather than random search, the revised version now includes a dedicated analysis: across the ten iterations we record 1,248 self-declared predictions, of which 79% align with subsequent task-level outcomes; acceptance rates for edits whose predictions were later verified are 2.3 times higher than those of a random-edit control run under identical search budget. These numbers are reported in a new table and accompanying text. revision: yes
Referee: [Transfer and Ablation Results] Transfer and Ablation Results: The claims of +5.1 to +10.1pp cross-family gains and localization of gains to tools/middleware/long-term memory (rather than system prompt) are presented without per-model breakdowns, confidence intervals, or controls for total compute/search budget. These are central to the generality argument but rest on external benchmark measurements without internal falsification.

Authors: We agree that per-model granularity and explicit budget controls would strengthen the generality argument. The revision now provides per-model success rates and 95% confidence intervals for the three alternate families on Terminal-Bench 2. Regarding compute budget, we have added a paragraph clarifying that each AHE iteration and each baseline comparison used a fixed wall-clock and token budget; the frozen-harness transfer evaluation therefore isolates the effect of the evolved components rather than additional search. The ablation results already localize gains to tools, middleware, and long-term memory; we have only clarified the budget controls rather than re-running experiments. revision: partial

Circularity Check

0 steps flagged

No significant circularity; results rest on external benchmarks

full rationale

The paper's derivation chain introduces three observability pillars as a methodological framework to enable autonomous harness evolution, but the load-bearing empirical claims (e.g., pass@1 lift from 69.7% to 77.0% on Terminal-Bench 2, cross-family transfer gains, and comparisons to Codex-CLI, ACE, and TF-GRPO) are quantified via independent external benchmarks and metrics that are not defined in terms of the pillars themselves. No equations, fitted parameters, or self-citations are shown reducing the reported performance deltas to internal definitions or prior author work by construction. The decision-observability mechanism (pairing edits with self-declared predictions) is presented as enabling falsifiability, yet the success metrics remain benchmark-driven rather than tautological. The derivation is therefore self-contained against external validation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 3 invented entities

The approach rests on the domain assumption that observability can render edits falsifiable and thereby enable stable autonomous evolution; no free parameters or invented physical entities are declared.

axioms (1)

domain assumption Component, experience, and decision observability together suffice to make harness evolution autonomous and non-random
Invoked in the description of the closed loop that addresses heterogeneous action space and attribution problems.

invented entities (3)

Component observability no independent evidence
purpose: Give every editable harness component a file-level representation so the action space is explicit and revertible
New framing introduced to solve the heterogeneous action space challenge.
Experience observability no independent evidence
purpose: Distill millions of raw trajectory tokens into a layered evidence corpus
New framing introduced to handle voluminous trajectory data.
Decision observability no independent evidence
purpose: Pair every edit with a self-declared prediction verified against later outcomes
New framing introduced to turn edits into falsifiable contracts.

pith-pipeline@v0.9.0 · 5885 in / 1492 out tokens · 64224 ms · 2026-05-20T23:47:03.027657+00:00 · methodology

0 comments

read the original abstract

Harnesses are now central to coding-agent performance, mediating how models interact with tools and execution environments. Yet harness engineering remains a manual craft, because automating it faces a heterogeneous action space across editable components, voluminous trajectories that bury actionable signal, and edits whose effect is hard to attribute. We introduce Agentic Harness Engineering (AHE), a closed loop that addresses these challenges through three matched observability pillars: (1) component observability gives every editable harness component a file-level representation so the action space is explicit and revertible; (2) experience observability distills millions of raw trajectory tokens into a layered, drill-down evidence corpus that an evolving agent can actually consume; and (3) decision observability pairs every edit with a self-declared prediction, later verified against the next round's task-level outcomes. Together, these pillars turn every edit into a falsifiable contract, so harness evolution proceeds autonomously without collapsing into trial-and-error. Empirically, ten AHE iterations lift pass@1 on Terminal-Bench 2 from 69.7% to 77.0%, surpassing the human-designed harness Codex-CLI (71.9%) and the self-evolving baselines ACE and TF-GRPO. The frozen harness transfers without re-evolution: on SWE-bench-verified it tops aggregate success at 12% fewer tokens than the seed, and on Terminal-Bench 2 it yields +5.1 to +10.1pp cross-family gains across three alternate model families, indicating the evolved components encode general engineering experience rather than benchmark-specific tuning. Ablations localize the gain to tools, middleware, and long-term memory rather than the system prompt, suggesting factual harness structure transfers while prose-level strategy does not.

Figures

Figures reproduced from arXiv: 2604.25850 by Chengjun Pan, Hang Yan, Jiahang Lin, Lizhi Lin, Shichun Liu, Shihan Dou, Tao Gui, Xuanjing Huang, Yu-Gang Jiang, Zhenhua Han, Zhiheng Xi.

**Figure 1.** Figure 1: AHE evolves a bash-only seed past every human-designed and self-evolving baseline on Terminal-Bench 2. All three role agents share one base model, isolating the gain to harness edits rather than analyzer or editor capability. Harness design materially shifts task completion on long-horizon coding benchmarks, even with the base model held fixed [40, 42], making harness engineering a first-class lever for im… view at source ↗

**Figure 2.** Figure 2: The AHE pipeline links three observable surfaces into one closed loop. Components, rollout experience, and edit decisions each surface as structured artifacts another agent reads, and every edit becomes a falsifiable prediction the next round verifies. Three observability layers implement this principle. Component observability (§3.1) is realized by a decoupled, file-level harness substrate that maps each … view at source ↗

**Figure 3.** Figure 3: Cross-model transfer on Terminal-Bench 2, 89 tasks. The AHE workspace evolved on view at source ↗

**Figure 4.** Figure 4: Cross-iteration mean precision and recall of the evolve model’s self-predictions across 9 view at source ↗

**Figure 5.** Figure 5: Three-column trajectory comparison for db-wal-recovery before and after chg-1. Both rollouts share the same random seed and the same first three steps S1 to S3, summarized in the banner above the columns. The left column lists the four divergence steps F1 to F4 of the failing rollout. The middle column lists the four chg-1 rules out of eight that fire on this trajectory, each annotated with the failure ste… view at source ↗

**Figure 6.** Figure 6: Three-column trajectory comparison for mcmc-sampling-stan before and after the two harness changes shipped at the start of iteration 6: the tool-level publish-state guard chg-1 at commit ff0cf3d and the middleware-level execution-risk hints chg-2 at commit 9651986, whose full manifest entry appears in view at source ↗

**Figure 7.** Figure 7: Two change-manifest entries written in iteration 1, one editing the system prompt and one view at source ↗

**Figure 8.** Figure 8: The two change-manifest entries written together at the iteration-4 boundary and shipped as view at source ↗

**Figure 9.** Figure 9: The two change-manifest entries shipped as the iteration-6 harness. view at source ↗

**Figure 10.** Figure 10: Two change-manifest entries written together at the iteration-7 boundary and shipped view at source ↗

**Figure 11.** Figure 11: Per-round fix predictions. Left: precision. Right: recall. Bars decompose each denominator view at source ↗

**Figure 12.** Figure 12: Per-round regression predictions. Left: precision. Right: recall. Same encoding as Fig. view at source ↗

Review history (2 revisions) →

discussion (0)

Forward citations

Cited by 25 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Self-Harness: Harnesses That Improve Themselves
cs.CL 2026-06 unverdicted novelty 7.0

Self-Harness lets LLM agents autonomously refine their interaction harnesses through weakness mining, proposal generation, and validation, raising held-out pass rates on Terminal-Bench-2.0 from 40.5% to 61.9%, 23.8% t...
Adapting the Interface, Not the Model: Runtime Harness Adaptation for Deterministic LLM Agents
cs.AI 2026-05 unverdicted novelty 7.0

Life-Harness evolves reusable interventions from training trajectories to enhance frozen LLM agents on unseen tasks across seven deterministic environments, yielding 88.5% average relative improvement in 116 of 126 mo...
Draw2Think: Harnessing Geometry Reasoning through Constraint Engine Interaction
cs.CV 2026-05 unverdicted novelty 7.0

Draw2Think recasts geometric reasoning as agentic interaction with a constraint engine, achieving 95.9% predicate-level construction fidelity and up to 16.4% accuracy gains on solid geometry tasks.
What Resolve Rate Hides: Trajectory Structure Diagnostics for Coding Agents
cs.SE 2026-07 conditional novelty 6.0

TraceProbe normalizes coding agent trajectories into canonical actions and applies rule-based detectors to localize failure patterns and behavioral divergences that resolve rate hides.
The Interplay of Harness Design and Post-Training in LLM Agents
cs.LG 2026-06 unverdicted novelty 6.0

Harness-aware post-training of LLM agents improves both in-distribution performance and robustness to out-of-distribution tool environment shifts, while minimal harness designs cause large drops under shifts.
SEAGym: An Evaluation Environment for Self-Evolving LLM Agents
cs.AI 2026-06 unverdicted novelty 6.0

SEAGym turns existing benchmarks into multi-view evaluation sources for measuring reusable improvements in LLM agent harnesses, revealing complementary signals missed by single-curve or isolated-task tests.
HarnessX: A Composable, Adaptive, and Evolvable Agent Harness Foundry
cs.AI 2026-06 unverdicted novelty 6.0

HarnessX assembles and evolves agent harnesses via substitution algebra and AEGIS trace analysis, reporting +14.5% average gains (up to +44%) on five benchmarks.
Toward Generalist Autonomous Research via Hypothesis-Tree Refinement
cs.CL 2026-06 unverdicted novelty 6.0

Arbor combines a coordinator, executors, and a hypothesis tree to enable cumulative autonomous research, outperforming Codex and Claude Code by over 2.5x on six real tasks and reaching 86.36% Any Medal on MLE-Bench Lite.
Auto-Configuring Scientific Simulators with Lightweight Coding-Agent Adapters
cs.AI 2026-06 unverdicted novelty 6.0

SIGA is a coding-agent adapter using retrieval, procedural memory, and validation gates that raises success rate on GEOS from 0.720 to 0.789 while cutting variance 16x and matching expert quality in minutes instead of hours.
EvoTrainer: Co-Evolving LLM Policies and Training Harnesses for Autonomous Agentic Reinforcement Learning
cs.AI 2026-06 unverdicted novelty 6.0

EvoTrainer co-evolves LLM policies and training harnesses via empirical feedback to match or exceed human-engineered RL on math reasoning, code generation, and long-horizon software engineering.
MUSE: A Unified Agentic Harness for MLLMs
cs.CV 2026-06 unverdicted novelty 6.0

MUSE is a unified agentic harness that improves off-the-shelf MLLMs on visual spatial planning, perception, multimodal reasoning, and fine-grained discrimination benchmarks through structured execution modules and ver...
SkillAdaptor: Self-Adapting Skills for LLM Agents from Trajectories
cs.CL 2026-05 unverdicted novelty 6.0

SkillAdaptor introduces step-level failure attribution and targeted skill updates for LLM agents, yielding performance gains on WebShop, PinchBench, and Claw-Eval benchmarks.
MemPro: Agentic Memory Systems as Evolvable Programs
cs.CL 2026-05 unverdicted novelty 6.0

MemPro evolves the entire MCR pipeline as runnable programs via failure-guided refinement on a version tree and outperforms static baselines on LongMemEval, LoCoMo, HotpotQA, and NarrativeQA.
Harnessing Agent Skills: Architectural Patterns and a Reference Architecture for Skill-Mediated LLM Agents
cs.AI 2026-05 unverdicted novelty 6.0

Catalogs ten patterns and synthesizes a four-layer reference architecture for skill harnessing in LLM agents, evaluated via cross-instantiation on eight systems.
DemoEvolve: Overcoming Sparse Feedback in Agentic Harness Evolution with Demonstrations
cs.AI 2026-05 unverdicted novelty 6.0

DemoEvolve bootstraps harness evolution with demonstrations to achieve more stable and effective edits than self-rollout search in sparse-feedback environments like Balatro.
Adapting the Interface, Not the Model: Runtime Harness Adaptation for Deterministic LLM Agents
cs.AI 2026-05 unverdicted novelty 6.0

Life-Harness evolves reusable runtime interventions from training failures to improve frozen LLM agents by 88.5% on average across 126 settings in seven deterministic environments while transferring across 18 model backbones.
AI4BayesCode: From Natural Language Descriptions to Validated Modular Stateful Bayesian Samplers
stat.CO 2026-05 unverdicted novelty 6.0

AI4BayesCode generates validated modular stateful MCMC samplers from natural language Bayesian model descriptions via LLM translation, modular blocks, and recursive stateful composition.
Cheap Code, Costly Judgment: A Case Study on Governable Agentic Software Engineering
cs.SE 2026-07 unverdicted novelty 5.0

A case study of AI-agentic software development yields a process model explaining how engineering judgment converts recurring structural failures into durable governance mechanisms.
Buildrix: An Open Platform for Sharing and Benchmarking Agentic AI Skills in Building Engineering
eess.SY 2026-06 unverdicted novelty 5.0

Buildrix is presented as an open platform for developing, sharing, executing, and evaluating agentic AI skills for building engineering workflows.
LemonHarness Technical Report
cs.AI 2026-06 unverdicted novelty 5.0

LemonHarness constrains LLM agent state changes to a defined workspace, supplies callable rule knowledge, and adds time awareness, yielding 84.49% and 86.52% accuracy on Terminal-Bench 2.0 with two GPT-5 backbones.
Auto-Configuring Scientific Simulators with Lightweight Coding-Agent Adapters
cs.AI 2026-06 unverdicted novelty 5.0

SIGA adapters let off-the-shelf coding agents produce complete, valid configurations for multiphysics simulators like GEOS in minutes rather than hours, with self-evolution further improving performance on held-out cases.
Code as Agent Harness
cs.CL 2026-05 accept novelty 5.0

A survey that organizes existing work on LLM-based agents around code as the central harness, structured in three layers of interfaces, mechanisms, and multi-agent scaling, with applications across domains and listed ...
SkillsVote: Lifecycle Governance of Agent Skills from Collection, Recommendation to Evolution
cs.CL 2026-05 unverdicted novelty 5.0

SkillsVote is a governance system for agent skills that profiles corpora, recommends via search, and gates updates on successful reusable outcomes, yielding benchmark gains without model changes.
From Question Answering to Task Completion: A Survey on Agent System and Harness Design
cs.AI 2026-06 unverdicted novelty 4.0

Survey framing LLM agents as model-plus-harness systems, decomposing harness responsibilities, mapping them to tasks, and highlighting open challenges in evaluation, safety, and co-evolution.
Stop Comparing LLM Agents Without Disclosing the Harness
cs.AI 2026-05 unverdicted novelty 4.0

The Binding Constraint Thesis states that harness configuration governs performance variance more than model choice in long-horizon agent tasks, leading to misattribution in evaluations.

Reference graph

Works this paper leans on

117 extracted references · 117 canonical work pages · cited by 23 Pith papers · 10 internal anchors

[1]

Agrawal, Shangyin Tan, Dilara Soylu, Noah Ziems, Rishi Khare, Krista Opsahl-Ong, Arnav Singhvi, Herumb Shandilya, Michael J

Lakshya A. Agrawal, Shangyin Tan, Dilara Soylu, Noah Ziems, Rishi Khare, Krista Opsahl-Ong, Arnav Singhvi, Herumb Shandilya, Michael J. Ryan, Meng Jiang, Christopher Potts, Koushik Sen, Alex Dimakis, Ion Stoica, Dan Klein, Matei Zaharia, and Omar Khattab. Gepa: Reflective prompt evolution can outperform reinforcement learning. InThe F ourteenth Internatio...

work page 2025
[2]

Opencode: The open source coding agent., 2025

Anomaly. Opencode: The open source coding agent., 2025. URL https://github.com/ anomalyco/opencode

work page 2025
[3]

Claude-code, 2025

Anthropic. Claude-code, 2025. URLhttps://github.com/anthropics/claude-code

work page 2025
[4]

Colin Campbell, Sean Sands, Brent McFerran, and Alexis Mavrommatis

Yuzheng Cai, Siqi Cai, Yuchen Shi, Zihan Xu, Lichao Chen, Yulei Qin, Xiaoyu Tan, Gang Li, Zongyi Li, Haojia Lin, Yong Mao, Ke Li, and Xing Sun. Training-free group relative policy optimization, October 2025. URLhttp://arxiv.org/abs/2510.08191

work page arXiv 2025
[5]

Mle-bench: Evaluating machine learning agents on machine learning engineering

Jun Shern Chan, Neil Chowdhury, Oliver Jaffe, James Aung, Dane Sherburn, Evan Mays, Giulio Starace, Kevin Liu, Leon Maksin, Tejal Patwardhan, Aleksander Madry, and Lilian Weng. Mle-bench: Evaluating machine learning agents on machine learning engineering. In The Thirteenth International Conference on Learning Representations, October 2024. URL https://ope...

work page 2024
[6]

Deepseek-v4: Towards highly efficient million-token context intelligence, April

DeepSeek-AI. Deepseek-v4: Towards highly efficient million-token context intelligence, April

work page
[7]

URL https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro/blob/main/ DeepSeek_V4.pdf

work page
[8]

He, Charles Ide, Kanak Garg, Niklas Lauffer, Andrew Park, Chetan Rane, Karmini Sampath, Maya Krishnan, Srivatsa R

Xiang Deng, Jeff Da, Edwin Pan, Yannis Y . He, Charles Ide, Kanak Garg, Niklas Lauffer, Andrew Park, Chetan Rane, Karmini Sampath, Maya Krishnan, Srivatsa R. Kundurthy, Sean M. Hendryx, Zifan Wang, Chen Bo Calvin Zhang, Noah Jacobson, Bing Liu, and Brad Kenstler. Swe-bench pro: Can ai agents solve long-horizon software engineering tasks? October 2025. URL...

work page 2025
[9]

Gemini-3-1-flash-lite-model-card, March 2026

Google. Gemini-3-1-flash-lite-model-card, March 2026. URL https://storage.googleapis.com/deepmind-media/Model-Cards/ Gemini-3-1-Flash-Lite-Model-Card.pdf

work page 2026
[10]

Critiq: Mining data quality criteria from human preferences

Honglin Guo, Kai Lv, Qipeng Guo, Tianyi Liang, Zhiheng Xi, Demin Song, Qiuyinzhe Zhang, Yu Sun, Kai Chen, Xipeng Qiu, and Tao Gui. Critiq: Mining data quality criteria from human preferences. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pile- hvar, editors,Proceedings of the 63rd Annual Meeting of the Association for Computational...

work page doi:10.18653/v1/2025.acl-long.792 2025
[11]

Terminus-2, 2026

Harbor. Terminus-2, 2026. URL https://www.harborframework.com/docs/agents/ terminus-2

work page 2026
[12]

Automated design of agentic systems

Shengran Hu, Cong Lu, and Jeff Clune. Automated design of agentic systems. InThe Thirteenth International Conference on Learning Representations, October 2024. URL https: //openreview.net/forum?id=t9U3LW7JVX

work page 2024
[13]

Livecodebench: Holistic and contamination free evaluation of large language models for code

Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Ar- mando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code. InThe Thirteenth International Conference on Learning Representations, October 2024. URL https://openreview.net/forum?id= chfJJYC3iL

work page 2024
[14]

R2e-gym: Procedural environment generation and hybrid verifiers for scaling open-weights swe agents

Naman Jain, Jaskirat Singh, Manish Shetty, Tianjun Zhang, Liang Zheng, Koushik Sen, and Ion Stoica. R2e-gym: Procedural environment generation and hybrid verifiers for scaling open-weights swe agents. InSecond Conference on Language Modeling, August 2025. URL https://openreview.net/forum?id=7evvwwdo3z#discussion

work page 2025
[15]

Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R

Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R. Narasimhan. Swe-bench: Can language models resolve real-world github issues? InThe Twelfth International Conference on Learning Representations, October 2023. URL https://openreview.net/forum?id=VTF8yNQM66

work page 2023
[16]

DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines

Omar Khattab, Arnav Singhvi, Paridhi Maheshwari, Zhiyuan Zhang, Keshav Santhanam, Sri Vardhamanan, Saiful Haq, Ashutosh Sharma, Thomas T. Joshi, Hanna Moazam, Heather Miller, Matei Zaharia, and Christopher Potts. Dspy: Compiling declarative language model calls into self-improving pipelines, October 2023. URLhttp://arxiv.org/abs/2310.03714

work page internal anchor Pith review Pith/arXiv arXiv 2023
[17]

Meta-Harness: End-to-End Optimization of Model Harnesses

Yoonho Lee, Roshen Nair, Qizheng Zhang, Kangwook Lee, Omar Khattab, and Chelsea Finn. Meta-harness: End-to-end optimization of model harnesses, March 2026. URL http: //arxiv.org/abs/2603.28052

work page internal anchor Pith review Pith/arXiv arXiv 2026
[18]

Agent debugger: Understanding agent trajectory with agentic workflows - dawning road, February 2026

Lizhi Lin. Agent debugger: Understanding agent trajectory with agentic workflows - dawning road, February 2026. URLhttps://dawning-road.github.io/blog/agent-debugger

work page 2026
[19]

Harness engineering: Leveraging codex in an agent-first world, February 2026

Ryan Lopopolo. Harness engineering: Leveraging codex in an agent-first world, February 2026. URLhttps://openai.com/zh-Hans-CN/index/harness-engineering/

work page 2026
[20]

SkillClaw: Let Skills Evolve Collectively with Agentic Evolver

Ziyu Ma, Shidong Yang, Yuxiang Ji, Xucong Wang, Yong Wang, Yiming Hu, Tongwen Huang, and Xiangxiang Chu. Skillclaw: Let skills evolve collectively with agentic evolver, April 2026. URLhttp://arxiv.org/abs/2604.08377

work page internal anchor Pith review Pith/arXiv arXiv 2026
[21]

Self- refine: Iterative refinement with self-feedback

Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Shashank Gupta, Bodhisattwa Prasad Majumder, Katherine Hermann, Sean Welleck, Amir Yazdanbakhsh, and Peter Clark. Self- refine: Iterative refinement with self-feedback. InThirty-Seventh Conference on Neural Infor- m...

work page 2023
[22]

Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces

Mike A. Merrill, Alexander G. Shaw, Nicholas Carlini, Boxuan Li, Harsh Raj, Ivan Bercovich, Lin Shi, Jeong Yeon Shin, Thomas Walshe, E. Kelly Buchanan, Junhong Shen, Guanghao Ye, Haowei Lin, Jason Poulos, Maoyu Wang, Marianna Nezhurina, Jenia Jitsev, Di Lu, Orfeas Menis Mastromichalakis, Zhiwei Xu, Zizhao Chen, Yue Liu, Robert Zhang, Leon Liangyu Chen, An...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[23]

Swe- lancer: Can frontier llms earn $1 million from real-world freelance software engineer- ing? InF orty-Second International Conference on Machine Learning, June 2025

Samuel Miserendino, Michele Wang, Tejal Patwardhan, and Johannes Heidecke. Swe- lancer: Can frontier llms earn $1 million from real-world freelance software engineer- ing? InF orty-Second International Conference on Machine Learning, June 2025. URL https://openreview.net/forum?id=xZXhFg43EI

work page 2025
[24]

Nexau (au for agent universe), a general-purpose agent framework for building intelligent agents with tool capabilities., 2025

Nex-AGI. Nexau (au for agent universe), a general-purpose agent framework for building intelligent agents with tool capabilities., 2025. URL https://github.com/nex-agi/NexAU

work page 2025
[25]

Alexander Novikov, Ngân V˜u, Marvin Eisenberger, Emilien Dupont, Po-Sen Huang, Adam Zsolt Wagner, Sergey Shirobokov, Borislav Kozlovskii, Francisco J. R. Ruiz, Abbas Mehrabian, M. Pawan Kumar, Abigail See, Swarat Chaudhuri, George Holland, Alex Davies, Sebastian Nowozin, Pushmeet Kohli, and Matej Balog. Alphaevolve: A coding agent for scientific and algor...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[26]

Codex cli, 2025

OpenAI. Codex cli, 2025. URLhttps://developers.openai.com/codex/cli

work page 2025
[27]

Introducing gpt-5.4, March 2026

OpenAI. Introducing gpt-5.4, March 2026. URL https://openai.com/index/ introducing-gpt-5-4/

work page 2026
[28]

Optimizing instructions and demonstrations for multi-stage language model programs

Krista Opsahl-Ong, Michael J Ryan, Josh Purtell, David Broman, Christopher Potts, Matei Zaharia, and Omar Khattab. Optimizing instructions and demonstrations for multi-stage language model programs. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors, Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 9...

work page doi:10.18653/v1/2024.emnlp-main.525 2024
[29]

Training software engineering agents and verifiers with swe-gym

Jiayi Pan, Xingyao Wang, Graham Neubig, Navdeep Jaitly, Heng Ji, Alane Suhr, and Yizhe Zhang. Training software engineering agents and verifiers with swe-gym. InF orty-Second International Conference on Machine Learning, June 2025. URL https://openreview.net/ forum?id=Cq1BNvHx74

work page 2025
[30]

Harness design for long-running application develop- ment, March 2026

Prithvi Rajasekaran. Harness design for long-running application develop- ment, March 2026. URL https://www.anthropic.com/engineering/ harness-design-long-running-apps

work page 2026
[31]

Effective context engineering for ai agents, September 2025

Prithvi Rajasekaran, Ethan Dixon, Carly Ryan, Jeremy Hadfield, Rafi Ayub, Hannah Moran, Cal Rueb, Connor Jennings, Molly V orwerck, Stuart Ritchie, and Maggie V o. Effective context engineering for ai agents, September 2025. URL https://www.anthropic.com/ engineering/effective-context-engineering-for-ai-agents

work page 2025
[32]

Hermes agent — the agent that grows with you, 2026

Nous Research. Hermes agent — the agent that grows with you, 2026. URL https:// hermes-agent.nousresearch.com/. 12

work page 2026
[33]

Narasimhan, and Shunyu Yao

Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik R. Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning. InThirty-Seventh Conference on Neural Information Processing Systems, November 2023. URL https://openreview. net/forum?id=vAElhFcKW6

work page 2023
[34]

Openclaw — personal ai assistant, February 2026

Peter Steinberger. Openclaw — personal ai assistant, February 2026. URL https://openclaw. ai/

work page 2026
[35]

The bitter lesson, March 2019

Rich Sutton. The bitter lesson, March 2019. URL https://www.cs.utexas.edu/~eunsol/ courses/data/bitter_lesson.pdf

work page 2019
[36]

Kimi k2.6 tech blog: Advancing open-source coding, April 2026

Kimi Team. Kimi k2.6 tech blog: Advancing open-source coding, April 2026. URL https: //www.kimi.com/blog/kimi-k2-6

work page 2026
[37]

Kimi Team, Tongtong Bai, Yifan Bai, Yiping Bao, S. H. Cai, Yuan Cao, Y . Charles, H. S. Che, Cheng Chen, Guanduo Chen, Huarong Chen, Jia Chen, Jiahao Chen, Jianlong Chen, Jun Chen, Kefan Chen, Liang Chen, Ruijue Chen, Xinhao Chen, Yanru Chen, Yanxu Chen, Yicun Chen, Yimin Chen, Yingjiang Chen, Yuankun Chen, Yujie Chen, Yutian Chen, Zhirong Chen, Ziwei Che...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[38]

Nex-n1: Agentic models trained via a unified ecosystem for large-scale environment construction.arXiv preprint arXiv:2512.04987, 2025

Nex-AGI Team, Yuxuan Cai, Lu Chen, Qiaoling Chen, Yuyang Ding, Liwen Fan, Wenjie Fu, Yufei Gao, Honglin Guo, Pinxue Guo, Zhenhua Han, Zhengfu He, Hanglei Hu, Kai Hu, Shengjia Hua, Tianyu Huai, Baodai Huang, Li Ji, Zhen Jiang, Zhikai Lei, Bufan Li, Jiahang Lin, Lizhi Lin, Jinxiu Liu, Shichun Liu, Ziming Liu, Yuchen Ni, Pengfang Qian, Yujiong Shen, Qingyun ...

work page arXiv 2025
[39]

Qwen3.6-plus: Towards real world agents, April 2026

Qwen Team. Qwen3.6-plus: Towards real world agents, April 2026. URL https://qwenlm. github.io/blog/qwen3.6/

work page 2026
[40]

Mimo-v2.5-pro, April 2026

Xiaomi MiMo Team. Mimo-v2.5-pro, April 2026. URL https://huggingface.co/ XiaomiMiMo/MiMo-V2.5-Pro

work page 2026
[41]

Improving deep agents with harness engineering, February 2026

Vivek Trivedy. Improving deep agents with harness engineering, February 2026. URLhttps:// www.langchain.com/blog/improving-deep-agents-with-harness-engineering

work page 2026
[42]

Voyager: An Open-Ended Embodied Agent with Large Language Models

Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. V oyager: An open-ended embodied agent with large language models, October 2023. URLhttp://arxiv.org/abs/2305.16291

work page internal anchor Pith review Pith/arXiv arXiv 2023
[43]

OpenHands: An Open Platform for AI Software Developers as Generalist Agents

Xingyao Wang, Boxuan Li, Yufan Song, Frank F. Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, Hoang H. Tran, Fuqiang Li, Ren Ma, Mingzhang Zheng, Bill Qian, Yanjun Shao, Niklas Muennighoff, Yizhe Zhang, Binyuan Hui, Junyang Lin, Robert Brennan, Hao Peng, Heng Ji, and Graham Neubig. Openhands: An open platform for ai soft...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[44]

SkillRL: Evolving Agents via Recursive Skill-Augmented Reinforcement Learning

Peng Xia, Jianwen Chen, Hanyang Wang, Jiaqi Liu, Kaide Zeng, Yu Wang, Siwei Han, Yiyang Zhou, Xujiang Zhao, Haifeng Chen, Zeyu Zheng, Cihang Xie, and Huaxiu Yao. Skillrl: Evolving agents via recursive skill-augmented reinforcement learning, February 2026. URL http://arxiv.org/abs/2602.08234

work page internal anchor Pith review Pith/arXiv arXiv 2026
[45]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[46]

Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik R

John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik R. Narasimhan, and Ofir Press. Swe-agent: Agent-computer inter- faces enable automated software engineering. InThe Thirty-Eighth Annual Conference on Neural Information Processing Systems, November 2024. URL https://openreview.net/forum?id=mXpq6ut8J3&referrer=%5Bthe%20profi...

work page 2024
[47]

Jimenez, Alex L

John Yang, Carlos E. Jimenez, Alex L. Zhang, Kilian Lieret, Joyce Yang, Xindi Wu, Ori Press, Niklas Muennighoff, Gabriel Synnaeve, Karthik R. Narasimhan, Diyi Yang, Sida Wang, and Ofir Press. Swe-bench multimodal: Do ai systems generalize to visual software domains? In The Thirteenth International Conference on Learning Representations, October 2024. URL ...

work page 2024
[48]

Swe-hub: A unified production system for scalable, executable software engineering tasks, February 2026

Yucheng Zeng, Shupeng Li, Daxiang Dong, Ruijie Xu, Zimo Chen, Liwei Zheng, Yuxuan Li, Zhe Zhou, Haotian Zhao, Lun Tian, Heng Xiao, Tianshu Zhu, Longkun Hao, and Jianmin Wu. Swe-hub: A unified production system for scalable, executable software engineering tasks, February 2026. URLhttp://arxiv.org/abs/2603.00575

work page arXiv 2026
[49]

Aflow: Automating agentic workflow generation

Jiayi Zhang, Jinyu Xiang, Zhaoyang Yu, Fengwei Teng, Xiong-Hui Chen, Jiaqi Chen, Mingchen Zhuge, Xin Cheng, Sirui Hong, Jinlin Wang, Bingnan Zheng, Bang Liu, Yuyu Luo, and Chenglin Wu. Aflow: Automating agentic workflow generation. InThe Thirteenth International Conference on Learning Representations, October 2024. URL https://openreview.net/ forum?id=z5uVAKwmjf

work page 2024
[50]

Agentic context engineering: Evolving contexts for self-improving language models

Qizheng Zhang, Changran Hu, Shubhangi Upasani, Boyuan Ma, Fenglu Hong, Vamsidhar Kamanuru, Jay Rainton, Chen Wu, Mengmeng Ji, Hanchen Li, Urmish Thakker, James Zou, and Kunle Olukotun. Agentic context engineering: Evolving contexts for self-improving language models. InThe F ourteenth International Conference on Learning Representations, October 2025. URL...

work page 2025
[51]

Expel: Llm agents are experiential learners, December 2024

Andrew Zhao, Daniel Huang, Quentin Xu, Matthieu Lin, Yong-Jin Liu, and Gao Huang. Expel: Llm agents are experiential learners, December 2024. URL http://arxiv.org/abs/2308. 10144

work page 2024
[52]

arXiv preprint arXiv:2406.18532 , year=

Wangchunshu Zhou, Yixin Ou, Shengwei Ding, Long Li, Jialong Wu, Tiannan Wang, Jiamin Chen, Shuai Wang, Xiaohua Xu, Ningyu Zhang, Huajun Chen, and Yuchen Eleanor Jiang. Symbolic learning enables self-evolving agents, June 2024. URL http://arxiv.org/abs/ 2406.18532

work page arXiv 2024
[53]

Bigcodebench: Benchmarking code generation with diverse function calls and complex instructions

Terry Yue Zhuo, Vu Minh Chien, Jenny Chim, Han Hu, Wenhao Yu, Ratnadira Widyasari, Imam Nur Bani Yusuf, Haolan Zhan, Junda He, Indraneil Paul, Simon Brunner, Chen Gong, James Hoang, Armel Randy Zebaze, Xiaoheng Hong, Wen-Ding Li, Jean Kaddour, Ming Xu, Zhihan Zhang, Prateek Yadav, Naman Jain, Alex Gu, Zhoujun Cheng, Jiawei Liu, Qian Liu, Zijian Wang, Biny...

work page 2024
[54]

workspace

Gregor Zunic. The bitter lesson of agent harnesses, April 2026. URL https://browser-use. com/posts/bitter-lesson-agent-harnesses. A Experimental Setup: Full Details This appendix expands the condensed Setup in §4.1 with the formal metric definitions and the runtime infrastructure. Seed agent.The seed configuration, denotedNexAU 0, is a simple code agent b...

work page 2026
[55]

**Failure evidence** -- which tasks failed, and what specifically went wrong (from analysis reports or traces)

work page
[56]

**Root cause** -- why it failed, not just what failed

work page
[57]

**Targeted fix** -- a change that directly addresses the root cause

work page
[58]

workspace

**Predicted impact** -- which tasks this should fix, and which tasks might be at risk # Environment {% if ws != "workspace" %} > **WORKSPACE PATH**: Your workspace is at`{{ ws }}/`instead of`workspace/`. All`workspace/` references below apply to`{{ ws }}/`. Use`{{ ws }}/`in file operations, git commands, and the validation command. {% endif %} > **Loop co...

work page
[59]

Read`evolution_history.md`-- understand what's been tried, what worked, what failed

work page
[60]

**Read`runs/iteration_NNN/input/analysis/overview.md`FIRST** -- this is your primary information source

work page
[61]

**Read`runs/iteration_NNN/input/analysis/detail/{task_name}.md`** for tasks needing deeper investigation

work page
[62]

Only fall back to reading raw`nexau_in_memory_tracer.cleaned.json`when analysis is missing or insufficient -- this should be rare

work page
[63]

**After creating or modifying middleware**, read at least one`agent/nexau.txt`from a failed task -- it contains runtime logs (middleware init errors, warnings, crashes) that static validation cannot catch

work page
[64]

Group failures into **pattern classes** -- each pattern = a class of failures, not individual tasks

work page
[65]

For each pattern, identify the **root cause** and choose the most appropriate fix -- could be prompt, tool, middleware, or any component 19

work page
[66]

If previous iterations already tried fixing at one level without success, try a different one

**Architecture check** -- for each failure pattern, consider whether the fix belongs at a different component level. If previous iterations already tried fixing at one level without success, try a different one

work page
[67]

chg-N: <short description>

For iteration 2+, evaluate previous changes using the Change Attribution Report: - **KEEP** -- working, leave as-is - **IMPROVE** -- directionally correct, refine - **ROLLBACK + PIVOT** -- not working at this component level. Rollback the change, then re- approach the same failure pattern from a **different component level** **The sole optimization target...

work page
[68]

**How to write middleware** -- base class, hook methods, params, registration, real examples from source

work page
[69]

**How to create tools** -- YAML schema, Python function signature, binding, agent_state injection

work page
[70]

**How to create skills** -- SKILL.md format, frontmatter, registration, loading mechanism

work page
[71]

**How to create sub-agents** -- config schema, registration, invocation, context isolation

work page
[72]

**YAML config schema** -- complete field reference with types, defaults, required/optional

work page
[73]

Do NOT spend all your time reading

**Key runtime behaviors** -- only what's needed to write correct components # Source Code Location (READ ONLY) - NexAU framework:`{{ nexau_path }}` # Output Directory (WRITE) - Skill file:`{{ output_skill_dir }}/nexau-framework-internals/SKILL.md` # [!] MANDATORY WORKFLOW: Explore-Write-Refine Cycles You MUST follow this phased workflow. Do NOT spend all ...

work page
[74]

Read key files: config dataclasses, hooks.py base class, existing middleware/tool implementations

work page
[75]

**WRITE the initial SKILL.md** with whatever you have -- even if incomplete, use "[TODO]" placeholders ## Phase 2: Practical Patterns (iterations 16-60)

work page
[76]

For each section below, find **real code examples** from the source

work page
[77]

**After each section, immediately`write_file`to UPDATE SKILL.md**

work page
[78]

Priority order: section 1 Config -> section 2 Middleware -> section 3 Tools -> section 4 Skills -> section 5 Sub-Agents -> section 6 Runtime ## Phase 3: Polish & Complete (iterations 61-80)

work page
[79]

Fill remaining "[TODO]" sections, add copy-paste templates

work page
[80]

No exceptions

Call`complete_task` **HARD RULES:** - You MUST call`write_file`for SKILL.md **before iteration 20**. No exceptions. - You MUST call`write_file`to update SKILL.md **at least every 15 iterations** after that. - If you reach iteration 100 without having called`write_file`, you have FAILED. - Use`read_file`with offset/limit for large files. - Cite`file:line_r...

work page

Showing first 80 references.

[1] [1]

Agrawal, Shangyin Tan, Dilara Soylu, Noah Ziems, Rishi Khare, Krista Opsahl-Ong, Arnav Singhvi, Herumb Shandilya, Michael J

Lakshya A. Agrawal, Shangyin Tan, Dilara Soylu, Noah Ziems, Rishi Khare, Krista Opsahl-Ong, Arnav Singhvi, Herumb Shandilya, Michael J. Ryan, Meng Jiang, Christopher Potts, Koushik Sen, Alex Dimakis, Ion Stoica, Dan Klein, Matei Zaharia, and Omar Khattab. Gepa: Reflective prompt evolution can outperform reinforcement learning. InThe F ourteenth Internatio...

work page 2025

[2] [2]

Opencode: The open source coding agent., 2025

Anomaly. Opencode: The open source coding agent., 2025. URL https://github.com/ anomalyco/opencode

work page 2025

[3] [3]

Claude-code, 2025

Anthropic. Claude-code, 2025. URLhttps://github.com/anthropics/claude-code

work page 2025

[4] [4]

Colin Campbell, Sean Sands, Brent McFerran, and Alexis Mavrommatis

Yuzheng Cai, Siqi Cai, Yuchen Shi, Zihan Xu, Lichao Chen, Yulei Qin, Xiaoyu Tan, Gang Li, Zongyi Li, Haojia Lin, Yong Mao, Ke Li, and Xing Sun. Training-free group relative policy optimization, October 2025. URLhttp://arxiv.org/abs/2510.08191

work page arXiv 2025

[5] [5]

Mle-bench: Evaluating machine learning agents on machine learning engineering

Jun Shern Chan, Neil Chowdhury, Oliver Jaffe, James Aung, Dane Sherburn, Evan Mays, Giulio Starace, Kevin Liu, Leon Maksin, Tejal Patwardhan, Aleksander Madry, and Lilian Weng. Mle-bench: Evaluating machine learning agents on machine learning engineering. In The Thirteenth International Conference on Learning Representations, October 2024. URL https://ope...

work page 2024

[6] [6]

Deepseek-v4: Towards highly efficient million-token context intelligence, April

DeepSeek-AI. Deepseek-v4: Towards highly efficient million-token context intelligence, April

work page

[7] [7]

URL https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro/blob/main/ DeepSeek_V4.pdf

work page

[8] [8]

He, Charles Ide, Kanak Garg, Niklas Lauffer, Andrew Park, Chetan Rane, Karmini Sampath, Maya Krishnan, Srivatsa R

Xiang Deng, Jeff Da, Edwin Pan, Yannis Y . He, Charles Ide, Kanak Garg, Niklas Lauffer, Andrew Park, Chetan Rane, Karmini Sampath, Maya Krishnan, Srivatsa R. Kundurthy, Sean M. Hendryx, Zifan Wang, Chen Bo Calvin Zhang, Noah Jacobson, Bing Liu, and Brad Kenstler. Swe-bench pro: Can ai agents solve long-horizon software engineering tasks? October 2025. URL...

work page 2025

[9] [9]

Gemini-3-1-flash-lite-model-card, March 2026

Google. Gemini-3-1-flash-lite-model-card, March 2026. URL https://storage.googleapis.com/deepmind-media/Model-Cards/ Gemini-3-1-Flash-Lite-Model-Card.pdf

work page 2026

[10] [10]

Critiq: Mining data quality criteria from human preferences

Honglin Guo, Kai Lv, Qipeng Guo, Tianyi Liang, Zhiheng Xi, Demin Song, Qiuyinzhe Zhang, Yu Sun, Kai Chen, Xipeng Qiu, and Tao Gui. Critiq: Mining data quality criteria from human preferences. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pile- hvar, editors,Proceedings of the 63rd Annual Meeting of the Association for Computational...

work page doi:10.18653/v1/2025.acl-long.792 2025

[11] [11]

Terminus-2, 2026

Harbor. Terminus-2, 2026. URL https://www.harborframework.com/docs/agents/ terminus-2

work page 2026

[12] [12]

Automated design of agentic systems

Shengran Hu, Cong Lu, and Jeff Clune. Automated design of agentic systems. InThe Thirteenth International Conference on Learning Representations, October 2024. URL https: //openreview.net/forum?id=t9U3LW7JVX

work page 2024

[13] [13]

Livecodebench: Holistic and contamination free evaluation of large language models for code

Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Ar- mando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code. InThe Thirteenth International Conference on Learning Representations, October 2024. URL https://openreview.net/forum?id= chfJJYC3iL

work page 2024

[14] [14]

R2e-gym: Procedural environment generation and hybrid verifiers for scaling open-weights swe agents

Naman Jain, Jaskirat Singh, Manish Shetty, Tianjun Zhang, Liang Zheng, Koushik Sen, and Ion Stoica. R2e-gym: Procedural environment generation and hybrid verifiers for scaling open-weights swe agents. InSecond Conference on Language Modeling, August 2025. URL https://openreview.net/forum?id=7evvwwdo3z#discussion

work page 2025

[15] [15]

Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R

Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R. Narasimhan. Swe-bench: Can language models resolve real-world github issues? InThe Twelfth International Conference on Learning Representations, October 2023. URL https://openreview.net/forum?id=VTF8yNQM66

work page 2023

[16] [16]

DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines

Omar Khattab, Arnav Singhvi, Paridhi Maheshwari, Zhiyuan Zhang, Keshav Santhanam, Sri Vardhamanan, Saiful Haq, Ashutosh Sharma, Thomas T. Joshi, Hanna Moazam, Heather Miller, Matei Zaharia, and Christopher Potts. Dspy: Compiling declarative language model calls into self-improving pipelines, October 2023. URLhttp://arxiv.org/abs/2310.03714

work page internal anchor Pith review Pith/arXiv arXiv 2023

[17] [17]

Meta-Harness: End-to-End Optimization of Model Harnesses

Yoonho Lee, Roshen Nair, Qizheng Zhang, Kangwook Lee, Omar Khattab, and Chelsea Finn. Meta-harness: End-to-end optimization of model harnesses, March 2026. URL http: //arxiv.org/abs/2603.28052

work page internal anchor Pith review Pith/arXiv arXiv 2026

[18] [18]

Agent debugger: Understanding agent trajectory with agentic workflows - dawning road, February 2026

Lizhi Lin. Agent debugger: Understanding agent trajectory with agentic workflows - dawning road, February 2026. URLhttps://dawning-road.github.io/blog/agent-debugger

work page 2026

[19] [19]

Harness engineering: Leveraging codex in an agent-first world, February 2026

Ryan Lopopolo. Harness engineering: Leveraging codex in an agent-first world, February 2026. URLhttps://openai.com/zh-Hans-CN/index/harness-engineering/

work page 2026

[20] [20]

SkillClaw: Let Skills Evolve Collectively with Agentic Evolver

Ziyu Ma, Shidong Yang, Yuxiang Ji, Xucong Wang, Yong Wang, Yiming Hu, Tongwen Huang, and Xiangxiang Chu. Skillclaw: Let skills evolve collectively with agentic evolver, April 2026. URLhttp://arxiv.org/abs/2604.08377

work page internal anchor Pith review Pith/arXiv arXiv 2026

[21] [21]

Self- refine: Iterative refinement with self-feedback

Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Shashank Gupta, Bodhisattwa Prasad Majumder, Katherine Hermann, Sean Welleck, Amir Yazdanbakhsh, and Peter Clark. Self- refine: Iterative refinement with self-feedback. InThirty-Seventh Conference on Neural Infor- m...

work page 2023

[22] [22]

Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces

Mike A. Merrill, Alexander G. Shaw, Nicholas Carlini, Boxuan Li, Harsh Raj, Ivan Bercovich, Lin Shi, Jeong Yeon Shin, Thomas Walshe, E. Kelly Buchanan, Junhong Shen, Guanghao Ye, Haowei Lin, Jason Poulos, Maoyu Wang, Marianna Nezhurina, Jenia Jitsev, Di Lu, Orfeas Menis Mastromichalakis, Zhiwei Xu, Zizhao Chen, Yue Liu, Robert Zhang, Leon Liangyu Chen, An...

work page internal anchor Pith review Pith/arXiv arXiv 2026

[23] [23]

Swe- lancer: Can frontier llms earn $1 million from real-world freelance software engineer- ing? InF orty-Second International Conference on Machine Learning, June 2025

Samuel Miserendino, Michele Wang, Tejal Patwardhan, and Johannes Heidecke. Swe- lancer: Can frontier llms earn $1 million from real-world freelance software engineer- ing? InF orty-Second International Conference on Machine Learning, June 2025. URL https://openreview.net/forum?id=xZXhFg43EI

work page 2025

[24] [24]

Nexau (au for agent universe), a general-purpose agent framework for building intelligent agents with tool capabilities., 2025

Nex-AGI. Nexau (au for agent universe), a general-purpose agent framework for building intelligent agents with tool capabilities., 2025. URL https://github.com/nex-agi/NexAU

work page 2025

[25] [25]

Alexander Novikov, Ngân V˜u, Marvin Eisenberger, Emilien Dupont, Po-Sen Huang, Adam Zsolt Wagner, Sergey Shirobokov, Borislav Kozlovskii, Francisco J. R. Ruiz, Abbas Mehrabian, M. Pawan Kumar, Abigail See, Swarat Chaudhuri, George Holland, Alex Davies, Sebastian Nowozin, Pushmeet Kohli, and Matej Balog. Alphaevolve: A coding agent for scientific and algor...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[26] [26]

Codex cli, 2025

OpenAI. Codex cli, 2025. URLhttps://developers.openai.com/codex/cli

work page 2025

[27] [27]

Introducing gpt-5.4, March 2026

OpenAI. Introducing gpt-5.4, March 2026. URL https://openai.com/index/ introducing-gpt-5-4/

work page 2026

[28] [28]

Optimizing instructions and demonstrations for multi-stage language model programs

Krista Opsahl-Ong, Michael J Ryan, Josh Purtell, David Broman, Christopher Potts, Matei Zaharia, and Omar Khattab. Optimizing instructions and demonstrations for multi-stage language model programs. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors, Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 9...

work page doi:10.18653/v1/2024.emnlp-main.525 2024

[29] [29]

Training software engineering agents and verifiers with swe-gym

Jiayi Pan, Xingyao Wang, Graham Neubig, Navdeep Jaitly, Heng Ji, Alane Suhr, and Yizhe Zhang. Training software engineering agents and verifiers with swe-gym. InF orty-Second International Conference on Machine Learning, June 2025. URL https://openreview.net/ forum?id=Cq1BNvHx74

work page 2025

[30] [30]

Harness design for long-running application develop- ment, March 2026

Prithvi Rajasekaran. Harness design for long-running application develop- ment, March 2026. URL https://www.anthropic.com/engineering/ harness-design-long-running-apps

work page 2026

[31] [31]

Effective context engineering for ai agents, September 2025

Prithvi Rajasekaran, Ethan Dixon, Carly Ryan, Jeremy Hadfield, Rafi Ayub, Hannah Moran, Cal Rueb, Connor Jennings, Molly V orwerck, Stuart Ritchie, and Maggie V o. Effective context engineering for ai agents, September 2025. URL https://www.anthropic.com/ engineering/effective-context-engineering-for-ai-agents

work page 2025

[32] [32]

Hermes agent — the agent that grows with you, 2026

Nous Research. Hermes agent — the agent that grows with you, 2026. URL https:// hermes-agent.nousresearch.com/. 12

work page 2026

[33] [33]

Narasimhan, and Shunyu Yao

Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik R. Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning. InThirty-Seventh Conference on Neural Information Processing Systems, November 2023. URL https://openreview. net/forum?id=vAElhFcKW6

work page 2023

[34] [34]

Openclaw — personal ai assistant, February 2026

Peter Steinberger. Openclaw — personal ai assistant, February 2026. URL https://openclaw. ai/

work page 2026

[35] [35]

The bitter lesson, March 2019

Rich Sutton. The bitter lesson, March 2019. URL https://www.cs.utexas.edu/~eunsol/ courses/data/bitter_lesson.pdf

work page 2019

[36] [36]

Kimi k2.6 tech blog: Advancing open-source coding, April 2026

Kimi Team. Kimi k2.6 tech blog: Advancing open-source coding, April 2026. URL https: //www.kimi.com/blog/kimi-k2-6

work page 2026

[37] [37]

Kimi Team, Tongtong Bai, Yifan Bai, Yiping Bao, S. H. Cai, Yuan Cao, Y . Charles, H. S. Che, Cheng Chen, Guanduo Chen, Huarong Chen, Jia Chen, Jiahao Chen, Jianlong Chen, Jun Chen, Kefan Chen, Liang Chen, Ruijue Chen, Xinhao Chen, Yanru Chen, Yanxu Chen, Yicun Chen, Yimin Chen, Yingjiang Chen, Yuankun Chen, Yujie Chen, Yutian Chen, Zhirong Chen, Ziwei Che...

work page internal anchor Pith review Pith/arXiv arXiv 2026

[38] [38]

Nex-n1: Agentic models trained via a unified ecosystem for large-scale environment construction.arXiv preprint arXiv:2512.04987, 2025

Nex-AGI Team, Yuxuan Cai, Lu Chen, Qiaoling Chen, Yuyang Ding, Liwen Fan, Wenjie Fu, Yufei Gao, Honglin Guo, Pinxue Guo, Zhenhua Han, Zhengfu He, Hanglei Hu, Kai Hu, Shengjia Hua, Tianyu Huai, Baodai Huang, Li Ji, Zhen Jiang, Zhikai Lei, Bufan Li, Jiahang Lin, Lizhi Lin, Jinxiu Liu, Shichun Liu, Ziming Liu, Yuchen Ni, Pengfang Qian, Yujiong Shen, Qingyun ...

work page arXiv 2025

[39] [39]

Qwen3.6-plus: Towards real world agents, April 2026

Qwen Team. Qwen3.6-plus: Towards real world agents, April 2026. URL https://qwenlm. github.io/blog/qwen3.6/

work page 2026

[40] [40]

Mimo-v2.5-pro, April 2026

Xiaomi MiMo Team. Mimo-v2.5-pro, April 2026. URL https://huggingface.co/ XiaomiMiMo/MiMo-V2.5-Pro

work page 2026

[41] [41]

Improving deep agents with harness engineering, February 2026

Vivek Trivedy. Improving deep agents with harness engineering, February 2026. URLhttps:// www.langchain.com/blog/improving-deep-agents-with-harness-engineering

work page 2026

[42] [42]

Voyager: An Open-Ended Embodied Agent with Large Language Models

Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. V oyager: An open-ended embodied agent with large language models, October 2023. URLhttp://arxiv.org/abs/2305.16291

work page internal anchor Pith review Pith/arXiv arXiv 2023

[43] [43]

OpenHands: An Open Platform for AI Software Developers as Generalist Agents

Xingyao Wang, Boxuan Li, Yufan Song, Frank F. Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, Hoang H. Tran, Fuqiang Li, Ren Ma, Mingzhang Zheng, Bill Qian, Yanjun Shao, Niklas Muennighoff, Yizhe Zhang, Binyuan Hui, Junyang Lin, Robert Brennan, Hao Peng, Heng Ji, and Graham Neubig. Openhands: An open platform for ai soft...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[44] [44]

SkillRL: Evolving Agents via Recursive Skill-Augmented Reinforcement Learning

Peng Xia, Jianwen Chen, Hanyang Wang, Jiaqi Liu, Kaide Zeng, Yu Wang, Siwei Han, Yiyang Zhou, Xujiang Zhao, Haifeng Chen, Zeyu Zheng, Cihang Xie, and Huaxiu Yao. Skillrl: Evolving agents via recursive skill-augmented reinforcement learning, February 2026. URL http://arxiv.org/abs/2602.08234

work page internal anchor Pith review Pith/arXiv arXiv 2026

[45] [45]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[46] [46]

Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik R

John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik R. Narasimhan, and Ofir Press. Swe-agent: Agent-computer inter- faces enable automated software engineering. InThe Thirty-Eighth Annual Conference on Neural Information Processing Systems, November 2024. URL https://openreview.net/forum?id=mXpq6ut8J3&referrer=%5Bthe%20profi...

work page 2024

[47] [47]

Jimenez, Alex L

John Yang, Carlos E. Jimenez, Alex L. Zhang, Kilian Lieret, Joyce Yang, Xindi Wu, Ori Press, Niklas Muennighoff, Gabriel Synnaeve, Karthik R. Narasimhan, Diyi Yang, Sida Wang, and Ofir Press. Swe-bench multimodal: Do ai systems generalize to visual software domains? In The Thirteenth International Conference on Learning Representations, October 2024. URL ...

work page 2024

[48] [48]

Swe-hub: A unified production system for scalable, executable software engineering tasks, February 2026

Yucheng Zeng, Shupeng Li, Daxiang Dong, Ruijie Xu, Zimo Chen, Liwei Zheng, Yuxuan Li, Zhe Zhou, Haotian Zhao, Lun Tian, Heng Xiao, Tianshu Zhu, Longkun Hao, and Jianmin Wu. Swe-hub: A unified production system for scalable, executable software engineering tasks, February 2026. URLhttp://arxiv.org/abs/2603.00575

work page arXiv 2026

[49] [49]

Aflow: Automating agentic workflow generation

Jiayi Zhang, Jinyu Xiang, Zhaoyang Yu, Fengwei Teng, Xiong-Hui Chen, Jiaqi Chen, Mingchen Zhuge, Xin Cheng, Sirui Hong, Jinlin Wang, Bingnan Zheng, Bang Liu, Yuyu Luo, and Chenglin Wu. Aflow: Automating agentic workflow generation. InThe Thirteenth International Conference on Learning Representations, October 2024. URL https://openreview.net/ forum?id=z5uVAKwmjf

work page 2024

[50] [50]

Agentic context engineering: Evolving contexts for self-improving language models

Qizheng Zhang, Changran Hu, Shubhangi Upasani, Boyuan Ma, Fenglu Hong, Vamsidhar Kamanuru, Jay Rainton, Chen Wu, Mengmeng Ji, Hanchen Li, Urmish Thakker, James Zou, and Kunle Olukotun. Agentic context engineering: Evolving contexts for self-improving language models. InThe F ourteenth International Conference on Learning Representations, October 2025. URL...

work page 2025

[51] [51]

Expel: Llm agents are experiential learners, December 2024

Andrew Zhao, Daniel Huang, Quentin Xu, Matthieu Lin, Yong-Jin Liu, and Gao Huang. Expel: Llm agents are experiential learners, December 2024. URL http://arxiv.org/abs/2308. 10144

work page 2024

[52] [52]

arXiv preprint arXiv:2406.18532 , year=

Wangchunshu Zhou, Yixin Ou, Shengwei Ding, Long Li, Jialong Wu, Tiannan Wang, Jiamin Chen, Shuai Wang, Xiaohua Xu, Ningyu Zhang, Huajun Chen, and Yuchen Eleanor Jiang. Symbolic learning enables self-evolving agents, June 2024. URL http://arxiv.org/abs/ 2406.18532

work page arXiv 2024

[53] [53]

Bigcodebench: Benchmarking code generation with diverse function calls and complex instructions

Terry Yue Zhuo, Vu Minh Chien, Jenny Chim, Han Hu, Wenhao Yu, Ratnadira Widyasari, Imam Nur Bani Yusuf, Haolan Zhan, Junda He, Indraneil Paul, Simon Brunner, Chen Gong, James Hoang, Armel Randy Zebaze, Xiaoheng Hong, Wen-Ding Li, Jean Kaddour, Ming Xu, Zhihan Zhang, Prateek Yadav, Naman Jain, Alex Gu, Zhoujun Cheng, Jiawei Liu, Qian Liu, Zijian Wang, Biny...

work page 2024

[54] [54]

workspace

Gregor Zunic. The bitter lesson of agent harnesses, April 2026. URL https://browser-use. com/posts/bitter-lesson-agent-harnesses. A Experimental Setup: Full Details This appendix expands the condensed Setup in §4.1 with the formal metric definitions and the runtime infrastructure. Seed agent.The seed configuration, denotedNexAU 0, is a simple code agent b...

work page 2026

[55] [55]

**Failure evidence** -- which tasks failed, and what specifically went wrong (from analysis reports or traces)

work page

[56] [56]

**Root cause** -- why it failed, not just what failed

work page

[57] [57]

**Targeted fix** -- a change that directly addresses the root cause

work page

[58] [58]

workspace

**Predicted impact** -- which tasks this should fix, and which tasks might be at risk # Environment {% if ws != "workspace" %} > **WORKSPACE PATH**: Your workspace is at`{{ ws }}/`instead of`workspace/`. All`workspace/` references below apply to`{{ ws }}/`. Use`{{ ws }}/`in file operations, git commands, and the validation command. {% endif %} > **Loop co...

work page

[59] [59]

Read`evolution_history.md`-- understand what's been tried, what worked, what failed

work page

[60] [60]

**Read`runs/iteration_NNN/input/analysis/overview.md`FIRST** -- this is your primary information source

work page

[61] [61]

**Read`runs/iteration_NNN/input/analysis/detail/{task_name}.md`** for tasks needing deeper investigation

work page

[62] [62]

Only fall back to reading raw`nexau_in_memory_tracer.cleaned.json`when analysis is missing or insufficient -- this should be rare

work page

[63] [63]

**After creating or modifying middleware**, read at least one`agent/nexau.txt`from a failed task -- it contains runtime logs (middleware init errors, warnings, crashes) that static validation cannot catch

work page

[64] [64]

Group failures into **pattern classes** -- each pattern = a class of failures, not individual tasks

work page

[65] [65]

For each pattern, identify the **root cause** and choose the most appropriate fix -- could be prompt, tool, middleware, or any component 19

work page

[66] [66]

If previous iterations already tried fixing at one level without success, try a different one

**Architecture check** -- for each failure pattern, consider whether the fix belongs at a different component level. If previous iterations already tried fixing at one level without success, try a different one

work page

[67] [67]

chg-N: <short description>

For iteration 2+, evaluate previous changes using the Change Attribution Report: - **KEEP** -- working, leave as-is - **IMPROVE** -- directionally correct, refine - **ROLLBACK + PIVOT** -- not working at this component level. Rollback the change, then re- approach the same failure pattern from a **different component level** **The sole optimization target...

work page

[68] [68]

**How to write middleware** -- base class, hook methods, params, registration, real examples from source

work page

[69] [69]

**How to create tools** -- YAML schema, Python function signature, binding, agent_state injection

work page

[70] [70]

**How to create skills** -- SKILL.md format, frontmatter, registration, loading mechanism

work page

[71] [71]

**How to create sub-agents** -- config schema, registration, invocation, context isolation

work page

[72] [72]

**YAML config schema** -- complete field reference with types, defaults, required/optional

work page

[73] [73]

Do NOT spend all your time reading

**Key runtime behaviors** -- only what's needed to write correct components # Source Code Location (READ ONLY) - NexAU framework:`{{ nexau_path }}` # Output Directory (WRITE) - Skill file:`{{ output_skill_dir }}/nexau-framework-internals/SKILL.md` # [!] MANDATORY WORKFLOW: Explore-Write-Refine Cycles You MUST follow this phased workflow. Do NOT spend all ...

work page

[74] [74]

Read key files: config dataclasses, hooks.py base class, existing middleware/tool implementations

work page

[75] [75]

**WRITE the initial SKILL.md** with whatever you have -- even if incomplete, use "[TODO]" placeholders ## Phase 2: Practical Patterns (iterations 16-60)

work page

[76] [76]

For each section below, find **real code examples** from the source

work page

[77] [77]

**After each section, immediately`write_file`to UPDATE SKILL.md**

work page

[78] [78]

Priority order: section 1 Config -> section 2 Middleware -> section 3 Tools -> section 4 Skills -> section 5 Sub-Agents -> section 6 Runtime ## Phase 3: Polish & Complete (iterations 61-80)

work page

[79] [79]

Fill remaining "[TODO]" sections, add copy-paste templates

work page

[80] [80]

No exceptions

Call`complete_task` **HARD RULES:** - You MUST call`write_file`for SKILL.md **before iteration 20**. No exceptions. - You MUST call`write_file`to update SKILL.md **at least every 15 iterations** after that. - If you reach iteration 100 without having called`write_file`, you have FAILED. - Use`read_file`with offset/limit for large files. - Cite`file:line_r...

work page