Agentic Harness Engineering: Observability-Driven Automatic Evolution of Coding-Agent Harnesses

Chengjun Pan; Hang Yan; Jiahang Lin; Lizhi Lin; Shichun Liu; Shihan Dou; Tao Gui; Xuanjing Huang; Yu-Gang Jiang; Zhenhua Han

arxiv: 2604.25850 · v4 · pith:6ONVFNPInew · submitted 2026-04-28 · 💻 cs.CL · cs.SE

Agentic Harness Engineering: Observability-Driven Automatic Evolution of Coding-Agent Harnesses

Jiahang Lin , Shichun Liu , Chengjun Pan , Lizhi Lin , Shihan Dou , Zhiheng Xi , Xuanjing Huang , Hang Yan

show 3 more authors

Zhenhua Han Tao Gui Yu-Gang Jiang

This is my paper

Pith reviewed 2026-05-07 16:12 UTC · model grok-4.3

classification 💻 cs.CL cs.SE

keywords Agentic Harness Engineeringcoding agentsobservabilityautomatic evolutionharness engineeringTerminal-BenchSWE-benchagent performance

0 comments

The pith

Three observability pillars let coding-agent harnesses evolve autonomously to beat human designs and transfer across benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Agentic Harness Engineering as a closed-loop process that automates improvements to the harnesses mediating how coding models use tools and environments. It addresses the problems of complex edit spaces, buried trajectory signals, and hard-to-attribute changes by creating three matched observability structures that make components explicit, distill experiences into usable layers, and link each edit to a verifiable prediction. A reader would care if this turns harness design from repeated manual tuning into a repeatable, falsifiable process that produces stronger and more reusable agent setups. The reported results include a rise in pass@1 from 69.7% to 77.0% on Terminal-Bench 2, outpacing both a human-designed baseline and prior self-evolving methods, plus successful transfer to another benchmark at lower token cost.

Core claim

Agentic Harness Engineering turns harness evolution into an autonomous loop by giving every editable component a file-level representation, distilling raw trajectories into a drill-down evidence corpus, and pairing each edit with a self-declared prediction that is checked against later task outcomes. Ten iterations of this loop raise pass@1 on Terminal-Bench 2 from 69.7% to 77.0%, exceeding the human-designed Codex-CLI harness at 71.9% and the self-evolving baselines. The resulting frozen harness transfers to SWE-bench-verified with 12% fewer tokens than the seed and delivers +5.1 to +10.1 percentage-point gains across three alternate model families on Terminal-Bench 2, while ablations show

What carries the argument

The three matched observability pillars that render harness components as explicit file-level objects, compress trajectories into layered evidence, and enforce prediction-then-verification contracts on every edit.

If this is right

The evolved harness can be frozen and reused on new tasks without further evolution while still showing gains.
Performance improvements localize to tools, middleware, and long-term memory components rather than the system prompt.
Cross-model gains appear on three alternate families, indicating the changes capture reusable engineering patterns.
Token usage drops on transferred tasks, showing efficiency as a side benefit of the evolved structure.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The distinction between transferable structural edits and non-transferable prompt edits suggests future evolution loops should prioritize component and memory changes over prose strategy.
The same observability approach could be tested on agent harnesses for non-coding domains such as data analysis or web navigation.
If the pillars scale to larger action spaces, they might reduce the need for human oversight in other automated agent-improvement pipelines.

Load-bearing premise

The three observability pillars give enough structure and signal that the evolution loop produces general improvements rather than noise-driven or benchmark-specific changes.

What would settle it

Apply the final evolved harness to a fresh coding benchmark family outside Terminal-Bench and SWE-bench; if pass rates show no lift over the seed harness, the claim of generalizable evolution would fail.

Figures

Figures reproduced from arXiv: 2604.25850 by Chengjun Pan, Hang Yan, Jiahang Lin, Lizhi Lin, Shichun Liu, Shihan Dou, Tao Gui, Xuanjing Huang, Yu-Gang Jiang, Zhenhua Han, Zhiheng Xi.

**Figure 1.** Figure 1: AHE evolves a bash-only seed past every human-designed and self-evolving baseline on Terminal-Bench 2. All three role agents share one base model, isolating the gain to harness edits rather than analyzer or editor capability. Harness design materially shifts task completion on long-horizon coding benchmarks, even with the base model held fixed [40, 42], making harness engineering a first-class lever for im… view at source ↗

**Figure 2.** Figure 2: The AHE pipeline links three observable surfaces into one closed loop. Components, rollout experience, and edit decisions each surface as structured artifacts another agent reads, and every edit becomes a falsifiable prediction the next round verifies. Three observability layers implement this principle. Component observability (§3.1) is realized by a decoupled, file-level harness substrate that maps each … view at source ↗

**Figure 3.** Figure 3: Cross-model transfer on Terminal-Bench 2, 89 tasks. The AHE workspace evolved on view at source ↗

**Figure 4.** Figure 4: Cross-iteration mean precision and recall of the evolve model’s self-predictions across 9 view at source ↗

**Figure 5.** Figure 5: Three-column trajectory comparison for db-wal-recovery before and after chg-1. Both rollouts share the same random seed and the same first three steps S1 to S3, summarized in the banner above the columns. The left column lists the four divergence steps F1 to F4 of the failing rollout. The middle column lists the four chg-1 rules out of eight that fire on this trajectory, each annotated with the failure ste… view at source ↗

**Figure 6.** Figure 6: Three-column trajectory comparison for mcmc-sampling-stan before and after the two harness changes shipped at the start of iteration 6: the tool-level publish-state guard chg-1 at commit ff0cf3d and the middleware-level execution-risk hints chg-2 at commit 9651986, whose full manifest entry appears in view at source ↗

**Figure 7.** Figure 7: Two change-manifest entries written in iteration 1, one editing the system prompt and one view at source ↗

**Figure 8.** Figure 8: The two change-manifest entries written together at the iteration-4 boundary and shipped as view at source ↗

**Figure 9.** Figure 9: The two change-manifest entries shipped as the iteration-6 harness. view at source ↗

**Figure 10.** Figure 10: Two change-manifest entries written together at the iteration-7 boundary and shipped view at source ↗

**Figure 11.** Figure 11: Per-round fix predictions. Left: precision. Right: recall. Bars decompose each denominator view at source ↗

**Figure 12.** Figure 12: Per-round regression predictions. Left: precision. Right: recall. Same encoding as Fig. view at source ↗

read the original abstract

Harnesses are now central to coding-agent performance, mediating how models interact with tools and execution environments. Yet harness engineering remains a manual craft, because automating it faces a heterogeneous action space across editable components, voluminous trajectories that bury actionable signal, and edits whose effect is hard to attribute. We introduce Agentic Harness Engineering (AHE), a closed loop that addresses these challenges through three matched observability pillars: (1) component observability gives every editable harness component a file-level representation so the action space is explicit and revertible; (2) experience observability distills millions of raw trajectory tokens into a layered, drill-down evidence corpus that an evolving agent can actually consume; and (3) decision observability pairs every edit with a self-declared prediction, later verified against the next round's task-level outcomes. Together, these pillars turn every edit into a falsifiable contract, so harness evolution proceeds autonomously without collapsing into trial-and-error. Empirically, ten AHE iterations lift pass@1 on Terminal-Bench 2 from 69.7% to 77.0%, surpassing the human-designed harness Codex-CLI (71.9%) and the self-evolving baselines ACE and TF-GRPO. The frozen harness transfers without re-evolution: on SWE-bench-verified it tops aggregate success at 12% fewer tokens than the seed, and on Terminal-Bench 2 it yields +5.1 to +10.1pp cross-family gains across three alternate model families, indicating the evolved components encode general engineering experience rather than benchmark-specific tuning. Ablations localize the gain to tools, middleware, and long-term memory rather than the system prompt, suggesting factual harness structure transfers while prose-level strategy does not.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a workable structure for automating harness edits via three observability pillars and shows benchmark gains, but the transfer claims rest on single-trajectory results without variance checks.

read the letter

The main takeaway is that this work turns harness evolution into a more controlled process by requiring the agent to declare predictions for each change and then verify them against outcomes. That decision observability, paired with explicit component representations and distilled trajectory evidence, is the clearest new piece. It moves beyond generic self-evolution loops by making edits falsifiable contracts rather than open-ended trial and error. The ablations that tie gains to tools, middleware, and memory instead of the system prompt are also useful; they give a concrete signal about what actually transfers when the harness is frozen and reused on new models or benchmarks. The reported lift from 69.7% to 77.0% on Terminal-Bench 2 and the cross-family gains of 5–10 points are the empirical hook, and the token savings on SWE-bench-verified add a practical angle. Those numbers suggest the method can produce reusable structure in at least some cases. The soft spots are mostly around statistical grounding. No error bars, run counts, or significance tests appear in the reported results, so it is hard to know whether the 7.3-point improvement sits outside normal variation of the seed harness. The SWE-bench transfer is described only as “tops aggregate success at 12% fewer tokens” without the raw success-rate delta, which leaves the generalization claim partly interpretive. A control that applies the same number of edits without the layered evidence or self-prediction step would have made the pillars’ causal role clearer. Readers working on coding-agent deployment will get the most from this; the setup is concrete enough to try on their own harnesses. It deserves a serious referee because the problem is real and the framing is actionable, even if the current experiments need tighter controls and more runs to support the broader claims. I would send it for review with a request for variance data and exact transfer metrics.

Referee Report

3 major / 2 minor

Summary. The paper introduces Agentic Harness Engineering (AHE), a closed-loop system for automatically evolving coding-agent harnesses via three observability pillars—component observability (explicit file-level editable components), experience observability (distilled layered evidence from trajectories), and decision observability (self-predicted edits verified against outcomes). It claims that 10 AHE iterations raise pass@1 on Terminal-Bench 2 from 69.7% to 77.0% (surpassing Codex-CLI at 71.9% and baselines ACE/TF-GRPO), with the frozen evolved harness transferring to SWE-bench-verified (higher aggregate success at 12% fewer tokens) and yielding +5.1 to +10.1pp gains across three alternate model families on Terminal-Bench 2; ablations attribute gains to tools/middleware/memory rather than prompts.

Significance. If the central claims hold under rigorous controls, the work would be a meaningful advance in automating harness design for coding agents, a currently manual process. The transfer results without re-evolution and the localization of gains to structural components (rather than prose) suggest the method can produce reusable engineering knowledge. The decision-observability mechanism for turning edits into falsifiable contracts is a conceptual strength that could generalize beyond the reported benchmarks.

major comments (3)

[Empirical evaluation] Empirical evaluation (results reporting the 7.3pp lift): the +7.3pp pass@1 improvement on Terminal-Bench 2 and the cross-model gains are presented without error bars, the number of independent evolution runs, or statistical significance tests, so it is impossible to determine whether the deltas exceed run-to-run variance of the seed harness.
[Ablation studies] Ablation studies: gains are localized to tools, middleware, and long-term memory, yet no control condition is reported that applies an equivalent number of edits without the three observability pillars or the self-prediction verification step; without this, the causal contribution of the pillars to generalizable structure remains unproven.
[Transfer experiments] Transfer experiments: the SWE-bench-verified result is described only as “tops aggregate success at 12% fewer tokens” with no exact success-rate delta, variance, or per-task breakdown, weakening the claim that the evolved harness encodes benchmark-independent engineering experience.

minor comments (2)

[Abstract and Methods] The abstract and methods could more explicitly define how an “iteration” is counted and what constitutes a single edit within the closed loop.
[Introduction and Methodology] Notation for the three pillars would benefit from consistent acronym usage or a summary table to improve readability when referring back to them in later sections.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on the empirical evaluation, ablation design, and transfer results. These comments highlight areas where we can improve rigor and clarity. We respond point by point below and commit to revisions.

read point-by-point responses

Referee: Empirical evaluation (results reporting the 7.3pp lift): the +7.3pp pass@1 improvement on Terminal-Bench 2 and the cross-model gains are presented without error bars, the number of independent evolution runs, or statistical significance tests, so it is impossible to determine whether the deltas exceed run-to-run variance of the seed harness.

Authors: We acknowledge that the primary results are reported from a single evolution run without error bars or significance tests. Each full AHE iteration incurs substantial compute for trajectory collection and evaluation across the benchmark, which constrained the initial experiments to one run. We will add error bars derived from repeated evaluations of the final harness, report results from one additional independent evolution run, and include a basic statistical comparison in the revised manuscript to address run-to-run variance. revision: yes
Referee: Ablation studies: gains are localized to tools, middleware, and long-term memory, yet no control condition is reported that applies an equivalent number of edits without the three observability pillars or the self-prediction verification step; without this, the causal contribution of the pillars to generalizable structure remains unproven.

Authors: We agree that the current ablations, which remove individual pillars, do not fully isolate the contribution of the observability mechanisms from the mere act of performing edits. A control applying an equivalent number of edits without component, experience, and decision observability would strengthen the causal argument. We will add this baseline in the revision, comparing AHE-guided evolution against random or heuristic edits of matching volume, to demonstrate that the pillars are necessary for the observed gains. revision: yes
Referee: Transfer experiments: the SWE-bench-verified result is described only as “tops aggregate success at 12% fewer tokens” with no exact success-rate delta, variance, or per-task breakdown, weakening the claim that the evolved harness encodes benchmark-independent engineering experience.

Authors: We will expand the transfer section to report the exact aggregate success rates on SWE-bench-verified for the seed and evolved harness, include variance or confidence intervals, and add a per-task breakdown table. This will quantify the improvement more precisely and better support the interpretation that the evolved components capture reusable engineering knowledge rather than benchmark-specific tuning. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the AHE derivation

full rationale

The paper describes an empirical closed-loop evolution process driven by three observability pillars that convert edits into verifiable predictions against task outcomes. Performance claims rest on reported benchmark lifts (Terminal-Bench 2, SWE-bench-verified) and cross-model transfer rather than any equation or definition that reduces to its own inputs by construction. No self-citations, fitted parameters renamed as predictions, or ansatzes smuggled via prior work appear in the load-bearing steps. The method is self-contained against external benchmarks and does not invoke uniqueness theorems or rename known results.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 3 invented entities

The approach rests on the assumption that harness components admit clean file-level representations and that distilled trajectory summaries contain sufficient signal for autonomous decision-making; no explicit free parameters or invented physical entities are stated.

axioms (2)

domain assumption Harness components can be represented at file level in a way that makes the action space explicit and revertible.
Invoked in the description of component observability pillar.
domain assumption Millions of raw trajectory tokens can be distilled into a layered evidence corpus that an evolving agent can consume effectively.
Core premise of experience observability.

invented entities (3)

Component observability pillar no independent evidence
purpose: Makes every editable harness component a file-level representation for explicit and revertible actions.
New conceptual construct introduced to address heterogeneous action space.
Experience observability pillar no independent evidence
purpose: Distills voluminous trajectories into drill-down evidence corpus.
New conceptual construct to handle signal burial in trajectories.
Decision observability pillar no independent evidence
purpose: Pairs every edit with a self-declared prediction verified against outcomes.
New conceptual construct to turn edits into falsifiable contracts.

pith-pipeline@v0.9.0 · 5644 in / 1729 out tokens · 71405 ms · 2026-05-07T16:12:58.518385+00:00 · methodology

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

AI4BayesCode: From Natural Language Descriptions to Validated Modular Stateful Bayesian Samplers
stat.CO 2026-05 unverdicted novelty 6.0

AI4BayesCode generates validated modular stateful MCMC samplers from natural language Bayesian model descriptions via LLM translation, modular blocks, and recursive stateful composition.
Code as Agent Harness
cs.CL 2026-05 accept novelty 5.0

A survey that organizes existing work on LLM-based agents around code as the central harness, structured in three layers of interfaces, mechanisms, and multi-agent scaling, with applications across domains and listed ...
SkillsVote: Lifecycle Governance of Agent Skills from Collection, Recommendation to Evolution
cs.CL 2026-05 unverdicted novelty 5.0

SkillsVote is a governance system for agent skills that profiles corpora, recommends via search, and gates updates on successful reusable outcomes, yielding benchmark gains without model changes.