Agentic Harness Engineering: Observability-Driven Automatic Evolution of Coding-Agent Harnesses
Pith reviewed 2026-05-07 16:12 UTC · model grok-4.3
The pith
Three observability pillars let coding-agent harnesses evolve autonomously to beat human designs and transfer across benchmarks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Agentic Harness Engineering turns harness evolution into an autonomous loop by giving every editable component a file-level representation, distilling raw trajectories into a drill-down evidence corpus, and pairing each edit with a self-declared prediction that is checked against later task outcomes. Ten iterations of this loop raise pass@1 on Terminal-Bench 2 from 69.7% to 77.0%, exceeding the human-designed Codex-CLI harness at 71.9% and the self-evolving baselines. The resulting frozen harness transfers to SWE-bench-verified with 12% fewer tokens than the seed and delivers +5.1 to +10.1 percentage-point gains across three alternate model families on Terminal-Bench 2, while ablations show
What carries the argument
The three matched observability pillars that render harness components as explicit file-level objects, compress trajectories into layered evidence, and enforce prediction-then-verification contracts on every edit.
If this is right
- The evolved harness can be frozen and reused on new tasks without further evolution while still showing gains.
- Performance improvements localize to tools, middleware, and long-term memory components rather than the system prompt.
- Cross-model gains appear on three alternate families, indicating the changes capture reusable engineering patterns.
- Token usage drops on transferred tasks, showing efficiency as a side benefit of the evolved structure.
Where Pith is reading between the lines
- The distinction between transferable structural edits and non-transferable prompt edits suggests future evolution loops should prioritize component and memory changes over prose strategy.
- The same observability approach could be tested on agent harnesses for non-coding domains such as data analysis or web navigation.
- If the pillars scale to larger action spaces, they might reduce the need for human oversight in other automated agent-improvement pipelines.
Load-bearing premise
The three observability pillars give enough structure and signal that the evolution loop produces general improvements rather than noise-driven or benchmark-specific changes.
What would settle it
Apply the final evolved harness to a fresh coding benchmark family outside Terminal-Bench and SWE-bench; if pass rates show no lift over the seed harness, the claim of generalizable evolution would fail.
Figures
read the original abstract
Harnesses are now central to coding-agent performance, mediating how models interact with tools and execution environments. Yet harness engineering remains a manual craft, because automating it faces a heterogeneous action space across editable components, voluminous trajectories that bury actionable signal, and edits whose effect is hard to attribute. We introduce Agentic Harness Engineering (AHE), a closed loop that addresses these challenges through three matched observability pillars: (1) component observability gives every editable harness component a file-level representation so the action space is explicit and revertible; (2) experience observability distills millions of raw trajectory tokens into a layered, drill-down evidence corpus that an evolving agent can actually consume; and (3) decision observability pairs every edit with a self-declared prediction, later verified against the next round's task-level outcomes. Together, these pillars turn every edit into a falsifiable contract, so harness evolution proceeds autonomously without collapsing into trial-and-error. Empirically, ten AHE iterations lift pass@1 on Terminal-Bench 2 from 69.7% to 77.0%, surpassing the human-designed harness Codex-CLI (71.9%) and the self-evolving baselines ACE and TF-GRPO. The frozen harness transfers without re-evolution: on SWE-bench-verified it tops aggregate success at 12% fewer tokens than the seed, and on Terminal-Bench 2 it yields +5.1 to +10.1pp cross-family gains across three alternate model families, indicating the evolved components encode general engineering experience rather than benchmark-specific tuning. Ablations localize the gain to tools, middleware, and long-term memory rather than the system prompt, suggesting factual harness structure transfers while prose-level strategy does not.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Agentic Harness Engineering (AHE), a closed-loop system for automatically evolving coding-agent harnesses via three observability pillars—component observability (explicit file-level editable components), experience observability (distilled layered evidence from trajectories), and decision observability (self-predicted edits verified against outcomes). It claims that 10 AHE iterations raise pass@1 on Terminal-Bench 2 from 69.7% to 77.0% (surpassing Codex-CLI at 71.9% and baselines ACE/TF-GRPO), with the frozen evolved harness transferring to SWE-bench-verified (higher aggregate success at 12% fewer tokens) and yielding +5.1 to +10.1pp gains across three alternate model families on Terminal-Bench 2; ablations attribute gains to tools/middleware/memory rather than prompts.
Significance. If the central claims hold under rigorous controls, the work would be a meaningful advance in automating harness design for coding agents, a currently manual process. The transfer results without re-evolution and the localization of gains to structural components (rather than prose) suggest the method can produce reusable engineering knowledge. The decision-observability mechanism for turning edits into falsifiable contracts is a conceptual strength that could generalize beyond the reported benchmarks.
major comments (3)
- [Empirical evaluation] Empirical evaluation (results reporting the 7.3pp lift): the +7.3pp pass@1 improvement on Terminal-Bench 2 and the cross-model gains are presented without error bars, the number of independent evolution runs, or statistical significance tests, so it is impossible to determine whether the deltas exceed run-to-run variance of the seed harness.
- [Ablation studies] Ablation studies: gains are localized to tools, middleware, and long-term memory, yet no control condition is reported that applies an equivalent number of edits without the three observability pillars or the self-prediction verification step; without this, the causal contribution of the pillars to generalizable structure remains unproven.
- [Transfer experiments] Transfer experiments: the SWE-bench-verified result is described only as “tops aggregate success at 12% fewer tokens” with no exact success-rate delta, variance, or per-task breakdown, weakening the claim that the evolved harness encodes benchmark-independent engineering experience.
minor comments (2)
- [Abstract and Methods] The abstract and methods could more explicitly define how an “iteration” is counted and what constitutes a single edit within the closed loop.
- [Introduction and Methodology] Notation for the three pillars would benefit from consistent acronym usage or a summary table to improve readability when referring back to them in later sections.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on the empirical evaluation, ablation design, and transfer results. These comments highlight areas where we can improve rigor and clarity. We respond point by point below and commit to revisions.
read point-by-point responses
-
Referee: Empirical evaluation (results reporting the 7.3pp lift): the +7.3pp pass@1 improvement on Terminal-Bench 2 and the cross-model gains are presented without error bars, the number of independent evolution runs, or statistical significance tests, so it is impossible to determine whether the deltas exceed run-to-run variance of the seed harness.
Authors: We acknowledge that the primary results are reported from a single evolution run without error bars or significance tests. Each full AHE iteration incurs substantial compute for trajectory collection and evaluation across the benchmark, which constrained the initial experiments to one run. We will add error bars derived from repeated evaluations of the final harness, report results from one additional independent evolution run, and include a basic statistical comparison in the revised manuscript to address run-to-run variance. revision: yes
-
Referee: Ablation studies: gains are localized to tools, middleware, and long-term memory, yet no control condition is reported that applies an equivalent number of edits without the three observability pillars or the self-prediction verification step; without this, the causal contribution of the pillars to generalizable structure remains unproven.
Authors: We agree that the current ablations, which remove individual pillars, do not fully isolate the contribution of the observability mechanisms from the mere act of performing edits. A control applying an equivalent number of edits without component, experience, and decision observability would strengthen the causal argument. We will add this baseline in the revision, comparing AHE-guided evolution against random or heuristic edits of matching volume, to demonstrate that the pillars are necessary for the observed gains. revision: yes
-
Referee: Transfer experiments: the SWE-bench-verified result is described only as “tops aggregate success at 12% fewer tokens” with no exact success-rate delta, variance, or per-task breakdown, weakening the claim that the evolved harness encodes benchmark-independent engineering experience.
Authors: We will expand the transfer section to report the exact aggregate success rates on SWE-bench-verified for the seed and evolved harness, include variance or confidence intervals, and add a per-task breakdown table. This will quantify the improvement more precisely and better support the interpretation that the evolved components capture reusable engineering knowledge rather than benchmark-specific tuning. revision: yes
Circularity Check
No significant circularity in the AHE derivation
full rationale
The paper describes an empirical closed-loop evolution process driven by three observability pillars that convert edits into verifiable predictions against task outcomes. Performance claims rest on reported benchmark lifts (Terminal-Bench 2, SWE-bench-verified) and cross-model transfer rather than any equation or definition that reduces to its own inputs by construction. No self-citations, fitted parameters renamed as predictions, or ansatzes smuggled via prior work appear in the load-bearing steps. The method is self-contained against external benchmarks and does not invoke uniqueness theorems or rename known results.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Harness components can be represented at file level in a way that makes the action space explicit and revertible.
- domain assumption Millions of raw trajectory tokens can be distilled into a layered evidence corpus that an evolving agent can consume effectively.
invented entities (3)
-
Component observability pillar
no independent evidence
-
Experience observability pillar
no independent evidence
-
Decision observability pillar
no independent evidence
Forward citations
Cited by 3 Pith papers
-
AI4BayesCode: From Natural Language Descriptions to Validated Modular Stateful Bayesian Samplers
AI4BayesCode generates validated modular stateful MCMC samplers from natural language Bayesian model descriptions via LLM translation, modular blocks, and recursive stateful composition.
-
Code as Agent Harness
A survey that organizes existing work on LLM-based agents around code as the central harness, structured in three layers of interfaces, mechanisms, and multi-agent scaling, with applications across domains and listed ...
-
SkillsVote: Lifecycle Governance of Agent Skills from Collection, Recommendation to Evolution
SkillsVote is a governance system for agent skills that profiles corpora, recommends via search, and gates updates on successful reusable outcomes, yielding benchmark gains without model changes.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.