It's Not the Capability: Harness Sensitivity Is Non-Monotone Across LLM Agent Tiers

Yong-eun Cho

arxiv: 2605.26731 · v1 · pith:XPOPNTEVnew · submitted 2026-05-26 · 💻 cs.AI · cs.CL

It's Not the Capability: Harness Sensitivity Is Non-Monotone Across LLM Agent Tiers

Yong-eun Cho This is my paper

Pith reviewed 2026-06-29 17:51 UTC · model grok-4.3

classification 💻 cs.AI cs.CL

keywords LLM agentsharness complexitycapability tiersnon-monotoneVTSRfailure taxonomyagent reliabilityHEAT-24

0 comments

The pith

Harness sensitivity in LLM agents is non-monotone across capability tiers, breaking the expected inverse link to model power.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

A common assumption holds that more structured harnesses always raise reliability and that higher-capability models need less of this structure, implying a monotone inverse relationship between tier and optimal harness complexity. The paper tests the assumption in a 432-run experiment that crosses six models from four tiers with three harness conditions on a git-verified 24-task benchmark. Results show the assumption fails in opposite directions: a frontier chat model loses 29-38 points with stricter harnesses while a frontier reasoning model reaches its highest success rate and lowest latency under the strict condition. Even a 2B constrained model matches stronger tiers in stability across all harnesses. The findings indicate that harness effects depend on model type rather than capability tier alone.

Core claim

The paper establishes that the hypothesized monotone inverse relationship between model capability tier and optimal harness complexity does not hold. For the frontier chat model, increased harness verbosity lowers verified task success rate by 29-38 percentage points. For the frontier reasoning model with extended thinking, the strict harness produces the highest VTSR of 91.7 percent together with the lowest latency. A 2B model achieves 91.7 percent stability across every harness level, matching stronger tiers. Harness sensitivity is therefore non-monotone and tied to whether the model is chat-oriented or reasoning-oriented.

What carries the argument

The controlled crossing of three harness conditions (light, balanced, strict) with models from four capability tiers, measured by verified task success rate on the HEAT-24 benchmark.

If this is right

Frontier chat models can lose substantial performance when harness verbosity increases.
Frontier reasoning models achieve both higher success and lower latency under strict harness conditions.
Constrained-tier models can maintain high stability across all harness levels.
Failure modes shift from format violations in capable models to wrong-file errors in low-capability models.
Practical harness selection rules can be stated in terms of model type and tier.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Agent deployment pipelines may need separate harness templates for chat versus reasoning models even at similar capability levels.
The non-monotonic pattern could be tested on additional benchmarks that use different verification methods.
Component-level ablation of harness elements might isolate which parts drive the opposing effects in chat and reasoning models.

Load-bearing premise

The single model chosen for each capability tier and the three defined harness conditions are sufficient to show that harness sensitivity is non-monotone in general rather than only for these specific models and tasks.

What would settle it

A larger experiment that samples multiple models per tier and finds a consistent drop in required harness complexity as capability rises would falsify the non-monotone claim.

Figures

Figures reproduced from arXiv: 2605.26731 by Yong-eun Cho.

**Figure 2.** Figure 2: Category-level VTSR heatmaps for frontier/strong-open (left) and constrained (right) tiers. Green = high VTSR; red = low. across all harnesses, with format violation rising under balanced and strict. The near-flat performance curve (≤21% in any condition) indicates that this model lacks the baseline instructionfollowing capability required to benefit from structural harness guidance. The slight strict… view at source ↗

**Figure 3.** Figure 3: Failure label distribution by tier and har [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

read the original abstract

A prevalent assumption in LLM agent deployment holds that more structured harnesses universally improve reliability, and that higher-capability models need proportionally less structural guidance -- together implying a monotone inverse relationship between model capability tier and optimal harness complexity. We test this hypothesis through a controlled 432-run experiment crossing six models across four capability tiers with three harness conditions (light, balanced, strict) on HEAT-24, a 24-task synthetic benchmark with git-based workspace verification. Our results refute the monotone inverse relationship on two fronts. First, for the frontier chat model evaluated (Gemini 2.5 Flash), increased harness verbosity lowers VTSR by 29-38 percentage points -- a harness-complexity paradox. Second, for the frontier reasoning model evaluated (Qwen3.5-122B, extended thinking enabled), strict harness achieves the highest VTSR (91.7%) and the lowest latency, the opposite of the prediction. Within the constrained tier, a 2B model (Gemma4:e2B) matches strong-open-tier stability at 91.7% across all harnesses. Because each tier is represented by a single model in this study, these results should be interpreted as model-specific observations; harness sensitivity appears non-monotone across the models evaluated, and depends critically on model type (chat vs. reasoning). We introduce a six-label failure taxonomy showing that format_violation dominates capable-model failures while wrong_file dominates low-capability failures, and we derive practical tier-aware harness selection guidelines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives concrete counterexamples to monotone harness assumptions but only for the two frontier models tested, with the design limiting any tier-level claim.

read the letter

The punchline is that this work ran 432 controlled runs on the HEAT-24 benchmark and found opposite harness effects in the two models it actually tested: Gemini 2.5 Flash lost 29-38 points with stricter harnesses, while Qwen3.5-122B gained with the strict version and also ran fastest. That is a direct empirical pushback against the idea that higher capability always pairs with lighter structure.

What stands out as new is the six-label failure taxonomy that separates format violations from wrong-file errors and the practical guidelines that follow from the patterns. The experiment crosses three harness levels with the models in a single setup, which lets them report the VTSR and latency differences cleanly. They also flag the model-specific nature of the results themselves, which keeps the claims grounded.

The main limitation is exactly what the stress-test note flags: one model per tier. The abstract states the findings are model-specific observations rather than a general tier result, and the design does not include enough within-tier variation to support the broader title phrasing. The benchmark is synthetic and git-based, so transfer to messier real tasks remains open. No error bars or per-condition variance numbers appear in the abstract, which makes the percentage-point claims harder to weigh.

This paper is for practitioners who choose harnesses for deployed agents and for evaluators who want a simple taxonomy to track failure modes. A reader running similar controlled comparisons would find the setup and the taxonomy useful even if they disagree with how far the results generalize.

It deserves peer review. The experiment is reproducible in principle, the refutation is stated with numbers, and the authors already qualify their scope. A referee could push on the single-model issue and the benchmark choice without the paper falling apart.

Referee Report

1 major / 2 minor

Summary. The paper claims that the assumption of a monotone inverse relationship between LLM capability tier and optimal harness complexity is refuted by a 432-run controlled experiment crossing six models in four tiers with three harness conditions (light, balanced, strict) on the HEAT-24 benchmark. Key results include a 29-38 percentage point VTSR drop for Gemini 2.5 Flash under stricter harnesses and 91.7% VTSR with lowest latency for Qwen3.5-122B under strict harness; a six-label failure taxonomy is introduced showing format_violation vs. wrong_file dominance by capability, along with tier-aware guidelines. The abstract qualifies findings as model-specific observations dependent on model type (chat vs. reasoning).

Significance. If the non-monotonic patterns hold, the work would usefully challenge a common deployment heuristic and support model-type-specific harness choices. The controlled design with explicit run counts (432) and git-based verification on a synthetic benchmark provides a reproducible empirical foundation for the specific model observations reported.

major comments (1)

[Abstract] Abstract: the title and central claim that harness sensitivity 'is non-monotone across LLM Agent Tiers' (refuting the 'monotone inverse relationship' assumption) rests on a single-model-per-tier design. Although the abstract correctly qualifies results as 'model-specific observations' and notes dependence on 'model type (chat vs. reasoning)', this design choice prevents generalizing the observed patterns to capability tiers in general, which is load-bearing for the refutation.

minor comments (2)

[Abstract] Abstract: the reported VTSR values (e.g., 91.7%) and percentage-point differences lack any reference to error bars, per-task variance, or statistical tests; if these appear in the methods or results sections, cross-reference them from the abstract to support the quantitative claims.
The description 'crossing six models across four capability tiers' should be clarified with the exact per-tier model counts to resolve any ambiguity with the single-model-per-tier limitation discussion.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for this precise observation on the scope of our claims. We agree that the single-model-per-tier design precludes generalizing patterns to capability tiers as a whole and that this is material to the strength of the refutation. We will revise the manuscript to align the title, abstract, and discussion more tightly with the model-specific nature of the results.

read point-by-point responses

Referee: [Abstract] Abstract: the title and central claim that harness sensitivity 'is non-monotone across LLM Agent Tiers' (refuting the 'monotone inverse relationship' assumption) rests on a single-model-per-tier design. Although the abstract correctly qualifies results as 'model-specific observations' and notes dependence on 'model type (chat vs. reasoning)', this design choice prevents generalizing the observed patterns to capability tiers in general, which is load-bearing for the refutation.

Authors: We accept the critique. Although the abstract already states that results are model-specific observations, the title and framing of the central claim use 'tiers' in a way that could imply broader generalization. We will (1) revise the title to 'It's Not the Capability: Harness Sensitivity Is Non-Monotone Across Evaluated LLM Agent Models', (2) add an explicit sentence in the abstract and Section 1 stating that tier-level generalization would require multiple models per tier, and (3) adjust the discussion to frame the findings as counterexamples to the universal assumption rather than a direct refutation at the tier level. These changes preserve the empirical contribution while removing any overstatement. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical controlled experiment

full rationale

The paper reports results from a 432-run controlled experiment crossing six models, three harness conditions, and the HEAT-24 benchmark. It measures VTSR and latency directly from runs and presents model-specific observations without any equations, fitted parameters renamed as predictions, or derivations. The text explicitly qualifies results as model-specific and notes dependence on model type. No self-citations, uniqueness theorems, or ansatzes appear in the provided text; the central claim rests on the experimental data rather than reducing to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the validity of the HEAT-24 benchmark as a proxy for real agent reliability and on the chosen models being representative enough to demonstrate non-monotonicity; no free parameters or invented entities are introduced.

axioms (1)

domain assumption The VTSR metric on HEAT-24 with git-based verification measures agent reliability comparably across harness conditions and model tiers.
Invoked when interpreting VTSR differences as evidence against the monotone hypothesis.

pith-pipeline@v0.9.1-grok · 5831 in / 1366 out tokens · 26317 ms · 2026-06-29T17:51:44.410237+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

7 extracted references · 7 canonical work pages · 2 internal anchors

[1]

Saibo Geng, Hudson Cooper, Michał Moskal, Samuel Jenkins, Julian Berman, Nathan Ranchin, Robert West, Eric Horvitz, and Harsha Nori

Decoupling task-solving and output formatting in LLM generation.arXiv preprint arXiv:2510.03595. Saibo Geng, Hudson Cooper, Michał Moskal, Samuel Jenkins, Julian Berman, Nathan Ranchin, Robert West, Eric Horvitz, and Harsha Nori

work page arXiv
[2]

MD Azizul Hakim

JSON- SchemaBench: A rigorous benchmark of struc- tured outputs for language models.arXiv preprint arXiv:2501.10868. MD Azizul Hakim

work page arXiv
[3]

Carlos E

Brevity constraints reverse performance hierarchies in language models.arXiv preprint arXiv:2604.00025. Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan

work page arXiv
[4]

You don’t need prompt engineering anymore: The prompting inversion.arXiv preprint arXiv:2510.22251. Yin Li

work page arXiv
[5]

Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, and 1 others

Decomposing LLM self-correction: The accuracy-correction paradox and error depth hypoth- esis.arXiv preprint arXiv:2601.00828. Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, and 1 others

work page arXiv
[6]

The Prompt Report: A Systematic Survey of Prompt Engineering Techniques

The prompt report: A systematic survey of prompt engineering techniques.arXiv preprint arXiv:2406.06608. Melanie Sclar, Yejin Choi, Yulia Tsvetkov, and Alane Suhr

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Reflexion: Language Agents with Verbal Reinforcement Learning

Reflexion: Language agents with verbal reinforcement learning.arXiv preprint arXiv:2303.11366. Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou

work page internal anchor Pith review Pith/arXiv arXiv

[1] [1]

Saibo Geng, Hudson Cooper, Michał Moskal, Samuel Jenkins, Julian Berman, Nathan Ranchin, Robert West, Eric Horvitz, and Harsha Nori

Decoupling task-solving and output formatting in LLM generation.arXiv preprint arXiv:2510.03595. Saibo Geng, Hudson Cooper, Michał Moskal, Samuel Jenkins, Julian Berman, Nathan Ranchin, Robert West, Eric Horvitz, and Harsha Nori

work page arXiv

[2] [2]

MD Azizul Hakim

JSON- SchemaBench: A rigorous benchmark of struc- tured outputs for language models.arXiv preprint arXiv:2501.10868. MD Azizul Hakim

work page arXiv

[3] [3]

Carlos E

Brevity constraints reverse performance hierarchies in language models.arXiv preprint arXiv:2604.00025. Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan

work page arXiv

[4] [4]

You don’t need prompt engineering anymore: The prompting inversion.arXiv preprint arXiv:2510.22251. Yin Li

work page arXiv

[5] [5]

Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, and 1 others

Decomposing LLM self-correction: The accuracy-correction paradox and error depth hypoth- esis.arXiv preprint arXiv:2601.00828. Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, and 1 others

work page arXiv

[6] [6]

The Prompt Report: A Systematic Survey of Prompt Engineering Techniques

The prompt report: A systematic survey of prompt engineering techniques.arXiv preprint arXiv:2406.06608. Melanie Sclar, Yejin Choi, Yulia Tsvetkov, and Alane Suhr

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

Reflexion: Language Agents with Verbal Reinforcement Learning

Reflexion: Language agents with verbal reinforcement learning.arXiv preprint arXiv:2303.11366. Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou

work page internal anchor Pith review Pith/arXiv arXiv