Natural-Language Agent Harnesses

Hai-Tao Zheng; Jingchen Ni; Lexiao Zou; Linyue Pan; Shuo Guo

arxiv: 2603.25723 · v2 · pith:FN2I4DW2new · submitted 2026-03-26 · 💻 cs.CL · cs.AI

Natural-Language Agent Harnesses

Linyue Pan , Lexiao Zou , Shuo Guo , Jingchen Ni , Hai-Tao Zheng This is my paper

Pith reviewed 2026-05-21 10:14 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords natural language agent harnessesintelligent harness runtimeagent execution policiesbenchmark comparisonsmodule ablationscoding benchmarkscomputer use agents

0 comments

The pith

Natural-language documents can serve as executable harnesses for agents and match code-based performance on benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes representing the external execution logic around AI agents as editable natural-language documents rather than buried code. These Natural-Language Agent Harnesses are interpreted by a shared runtime that manages agent calls, state, validations, and artifacts. On benchmarks for coding, terminal use, and computer use, this approach matches the task success of traditional code and prompt-based harnesses. It also produces much shorter static policies that are easier to inspect and modify. Ablation studies demonstrate that individual harness modules can be analyzed separately.

Core claim

The paper claims that agent harness policies can be expressed as natural-language objects and executed reliably by an interpreter runtime, achieving similar outcomes to code implementations on standard benchmarks while making the policies shorter and more transparent for scientific use.

What carries the argument

Natural-Language Agent Harnesses (NLAHs) as editable documents describing run-level policy, paired with the Intelligent Harness Runtime (IHR) that translates them into agent calls, handoffs, state updates, validation gates, and artifact contracts.

If this is right

Harnesses become easier to inspect, compare, and transfer without reading controller code.
Ablations can isolate the contribution of individual harness modules such as validation gates.
Editing harness behavior requires only changing text in a document rather than reprogramming.
Static policies become shorter and more readable while preserving task outcomes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Community sharing of agent execution logic could shift from code snippets to reusable text documents.
Non-programmers might design and refine agent flows by editing natural-language descriptions.
Standardized harness formats could improve reproducibility across different agent research projects.

Load-bearing premise

Natural-language descriptions of harness policies can be interpreted by the runtime with enough precision and consistency to match the functional behavior of code-based harnesses without introducing new ambiguities.

What would settle it

A side-by-side execution of the same task where an NLAH version fails to replicate a specific handoff, validation gate, or state update that the code version handles correctly.

read the original abstract

Agent performance is strongly shaped by the surrounding harness: the external execution system around a model that organizes a task run. Yet this logic is usually buried in tightly coupled controller code, which makes harnesses hard to inspect, compare, transfer, and ablate. This paper asks whether the reusable design pattern of an agent harness can be represented as an executable natural-language object. We introduce Natural-Language Agent Harnesses (NLAHs), editable documents that describe run-level harness policy, and Intelligent Harness Runtime (IHR), a shared runtime that interprets these documents into agent calls, handoffs, state updates, validation gates, and artifact contracts. Across coding, terminal-use, and computer-use benchmarks, IHR-executed NLAHs achieve comparable task outcomes to code and prompted realizations, while exposing much shorter static harness policies. Module ablations further show that explicit harness modules are analyzable. These results suggest that agent harnesses can be turned from incidental glue around models into scientific representation objects.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

NLAHs frame harness policies as editable natural-language documents run by a shared interpreter, which could improve inspectability, but the abstract supplies no metrics or examples to confirm functional equivalence holds.

read the letter

The main point is that this work treats the harness around an agent as a reusable natural-language document instead of controller code, with a runtime that turns the document into calls, state updates, and gates. That setup aims to make policies shorter, easier to edit, and more comparable across coding, terminal, and computer-use tasks while keeping outcomes similar to code or prompted versions. The idea of turning incidental glue into something that can be ablated and inspected as a scientific object is a reasonable response to a real practical issue in current agent systems. If the experiments show clean module ablations and shorter policies without performance loss, the framing could be useful for people who need to transfer or debug harness logic. The abstract does not give numbers, error bars, benchmark names, or side-by-side traces, so it is hard to judge how well the natural-language version actually matches code behavior. Natural language leaves room for different readings on sequencing, error handling, and termination, and without concrete policy examples or resolution rules it is unclear whether the runtime avoids new ambiguities or hidden dependencies. The stress-test concern about interpretation consistency looks like it still applies based on what is shown. This is the kind of paper that would interest people already working on agent harnesses and execution environments. A reader who wants concrete ways to make agent logic more transparent might get something out of it if the full results section has the missing details and controls. I would send it to peer review so referees can check the data and examples directly rather than desk-rejecting on the abstract alone.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Natural-Language Agent Harnesses (NLAHs) as editable natural-language documents that specify run-level harness policies for agents, along with an Intelligent Harness Runtime (IHR) that interprets these documents to produce agent calls, state updates, validation gates, and artifact contracts. The central empirical claim is that IHR-executed NLAHs yield task outcomes comparable to those from traditional code-based and prompted harness realizations across coding, terminal-use, and computer-use benchmarks, while exposing substantially shorter static harness policies; module ablations are presented to show that explicit harness components remain analyzable.

Significance. If the quantitative claims are substantiated, the work offers a concrete route to converting agent harness logic from opaque, tightly coupled controller code into inspectable, transferable, and ablatable scientific objects. This could improve reproducibility, systematic comparison, and ablation studies in agent systems by making the surrounding execution policy an explicit, human-readable artifact rather than incidental glue.

major comments (2)

[Abstract and §4] Abstract and experimental results section: the assertion of 'comparable task outcomes' to code and prompted baselines is presented without any reported success rates, error bars, benchmark identifiers, statistical tests, or ablation tables. This absence prevents evaluation of whether the claimed parity holds or whether differences fall within experimental noise, directly undermining the load-bearing empirical claim.
[§3] §3 (IHR runtime description): no concrete policy-document examples, resolution rules for sequencing/error-handling/termination, or side-by-side execution traces are supplied to demonstrate that natural-language interpretation produces state transitions and validation behavior functionally equivalent to the corresponding code harnesses. Given that the skeptic concern centers on hidden ambiguities in NL resolution, this omission leaves the functional-equivalence assumption unverified.

minor comments (2)

[Introduction] Notation for 'NLAH' and 'IHR' is introduced in the abstract but lacks an explicit first-use definition or glossary entry in the main text.
[Figures] Figure captions for any harness-policy examples or ablation diagrams should explicitly label axes, units, and comparison conditions for clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. The comments highlight opportunities to make the empirical results and runtime mechanics more transparent and verifiable. We respond to each major comment below and commit to revisions that directly address the concerns raised.

read point-by-point responses

Referee: [Abstract and §4] Abstract and experimental results section: the assertion of 'comparable task outcomes' to code and prompted baselines is presented without any reported success rates, error bars, benchmark identifiers, statistical tests, or ablation tables. This absence prevents evaluation of whether the claimed parity holds or whether differences fall within experimental noise, directly undermining the load-bearing empirical claim.

Authors: We agree that the current high-level phrasing in the abstract and §4 does not supply the quantitative detail needed to evaluate the parity claim rigorously. The manuscript states that IHR-executed NLAHs achieve comparable task outcomes to code and prompted realizations across the three benchmark categories, but does not include the per-benchmark success rates, variance measures, or statistical comparisons. In the revision we will expand §4 with explicit success rates, standard deviations (or error bars) from repeated runs, the precise benchmark identifiers employed, and statistical tests assessing whether observed differences fall within experimental noise. The abstract will be updated to reference these concrete results. revision: yes
Referee: [§3] §3 (IHR runtime description): no concrete policy-document examples, resolution rules for sequencing/error-handling/termination, or side-by-side execution traces are supplied to demonstrate that natural-language interpretation produces state transitions and validation behavior functionally equivalent to the corresponding code harnesses. Given that the skeptic concern centers on hidden ambiguities in NL resolution, this omission leaves the functional-equivalence assumption unverified.

Authors: We accept that concrete illustrations are required to substantiate functional equivalence and to mitigate concerns about hidden ambiguities in natural-language resolution. Although §3 describes how the IHR maps NLAH documents to agent calls, state updates, validation gates, and artifact contracts, it does not provide worked examples or traces. In the revision we will insert (i) representative NLAH policy documents, (ii) explicit resolution rules for sequencing, error handling, and termination, and (iii) side-by-side execution traces that compare the state transitions and validation outcomes produced by the natural-language interpreter against the corresponding code harnesses. revision: yes

Circularity Check

0 steps flagged

No circularity: claims rest on benchmark comparisons, not self-referential derivations

full rationale

The paper introduces NLAHs as editable natural-language documents and IHR as an interpreter that maps them to agent calls, state updates, and validation gates. It then reports empirical results across coding, terminal-use, and computer-use benchmarks showing comparable task outcomes to code-based and prompted baselines, plus shorter policies and analyzable modules. No equations, fitted parameters, or derivation steps appear in the abstract or described structure. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes. The central claims are therefore grounded in external experimental comparisons rather than reducing to the inputs by construction, satisfying the self-contained criterion for a score of 0.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim rests on the assumption that natural language can encode harness policies precisely enough for reliable execution and that the proposed runtime can interpret them without loss of capability.

axioms (1)

domain assumption Natural language descriptions can capture complex harness policies with sufficient precision and without ambiguity for runtime interpretation.
This premise is required for NLAHs to function as claimed replacements for code-based harnesses.

invented entities (2)

Natural-Language Agent Harness (NLAH) no independent evidence
purpose: To serve as an editable natural-language representation of run-level harness policy.
New representation object introduced to replace code-based harnesses.
Intelligent Harness Runtime (IHR) no independent evidence
purpose: To interpret NLAH documents into agent calls, state updates, and validation gates.
New execution engine proposed to operationalize the natural-language harnesses.

pith-pipeline@v0.9.0 · 5701 in / 1336 out tokens · 62324 ms · 2026-05-21T10:14:55.683670+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/ (AbsoluteFloorClosure, AlexanderDuality, Cost/FunctionalEquation) reality_from_one_distinction; washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We introduce Natural-Language Agent Harnesses (NLAHs), editable documents that describe run-level harness policy, and Intelligent Harness Runtime (IHR), a shared runtime that interprets these documents into agent calls, handoffs, state updates, validation gates, and artifact contracts.
IndisputableMonolith/Cost Jcost uniqueness unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Module ablations further show that explicit harness modules are analyzable.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 14 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Agent-ValueBench: A Comprehensive Benchmark for Evaluating Agent Values
cs.AI 2026-05 unverdicted novelty 8.0

Agent-ValueBench is the first dedicated benchmark for agent values, showing they diverge from LLM values, form a homogeneous 'Value Tide' across models, and bend under harnesses and skill steering.
Remember the Decision, Not the Description: A Rate-Distortion Framework for Agent Memory
cs.AI 2026-05 unverdicted novelty 7.0

Memory for long-horizon agents should preserve distinctions that affect decisions under a fixed budget, not descriptive features, yielding an exact forgetting boundary and a new online learner DeMem with regret guarantees.
Agentic MIP Research: Accelerated Constraint Handler Generation
cs.AI 2026-05 unverdicted novelty 7.0

LLM agents in a solver-aware harness recover global constraints from MIP formulations, generate executable propagation-only handlers for SCIP, and solve five additional MIPLIB 2017 instances.
Training with Harnesses: On-Policy Harness Self-Distillation for Complex Reasoning
cs.CL 2026-05 unverdicted novelty 7.0

OPHSD uses harness-augmented models as teachers to distill reasoning capabilities into base LLMs, yielding strong standalone performance on classification and math tasks.
Agentic World Modeling: Foundations, Capabilities, Laws, and Beyond
cs.AI 2026-04 unverdicted novelty 7.0

Proposes a levels x laws taxonomy for world models in AI agents, defining L1-L3 capabilities across physical, digital, social, and scientific regimes while reviewing over 400 works to outline a roadmap for advanced ag...
AI4BayesCode: From Natural Language Descriptions to Validated Modular Stateful Bayesian Samplers
stat.CO 2026-05 unverdicted novelty 6.0

AI4BayesCode generates validated modular stateful MCMC samplers from natural language Bayesian model descriptions via LLM translation, modular blocks, and recursive stateful composition.
Harnesses for Inference-Time Alignment over Execution Trajectories
cs.LG 2026-05 unverdicted novelty 6.0

Partial harnesses for LLM agents, specifying only initial execution steps, achieve higher pass rates than fully decomposed workflows, as analyzed through trajectory alignment and validated in synthetic and terminal be...
AutoPyVerifier: Learning Compact Executable Verifiers for Large Language Model Outputs
cs.CL 2026-04 unverdicted novelty 6.0

AutoPyVerifier learns compact sets of executable Python verifiers from labeled LLM outputs via LLM synthesis and DAG search, improving objective prediction by up to 55 F1 points and downstream LLM accuracy by up to 17 points.
Agent-World: Scaling Real-World Environment Synthesis for Evolving General Agent Intelligence
cs.AI 2026-04 unverdicted novelty 6.0

Agent-World autonomously synthesizes verifiable real-world tasks and uses continuous self-evolution to train 8B and 14B agents that outperform proprietary models on 23 benchmarks.
Code as Agent Harness
cs.CL 2026-05 accept novelty 5.0

A survey that organizes existing work on LLM-based agents around code as the central harness, structured in three layers of interfaces, mechanisms, and multi-agent scaling, with applications across domains and listed ...
Harness Engineering as Categorical Architecture
cs.PL 2026-05 unverdicted novelty 5.0

Categorical Architecture triple (G, Know, Phi) supplies the formal theory for composing LLM agent harnesses with structurally preserved certificates.
Intermediate Artifacts as First-Class Citizens: A Data Model for Durable Intermediate Artifacts in Agentic Systems
cs.AI 2026-05 unverdicted novelty 5.0

A systems-level data model for preserving typed, addressable, versioned, and dependency-aware intermediate artifacts in agentic AI systems to improve long-term inspectability and maintainability.
Nautilus: From One Prompt to Plug-and-Play Robot Learning
cs.RO 2026-05 unverdicted novelty 5.0

NAUTILUS is a prompt-driven harness that automates plug-and-play adapters, typed contracts, and validation for policies, benchmarks, and robots in learning research.
From Agent Loops to Deterministic Graphs: Execution Lineage for Reproducible AI-Native Work
cs.AI 2026-05 conditional novelty 5.0

Execution lineage models AI-native work as a DAG of computations with explicit dependencies, achieving perfect state preservation in controlled update tasks where loop-based agents introduce churn and contamination.