pith. sign in

arxiv: 2605.22166 · v1 · pith:KWKXW7FGnew · submitted 2026-05-21 · 💻 cs.AI

Adapting the Interface, Not the Model: Runtime Harness Adaptation for Deterministic LLM Agents

Pith reviewed 2026-05-22 06:06 UTC · model grok-4.3

classification 💻 cs.AI
keywords LLM agentsruntime harnessinterface adaptationdeterministic environmentsfrozen modelstrajectory interventionsagent benchmarkstransfer across models
0
0 comments X

The pith

A fixed runtime harness evolved from training failures improves frozen LLM agents across 18 models by adapting the interface rather than model weights.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that many LLM agent failures in deterministic environments stem from mismatches at the model-environment interface rather than shortcomings in the model's knowledge or parameters. Life-Harness converts recurring failures observed in training trajectories into a fixed collection of reusable interventions that handle environment contracts, procedural skills, action realization, and trajectory regulation. These interventions stay unchanged during evaluation on held-out tasks. The approach delivers gains on 116 of 126 model-environment combinations across 18 backbones with an average 88.5 percent relative improvement, and a harness derived only from one small model transfers effectively to the rest. This positions runtime interface adaptation as a lightweight complement to model-centric training methods.

Core claim

Life-Harness evolves a lifecycle-aware runtime harness from training trajectories by converting recurring interaction failures into reusable interventions across environment contracts, procedural skills, action realization, and trajectory regulation. The harness remains fixed during held-out evaluation. When applied to frozen LLMs it improves 116 out of 126 model-environment settings across 18 backbones with an average relative improvement of 88.5 percent. Harnesses evolved solely from Qwen3-4B-Instruct trajectories transfer to 17 other models, indicating that the interventions capture reusable environment-side structure rather than model-specific behavior.

What carries the argument

Life-Harness, the fixed set of interventions derived from recurring failures in training trajectories that adjust observation, tool use, action execution, feedback interpretation, and trajectory control at runtime.

If this is right

  • Runtime interface adaptation can serve as a complement to model parameter updates for improving agents in rule-governed domains.
  • A single harness evolved from trajectories of one model can transfer to many other models without additional training.
  • Focusing on recurring failures in training data yields reusable fixes that improve held-out performance across multiple benchmarks and backbones.
  • Interface mismatches in deterministic environments can be addressed without changing model weights or evaluation setups.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the harness truly captures general environment structure, similar fixed interventions could be tested on additional deterministic tasks outside the original seven environments.
  • Developers might reduce per-model agent fine-tuning efforts if reusable harnesses handle common interface failures across deployments.
  • Some performance gaps attributed to model limitations in agent settings may instead reflect fixable interface design choices that can be handled separately from the model.

Load-bearing premise

Recurring interaction failures observed in training trajectories can be converted into a fixed set of reusable interventions that remain effective and non-overfitting on held-out evaluation trajectories without any further adaptation or selection during testing.

What would settle it

Finding that the Life-Harness either reduces performance or shows no improvement on a new collection of held-out trajectories from the same environments, or that a harness evolved from one model provides no benefit when applied to models outside its original set.

Figures

Figures reproduced from arXiv: 2605.22166 by Huifeng Wen, Meng Li, Tianshi Xu.

Figure 1
Figure 1. Figure 1: Adapting the runtime harness, not the model. [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: (a) An agent is not just an LLM: its behavior is shaped by the runtime harness that mediates observations, [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Failure diagnosis on training tasks. harness adapts the model–environment interface rather than model weights. It operates on the inter￾action loop defined in Section 3.1: the environment contract C, the task description x, the environment state st , the model action at , and the trajectory τt . 4.1 Failure Diagnosis Before designing the harness, we first diagnose the primary failure modes of baseline agen… view at source ↗
Figure 4
Figure 4. Figure 4: Overview of LIFE-HARNESS. The harness adapts the model-environment interface through four lifecycle layers spanning before interaction, task conditioning, before environment execution, and after execution. 4.3.2 Procedural Skill Layer This layer provides non-parametric guidance from training trajectories. A skill is a compact and reusable strategy that captures the essence of how to accomplish specific sub… view at source ↗
Figure 5
Figure 5. Figure 5: Absolute performance improvement across 18 [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Training set performance improves steadily as [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Comparison with prompt evolving method. ization. Evolution Dynamics [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Comparison between specialized tool-use training and runtime harnessing. Harnessing can outperform [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗
read the original abstract

LLM agents are shaped not only by their language models, but also by the runtime harness that mediates observation, tool use, action execution, feedback interpretation, and trajectory control. While existing agent adaptation methods mainly update model parameters, many failures in deterministic, rule-governed domains stem from mismatches at the model--environment interface. We propose Life-Harness, a lifecycle-aware runtime harness that improves frozen LLM agents without changing model weights or evaluation environments. Life-Harness evolves from training trajectories by converting recurring interaction failures into reusable interventions across environment contracts, procedural skills, action realization, and trajectory regulation, and remains fixed during held-out evaluation. On seven deterministic environments from $\tau$-bench, $\tau^2$-bench, and AgentBench, Life-Harness improves 116 out of 126 model--environment settings across 18 model backbones, with an average relative improvement of 88.5%. Harnesses evolved only from Qwen3-4B-Instruct trajectories transfer to 17 other models, showing that Life-Harness captures reusable environment-side structure rather than model-specific behavior. These results position runtime interface adaptation as a complementary alternative to model-centric agent training. Code is available at GitHub.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes Life-Harness, a lifecycle-aware runtime harness for frozen LLM agents in deterministic environments. It evolves reusable interventions from training trajectories by converting recurring interaction failures across environment contracts, procedural skills, action realization, and trajectory regulation; the harness remains fixed at test time. The central empirical claim is that this yields improvements in 116 of 126 model-environment settings across 18 backbones (average 88.5% relative gain) and that harnesses derived solely from Qwen3-4B-Instruct trajectories transfer to 17 other models, demonstrating capture of environment-side structure rather than model-specific patterns.

Significance. If the results and transfer evidence hold after clarification of the intervention-construction pipeline, the work provides a concrete, reproducible alternative to model-centric adaptation for rule-governed agent domains. The cross-model transfer result and public code release are notable strengths that would support broader adoption of interface-level fixes.

major comments (2)
  1. [§3.2] §3.2 (Failure-to-Intervention Pipeline): The description of how recurring failures are detected and turned into fixed interventions does not explicitly state whether clustering or rule writing inspects model-generated token sequences or reasoning traces from the source trajectories. This detail is load-bearing for the transfer claim in the abstract and §5.3; without it, the 88.5% average improvement and cross-model results could partly reflect implicit model-specific patching rather than purely environment-side adaptation.
  2. [§4.2] §4.2 and Table 1: The 116/126 success count and per-setting relative improvements are reported without an ablation that isolates post-hoc selection or threshold tuning during harness construction. If any fitted rules or selection steps were applied after observing training trajectories, they must be shown to be environment-contract-only; otherwise the held-out evaluation gains risk over-attribution to the harness.
minor comments (2)
  1. Notation for environment contracts and intervention types is introduced without a compact summary table; a single table listing the four categories with one canonical example each would improve readability.
  2. The abstract states 'Code is available at GitHub' but the manuscript does not include the exact repository URL or commit hash; this should be added for reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on Life-Harness. The comments identify opportunities to strengthen the description of the intervention pipeline and to provide additional controls on the empirical results. We address each major comment below and indicate the revisions we will make.

read point-by-point responses
  1. Referee: [§3.2] §3.2 (Failure-to-Intervention Pipeline): The description of how recurring failures are detected and turned into fixed interventions does not explicitly state whether clustering or rule writing inspects model-generated token sequences or reasoning traces from the source trajectories. This detail is load-bearing for the transfer claim in the abstract and §5.3; without it, the 88.5% average improvement and cross-model results could partly reflect implicit model-specific patching rather than purely environment-side adaptation.

    Authors: The failure-to-intervention pipeline in §3.2 operates exclusively on observable interaction traces consisting of environment observations, agent action strings, and resulting feedback signals as defined by the deterministic environment contracts. Clustering and rule formulation are performed on these environment-governed signals; no model-internal reasoning traces or full token sequences are inspected or used. This design ensures the resulting interventions address environment-contract, procedural, action-realization, and trajectory-regulation mismatches rather than model-specific patterns, which is consistent with the cross-model transfer results reported in §5.3. We will add an explicit statement of this scope to §3.2 in the revision. revision: yes

  2. Referee: [§4.2] §4.2 and Table 1: The 116/126 success count and per-setting relative improvements are reported without an ablation that isolates post-hoc selection or threshold tuning during harness construction. If any fitted rules or selection steps were applied after observing training trajectories, they must be shown to be environment-contract-only; otherwise the held-out evaluation gains risk over-attribution to the harness.

    Authors: Harness construction applies fixed, deterministic criteria based on the recurrence of failure categories across training trajectories and their alignment with the predefined environment contracts; no performance-based threshold tuning or post-hoc selection of rules occurs after observing the trajectories. All interventions are therefore environment-contract-only by construction. To further isolate this aspect, we will add an ablation in the revised §4.2 that removes the recurrence filter and reports the resulting performance on the same 126 settings, confirming that the reported gains derive from the contract-derived interventions rather than selection artifacts. revision: yes

Circularity Check

0 steps flagged

No significant circularity: harness construction and transfer results remain empirically grounded

full rationale

The paper constructs Life-Harness by converting recurring failures observed in training trajectories into a fixed set of reusable interventions, then evaluates the frozen harness on held-out trajectories and across 17 other models. No equations, fitted parameters, or self-citations are shown that reduce the reported 88.5% average improvement or cross-model transfer to the training inputs by construction. The central claim rests on empirical measurement of environment-side structure captured in the interventions, with the transfer evidence serving as an external check rather than a self-referential loop. The method is therefore self-contained against the stated evaluation protocol.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on the premise that failure patterns in training trajectories encode reusable environment-side structure that generalizes to held-out settings; no explicit free parameters, axioms, or invented entities are stated in the abstract.

pith-pipeline@v0.9.0 · 5750 in / 1194 out tokens · 29446 ms · 2026-05-22T06:06:29.036237+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

59 extracted references · 59 canonical work pages · 21 internal anchors

  1. [1]

    Claw-Eval: Towards Trustworthy Evaluation of Autonomous Agents

    Claw-eval: Toward trustworthy evaluation of autonomous agents , author=. arXiv preprint arXiv:2604.06132 , year=

  2. [2]

    MiMo-V2-Flash Technical Report

    Mimo-v2-flash technical report , author=. arXiv preprint arXiv:2601.02780 , year=

  3. [3]

    Thinking Machines Lab: Connectionism , year =

    Kevin Lu and Thinking Machines Lab , title =. Thinking Machines Lab: Connectionism , year =

  4. [4]

    arXiv preprint arXiv:2601.15141 , year=

    CLEANER: Self-Purified Trajectories Boost Agentic Reinforcement Learning , author=. arXiv preprint arXiv:2601.15141 , year=

  5. [5]

    arXiv preprint arXiv:2405.18369 , year=

    Promptwizard: Task-aware prompt optimization framework , author=. arXiv preprint arXiv:2405.18369 , year=

  6. [6]

    Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution

    Promptbreeder: Self-referential self-improvement via prompt evolution , author=. arXiv preprint arXiv:2309.16797 , year=

  7. [7]

    Agentic Harness Engineering: Observability-Driven Automatic Evolution of Coding-Agent Harnesses

    Agentic Harness Engineering: Observability-Driven Automatic Evolution of Coding-Agent Harnesses , author=. arXiv preprint arXiv:2604.25850 , year=

  8. [8]

    2026 , note =

    HMMT Problem Sets , howpublished =. 2026 , note =

  9. [9]

    Forty-first International Conference on Machine Learning , year=

    Executable code actions elicit better llm agents , author=. Forty-first International Conference on Machine Learning , year=

  10. [10]

    Advances in Neural Information Processing Systems , volume=

    Swe-agent: Agent-computer interfaces enable automated software engineering , author=. Advances in Neural Information Processing Systems , volume=

  11. [11]

    International Conference on Learning Representations , volume=

    Webarena: A realistic web environment for building autonomous agents , author=. International Conference on Learning Representations , volume=

  12. [12]

    International Conference on Learning Representations , volume=

    Spider 2.0: Evaluating language models on real-world enterprise text-to-sql workflows , author=. International Conference on Learning Representations , volume=

  13. [13]

    Advances in Neural Information Processing Systems , volume=

    Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments , author=. Advances in Neural Information Processing Systems , volume=

  14. [14]

    2026 , howpublished =

    Claude Code , author =. 2026 , howpublished =

  15. [15]

    2026 , howpublished =

    Codex CLI , author =. 2026 , howpublished =

  16. [16]

    2026 , howpublished =

    OpenCode: The open source AI coding agent , author =. 2026 , howpublished =

  17. [17]

    DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence , author=

  18. [18]

    arXiv e-prints , pages=

    The llama 3 herd of models , author=. arXiv e-prints , pages=

  19. [19]

    2025 , eprint=

    Qwen3 Technical Report , author=. 2025 , eprint=

  20. [20]

    Qwen2.5: A Party of Foundation Models , url =

    Qwen Team , month =. Qwen2.5: A Party of Foundation Models , url =

  21. [21]

    Advances in Neural Information Processing Systems , volume=

    Apigen-mt: Agentic pipeline for multi-turn data generation via simulated agent-human interplay , author=. Advances in Neural Information Processing Systems , volume=

  22. [22]

    Yao, Shunyu and Shinn, Noah and Razavi, Pedram and Narasimhan, Karthik , journal=

  23. [23]

    Barres, Victor and Dong, Honghua and Ray, Soham and Si, Xujie and Narasimhan, Karthik , journal=

  24. [24]

    International Conference on Learning Representations , volume=

    Agentbench: Evaluating llms as agents , author=. International Conference on Learning Representations , volume=

  25. [25]

    OpenAI engineering note , year=

    Harness engineering: leveraging codex in an agent-first world , author=. OpenAI engineering note , year=

  26. [26]

    Advances in neural information processing systems , volume=

    Retrieval-augmented generation for knowledge-intensive nlp tasks , author=. Advances in neural information processing systems , volume=

  27. [27]

    Proceedings of the 61st annual meeting of the association for computational linguistics (volume 1: long papers) , pages=

    Interleaving retrieval with chain-of-thought reasoning for knowledge-intensive multi-step questions , author=. Proceedings of the 61st annual meeting of the association for computational linguistics (volume 1: long papers) , pages=

  28. [28]

    , author=

    MemGPT: towards LLMs as operating systems. , author=. 2023 , publisher=

  29. [29]

    Handbook of evolutionary machine learning , pages=

    Evolution through large models , author=. Handbook of evolutionary machine learning , pages=. 2023 , publisher=

  30. [30]

    International Conference on Learning Representations , volume=

    Automated design of agentic systems , author=. International Conference on Learning Representations , volume=

  31. [31]

    International Conference on Learning Representations , volume=

    Aflow: Automating agentic workflow generation , author=. International Conference on Learning Representations , volume=

  32. [32]

    MemEvolve: Meta-Evolution of Agent Memory Systems

    Memevolve: Meta-evolution of agent memory systems , author=. arXiv preprint arXiv:2512.18746 , year=

  33. [33]

    arXiv preprint arXiv:2602.07755 , year=

    Learning to continually learn via meta-learning agentic memory designs , author=. arXiv preprint arXiv:2602.07755 , year=

  34. [34]

    gradient descent

    Automatic prompt optimization with “gradient descent” and beam search , author=. Proceedings of the 2023 conference on empirical methods in natural language processing , pages=

  35. [35]

    Advances in neural information processing systems , volume=

    Self-refine: Iterative refinement with self-feedback , author=. Advances in neural information processing systems , volume=

  36. [36]

    TextGrad: Automatic "Differentiation" via Text

    Textgrad: Automatic" differentiation" via text , author=. arXiv preprint arXiv:2406.07496 , year=

  37. [37]

    International Conference on Learning Representations , volume=

    Large language models as optimizers , author=. International Conference on Learning Representations , volume=

  38. [38]

    GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning

    Gepa: Reflective prompt evolution can outperform reinforcement learning , author=. arXiv preprint arXiv:2507.19457 , year=

  39. [39]

    AlphaEvolve: A coding agent for scientific and algorithmic discovery

    Alphaevolve: A coding agent for scientific and algorithmic discovery , author=. arXiv preprint arXiv:2506.13131 , year=

  40. [40]

    URL https://github

    Openevolve: an open-source evolutionary coding agent, 2025 , author=. URL https://github. com/codelion/openevolve , volume=

  41. [41]

    arXiv preprint arXiv:2511.07919 , year=

    Feedback descent: Open-ended text optimization via pairwise comparison , author=. arXiv preprint arXiv:2511.07919 , year=

  42. [42]

    LLMs Improving LLMs: Agentic Discovery for Test-Time Scaling

    LLMs Improving LLMs: Agentic Discovery for Test-Time Scaling , author=. arXiv preprint arXiv:2605.08083 , year=

  43. [43]

    Workspace Optimization: How to Train Your Agent

    Workspace Optimization: How to Train Your Agent , author=. arXiv preprint arXiv:2605.09650 , year=

  44. [44]

    Continual Harness: Online Adaptation for Self-Improving Foundation Agents

    Continual Harness: Online Adaptation for Self-Improving Foundation Agents , author=. arXiv preprint arXiv:2605.09998 , year=

  45. [45]

    HARBOR: Automated Harness Optimization

    HARBOR: Automated Harness Optimization , author=. arXiv preprint arXiv:2604.20938 , year=

  46. [46]

    Meta-Harness: End-to-End Optimization of Model Harnesses

    Meta-harness: End-to-end optimization of model harnesses , author=. arXiv preprint arXiv:2603.28052 , year=

  47. [47]

    Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=

    Optimizing instructions and demonstrations for multi-stage language model programs , author=. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=

  48. [48]

    ALFWorld: Aligning Text and Embodied Environments for Interactive Learning

    Alfworld: Aligning text and embodied environments for interactive learning , author=. arXiv preprint arXiv:2010.03768 , year=

  49. [49]

    Advances in Neural Information Processing Systems , volume=

    Webshop: Towards scalable real-world web interaction with grounded language agents , author=. Advances in Neural Information Processing Systems , volume=

  50. [50]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning , author=. arXiv preprint arXiv:2501.12948 , year=

  51. [51]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Deepseekmath: Pushing the limits of mathematical reasoning in open language models , author=. arXiv preprint arXiv:2402.03300 , year=

  52. [52]

    Advances in Neural Information Processing Systems , volume=

    Dapo: An open-source llm reinforcement learning system at scale , author=. Advances in Neural Information Processing Systems , volume=

  53. [53]

    Kimi k1.5: Scaling Reinforcement Learning with LLMs

    Kimi k1. 5: Scaling reinforcement learning with llms , author=. arXiv preprint arXiv:2501.12599 , year=

  54. [54]

    ArXiv , year=

    Gemma 3 Technical Report , author=. ArXiv , year=

  55. [55]

    Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models

    Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models , author=. arXiv preprint arXiv:2601.18734 , year=

  56. [56]

    Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe

    Rethinking on-policy distillation of large language models: Phenomenology, mechanism, and recipe , author=. arXiv preprint arXiv:2604.13016 , year=

  57. [57]

    A Survey of On-Policy Distillation for Large Language Models

    A survey of on-policy distillation for large language models , author=. arXiv preprint arXiv:2604.00626 , year=

  58. [58]

    Agentic Context Engineering: Evolving Contexts for Self-Improving Language Models

    Agentic context engineering: Evolving contexts for self-improving language models , author=. arXiv preprint arXiv:2510.04618 , year=

  59. [59]

    arXiv preprint arXiv:2502.06855 , year=

    Self-supervised prompt optimization , author=. arXiv preprint arXiv:2502.06855 , year=