Adapting the Interface, Not the Model: Runtime Harness Adaptation for Deterministic LLM Agents
Pith reviewed 2026-05-22 06:06 UTC · model grok-4.3
The pith
A fixed runtime harness evolved from training failures improves frozen LLM agents across 18 models by adapting the interface rather than model weights.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Life-Harness evolves a lifecycle-aware runtime harness from training trajectories by converting recurring interaction failures into reusable interventions across environment contracts, procedural skills, action realization, and trajectory regulation. The harness remains fixed during held-out evaluation. When applied to frozen LLMs it improves 116 out of 126 model-environment settings across 18 backbones with an average relative improvement of 88.5 percent. Harnesses evolved solely from Qwen3-4B-Instruct trajectories transfer to 17 other models, indicating that the interventions capture reusable environment-side structure rather than model-specific behavior.
What carries the argument
Life-Harness, the fixed set of interventions derived from recurring failures in training trajectories that adjust observation, tool use, action execution, feedback interpretation, and trajectory control at runtime.
If this is right
- Runtime interface adaptation can serve as a complement to model parameter updates for improving agents in rule-governed domains.
- A single harness evolved from trajectories of one model can transfer to many other models without additional training.
- Focusing on recurring failures in training data yields reusable fixes that improve held-out performance across multiple benchmarks and backbones.
- Interface mismatches in deterministic environments can be addressed without changing model weights or evaluation setups.
Where Pith is reading between the lines
- If the harness truly captures general environment structure, similar fixed interventions could be tested on additional deterministic tasks outside the original seven environments.
- Developers might reduce per-model agent fine-tuning efforts if reusable harnesses handle common interface failures across deployments.
- Some performance gaps attributed to model limitations in agent settings may instead reflect fixable interface design choices that can be handled separately from the model.
Load-bearing premise
Recurring interaction failures observed in training trajectories can be converted into a fixed set of reusable interventions that remain effective and non-overfitting on held-out evaluation trajectories without any further adaptation or selection during testing.
What would settle it
Finding that the Life-Harness either reduces performance or shows no improvement on a new collection of held-out trajectories from the same environments, or that a harness evolved from one model provides no benefit when applied to models outside its original set.
Figures
read the original abstract
LLM agents are shaped not only by their language models, but also by the runtime harness that mediates observation, tool use, action execution, feedback interpretation, and trajectory control. While existing agent adaptation methods mainly update model parameters, many failures in deterministic, rule-governed domains stem from mismatches at the model--environment interface. We propose Life-Harness, a lifecycle-aware runtime harness that improves frozen LLM agents without changing model weights or evaluation environments. Life-Harness evolves from training trajectories by converting recurring interaction failures into reusable interventions across environment contracts, procedural skills, action realization, and trajectory regulation, and remains fixed during held-out evaluation. On seven deterministic environments from $\tau$-bench, $\tau^2$-bench, and AgentBench, Life-Harness improves 116 out of 126 model--environment settings across 18 model backbones, with an average relative improvement of 88.5%. Harnesses evolved only from Qwen3-4B-Instruct trajectories transfer to 17 other models, showing that Life-Harness captures reusable environment-side structure rather than model-specific behavior. These results position runtime interface adaptation as a complementary alternative to model-centric agent training. Code is available at GitHub.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Life-Harness, a lifecycle-aware runtime harness for frozen LLM agents in deterministic environments. It evolves reusable interventions from training trajectories by converting recurring interaction failures across environment contracts, procedural skills, action realization, and trajectory regulation; the harness remains fixed at test time. The central empirical claim is that this yields improvements in 116 of 126 model-environment settings across 18 backbones (average 88.5% relative gain) and that harnesses derived solely from Qwen3-4B-Instruct trajectories transfer to 17 other models, demonstrating capture of environment-side structure rather than model-specific patterns.
Significance. If the results and transfer evidence hold after clarification of the intervention-construction pipeline, the work provides a concrete, reproducible alternative to model-centric adaptation for rule-governed agent domains. The cross-model transfer result and public code release are notable strengths that would support broader adoption of interface-level fixes.
major comments (2)
- [§3.2] §3.2 (Failure-to-Intervention Pipeline): The description of how recurring failures are detected and turned into fixed interventions does not explicitly state whether clustering or rule writing inspects model-generated token sequences or reasoning traces from the source trajectories. This detail is load-bearing for the transfer claim in the abstract and §5.3; without it, the 88.5% average improvement and cross-model results could partly reflect implicit model-specific patching rather than purely environment-side adaptation.
- [§4.2] §4.2 and Table 1: The 116/126 success count and per-setting relative improvements are reported without an ablation that isolates post-hoc selection or threshold tuning during harness construction. If any fitted rules or selection steps were applied after observing training trajectories, they must be shown to be environment-contract-only; otherwise the held-out evaluation gains risk over-attribution to the harness.
minor comments (2)
- Notation for environment contracts and intervention types is introduced without a compact summary table; a single table listing the four categories with one canonical example each would improve readability.
- The abstract states 'Code is available at GitHub' but the manuscript does not include the exact repository URL or commit hash; this should be added for reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on Life-Harness. The comments identify opportunities to strengthen the description of the intervention pipeline and to provide additional controls on the empirical results. We address each major comment below and indicate the revisions we will make.
read point-by-point responses
-
Referee: [§3.2] §3.2 (Failure-to-Intervention Pipeline): The description of how recurring failures are detected and turned into fixed interventions does not explicitly state whether clustering or rule writing inspects model-generated token sequences or reasoning traces from the source trajectories. This detail is load-bearing for the transfer claim in the abstract and §5.3; without it, the 88.5% average improvement and cross-model results could partly reflect implicit model-specific patching rather than purely environment-side adaptation.
Authors: The failure-to-intervention pipeline in §3.2 operates exclusively on observable interaction traces consisting of environment observations, agent action strings, and resulting feedback signals as defined by the deterministic environment contracts. Clustering and rule formulation are performed on these environment-governed signals; no model-internal reasoning traces or full token sequences are inspected or used. This design ensures the resulting interventions address environment-contract, procedural, action-realization, and trajectory-regulation mismatches rather than model-specific patterns, which is consistent with the cross-model transfer results reported in §5.3. We will add an explicit statement of this scope to §3.2 in the revision. revision: yes
-
Referee: [§4.2] §4.2 and Table 1: The 116/126 success count and per-setting relative improvements are reported without an ablation that isolates post-hoc selection or threshold tuning during harness construction. If any fitted rules or selection steps were applied after observing training trajectories, they must be shown to be environment-contract-only; otherwise the held-out evaluation gains risk over-attribution to the harness.
Authors: Harness construction applies fixed, deterministic criteria based on the recurrence of failure categories across training trajectories and their alignment with the predefined environment contracts; no performance-based threshold tuning or post-hoc selection of rules occurs after observing the trajectories. All interventions are therefore environment-contract-only by construction. To further isolate this aspect, we will add an ablation in the revised §4.2 that removes the recurrence filter and reports the resulting performance on the same 126 settings, confirming that the reported gains derive from the contract-derived interventions rather than selection artifacts. revision: yes
Circularity Check
No significant circularity: harness construction and transfer results remain empirically grounded
full rationale
The paper constructs Life-Harness by converting recurring failures observed in training trajectories into a fixed set of reusable interventions, then evaluates the frozen harness on held-out trajectories and across 17 other models. No equations, fitted parameters, or self-citations are shown that reduce the reported 88.5% average improvement or cross-model transfer to the training inputs by construction. The central claim rests on empirical measurement of environment-side structure captured in the interventions, with the transfer evidence serving as an external check rather than a self-referential loop. The method is therefore self-contained against the stated evaluation protocol.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Claw-Eval: Towards Trustworthy Evaluation of Autonomous Agents
Claw-eval: Toward trustworthy evaluation of autonomous agents , author=. arXiv preprint arXiv:2604.06132 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
MiMo-V2-Flash Technical Report
Mimo-v2-flash technical report , author=. arXiv preprint arXiv:2601.02780 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
Thinking Machines Lab: Connectionism , year =
Kevin Lu and Thinking Machines Lab , title =. Thinking Machines Lab: Connectionism , year =
-
[4]
arXiv preprint arXiv:2601.15141 , year=
CLEANER: Self-Purified Trajectories Boost Agentic Reinforcement Learning , author=. arXiv preprint arXiv:2601.15141 , year=
-
[5]
arXiv preprint arXiv:2405.18369 , year=
Promptwizard: Task-aware prompt optimization framework , author=. arXiv preprint arXiv:2405.18369 , year=
-
[6]
Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution
Promptbreeder: Self-referential self-improvement via prompt evolution , author=. arXiv preprint arXiv:2309.16797 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
Agentic Harness Engineering: Observability-Driven Automatic Evolution of Coding-Agent Harnesses
Agentic Harness Engineering: Observability-Driven Automatic Evolution of Coding-Agent Harnesses , author=. arXiv preprint arXiv:2604.25850 , year=
work page internal anchor Pith review Pith/arXiv arXiv
- [8]
-
[9]
Forty-first International Conference on Machine Learning , year=
Executable code actions elicit better llm agents , author=. Forty-first International Conference on Machine Learning , year=
-
[10]
Advances in Neural Information Processing Systems , volume=
Swe-agent: Agent-computer interfaces enable automated software engineering , author=. Advances in Neural Information Processing Systems , volume=
-
[11]
International Conference on Learning Representations , volume=
Webarena: A realistic web environment for building autonomous agents , author=. International Conference on Learning Representations , volume=
-
[12]
International Conference on Learning Representations , volume=
Spider 2.0: Evaluating language models on real-world enterprise text-to-sql workflows , author=. International Conference on Learning Representations , volume=
-
[13]
Advances in Neural Information Processing Systems , volume=
Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments , author=. Advances in Neural Information Processing Systems , volume=
- [14]
- [15]
-
[16]
OpenCode: The open source AI coding agent , author =. 2026 , howpublished =
work page 2026
-
[17]
DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence , author=
- [18]
- [19]
-
[20]
Qwen2.5: A Party of Foundation Models , url =
Qwen Team , month =. Qwen2.5: A Party of Foundation Models , url =
-
[21]
Advances in Neural Information Processing Systems , volume=
Apigen-mt: Agentic pipeline for multi-turn data generation via simulated agent-human interplay , author=. Advances in Neural Information Processing Systems , volume=
-
[22]
Yao, Shunyu and Shinn, Noah and Razavi, Pedram and Narasimhan, Karthik , journal=
-
[23]
Barres, Victor and Dong, Honghua and Ray, Soham and Si, Xujie and Narasimhan, Karthik , journal=
-
[24]
International Conference on Learning Representations , volume=
Agentbench: Evaluating llms as agents , author=. International Conference on Learning Representations , volume=
-
[25]
OpenAI engineering note , year=
Harness engineering: leveraging codex in an agent-first world , author=. OpenAI engineering note , year=
-
[26]
Advances in neural information processing systems , volume=
Retrieval-augmented generation for knowledge-intensive nlp tasks , author=. Advances in neural information processing systems , volume=
-
[27]
Interleaving retrieval with chain-of-thought reasoning for knowledge-intensive multi-step questions , author=. Proceedings of the 61st annual meeting of the association for computational linguistics (volume 1: long papers) , pages=
- [28]
-
[29]
Handbook of evolutionary machine learning , pages=
Evolution through large models , author=. Handbook of evolutionary machine learning , pages=. 2023 , publisher=
work page 2023
-
[30]
International Conference on Learning Representations , volume=
Automated design of agentic systems , author=. International Conference on Learning Representations , volume=
-
[31]
International Conference on Learning Representations , volume=
Aflow: Automating agentic workflow generation , author=. International Conference on Learning Representations , volume=
-
[32]
MemEvolve: Meta-Evolution of Agent Memory Systems
Memevolve: Meta-evolution of agent memory systems , author=. arXiv preprint arXiv:2512.18746 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[33]
arXiv preprint arXiv:2602.07755 , year=
Learning to continually learn via meta-learning agentic memory designs , author=. arXiv preprint arXiv:2602.07755 , year=
-
[34]
Automatic prompt optimization with “gradient descent” and beam search , author=. Proceedings of the 2023 conference on empirical methods in natural language processing , pages=
work page 2023
-
[35]
Advances in neural information processing systems , volume=
Self-refine: Iterative refinement with self-feedback , author=. Advances in neural information processing systems , volume=
-
[36]
TextGrad: Automatic "Differentiation" via Text
Textgrad: Automatic" differentiation" via text , author=. arXiv preprint arXiv:2406.07496 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[37]
International Conference on Learning Representations , volume=
Large language models as optimizers , author=. International Conference on Learning Representations , volume=
-
[38]
GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning
Gepa: Reflective prompt evolution can outperform reinforcement learning , author=. arXiv preprint arXiv:2507.19457 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[39]
AlphaEvolve: A coding agent for scientific and algorithmic discovery
Alphaevolve: A coding agent for scientific and algorithmic discovery , author=. arXiv preprint arXiv:2506.13131 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[40]
Openevolve: an open-source evolutionary coding agent, 2025 , author=. URL https://github. com/codelion/openevolve , volume=
work page 2025
-
[41]
arXiv preprint arXiv:2511.07919 , year=
Feedback descent: Open-ended text optimization via pairwise comparison , author=. arXiv preprint arXiv:2511.07919 , year=
-
[42]
LLMs Improving LLMs: Agentic Discovery for Test-Time Scaling
LLMs Improving LLMs: Agentic Discovery for Test-Time Scaling , author=. arXiv preprint arXiv:2605.08083 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[43]
Workspace Optimization: How to Train Your Agent
Workspace Optimization: How to Train Your Agent , author=. arXiv preprint arXiv:2605.09650 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[44]
Continual Harness: Online Adaptation for Self-Improving Foundation Agents
Continual Harness: Online Adaptation for Self-Improving Foundation Agents , author=. arXiv preprint arXiv:2605.09998 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[45]
HARBOR: Automated Harness Optimization
HARBOR: Automated Harness Optimization , author=. arXiv preprint arXiv:2604.20938 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[46]
Meta-Harness: End-to-End Optimization of Model Harnesses
Meta-harness: End-to-end optimization of model harnesses , author=. arXiv preprint arXiv:2603.28052 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[47]
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=
Optimizing instructions and demonstrations for multi-stage language model programs , author=. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=
work page 2024
-
[48]
ALFWorld: Aligning Text and Embodied Environments for Interactive Learning
Alfworld: Aligning text and embodied environments for interactive learning , author=. arXiv preprint arXiv:2010.03768 , year=
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[49]
Advances in Neural Information Processing Systems , volume=
Webshop: Towards scalable real-world web interaction with grounded language agents , author=. Advances in Neural Information Processing Systems , volume=
-
[50]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning , author=. arXiv preprint arXiv:2501.12948 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[51]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Deepseekmath: Pushing the limits of mathematical reasoning in open language models , author=. arXiv preprint arXiv:2402.03300 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[52]
Advances in Neural Information Processing Systems , volume=
Dapo: An open-source llm reinforcement learning system at scale , author=. Advances in Neural Information Processing Systems , volume=
-
[53]
Kimi k1.5: Scaling Reinforcement Learning with LLMs
Kimi k1. 5: Scaling reinforcement learning with llms , author=. arXiv preprint arXiv:2501.12599 , year=
work page internal anchor Pith review Pith/arXiv arXiv
- [54]
-
[55]
Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models
Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models , author=. arXiv preprint arXiv:2601.18734 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[56]
Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe
Rethinking on-policy distillation of large language models: Phenomenology, mechanism, and recipe , author=. arXiv preprint arXiv:2604.13016 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[57]
A Survey of On-Policy Distillation for Large Language Models
A survey of on-policy distillation for large language models , author=. arXiv preprint arXiv:2604.00626 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[58]
Agentic Context Engineering: Evolving Contexts for Self-Improving Language Models
Agentic context engineering: Evolving contexts for self-improving language models , author=. arXiv preprint arXiv:2510.04618 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[59]
arXiv preprint arXiv:2502.06855 , year=
Self-supervised prompt optimization , author=. arXiv preprint arXiv:2502.06855 , year=
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.