Adapting the Interface, Not the Model: Runtime Harness Adaptation for Deterministic LLM Agents

Huifeng Wen; Meng Li; Tianshi Xu

arxiv: 2605.22166 · v1 · pith:KWKXW7FGnew · submitted 2026-05-21 · 💻 cs.AI

Adapting the Interface, Not the Model: Runtime Harness Adaptation for Deterministic LLM Agents

Tianshi Xu , Huifeng Wen , Meng Li This is my paper

Pith reviewed 2026-05-22 06:06 UTC · model grok-4.3

classification 💻 cs.AI

keywords LLM agentsruntime harnessinterface adaptationdeterministic environmentsfrozen modelstrajectory interventionsagent benchmarkstransfer across models

0 comments

The pith

A fixed runtime harness evolved from training failures improves frozen LLM agents across 18 models by adapting the interface rather than model weights.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that many LLM agent failures in deterministic environments stem from mismatches at the model-environment interface rather than shortcomings in the model's knowledge or parameters. Life-Harness converts recurring failures observed in training trajectories into a fixed collection of reusable interventions that handle environment contracts, procedural skills, action realization, and trajectory regulation. These interventions stay unchanged during evaluation on held-out tasks. The approach delivers gains on 116 of 126 model-environment combinations across 18 backbones with an average 88.5 percent relative improvement, and a harness derived only from one small model transfers effectively to the rest. This positions runtime interface adaptation as a lightweight complement to model-centric training methods.

Core claim

Life-Harness evolves a lifecycle-aware runtime harness from training trajectories by converting recurring interaction failures into reusable interventions across environment contracts, procedural skills, action realization, and trajectory regulation. The harness remains fixed during held-out evaluation. When applied to frozen LLMs it improves 116 out of 126 model-environment settings across 18 backbones with an average relative improvement of 88.5 percent. Harnesses evolved solely from Qwen3-4B-Instruct trajectories transfer to 17 other models, indicating that the interventions capture reusable environment-side structure rather than model-specific behavior.

What carries the argument

Life-Harness, the fixed set of interventions derived from recurring failures in training trajectories that adjust observation, tool use, action execution, feedback interpretation, and trajectory control at runtime.

If this is right

Runtime interface adaptation can serve as a complement to model parameter updates for improving agents in rule-governed domains.
A single harness evolved from trajectories of one model can transfer to many other models without additional training.
Focusing on recurring failures in training data yields reusable fixes that improve held-out performance across multiple benchmarks and backbones.
Interface mismatches in deterministic environments can be addressed without changing model weights or evaluation setups.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the harness truly captures general environment structure, similar fixed interventions could be tested on additional deterministic tasks outside the original seven environments.
Developers might reduce per-model agent fine-tuning efforts if reusable harnesses handle common interface failures across deployments.
Some performance gaps attributed to model limitations in agent settings may instead reflect fixable interface design choices that can be handled separately from the model.

Load-bearing premise

Recurring interaction failures observed in training trajectories can be converted into a fixed set of reusable interventions that remain effective and non-overfitting on held-out evaluation trajectories without any further adaptation or selection during testing.

What would settle it

Finding that the Life-Harness either reduces performance or shows no improvement on a new collection of held-out trajectories from the same environments, or that a harness evolved from one model provides no benefit when applied to models outside its original set.

Figures

Figures reproduced from arXiv: 2605.22166 by Huifeng Wen, Meng Li, Tianshi Xu.

**Figure 2.** Figure 2: (a) An agent is not just an LLM: its behavior is shaped by the runtime harness that mediates observations, [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: Failure diagnosis on training tasks. harness adapts the model–environment interface rather than model weights. It operates on the interaction loop defined in Section 3.1: the environment contract C, the task description x, the environment state st , the model action at , and the trajectory τt . 4.1 Failure Diagnosis Before designing the harness, we first diagnose the primary failure modes of baseline agen… view at source ↗

**Figure 4.** Figure 4: Overview of LIFE-HARNESS. The harness adapts the model-environment interface through four lifecycle layers spanning before interaction, task conditioning, before environment execution, and after execution. 4.3.2 Procedural Skill Layer This layer provides non-parametric guidance from training trajectories. A skill is a compact and reusable strategy that captures the essence of how to accomplish specific sub… view at source ↗

**Figure 5.** Figure 5: Absolute performance improvement across 18 [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Training set performance improves steadily as [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Comparison with prompt evolving method. ization. Evolution Dynamics [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 8.** Figure 8: Comparison between specialized tool-use training and runtime harnessing. Harnessing can outperform [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗

read the original abstract

LLM agents are shaped not only by their language models, but also by the runtime harness that mediates observation, tool use, action execution, feedback interpretation, and trajectory control. While existing agent adaptation methods mainly update model parameters, many failures in deterministic, rule-governed domains stem from mismatches at the model--environment interface. We propose Life-Harness, a lifecycle-aware runtime harness that improves frozen LLM agents without changing model weights or evaluation environments. Life-Harness evolves from training trajectories by converting recurring interaction failures into reusable interventions across environment contracts, procedural skills, action realization, and trajectory regulation, and remains fixed during held-out evaluation. On seven deterministic environments from $\tau$-bench, $\tau^2$-bench, and AgentBench, Life-Harness improves 116 out of 126 model--environment settings across 18 model backbones, with an average relative improvement of 88.5%. Harnesses evolved only from Qwen3-4B-Instruct trajectories transfer to 17 other models, showing that Life-Harness captures reusable environment-side structure rather than model-specific behavior. These results position runtime interface adaptation as a complementary alternative to model-centric agent training. Code is available at GitHub.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's real contribution is showing that a harness built once from failure patterns in a single small model can deliver large gains across many models and deterministic environments without any retraining.

read the letter

The punchline is straightforward: instead of updating model weights, this work evolves a fixed runtime harness from recurring failures seen in training trajectories and keeps it unchanged at test time. On seven deterministic environments it lifts 116 of 126 model-environment combinations across 18 backbones, with an average relative gain of 88.5 percent. Harnesses derived only from Qwen3-4B-Instruct trajectories also transfer to the other 17 models, which is the part worth paying attention to if the claim holds up on inspection.

Referee Report

2 major / 2 minor

Summary. The paper proposes Life-Harness, a lifecycle-aware runtime harness for frozen LLM agents in deterministic environments. It evolves reusable interventions from training trajectories by converting recurring interaction failures across environment contracts, procedural skills, action realization, and trajectory regulation; the harness remains fixed at test time. The central empirical claim is that this yields improvements in 116 of 126 model-environment settings across 18 backbones (average 88.5% relative gain) and that harnesses derived solely from Qwen3-4B-Instruct trajectories transfer to 17 other models, demonstrating capture of environment-side structure rather than model-specific patterns.

Significance. If the results and transfer evidence hold after clarification of the intervention-construction pipeline, the work provides a concrete, reproducible alternative to model-centric adaptation for rule-governed agent domains. The cross-model transfer result and public code release are notable strengths that would support broader adoption of interface-level fixes.

major comments (2)

[§3.2] §3.2 (Failure-to-Intervention Pipeline): The description of how recurring failures are detected and turned into fixed interventions does not explicitly state whether clustering or rule writing inspects model-generated token sequences or reasoning traces from the source trajectories. This detail is load-bearing for the transfer claim in the abstract and §5.3; without it, the 88.5% average improvement and cross-model results could partly reflect implicit model-specific patching rather than purely environment-side adaptation.
[§4.2] §4.2 and Table 1: The 116/126 success count and per-setting relative improvements are reported without an ablation that isolates post-hoc selection or threshold tuning during harness construction. If any fitted rules or selection steps were applied after observing training trajectories, they must be shown to be environment-contract-only; otherwise the held-out evaluation gains risk over-attribution to the harness.

minor comments (2)

Notation for environment contracts and intervention types is introduced without a compact summary table; a single table listing the four categories with one canonical example each would improve readability.
The abstract states 'Code is available at GitHub' but the manuscript does not include the exact repository URL or commit hash; this should be added for reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on Life-Harness. The comments identify opportunities to strengthen the description of the intervention pipeline and to provide additional controls on the empirical results. We address each major comment below and indicate the revisions we will make.

read point-by-point responses

Referee: [§3.2] §3.2 (Failure-to-Intervention Pipeline): The description of how recurring failures are detected and turned into fixed interventions does not explicitly state whether clustering or rule writing inspects model-generated token sequences or reasoning traces from the source trajectories. This detail is load-bearing for the transfer claim in the abstract and §5.3; without it, the 88.5% average improvement and cross-model results could partly reflect implicit model-specific patching rather than purely environment-side adaptation.

Authors: The failure-to-intervention pipeline in §3.2 operates exclusively on observable interaction traces consisting of environment observations, agent action strings, and resulting feedback signals as defined by the deterministic environment contracts. Clustering and rule formulation are performed on these environment-governed signals; no model-internal reasoning traces or full token sequences are inspected or used. This design ensures the resulting interventions address environment-contract, procedural, action-realization, and trajectory-regulation mismatches rather than model-specific patterns, which is consistent with the cross-model transfer results reported in §5.3. We will add an explicit statement of this scope to §3.2 in the revision. revision: yes
Referee: [§4.2] §4.2 and Table 1: The 116/126 success count and per-setting relative improvements are reported without an ablation that isolates post-hoc selection or threshold tuning during harness construction. If any fitted rules or selection steps were applied after observing training trajectories, they must be shown to be environment-contract-only; otherwise the held-out evaluation gains risk over-attribution to the harness.

Authors: Harness construction applies fixed, deterministic criteria based on the recurrence of failure categories across training trajectories and their alignment with the predefined environment contracts; no performance-based threshold tuning or post-hoc selection of rules occurs after observing the trajectories. All interventions are therefore environment-contract-only by construction. To further isolate this aspect, we will add an ablation in the revised §4.2 that removes the recurrence filter and reports the resulting performance on the same 126 settings, confirming that the reported gains derive from the contract-derived interventions rather than selection artifacts. revision: yes

Circularity Check

0 steps flagged

No significant circularity: harness construction and transfer results remain empirically grounded

full rationale

The paper constructs Life-Harness by converting recurring failures observed in training trajectories into a fixed set of reusable interventions, then evaluates the frozen harness on held-out trajectories and across 17 other models. No equations, fitted parameters, or self-citations are shown that reduce the reported 88.5% average improvement or cross-model transfer to the training inputs by construction. The central claim rests on empirical measurement of environment-side structure captured in the interventions, with the transfer evidence serving as an external check rather than a self-referential loop. The method is therefore self-contained against the stated evaluation protocol.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on the premise that failure patterns in training trajectories encode reusable environment-side structure that generalizes to held-out settings; no explicit free parameters, axioms, or invented entities are stated in the abstract.

pith-pipeline@v0.9.0 · 5750 in / 1194 out tokens · 29446 ms · 2026-05-22T06:06:29.036237+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

59 extracted references · 59 canonical work pages · 21 internal anchors

[1]

Claw-Eval: Towards Trustworthy Evaluation of Autonomous Agents

Claw-eval: Toward trustworthy evaluation of autonomous agents , author=. arXiv preprint arXiv:2604.06132 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[2]

MiMo-V2-Flash Technical Report

Mimo-v2-flash technical report , author=. arXiv preprint arXiv:2601.02780 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Thinking Machines Lab: Connectionism , year =

Kevin Lu and Thinking Machines Lab , title =. Thinking Machines Lab: Connectionism , year =

work page
[4]

arXiv preprint arXiv:2601.15141 , year=

CLEANER: Self-Purified Trajectories Boost Agentic Reinforcement Learning , author=. arXiv preprint arXiv:2601.15141 , year=

work page arXiv
[5]

arXiv preprint arXiv:2405.18369 , year=

Promptwizard: Task-aware prompt optimization framework , author=. arXiv preprint arXiv:2405.18369 , year=

work page arXiv
[6]

Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution

Promptbreeder: Self-referential self-improvement via prompt evolution , author=. arXiv preprint arXiv:2309.16797 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Agentic Harness Engineering: Observability-Driven Automatic Evolution of Coding-Agent Harnesses

Agentic Harness Engineering: Observability-Driven Automatic Evolution of Coding-Agent Harnesses , author=. arXiv preprint arXiv:2604.25850 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[8]

2026 , note =

HMMT Problem Sets , howpublished =. 2026 , note =

work page 2026
[9]

Forty-first International Conference on Machine Learning , year=

Executable code actions elicit better llm agents , author=. Forty-first International Conference on Machine Learning , year=

work page
[10]

Advances in Neural Information Processing Systems , volume=

Swe-agent: Agent-computer interfaces enable automated software engineering , author=. Advances in Neural Information Processing Systems , volume=

work page
[11]

International Conference on Learning Representations , volume=

Webarena: A realistic web environment for building autonomous agents , author=. International Conference on Learning Representations , volume=

work page
[12]

International Conference on Learning Representations , volume=

Spider 2.0: Evaluating language models on real-world enterprise text-to-sql workflows , author=. International Conference on Learning Representations , volume=

work page
[13]

Advances in Neural Information Processing Systems , volume=

Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments , author=. Advances in Neural Information Processing Systems , volume=

work page
[14]

2026 , howpublished =

Claude Code , author =. 2026 , howpublished =

work page 2026
[15]

2026 , howpublished =

Codex CLI , author =. 2026 , howpublished =

work page 2026
[16]

2026 , howpublished =

OpenCode: The open source AI coding agent , author =. 2026 , howpublished =

work page 2026
[17]

DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence , author=

work page
[18]

arXiv e-prints , pages=

The llama 3 herd of models , author=. arXiv e-prints , pages=

work page
[19]

2025 , eprint=

Qwen3 Technical Report , author=. 2025 , eprint=

work page 2025
[20]

Qwen2.5: A Party of Foundation Models , url =

Qwen Team , month =. Qwen2.5: A Party of Foundation Models , url =

work page
[21]

Advances in Neural Information Processing Systems , volume=

Apigen-mt: Agentic pipeline for multi-turn data generation via simulated agent-human interplay , author=. Advances in Neural Information Processing Systems , volume=

work page
[22]

Yao, Shunyu and Shinn, Noah and Razavi, Pedram and Narasimhan, Karthik , journal=

work page
[23]

Barres, Victor and Dong, Honghua and Ray, Soham and Si, Xujie and Narasimhan, Karthik , journal=

work page
[24]

International Conference on Learning Representations , volume=

Agentbench: Evaluating llms as agents , author=. International Conference on Learning Representations , volume=

work page
[25]

OpenAI engineering note , year=

Harness engineering: leveraging codex in an agent-first world , author=. OpenAI engineering note , year=

work page
[26]

Advances in neural information processing systems , volume=

Retrieval-augmented generation for knowledge-intensive nlp tasks , author=. Advances in neural information processing systems , volume=

work page
[27]

Proceedings of the 61st annual meeting of the association for computational linguistics (volume 1: long papers) , pages=

Interleaving retrieval with chain-of-thought reasoning for knowledge-intensive multi-step questions , author=. Proceedings of the 61st annual meeting of the association for computational linguistics (volume 1: long papers) , pages=

work page
[28]

, author=

MemGPT: towards LLMs as operating systems. , author=. 2023 , publisher=

work page 2023
[29]

Handbook of evolutionary machine learning , pages=

Evolution through large models , author=. Handbook of evolutionary machine learning , pages=. 2023 , publisher=

work page 2023
[30]

International Conference on Learning Representations , volume=

Automated design of agentic systems , author=. International Conference on Learning Representations , volume=

work page
[31]

International Conference on Learning Representations , volume=

Aflow: Automating agentic workflow generation , author=. International Conference on Learning Representations , volume=

work page
[32]

MemEvolve: Meta-Evolution of Agent Memory Systems

Memevolve: Meta-evolution of agent memory systems , author=. arXiv preprint arXiv:2512.18746 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[33]

arXiv preprint arXiv:2602.07755 , year=

Learning to continually learn via meta-learning agentic memory designs , author=. arXiv preprint arXiv:2602.07755 , year=

work page arXiv
[34]

gradient descent

Automatic prompt optimization with “gradient descent” and beam search , author=. Proceedings of the 2023 conference on empirical methods in natural language processing , pages=

work page 2023
[35]

Advances in neural information processing systems , volume=

Self-refine: Iterative refinement with self-feedback , author=. Advances in neural information processing systems , volume=

work page
[36]

TextGrad: Automatic "Differentiation" via Text

Textgrad: Automatic" differentiation" via text , author=. arXiv preprint arXiv:2406.07496 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[37]

International Conference on Learning Representations , volume=

Large language models as optimizers , author=. International Conference on Learning Representations , volume=

work page
[38]

GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning

Gepa: Reflective prompt evolution can outperform reinforcement learning , author=. arXiv preprint arXiv:2507.19457 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[39]

AlphaEvolve: A coding agent for scientific and algorithmic discovery

Alphaevolve: A coding agent for scientific and algorithmic discovery , author=. arXiv preprint arXiv:2506.13131 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[40]

URL https://github

Openevolve: an open-source evolutionary coding agent, 2025 , author=. URL https://github. com/codelion/openevolve , volume=

work page 2025
[41]

arXiv preprint arXiv:2511.07919 , year=

Feedback descent: Open-ended text optimization via pairwise comparison , author=. arXiv preprint arXiv:2511.07919 , year=

work page arXiv
[42]

LLMs Improving LLMs: Agentic Discovery for Test-Time Scaling

LLMs Improving LLMs: Agentic Discovery for Test-Time Scaling , author=. arXiv preprint arXiv:2605.08083 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[43]

Workspace Optimization: How to Train Your Agent

Workspace Optimization: How to Train Your Agent , author=. arXiv preprint arXiv:2605.09650 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[44]

Continual Harness: Online Adaptation for Self-Improving Foundation Agents

Continual Harness: Online Adaptation for Self-Improving Foundation Agents , author=. arXiv preprint arXiv:2605.09998 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[45]

HARBOR: Automated Harness Optimization

HARBOR: Automated Harness Optimization , author=. arXiv preprint arXiv:2604.20938 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[46]

Meta-Harness: End-to-End Optimization of Model Harnesses

Meta-harness: End-to-end optimization of model harnesses , author=. arXiv preprint arXiv:2603.28052 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[47]

Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=

Optimizing instructions and demonstrations for multi-stage language model programs , author=. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=

work page 2024
[48]

ALFWorld: Aligning Text and Embodied Environments for Interactive Learning

Alfworld: Aligning text and embodied environments for interactive learning , author=. arXiv preprint arXiv:2010.03768 , year=

work page internal anchor Pith review Pith/arXiv arXiv 2010
[49]

Advances in Neural Information Processing Systems , volume=

Webshop: Towards scalable real-world web interaction with grounded language agents , author=. Advances in Neural Information Processing Systems , volume=

work page
[50]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning , author=. arXiv preprint arXiv:2501.12948 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[51]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Deepseekmath: Pushing the limits of mathematical reasoning in open language models , author=. arXiv preprint arXiv:2402.03300 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[52]

Advances in Neural Information Processing Systems , volume=

Dapo: An open-source llm reinforcement learning system at scale , author=. Advances in Neural Information Processing Systems , volume=

work page
[53]

Kimi k1.5: Scaling Reinforcement Learning with LLMs

Kimi k1. 5: Scaling reinforcement learning with llms , author=. arXiv preprint arXiv:2501.12599 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[54]

ArXiv , year=

Gemma 3 Technical Report , author=. ArXiv , year=

work page
[55]

Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models

Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models , author=. arXiv preprint arXiv:2601.18734 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[56]

Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe

Rethinking on-policy distillation of large language models: Phenomenology, mechanism, and recipe , author=. arXiv preprint arXiv:2604.13016 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[57]

A Survey of On-Policy Distillation for Large Language Models

A survey of on-policy distillation for large language models , author=. arXiv preprint arXiv:2604.00626 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[58]

Agentic Context Engineering: Evolving Contexts for Self-Improving Language Models

Agentic context engineering: Evolving contexts for self-improving language models , author=. arXiv preprint arXiv:2510.04618 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[59]

arXiv preprint arXiv:2502.06855 , year=

Self-supervised prompt optimization , author=. arXiv preprint arXiv:2502.06855 , year=

work page arXiv

[1] [1]

Claw-Eval: Towards Trustworthy Evaluation of Autonomous Agents

Claw-eval: Toward trustworthy evaluation of autonomous agents , author=. arXiv preprint arXiv:2604.06132 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

MiMo-V2-Flash Technical Report

Mimo-v2-flash technical report , author=. arXiv preprint arXiv:2601.02780 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

Thinking Machines Lab: Connectionism , year =

Kevin Lu and Thinking Machines Lab , title =. Thinking Machines Lab: Connectionism , year =

work page

[4] [4]

arXiv preprint arXiv:2601.15141 , year=

CLEANER: Self-Purified Trajectories Boost Agentic Reinforcement Learning , author=. arXiv preprint arXiv:2601.15141 , year=

work page arXiv

[5] [5]

arXiv preprint arXiv:2405.18369 , year=

Promptwizard: Task-aware prompt optimization framework , author=. arXiv preprint arXiv:2405.18369 , year=

work page arXiv

[6] [6]

Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution

Promptbreeder: Self-referential self-improvement via prompt evolution , author=. arXiv preprint arXiv:2309.16797 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

Agentic Harness Engineering: Observability-Driven Automatic Evolution of Coding-Agent Harnesses

Agentic Harness Engineering: Observability-Driven Automatic Evolution of Coding-Agent Harnesses , author=. arXiv preprint arXiv:2604.25850 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

2026 , note =

HMMT Problem Sets , howpublished =. 2026 , note =

work page 2026

[9] [9]

Forty-first International Conference on Machine Learning , year=

Executable code actions elicit better llm agents , author=. Forty-first International Conference on Machine Learning , year=

work page

[10] [10]

Advances in Neural Information Processing Systems , volume=

Swe-agent: Agent-computer interfaces enable automated software engineering , author=. Advances in Neural Information Processing Systems , volume=

work page

[11] [11]

International Conference on Learning Representations , volume=

Webarena: A realistic web environment for building autonomous agents , author=. International Conference on Learning Representations , volume=

work page

[12] [12]

International Conference on Learning Representations , volume=

Spider 2.0: Evaluating language models on real-world enterprise text-to-sql workflows , author=. International Conference on Learning Representations , volume=

work page

[13] [13]

Advances in Neural Information Processing Systems , volume=

Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments , author=. Advances in Neural Information Processing Systems , volume=

work page

[14] [14]

2026 , howpublished =

Claude Code , author =. 2026 , howpublished =

work page 2026

[15] [15]

2026 , howpublished =

Codex CLI , author =. 2026 , howpublished =

work page 2026

[16] [16]

2026 , howpublished =

OpenCode: The open source AI coding agent , author =. 2026 , howpublished =

work page 2026

[17] [17]

DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence , author=

work page

[18] [18]

arXiv e-prints , pages=

The llama 3 herd of models , author=. arXiv e-prints , pages=

work page

[19] [19]

2025 , eprint=

Qwen3 Technical Report , author=. 2025 , eprint=

work page 2025

[20] [20]

Qwen2.5: A Party of Foundation Models , url =

Qwen Team , month =. Qwen2.5: A Party of Foundation Models , url =

work page

[21] [21]

Advances in Neural Information Processing Systems , volume=

Apigen-mt: Agentic pipeline for multi-turn data generation via simulated agent-human interplay , author=. Advances in Neural Information Processing Systems , volume=

work page

[22] [22]

Yao, Shunyu and Shinn, Noah and Razavi, Pedram and Narasimhan, Karthik , journal=

work page

[23] [23]

Barres, Victor and Dong, Honghua and Ray, Soham and Si, Xujie and Narasimhan, Karthik , journal=

work page

[24] [24]

International Conference on Learning Representations , volume=

Agentbench: Evaluating llms as agents , author=. International Conference on Learning Representations , volume=

work page

[25] [25]

OpenAI engineering note , year=

Harness engineering: leveraging codex in an agent-first world , author=. OpenAI engineering note , year=

work page

[26] [26]

Advances in neural information processing systems , volume=

Retrieval-augmented generation for knowledge-intensive nlp tasks , author=. Advances in neural information processing systems , volume=

work page

[27] [27]

Proceedings of the 61st annual meeting of the association for computational linguistics (volume 1: long papers) , pages=

Interleaving retrieval with chain-of-thought reasoning for knowledge-intensive multi-step questions , author=. Proceedings of the 61st annual meeting of the association for computational linguistics (volume 1: long papers) , pages=

work page

[28] [28]

, author=

MemGPT: towards LLMs as operating systems. , author=. 2023 , publisher=

work page 2023

[29] [29]

Handbook of evolutionary machine learning , pages=

Evolution through large models , author=. Handbook of evolutionary machine learning , pages=. 2023 , publisher=

work page 2023

[30] [30]

International Conference on Learning Representations , volume=

Automated design of agentic systems , author=. International Conference on Learning Representations , volume=

work page

[31] [31]

International Conference on Learning Representations , volume=

Aflow: Automating agentic workflow generation , author=. International Conference on Learning Representations , volume=

work page

[32] [32]

MemEvolve: Meta-Evolution of Agent Memory Systems

Memevolve: Meta-evolution of agent memory systems , author=. arXiv preprint arXiv:2512.18746 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[33] [33]

arXiv preprint arXiv:2602.07755 , year=

Learning to continually learn via meta-learning agentic memory designs , author=. arXiv preprint arXiv:2602.07755 , year=

work page arXiv

[34] [34]

gradient descent

Automatic prompt optimization with “gradient descent” and beam search , author=. Proceedings of the 2023 conference on empirical methods in natural language processing , pages=

work page 2023

[35] [35]

Advances in neural information processing systems , volume=

Self-refine: Iterative refinement with self-feedback , author=. Advances in neural information processing systems , volume=

work page

[36] [36]

TextGrad: Automatic "Differentiation" via Text

Textgrad: Automatic" differentiation" via text , author=. arXiv preprint arXiv:2406.07496 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[37] [37]

International Conference on Learning Representations , volume=

Large language models as optimizers , author=. International Conference on Learning Representations , volume=

work page

[38] [38]

GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning

Gepa: Reflective prompt evolution can outperform reinforcement learning , author=. arXiv preprint arXiv:2507.19457 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[39] [39]

AlphaEvolve: A coding agent for scientific and algorithmic discovery

Alphaevolve: A coding agent for scientific and algorithmic discovery , author=. arXiv preprint arXiv:2506.13131 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[40] [40]

URL https://github

Openevolve: an open-source evolutionary coding agent, 2025 , author=. URL https://github. com/codelion/openevolve , volume=

work page 2025

[41] [41]

arXiv preprint arXiv:2511.07919 , year=

Feedback descent: Open-ended text optimization via pairwise comparison , author=. arXiv preprint arXiv:2511.07919 , year=

work page arXiv

[42] [42]

LLMs Improving LLMs: Agentic Discovery for Test-Time Scaling

LLMs Improving LLMs: Agentic Discovery for Test-Time Scaling , author=. arXiv preprint arXiv:2605.08083 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[43] [43]

Workspace Optimization: How to Train Your Agent

Workspace Optimization: How to Train Your Agent , author=. arXiv preprint arXiv:2605.09650 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[44] [44]

Continual Harness: Online Adaptation for Self-Improving Foundation Agents

Continual Harness: Online Adaptation for Self-Improving Foundation Agents , author=. arXiv preprint arXiv:2605.09998 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[45] [45]

HARBOR: Automated Harness Optimization

HARBOR: Automated Harness Optimization , author=. arXiv preprint arXiv:2604.20938 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[46] [46]

Meta-Harness: End-to-End Optimization of Model Harnesses

Meta-harness: End-to-end optimization of model harnesses , author=. arXiv preprint arXiv:2603.28052 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[47] [47]

Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=

Optimizing instructions and demonstrations for multi-stage language model programs , author=. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=

work page 2024

[48] [48]

ALFWorld: Aligning Text and Embodied Environments for Interactive Learning

Alfworld: Aligning text and embodied environments for interactive learning , author=. arXiv preprint arXiv:2010.03768 , year=

work page internal anchor Pith review Pith/arXiv arXiv 2010

[49] [49]

Advances in Neural Information Processing Systems , volume=

Webshop: Towards scalable real-world web interaction with grounded language agents , author=. Advances in Neural Information Processing Systems , volume=

work page

[50] [50]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning , author=. arXiv preprint arXiv:2501.12948 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[51] [51]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Deepseekmath: Pushing the limits of mathematical reasoning in open language models , author=. arXiv preprint arXiv:2402.03300 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[52] [52]

Advances in Neural Information Processing Systems , volume=

Dapo: An open-source llm reinforcement learning system at scale , author=. Advances in Neural Information Processing Systems , volume=

work page

[53] [53]

Kimi k1.5: Scaling Reinforcement Learning with LLMs

Kimi k1. 5: Scaling reinforcement learning with llms , author=. arXiv preprint arXiv:2501.12599 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[54] [54]

ArXiv , year=

Gemma 3 Technical Report , author=. ArXiv , year=

work page

[55] [55]

Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models

Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models , author=. arXiv preprint arXiv:2601.18734 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[56] [56]

Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe

Rethinking on-policy distillation of large language models: Phenomenology, mechanism, and recipe , author=. arXiv preprint arXiv:2604.13016 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[57] [57]

A Survey of On-Policy Distillation for Large Language Models

A survey of on-policy distillation for large language models , author=. arXiv preprint arXiv:2604.00626 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[58] [58]

Agentic Context Engineering: Evolving Contexts for Self-Improving Language Models

Agentic context engineering: Evolving contexts for self-improving language models , author=. arXiv preprint arXiv:2510.04618 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[59] [59]

arXiv preprint arXiv:2502.06855 , year=

Self-supervised prompt optimization , author=. arXiv preprint arXiv:2502.06855 , year=

work page arXiv