pith. machine review for the scientific record. sign in

arxiv: 2605.10913 · v1 · submitted 2026-05-11 · 💻 cs.AI · cs.PL· cs.SE

Recognition: 1 theorem link

· Lean Theorem

Shepherd: A Runtime Substrate Empowering Meta-Agents with a Formalized Execution Trace

Authors on Pith no claims yet

Pith reviewed 2026-05-12 03:21 UTC · model grok-4.3

classification 💻 cs.AI cs.PLcs.SE
keywords meta-agentsexecution traceruntime forkingcounterfactual optimizationTree-RLagent supervisionfunctional programming modelGit-like trace
0
0 comments X

The pith

Shepherd formalizes meta-agent operations as functions on a Git-like execution trace that records every interaction for fast forking and replay.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Shepherd as a functional programming model that treats meta-agent operations on target agents as functions whose core steps are mechanized in Lean. It records every agent-environment interaction as a typed event inside a Git-like execution trace, so any past state can be forked and replayed without restarting from scratch. Forking the agent process and filesystem runs five times faster than Docker while reusing over 95 percent of cached prompts on replay. Three concrete uses demonstrate the model: a live supervisor raises pair-coding success from 28.8 to 54.7 percent, branching counterfactual search beats baselines by up to 11 points while cutting wall-clock time by up to 58 percent, and selective forking of rollouts lifts Tree-RL performance on TerminalBench-2 from 34.2 to 39.4 percent. These outcomes position the trace and forking mechanism as practical infrastructure for writing and running meta-agents.

Core claim

Shepherd records every agent-environment interaction as a typed event in a Git-like execution trace, enabling any past state to be forked and replayed. The substrate forks the agent process and its filesystem five times faster than Docker and reuses more than 95 percent of prompt cache on replay. When applied to runtime intervention, counterfactual meta-optimization, and Tree-RL training, the trace produces measurable gains in pass rates, benchmark scores, and training efficiency across the reported tasks.

What carries the argument

The typed execution trace that stores every agent-environment interaction as an event and supports forking of both the agent process and its filesystem.

If this is right

  • A live supervisor using the trace can raise pair-coding pass rates from 28.8 percent to 54.7 percent on CooperBench.
  • Branching exploration inside the trace outperforms baselines on four benchmarks by as much as 11 points and reduces wall-clock time by as much as 58 percent.
  • Forking rollouts at selected turns inside the trace raises TerminalBench-2 performance from 34.2 percent to 39.4 percent.
  • Any past agent state captured in the trace can be replayed or branched without restarting the full environment.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same trace structure could let developers version and debug ordinary single-agent systems the way git versions code.
  • High cache reuse on replay suggests the mechanism may scale to longer-horizon agent runs where repeated prompt computation would otherwise dominate cost.
  • If the Lean mechanization of core operations is extended, it could support machine-checked proofs that certain meta-agent interventions preserve safety properties.

Load-bearing premise

The reported gains in intervention success, optimization scores, and RL performance arise from the trace and forking features rather than from unmeasured differences in experimental setup or implementation.

What would settle it

Run the same three applications with the forking and trace recording disabled while keeping every other component fixed; if the pass-rate, benchmark, and training improvements disappear, the central claim is supported.

Figures

Figures reproduced from arXiv: 2605.10913 by Ananjan Nandi, Christopher D Manning, Derek Chong, Dilara Soylu, Jiuding Sun, Simon Yu, Weiyan Shi.

Figure 1
Figure 1. Figure 1: SHEPHERD meta-agents. Top: A supervisor meta-agent manages code repair agents. Bottom: Results from three meta-agents: (A) live supervision; (B) meta-optimization; (C) Tree GRPO Abstract As LLM-based agentic systems grow more complex, they increasingly rely on meta-agents: higher-order agents that act on other agents, much like managers supervise employees. Yet existing agentic runtimes expose execution on… view at source ↗
Figure 2
Figure 2. Figure 2: Live intervention experiments on CooperBench, with Claude Haiku 4.5 as worker. [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: LiveCodeBench comparison. Left: held-out test pass-rate versus optimization wallclock. Right: dev-set trajectory for each method across optimization wallclock. CRO subtask-cache reuse is reported separately in [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: CRO computation reuse on LiveCodeBench rises from ∼1% on the first cold proposer session to over 60%. Setup. We evaluate on subsets of HoVer [13], MATH [8], IFBench [27], LiveCodeBench [11], and Terminal￾Bench 2.0 (TB-2; [24]), comparing CRO against the base￾line workflow, GEPA (optimizing workflow code) [2], and MetaHarness [19]. The executor is GPT-5.4-mini and meta-optimizers use GPT-5.4 (in the Codex h… view at source ↗
Figure 5
Figure 5. Figure 5: Trajectory compression across two worker model families and two benchmarks. The same [PITH_FULL_IMAGE:figures/full_fig_p020_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: HoVer: held-out test pass-rate vs. optimization wallclock (left) and per-iteration dev-set [PITH_FULL_IMAGE:figures/full_fig_p039_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: HoVer: subtask cache reuse per CRO proposer session. [PITH_FULL_IMAGE:figures/full_fig_p040_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: IFBench: test pass-rate vs. wallclock and dev-set trajectory. [PITH_FULL_IMAGE:figures/full_fig_p040_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: IFBench: subtask cache reuse per CRO proposer session. [PITH_FULL_IMAGE:figures/full_fig_p040_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: LiveCodeBench: test pass-rate vs. wallclock and dev-set trajectory. [PITH_FULL_IMAGE:figures/full_fig_p041_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: LiveCodeBench: subtask cache reuse per CRO proposer session. [PITH_FULL_IMAGE:figures/full_fig_p041_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: MATH (Level 5): test pass-rate vs. wallclock and dev-set trajectory. [PITH_FULL_IMAGE:figures/full_fig_p042_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: MATH (Level 5): subtask cache reuse per CRO proposer session. [PITH_FULL_IMAGE:figures/full_fig_p042_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: TerminalBench 2.0: test pass-rate vs. wallclock and dev-set trajectory. [PITH_FULL_IMAGE:figures/full_fig_p042_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: TerminalBench 2.0: subtask cache reuse per CRO proposer session. [PITH_FULL_IMAGE:figures/full_fig_p043_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: GRPO group composition over training (rows: base model; columns: setting). Tree-GRPO [PITH_FULL_IMAGE:figures/full_fig_p047_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Held-out Endless Terminals evaluation, sampled every 10 training steps (raw, unsmoothed). [PITH_FULL_IMAGE:figures/full_fig_p047_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Train raw reward (mean over G=8 roots) for both base models, panels are Qwen3.5- 35B-A3B (left) and Nemotron-3-Super-120B-A12B (right). Tree-GRPO (K=4, teal) reaches higher reward than Flat GRPO (red) at every rollout step. Faint dots are observed steps from the flat-baseline run; smooth lines are denoised trajectories. Case 1: Early mistake (T=4, reward=0.00) Task: Install the requests package and verify… view at source ↗
Figure 19
Figure 19. Figure 19: Early-mistake case. The wrong package name on turn 1 dooms the rest of the trajectory. [PITH_FULL_IMAGE:figures/full_fig_p048_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Ambiguous case. At least three turns offer plausible branches (skip-the-version-check, [PITH_FULL_IMAGE:figures/full_fig_p049_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: Long-trajectory case. A 9-turn rollout with a wrong-file edit at turn 4 cascades into 5 [PITH_FULL_IMAGE:figures/full_fig_p049_21.png] view at source ↗
read the original abstract

We introduce Shepherd, a functional programming model that formalizes meta-agent operations on target agents as functions, with core operations mechanized in Lean. Shepherd records every agent-environment interaction as a typed event in a Git-like execution trace, enabling any past state to be forked and replayed. The system forks the agent process and its filesystem $5\times$ faster than Docker, achieving $>95\%$ prompt-cache reuse on replay. We demonstrate the model through three applications. First, in runtime intervention, a live supervisor increases pair coding pass rates from 28.8% to 54.7% on CooperBench. Second, in counterfactual meta-optimization, branching exploration outperforms baselines across four benchmarks by up to 11 points while reducing wall-clock time by up to 58%. Third, in Tree-RL training, forking rollouts at selected turns improves TerminalBench-2 performance from 34.2% to 39.4%. These results establish Shepherd as an efficient infrastructure for programming meta-agents. We open-source the system to support future research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper introduces Shepherd, a functional programming model that formalizes meta-agent operations on target agents as functions, with core operations mechanized in Lean. It records every agent-environment interaction as a typed event in a Git-like execution trace, enabling forking and replay of past states. The system forks agent processes and filesystems 5× faster than Docker with >95% prompt-cache reuse on replay. It demonstrates the model in three applications: runtime intervention raising pair-coding pass rates from 28.8% to 54.7% on CooperBench; counterfactual meta-optimization outperforming baselines by up to 11 points with up to 58% wall-clock reduction across four benchmarks; and Tree-RL training improving TerminalBench-2 from 34.2% to 39.4%. These results are presented as establishing Shepherd as efficient infrastructure for programming meta-agents, with the system open-sourced.

Significance. If the empirical claims hold under proper controls, Shepherd could provide a useful formalized runtime substrate for meta-agent development, leveraging execution traces for intervention, optimization, and training. The mechanization of core operations in Lean is a clear strength, supplying machine-checked proofs for the model. Open-sourcing the system is also a positive step that supports reproducibility and community follow-on work.

major comments (1)
  1. [Abstract] Abstract: the abstract reports specific performance improvements (pair-coding pass rate 28.8%→54.7% on CooperBench, up to 11-point gains with 58% wall-clock reduction, TerminalBench-2 34.2%→39.4%) but supplies no experimental details, baselines, statistical tests, error bars, or methodology. This prevents verification that the gains are attributable to the typed execution trace and forking features rather than unstated factors, which is load-bearing for the central claim that 'these results establish Shepherd as an efficient infrastructure for programming meta-agents.'

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their positive assessment of Shepherd's contributions, including the Lean mechanization and open-sourcing, and for the constructive feedback on the abstract. We address the major comment below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the abstract reports specific performance improvements (pair-coding pass rate 28.8%→54.7% on CooperBench, up to 11-point gains with 58% wall-clock reduction, TerminalBench-2 34.2%→39.4%) but supplies no experimental details, baselines, statistical tests, error bars, or methodology. This prevents verification that the gains are attributable to the typed execution trace and forking features rather than unstated factors, which is load-bearing for the central claim that 'these results establish Shepherd as an efficient infrastructure for programming meta-agents.'

    Authors: The abstract is intentionally concise to highlight key outcomes, following standard academic practice. The full manuscript provides the requested experimental details, including baselines, statistical tests, error bars, and methodology, in the dedicated evaluation sections for each application (runtime intervention, counterfactual meta-optimization, and Tree-RL training). These sections describe controlled experiments that isolate the contributions of the typed execution traces and forking mechanisms, supporting the attribution of the reported gains. The abstract's central claim is thus grounded in the body of the paper rather than standing alone. revision: no

Circularity Check

0 steps flagged

No circularity: empirical claims rest on reported results without derivations or self-referential reductions

full rationale

The provided abstract contains no equations, derivations, fitted parameters, or self-citations. It introduces Shepherd as a functional model with execution traces and forking, then reports three separate empirical applications (runtime intervention on CooperBench, counterfactual optimization on four benchmarks, Tree-RL on TerminalBench-2) with performance deltas. These results are presented as demonstrations rather than as outputs derived from the system's definition by construction. No load-bearing step reduces a prediction or uniqueness claim to an input fit or prior self-citation; the central infrastructure claim is supported by the listed experimental outcomes, which remain externally verifiable in principle.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides insufficient detail to identify specific free parameters, axioms, or invented entities. The Lean mechanization likely relies on standard mathematical axioms for formal verification but none are explicitly listed.

pith-pipeline@v0.9.0 · 5479 in / 1149 out tokens · 80011 ms · 2026-05-12T03:21:05.580474+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

84 extracted references · 84 canonical work pages · 20 internal anchors

  1. [1]

    URL https://www.anthropic

    https://www.anthropic.com/engineering/managed-agents. URL https://www.anthropic. com/engineering/managed-agents

  2. [2]

    GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning

    Lakshya A. Agrawal, Shangyin Tan, Dilara Soylu, Noah Ziems, Rishi Khare, Krista Opsahl- Ong, Arnav Singhvi, Herumb Shandilya, Michael J. Ryan, Meng Jiang, Christopher Potts, Koushik Sen, Alexandros G. Dimakis, Ion Stoica, Dan Klein, Matei Zaharia, and Omar Khattab. GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning, July 2025. URL htt...

  3. [3]

    Dimakis, and Joseph E

    Parth Asawa, Alan Zhu, Abby O’Neill, Matei Zaharia, Alexandros G. Dimakis, and Joseph E. Gonzalez. How to Train Your Advisor: Steering Black-Box LLMs with Advisor Models, October 2025. URLhttp://arxiv.org/abs/2510.02453. arXiv:2510.02453 [cs]

  4. [4]

    Why Do Multi-Agent LLM Systems Fail?

    Mert Cemri, Melissa Z. Pan, Shuyi Yang, Lakshya A. Agrawal, Bhavya Chopra, Rishabh Tiwari, Kurt Keutzer, Aditya Parameswaran, Dan Klein, Kannan Ramchandran, Matei Zaharia, Joseph E. Gonzalez, and Ion Stoica. Why do multi-agent llm systems fail?, 2025. URL https://arxiv.org/abs/2503.13657

  5. [5]

    Context-lite multi-turn reinforcement learning for LLM agents

    Wentse Chen, Jiayu Chen, Hao Zhu, and Jeff Schneider. Context-lite multi-turn reinforcement learning for LLM agents. InES-FoMo III: 3rd Workshop on Efficient Systems for Foundation Models, 2025. URLhttps://openreview.net/forum?id=6CE5PLsZdW

  6. [6]

    Trace is the next AutoDiff: Generative optimiza- tion with rich feedback, execution traces, and LLMs, 2024

    Ching-An Cheng, Allen Nie, and Adith Swaminathan. Trace is the Next AutoDiff: Generative Optimization with Rich Feedback, Execution Traces, and LLMs, June 2024. URL https: //arxiv.org/abs/2406.16218. arXiv:2406.16218

  7. [7]

    Goodman, and Dimitris Papailiopoulos

    Kanishk Gandhi, Shivam Garg, Noah D. Goodman, and Dimitris Papailiopoulos. Endless Terminals: Scaling RL Environments for Terminal Agents, January 2026. URL https:// arxiv.org/abs/2601.16443. arXiv:2601.16443

  8. [8]

    Measuring mathematical problem solving with the math dataset,

    Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset,

  9. [9]

    URLhttps://arxiv.org/abs/2103.03874

  10. [10]

    MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework

    Sirui Hong, Mingchen Zhuge, Jiaqi Chen, Xiawu Zheng, Yuheng Cheng, Ceyao Zhang, Jinlin Wang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, Chenyu Ran, Lingfeng Xiao, Chenglin Wu, and Jürgen Schmidhuber. MetaGPT: Meta Programming for A Multi- Agent Collaborative Framework, August 2023. URLhttps://arxiv.org/abs/2308.00352. arXiv:2308.00352

  11. [11]

    Treerl: Llm rein- forcement learning with on-policy tree search, 2025

    Zhenyu Hou, Ziniu Hu, Yujiang Li, Rui Lu, Jie Tang, and Yuxiao Dong. Treerl: Llm rein- forcement learning with on-policy tree search, 2025. URL https://arxiv.org/abs/2506. 11902

  12. [12]

    LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

    Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Ar- mando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code, 2024. URL https://arxiv.org/abs/ 2403.07974

  13. [13]

    Tree search for llm agent reinforcement learning, 2026

    Yuxiang Ji, Ziyu Ma, Yong Wang, Guanhua Chen, Xiangxiang Chu, and Liaoni Wu. Tree search for llm agent reinforcement learning, 2026. URLhttps://arxiv.org/abs/2509.21240

  14. [14]

    Hover: A dataset for many-hop fact extraction and claim verification, 2020

    Yichen Jiang, Shikha Bordia, Zheng Zhong, Charles Dognin, Maneesh Singh, and Mohit Bansal. Hover: A dataset for many-hop fact extraction and claim verification, 2020. URL https://arxiv.org/abs/2011.03088

  15. [15]

    SWE-bench: Can language models resolve real-world github issues? InThe Twelfth International Conference on Learning Representations, 2024

    Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R Narasimhan. SWE-bench: Can language models resolve real-world github issues? InThe Twelfth International Conference on Learning Representations, 2024. URL https: //openreview.net/forum?id=VTF8yNQM66. 10

  16. [16]

    DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines

    Omar Khattab, Arnav Singhvi, Paridhi Maheshwari, Zhiyuan Zhang, Keshav Santhanam, Sri Vardhamanan, Saiful Haq, Ashutosh Sharma, Thomas T. Joshi, Hanna Moazam, Heather Miller, Matei Zaharia, and Christopher Potts. DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines, October 2023. URLhttps://arxiv.org/abs/2310.03714. arXiv:2310.03714

  17. [17]

    arXiv preprint arXiv:2601.13295 , year=

    Arpandeep Khatua, Hao Zhu, Peter Tran, Arya Prabhudesai, Frederic Sadrieh, Johann K. Lieberwirth, Xinkai Yu, Yicheng Fu, Michael J. Ryan, Jiaxin Pei, and Diyi Yang. CooperBench: Why Coding Agents Cannot be Your Teammates Yet, January 2026. URL https://arxiv. org/abs/2601.13295. arXiv:2601.13295

  18. [18]

    Scaling Test-Time Compute for Agentic Coding

    Joongwon Kim, Wannan Yang, Kelvin Niu, Hongming Zhang, Yun Zhu, Eryk Helenowski, Ruan Silva, Zhengxing Chen, Srinivasan Iyer, Manzil Zaheer, Daniel Fried, Hannaneh Hajishirzi, Sanjeev Arora, Gabriel Synnaeve, Ruslan Salakhutdinov, and Anirudh Goyal. Scaling Test-Time Compute for Agentic Coding, 2026. URLhttps://arxiv.org/abs/2604.16529

  19. [19]

    Tinker, 2025

    Thinking Machines Lab. Tinker, 2025. URLhttps://thinkingmachines.ai/tinker/

  20. [20]

    Meta-Harness: End-to-End Optimization of Model Harnesses

    Yoonho Lee, Roshen Nair, Qizheng Zhang, Kangwook Lee, Omar Khattab, and Chelsea Finn. Meta-Harness: End-to-End Optimization of Model Harnesses, March 2026. URL https://arxiv.org/abs/2603.28052. arXiv:2603.28052

  21. [21]

    Agentic Aggregation for Parallel Scaling of Long-Horizon Agentic Tasks

    Yoonsang Lee, Howard Yen, Xi Ye, and Danqi Chen. Agentic Aggregation for Parallel Scaling of Long-Horizon Agentic Tasks, 2026. URLhttps://arxiv.org/abs/2604.11753

  22. [22]

    Combee: Scaling Prompt Learning for Self-Improving Language Model Agents

    Hanchen Li, Runyuan He, Qizheng Zhang, Changxiu Ji, Qiuyang Mang, Xiaokun Chen, Lakshya A. Agrawal, Wei-Liang Liao, Eric Yang, Alvin Cheung, James Zou, Kunle Oluko- tun, Ion Stoica, and Joseph E. Gonzalez. Combee: Scaling Prompt Learning for Self- Improving Language Model Agents, April 2026. URLhttp://arxiv.org/abs/2604.04247. arXiv:2604.04247 [cs]

  23. [23]

    AgentGit: A Version Control Framework for Reliable and Scalable LLM-Powered Multi-Agent Systems, November 2025

    Yang Li, Siqi Ping, Xiyu Chen, Xiaojian Qi, Zigan Wang, Ye Luo, and Xiaowei Zhang. AgentGit: A Version Control Framework for Reliable and Scalable LLM-Powered Multi-Agent Systems, November 2025. URLhttps://arxiv.org/abs/2511.00628. arXiv:2511.00628

  24. [24]

    Stop wasting your tokens: Towards efficient runtime multi-agent systems.arXiv preprint arXiv:2510.26585, 2025

    Fulin Lin, Shaowen Chen, Ruishan Fang, Hongwei Wang, and Tao Lin. Stop wasting your tokens: Towards efficient runtime multi-agent systems.arXiv preprint arXiv:2510.26585, 2025. URLhttps://arxiv.org/abs/2510.26585

  25. [25]

    Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces

    Mike A. Merrill, Alexander G. Shaw, Nicholas Carlini, Boxuan Li, Harsh Raj, Ivan Bercovich, Lin Shi, Jeong Yeon Shin, Thomas Walshe, E. Kelly Buchanan, Junhong Shen, Guanghao Ye, Haowei Lin, Jason Poulos, Maoyu Wang, Marianna Nezhurina, Jenia Jitsev, Di Lu, Orfeas Menis Mastromichalakis, Zhiwei Xu, Zizhao Chen, Yue Liu, Robert Zhang, Leon Liangyu Chen, An...

  26. [26]

    Alexander Novikov, Ngân V˜u, Marvin Eisenberger, Emilien Dupont, Po-Sen Huang, Adam Zsolt Wagner, Sergey Shirobokov, Borislav Kozlovskii, Francisco J. R. Ruiz, Abbas Mehrabian, M. Pawan Kumar, Abigail See, Swarat Chaudhuri, George Holland, Alex Davies, Sebastian Nowozin, Pushmeet Kohli, and Matej Balog. AlphaEvolve: A coding agent for scientific and algor...

  27. [27]

    Nvidia nemotron 3: Efficient and open intelligence, 2025

    NVIDIA. Nvidia nemotron 3: Efficient and open intelligence, 2025. URL https://arxiv. org/abs/2512.20856. White Paper

  28. [28]

    Generalizing verifiable instruction following,

    Valentina Pyatkin, Saumya Malik, Victoria Graf, Hamish Ivison, Shengyi Huang, Pradeep Dasigi, Nathan Lambert, and Hannaneh Hajishirzi. Generalizing verifiable instruction following,

  29. [29]

    URLhttps://arxiv.org/abs/2507.02833

  30. [30]

    Qwen3.5: Towards native multimodal agents, February 2026

    Qwen Team. Qwen3.5: Towards native multimodal agents, February 2026. URL https: //qwen.ai/blog?id=qwen3.5

  31. [31]

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y . K. Li, Y . Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models, 2024. URL https://arxiv.org/abs/ 2402.03300

  32. [32]

    Reflexion: Language Agents with Verbal Reinforcement Learning

    Noah Shinn, Federico Cassano, Edward Berman, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language Agents with Verbal Reinforcement Learning, March 2023. URLhttps://arxiv.org/abs/2303.11366. arXiv:2303.11366

  33. [33]

    Reinforcement learning: An introduction 1st edition.Exp

    RS Sutton and AG Barto. Reinforcement learning: An introduction 1st edition.Exp. Psychol. Learn. Mem. Cogn, 30:1302–1321, 1998

  34. [34]

    Hindsight credit assignment for long-horizon llm agents, 2026

    Hui-Ze Tan, Xiao-Wen Yang, Hao Chen, Jie-Jing Shao, Yi Wen, Yuteng Shen, Weihong Luo, Xiku Du, Lan-Zhe Guo, and Yu-Feng Li. Hindsight credit assignment for long-horizon llm agents, 2026. URLhttps://arxiv.org/abs/2603.08754

  35. [35]

    Counterfactual explanations without open- ing the black box: Automated decisions and the gdpr.Harvard Journal of Law & Technology, 31(2):841–887, 2018

    Sandra Wachter, Brent Mittelstadt, and Chris Russell. Counterfactual explanations without open- ing the black box: Automated decisions and the gdpr.Harvard Journal of Law & Technology, 31(2):841–887, 2018

  36. [36]

    Fork, Explore, Commit: OS Primitives for Agentic Exploration, February 2026

    Cong Wang and Yusheng Zheng. Fork, Explore, Commit: OS Primitives for Agentic Exploration, February 2026. URLhttps://arxiv.org/abs/2602.08199. arXiv:2602.08199

  37. [37]

    AgentSPEX: An Agent SPecification and EXecution Language

    Pengcheng Wang, Jerry Huang, Jiarui Yao, Rui Pan, Peizhi Niu, Yaowenqi Liu, Ruida Wang, Renhao Lu, Yuwei Guo, and Tong Zhang. AgentSPEX: An Agent SPecification and EXecution Language, 2026. URLhttps://arxiv.org/abs/2604.13346

  38. [38]

    Miao Xiong, Zhiyuan Hu, Xinyang Lu, Yifei Li, Jie Fu, Junxian He, and Bryan Hooi

    Ruiyi Wang and Prithviraj Ammanabrolu. A practitioner’s guide to multi-turn agentic reinforce- ment learning, 2025. URLhttps://arxiv.org/abs/2510.01132

  39. [39]

    The OpenHands Software Agent SDK: A Composable and Extensible Foundation for Production Agents

    Xingyao Wang, Jiayi Pan, Binyuan Hui, et al. OpenHands V1: Event-sourced state management for multi-agent coding systems.MLSys, 2026. URL https://arxiv.org/abs/2511.03690. arXiv:2511.03690

  40. [40]

    ThetaEvolve: Test-time Learning on Open Problems,

    Yiping Wang, Shao-Rong Su, Zhiyuan Zeng, Eva Xu, Liliang Ren, Xinyu Yang, Zeyi Huang, Xuehai He, Luyao Ma, Baolin Peng, Hao Cheng, Pengcheng He, Weizhu Chen, Shuohang Wang, Simon Shaolei Du, and Yelong Shen. ThetaEvolve: Test-time Learning on Open Problems,

  41. [41]

    URLhttps://arxiv.org/abs/2511.23473

  42. [42]

    Evo-Memory: Benchmarking LLM Agent Test-time Learning with Self-Evolving Memory

    Tianxin Wei, Noveen Sachdeva, Benjamin Coleman, Zhankui He, Yuanchen Bei, Xuying Ning, Mengting Ai, Yunzhe Li, Jingrui He, Ed H. Chi, Chi Wang, Shuo Chen, Fernando Pereira, Wang- Cheng Kang, and Derek Zhiyuan Cheng. Evo-Memory: Benchmarking LLM Agent Test-time Learning with Self-Evolving Memory, 2025. URLhttps://arxiv.org/abs/2511.20857

  43. [43]

    Group-Evolving Agents: Open-Ended Self-Improvement via Experience Sharing, 2026

    Zhaotian Weng, Antonis Antoniades, Deepak Nathani, Zhen Zhang, Xiao Pu, and Xin Eric Wang. Group-Evolving Agents: Open-Ended Self-Improvement via Experience Sharing, 2026. URLhttps://arxiv.org/abs/2602.04837

  44. [44]

    AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation

    Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, Ahmed Hassan Awadallah, Ryen W White, Doug Burger, and Chi Wang. AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation, August 2023. URLhttps://arxiv.org/abs/2308.08155. arXiv:2308.08155. 12

  45. [45]

    Agent0: Unleashing Self-Evolving Agents from Zero Data via Tool-Integrated Reasoning,

    Peng Xia, Kaide Zeng, Jiaqi Liu, Can Qin, Fang Wu, Yiyang Zhou, Caiming Xiong, and Huaxiu Yao. Agent0: Unleashing Self-Evolving Agents from Zero Data via Tool-Integrated Reasoning,

  46. [46]

    URLhttps://arxiv.org/abs/2511.16043

  47. [47]

    Lillicrap, Kenji Kawaguchi, and Michael Shieh

    Yuxi Xie, Anirudh Goyal, Wenyue Zheng, Min-Yen Kan, Timothy P. Lillicrap, Kenji Kawaguchi, and Michael Shieh. Monte carlo tree search boosts reasoning via iterative preference learning,

  48. [48]

    URLhttps://arxiv.org/abs/2405.00451

  49. [49]

    Recursive Language Models

    Alex L. Zhang, Tim Kraska, and Omar Khattab. Recursive language models, 2025. URL https://arxiv.org/abs/2512.24601

  50. [50]

    Zhang, Zhening Li, and Omar Khattab

    Alex L. Zhang, Zhening Li, and Omar Khattab. The Mismanaged Geniuses Hypothesis, 2026. URLhttps://alexzhang13.github.io/blog/2026/mgh/. Blog post

  51. [51]

    Agentracer: Who is inducing failure in the llm agentic systems?arXiv preprint arXiv:2509.03312, 2025

    Guibin Zhang, Junhao Wang, Junjie Chen, Wangchunshu Zhou, Kun Wang, and Shuicheng Yan. Agentracer: Who is inducing failure in the llm agentic systems?, 2025. URL https: //arxiv.org/abs/2509.03312

  52. [52]

    Darwin G

    Jenny Zhang, Shengran Hu, Cong Lu, Robert Lange, and Jeff Clune. Darwin Gödel Machine: Open-Ended Evolution of Self-Improving Agents, 2025. URL https://arxiv.org/abs/ 2505.22954

  53. [53]

    Hyperagents.arXiv preprint arXiv:2603.19461, 2026

    Jenny Zhang, Bingchen Zhao, Wannan Yang, Jakob Foerster, Jeff Clune, Minqi Jiang, Sam Devlin, and Tatiana Shavrina. Hyperagents, March 2026. URL https://arxiv.org/abs/ 2603.19461. arXiv:2603.19461

  54. [54]

    Tomz, Christopher D

    Jiayi Zhang, Simon Yu, Derek Chong, Anthony Sicilia, Michael R. Tomz, Christopher D. Manning, and Weiyan Shi. Verbalized sampling: How to mitigate mode collapse and unlock LLM diversity. InarXiv preprint arXiv:2510.01171, 2025. URL https://arxiv.org/abs/ 2510.01171

  55. [55]

    Agentic Context Engineering: Evolving Contexts for Self-Improving Language Models

    Qizheng Zhang, Changran Hu, Shubhangi Upasani, Boyuan Ma, Fenglu Hong, Vamsidhar Kamanuru, Jay Rainton, Chen Wu, Mengmeng Ji, Hanchen Li, Urmish Thakker, James Zou, and Kunle Olukotun. Agentic Context Engineering: Evolving Contexts for Self-Improving Language Models, 2025. URLhttps://arxiv.org/abs/2510.04618

  56. [56]

    MemRL: Self-Evolving Agents via Runtime Reinforcement Learning on Episodic Memory,

    Shengtao Zhang, Jiaqian Wang, Ruiwen Zhou, Junwei Liao, Yuchen Feng, Zhuo Li, Yujie Zheng, Weinan Zhang, Ying Wen, Zhiyu Li, Feiyu Xiong, Yutao Qi, Bo Tang, and Muning Wen. MemRL: Self-Evolving Agents via Runtime Reinforcement Learning on Episodic Memory,

  57. [57]

    URLhttps://arxiv.org/abs/2601.03192

  58. [58]

    Language agent tree search unifies reasoning acting and planning in language models.arXiv preprint arXiv:2310.04406, 2023

    Andy Zhou, Kai Yan, Michal Shlapentokh-Rothman, Haohan Wang, and Yu-Xiong Wang. Language Agent Tree Search Unifies Reasoning Acting and Planning in Language Models, October 2023. URLhttps://arxiv.org/abs/2310.04406. arXiv:2310.04406

  59. [59]

    2–3% of one turn

    Maciej ´Swiechowski, Konrad Godlewski, Bartosz Sawicki, and Jacek Ma ´ndziuk. Monte carlo tree search: a review of recent modifications and applications.Artificial Intelligence Review, 56(3):2497–2562, 2022. ISSN 1573-7462. doi: 10.1007/s10462-022-10228-y. URL http://dx.doi.org/10.1007/s10462-022-10228-y. A Limitations and Future Works A.1 Limitations Pro...

  60. [60]

    This should be the default for the vast majority of agents on most ticks

    “none”— everything is fine, let the agent keep working. This should be the default for the vast majority of agents on most ticks. Over-intervention destroys progress

  61. [61]

    oh, the supervisor is nudging me

    “steer”— CHEAPEST intervention. The agent’s conversation is kept intact; we only append a new user message with your guidance so the agent sees it as “oh, the supervisor is nudging me”. Full conversation history and tool call context are preserved, KV cache is reused. Use this when: • the agent is broadly on task but drifting or about to make a minor wron...

  62. [62]

    redirect

    “redirect”— EXPENSIVE. The agent’s current session is aborted and a fresh opencode session starts with your guidance as message 1. The agent loses ALL memory of what it has explored, read, tried, or learned — it starts from scratch (but the files it already edited are still there on disk). Use this when: • the agent is stuck in an obvious loop (same tool,...

  63. [63]

    none”. • “Stuck in a loop

    “revert”— EXPENSIVE and destructive. Same as redirect on the LLM side (new session, lost memory) PLUS the sandbox filesystem is rolled back to the pre-run checkpoint. All files the agent edited are discarded. Use this ONLY when: • the agent wrote files that corrupt the repo (overwrote core code with garbage, introduced unrelated changes, broke syntax) • t...

  64. [64]

    (turn 1, session 0 only)Read README.md for the search policy, mechanism axes, and loop hard rules; readORIENTATION.mdfor the run parameters

  65. [65]

    (turn 1, every session)Read brief.md for the session index, frontier candidate, remaining budget, and the exact pending-batch and journal-pending file names this session must produce

  66. [66]

    (working turns)Use the read-only inspection tools (§F.2.3) to inspect prior runs; pick a base reference (frontier,baseline,promoted, arun_NNN, acand-XXX, or a Meta-Git scope ref); construct sibling variants withstage_variant, or by writing files directly under variants/session_NNN/vXX/workflow/; attach targeted_examples = {improve, protect, invariant}to each

  67. [67]

    Whenfinish_session returns control to the host, the proposer’s LLM session terminates and no chat-history context survives across sessions

    (handoff turn)Write hypothesis_logs/session_NNN.md with ### Findings, ### Hypotheses,### Considered & Rejected, and### Selected Batch sections; drop a session-fragment into journal_pending/session_NNN.md; write the manifest to pending_batches/session_NNN.json; callfinish_session. Whenfinish_session returns control to the host, the proposer’s LLM session t...

  68. [68]

    It is the host’s compact projection of 6candidates, failures, prior logs, and the handoff contract

    Read ‘brief.md‘ first. It is the host’s compact projection of 6candidates, failures, prior logs, and the handoff contract

  69. [69]

    Choose a set of prior sources as the batch base: ‘frontier‘, 8‘baseline‘, ‘promoted‘, a run id, a candidate id, or a Meta-Git 9scope ref/name

  70. [70]

    Create sibling variants from that set of bases using 11‘stage_variant‘, or by passing full ‘files‘/‘workflow_dir‘ entries 12in a batch manifest

  71. [71]

    These targeted checks are the 15preflight before hidden aggregate dev scoring

    Every variant must include explicit targeted examples with improve, 14protect, and invariant intent. These targeted checks are the 15preflight before hidden aggregate dev scoring

  72. [72]

    Call ‘run_counterfactual_batch‘ or ‘submit_counterfactual_batch ‘

  73. [73]

    Inspect aggregate outcomes, write the required 18‘experiment_logs/eXXX.md‘ with ‘## Outcome‘ and ‘## Next‘, then 19call ‘finish_session ‘. 20 21Rules: fix failure classes, not literal train examples; preserve the 22Agentic task shape; use valid train ids from brief.md, 23candidate_catalog.json, or traces/metrics; avoid near-duplicate prompt 24tweaks; trea...

  74. [74]

    show_history; read one trace.md of an interesting prior 35run; identify a failure CLASS (>=2 examples)

    ANALYZE. show_history; read one trace.md of an interesting prior 35run; identify a failure CLASS (>=2 examples)

  75. [75]

    Required before every batch

    HYPOTHESIZE. Required before every batch. Write 37hypotheses/hNNN_*.md with Branch from / Claim / Proposed 38change / Expected outcome / Why this differs from 39previous attempts / Cache consequence sections

  76. [76]

    branch(from_ref=...) to reset workflow/

    BRANCH. branch(from_ref=...) to reset workflow/

  77. [77]

    EDIT + check_workflow

  78. [78]

    43Provide explicit targeted_examples for every variant

    Prefer run_counterfactual_batch with several staged siblings. 43Provide explicit targeted_examples for every variant

  79. [79]

    Required between runs

    OBSERVE. Required between runs. Write observations/oNNN.md

  80. [80]

    Flat dev is not a stop condition; 46switch to a structurally different move

    Repeat until budget exhausted. Flat dev is not a stop condition; 46switch to a structurally different move. F.2.3 Proposer tool surface The proposer’s tools fall into four groups: filesystem inspection and editing, ledger inspection, effect- level introspection, and host-mediated evaluation. Table 13 lists the live surface; full JSON-Schema specifications...

Showing first 80 references.