arxiv: 2605.10913 · v1 · submitted 2026-05-11 · 💻 cs.AI · cs.PL· cs.SE

Recognition: 1 theorem link

· Lean Theorem

Shepherd: A Runtime Substrate Empowering Meta-Agents with a Formalized Execution Trace

Simon Yu , Derek Chong , Ananjan Nandi , Dilara Soylu , Jiuding Sun , Christopher D Manning , Weiyan Shi

Authors on Pith no claims yet

Pith reviewed 2026-05-12 03:21 UTC · model grok-4.3

classification 💻 cs.AI cs.PLcs.SE

keywords meta-agentsexecution traceruntime forkingcounterfactual optimizationTree-RLagent supervisionfunctional programming modelGit-like trace

0 comments

The pith

Shepherd formalizes meta-agent operations as functions on a Git-like execution trace that records every interaction for fast forking and replay.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Shepherd as a functional programming model that treats meta-agent operations on target agents as functions whose core steps are mechanized in Lean. It records every agent-environment interaction as a typed event inside a Git-like execution trace, so any past state can be forked and replayed without restarting from scratch. Forking the agent process and filesystem runs five times faster than Docker while reusing over 95 percent of cached prompts on replay. Three concrete uses demonstrate the model: a live supervisor raises pair-coding success from 28.8 to 54.7 percent, branching counterfactual search beats baselines by up to 11 points while cutting wall-clock time by up to 58 percent, and selective forking of rollouts lifts Tree-RL performance on TerminalBench-2 from 34.2 to 39.4 percent. These outcomes position the trace and forking mechanism as practical infrastructure for writing and running meta-agents.

Core claim

Shepherd records every agent-environment interaction as a typed event in a Git-like execution trace, enabling any past state to be forked and replayed. The substrate forks the agent process and its filesystem five times faster than Docker and reuses more than 95 percent of prompt cache on replay. When applied to runtime intervention, counterfactual meta-optimization, and Tree-RL training, the trace produces measurable gains in pass rates, benchmark scores, and training efficiency across the reported tasks.

What carries the argument

The typed execution trace that stores every agent-environment interaction as an event and supports forking of both the agent process and its filesystem.

If this is right

A live supervisor using the trace can raise pair-coding pass rates from 28.8 percent to 54.7 percent on CooperBench.
Branching exploration inside the trace outperforms baselines on four benchmarks by as much as 11 points and reduces wall-clock time by as much as 58 percent.
Forking rollouts at selected turns inside the trace raises TerminalBench-2 performance from 34.2 percent to 39.4 percent.
Any past agent state captured in the trace can be replayed or branched without restarting the full environment.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same trace structure could let developers version and debug ordinary single-agent systems the way git versions code.
High cache reuse on replay suggests the mechanism may scale to longer-horizon agent runs where repeated prompt computation would otherwise dominate cost.
If the Lean mechanization of core operations is extended, it could support machine-checked proofs that certain meta-agent interventions preserve safety properties.

Load-bearing premise

The reported gains in intervention success, optimization scores, and RL performance arise from the trace and forking features rather than from unmeasured differences in experimental setup or implementation.

What would settle it

Run the same three applications with the forking and trace recording disabled while keeping every other component fixed; if the pass-rate, benchmark, and training improvements disappear, the central claim is supported.

Figures

Figures reproduced from arXiv: 2605.10913 by Ananjan Nandi, Christopher D Manning, Derek Chong, Dilara Soylu, Jiuding Sun, Simon Yu, Weiyan Shi.

**Figure 1.** Figure 1: SHEPHERD meta-agents. Top: A supervisor meta-agent manages code repair agents. Bottom: Results from three meta-agents: (A) live supervision; (B) meta-optimization; (C) Tree GRPO Abstract As LLM-based agentic systems grow more complex, they increasingly rely on meta-agents: higher-order agents that act on other agents, much like managers supervise employees. Yet existing agentic runtimes expose execution on… view at source ↗

**Figure 2.** Figure 2: Live intervention experiments on CooperBench, with Claude Haiku 4.5 as worker. [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: LiveCodeBench comparison. Left: held-out test pass-rate versus optimization wallclock. Right: dev-set trajectory for each method across optimization wallclock. CRO subtask-cache reuse is reported separately in [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: CRO computation reuse on LiveCodeBench rises from ∼1% on the first cold proposer session to over 60%. Setup. We evaluate on subsets of HoVer [13], MATH [8], IFBench [27], LiveCodeBench [11], and TerminalBench 2.0 (TB-2; [24]), comparing CRO against the baseline workflow, GEPA (optimizing workflow code) [2], and MetaHarness [19]. The executor is GPT-5.4-mini and meta-optimizers use GPT-5.4 (in the Codex h… view at source ↗

**Figure 5.** Figure 5: Trajectory compression across two worker model families and two benchmarks. The same [PITH_FULL_IMAGE:figures/full_fig_p020_5.png] view at source ↗

**Figure 6.** Figure 6: HoVer: held-out test pass-rate vs. optimization wallclock (left) and per-iteration dev-set [PITH_FULL_IMAGE:figures/full_fig_p039_6.png] view at source ↗

**Figure 7.** Figure 7: HoVer: subtask cache reuse per CRO proposer session. [PITH_FULL_IMAGE:figures/full_fig_p040_7.png] view at source ↗

**Figure 8.** Figure 8: IFBench: test pass-rate vs. wallclock and dev-set trajectory. [PITH_FULL_IMAGE:figures/full_fig_p040_8.png] view at source ↗

**Figure 9.** Figure 9: IFBench: subtask cache reuse per CRO proposer session. [PITH_FULL_IMAGE:figures/full_fig_p040_9.png] view at source ↗

**Figure 10.** Figure 10: LiveCodeBench: test pass-rate vs. wallclock and dev-set trajectory. [PITH_FULL_IMAGE:figures/full_fig_p041_10.png] view at source ↗

**Figure 11.** Figure 11: LiveCodeBench: subtask cache reuse per CRO proposer session. [PITH_FULL_IMAGE:figures/full_fig_p041_11.png] view at source ↗

**Figure 12.** Figure 12: MATH (Level 5): test pass-rate vs. wallclock and dev-set trajectory. [PITH_FULL_IMAGE:figures/full_fig_p042_12.png] view at source ↗

**Figure 13.** Figure 13: MATH (Level 5): subtask cache reuse per CRO proposer session. [PITH_FULL_IMAGE:figures/full_fig_p042_13.png] view at source ↗

**Figure 14.** Figure 14: TerminalBench 2.0: test pass-rate vs. wallclock and dev-set trajectory. [PITH_FULL_IMAGE:figures/full_fig_p042_14.png] view at source ↗

**Figure 15.** Figure 15: TerminalBench 2.0: subtask cache reuse per CRO proposer session. [PITH_FULL_IMAGE:figures/full_fig_p043_15.png] view at source ↗

**Figure 16.** Figure 16: GRPO group composition over training (rows: base model; columns: setting). Tree-GRPO [PITH_FULL_IMAGE:figures/full_fig_p047_16.png] view at source ↗

**Figure 17.** Figure 17: Held-out Endless Terminals evaluation, sampled every 10 training steps (raw, unsmoothed). [PITH_FULL_IMAGE:figures/full_fig_p047_17.png] view at source ↗

**Figure 18.** Figure 18: Train raw reward (mean over G=8 roots) for both base models, panels are Qwen3.5- 35B-A3B (left) and Nemotron-3-Super-120B-A12B (right). Tree-GRPO (K=4, teal) reaches higher reward than Flat GRPO (red) at every rollout step. Faint dots are observed steps from the flat-baseline run; smooth lines are denoised trajectories. Case 1: Early mistake (T=4, reward=0.00) Task: Install the requests package and verify… view at source ↗

**Figure 19.** Figure 19: Early-mistake case. The wrong package name on turn 1 dooms the rest of the trajectory. [PITH_FULL_IMAGE:figures/full_fig_p048_19.png] view at source ↗

**Figure 20.** Figure 20: Ambiguous case. At least three turns offer plausible branches (skip-the-version-check, [PITH_FULL_IMAGE:figures/full_fig_p049_20.png] view at source ↗

**Figure 21.** Figure 21: Long-trajectory case. A 9-turn rollout with a wrong-file edit at turn 4 cascades into 5 [PITH_FULL_IMAGE:figures/full_fig_p049_21.png] view at source ↗

read the original abstract

We introduce Shepherd, a functional programming model that formalizes meta-agent operations on target agents as functions, with core operations mechanized in Lean. Shepherd records every agent-environment interaction as a typed event in a Git-like execution trace, enabling any past state to be forked and replayed. The system forks the agent process and its filesystem $5\times$ faster than Docker, achieving $>95\%$ prompt-cache reuse on replay. We demonstrate the model through three applications. First, in runtime intervention, a live supervisor increases pair coding pass rates from 28.8% to 54.7% on CooperBench. Second, in counterfactual meta-optimization, branching exploration outperforms baselines across four benchmarks by up to 11 points while reducing wall-clock time by up to 58%. Third, in Tree-RL training, forking rollouts at selected turns improves TerminalBench-2 performance from 34.2% to 39.4%. These results establish Shepherd as an efficient infrastructure for programming meta-agents. We open-source the system to support future research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Shepherd gives meta-agents a functional runtime with Lean checks and fast Git-like forking, but the benchmark gains are presented without enough controls to pin them on those features.

read the letter

Shepherd is a runtime that turns meta-agent actions into functions, mechanizes the basics in Lean, and logs everything in a typed trace that works like Git for forking past states. The forking is 5 times faster than Docker with good cache reuse on replay. The paper shows this in three places. A supervisor uses it to lift pair-coding success from 28.8% to 54.7% on CooperBench. Counterfactual branching beats four benchmarks by as much as 11 points and cuts wall-clock time by up to 58%. Forking rollouts in Tree-RL raises TerminalBench-2 from 34.2% to 39.4%. They also open-source the code. The weak part is the lack of experimental detail. The abstract reports those improvements but says nothing about what the baseline agents were, whether they had similar runtime access, or any ablations that isolate the trace and forking. Without that, it's not clear the gains come from Shepherd rather than other changes in the setup. The Lean part sounds solid on paper but we can't see how much is actually formalized. This is for people who build or study meta-agents and want infrastructure for control and exploration. It deserves a serious referee because the system is new, the use cases are specific, and the claims can be checked once the full experiments and code are reviewed. I'd send it out with a note to add the missing controls and baseline descriptions.

Referee Report

1 major / 0 minor

Summary. The paper introduces Shepherd, a functional programming model that formalizes meta-agent operations on target agents as functions, with core operations mechanized in Lean. It records every agent-environment interaction as a typed event in a Git-like execution trace, enabling forking and replay of past states. The system forks agent processes and filesystems 5× faster than Docker with >95% prompt-cache reuse on replay. It demonstrates the model in three applications: runtime intervention raising pair-coding pass rates from 28.8% to 54.7% on CooperBench; counterfactual meta-optimization outperforming baselines by up to 11 points with up to 58% wall-clock reduction across four benchmarks; and Tree-RL training improving TerminalBench-2 from 34.2% to 39.4%. These results are presented as establishing Shepherd as efficient infrastructure for programming meta-agents, with the system open-sourced.

Significance. If the empirical claims hold under proper controls, Shepherd could provide a useful formalized runtime substrate for meta-agent development, leveraging execution traces for intervention, optimization, and training. The mechanization of core operations in Lean is a clear strength, supplying machine-checked proofs for the model. Open-sourcing the system is also a positive step that supports reproducibility and community follow-on work.

major comments (1)

[Abstract] Abstract: the abstract reports specific performance improvements (pair-coding pass rate 28.8%→54.7% on CooperBench, up to 11-point gains with 58% wall-clock reduction, TerminalBench-2 34.2%→39.4%) but supplies no experimental details, baselines, statistical tests, error bars, or methodology. This prevents verification that the gains are attributable to the typed execution trace and forking features rather than unstated factors, which is load-bearing for the central claim that 'these results establish Shepherd as an efficient infrastructure for programming meta-agents.'

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their positive assessment of Shepherd's contributions, including the Lean mechanization and open-sourcing, and for the constructive feedback on the abstract. We address the major comment below.

read point-by-point responses

Referee: [Abstract] Abstract: the abstract reports specific performance improvements (pair-coding pass rate 28.8%→54.7% on CooperBench, up to 11-point gains with 58% wall-clock reduction, TerminalBench-2 34.2%→39.4%) but supplies no experimental details, baselines, statistical tests, error bars, or methodology. This prevents verification that the gains are attributable to the typed execution trace and forking features rather than unstated factors, which is load-bearing for the central claim that 'these results establish Shepherd as an efficient infrastructure for programming meta-agents.'

Authors: The abstract is intentionally concise to highlight key outcomes, following standard academic practice. The full manuscript provides the requested experimental details, including baselines, statistical tests, error bars, and methodology, in the dedicated evaluation sections for each application (runtime intervention, counterfactual meta-optimization, and Tree-RL training). These sections describe controlled experiments that isolate the contributions of the typed execution traces and forking mechanisms, supporting the attribution of the reported gains. The abstract's central claim is thus grounded in the body of the paper rather than standing alone. revision: no

Circularity Check

0 steps flagged

No circularity: empirical claims rest on reported results without derivations or self-referential reductions

full rationale

The provided abstract contains no equations, derivations, fitted parameters, or self-citations. It introduces Shepherd as a functional model with execution traces and forking, then reports three separate empirical applications (runtime intervention on CooperBench, counterfactual optimization on four benchmarks, Tree-RL on TerminalBench-2) with performance deltas. These results are presented as demonstrations rather than as outputs derived from the system's definition by construction. No load-bearing step reduces a prediction or uniqueness claim to an input fit or prior self-citation; the central infrastructure claim is supported by the listed experimental outcomes, which remain externally verifiable in principle.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides insufficient detail to identify specific free parameters, axioms, or invented entities. The Lean mechanization likely relies on standard mathematical axioms for formal verification but none are explicitly listed.

pith-pipeline@v0.9.0 · 5479 in / 1149 out tokens · 80011 ms · 2026-05-12T03:21:05.580474+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean, AlexanderDuality.lean, ArithmeticFromLogic.lean reality_from_one_distinction echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

core operations mechanized in Lean... small algebraic-effects calculus... proof envelopes... typed event in a Git-like execution trace... fork... replay

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

84 extracted references · 84 canonical work pages · 20 internal anchors

[1]

URL https://www.anthropic

https://www.anthropic.com/engineering/managed-agents. URL https://www.anthropic. com/engineering/managed-agents

work page
[2]

GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning

Lakshya A. Agrawal, Shangyin Tan, Dilara Soylu, Noah Ziems, Rishi Khare, Krista Opsahl- Ong, Arnav Singhvi, Herumb Shandilya, Michael J. Ryan, Meng Jiang, Christopher Potts, Koushik Sen, Alexandros G. Dimakis, Ion Stoica, Dan Klein, Matei Zaharia, and Omar Khattab. GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning, July 2025. URL htt...

work page internal anchor Pith review arXiv 2025
[3]

Dimakis, and Joseph E

Parth Asawa, Alan Zhu, Abby O’Neill, Matei Zaharia, Alexandros G. Dimakis, and Joseph E. Gonzalez. How to Train Your Advisor: Steering Black-Box LLMs with Advisor Models, October 2025. URLhttp://arxiv.org/abs/2510.02453. arXiv:2510.02453 [cs]

work page arXiv 2025
[4]

Why Do Multi-Agent LLM Systems Fail?

Mert Cemri, Melissa Z. Pan, Shuyi Yang, Lakshya A. Agrawal, Bhavya Chopra, Rishabh Tiwari, Kurt Keutzer, Aditya Parameswaran, Dan Klein, Kannan Ramchandran, Matei Zaharia, Joseph E. Gonzalez, and Ion Stoica. Why do multi-agent llm systems fail?, 2025. URL https://arxiv.org/abs/2503.13657

work page internal anchor Pith review arXiv 2025
[5]

Context-lite multi-turn reinforcement learning for LLM agents

Wentse Chen, Jiayu Chen, Hao Zhu, and Jeff Schneider. Context-lite multi-turn reinforcement learning for LLM agents. InES-FoMo III: 3rd Workshop on Efficient Systems for Foundation Models, 2025. URLhttps://openreview.net/forum?id=6CE5PLsZdW

work page 2025
[6]

Trace is the next AutoDiff: Generative optimiza- tion with rich feedback, execution traces, and LLMs, 2024

Ching-An Cheng, Allen Nie, and Adith Swaminathan. Trace is the Next AutoDiff: Generative Optimization with Rich Feedback, Execution Traces, and LLMs, June 2024. URL https: //arxiv.org/abs/2406.16218. arXiv:2406.16218

work page arXiv 2024
[7]

Goodman, and Dimitris Papailiopoulos

Kanishk Gandhi, Shivam Garg, Noah D. Goodman, and Dimitris Papailiopoulos. Endless Terminals: Scaling RL Environments for Terminal Agents, January 2026. URL https:// arxiv.org/abs/2601.16443. arXiv:2601.16443

work page arXiv 2026
[8]

Measuring mathematical problem solving with the math dataset,

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset,

work page
[9]

URLhttps://arxiv.org/abs/2103.03874

work page internal anchor Pith review Pith/arXiv arXiv
[10]

MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework

Sirui Hong, Mingchen Zhuge, Jiaqi Chen, Xiawu Zheng, Yuheng Cheng, Ceyao Zhang, Jinlin Wang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, Chenyu Ran, Lingfeng Xiao, Chenglin Wu, and Jürgen Schmidhuber. MetaGPT: Meta Programming for A Multi- Agent Collaborative Framework, August 2023. URLhttps://arxiv.org/abs/2308.00352. arXiv:2308.00352

work page internal anchor Pith review Pith/arXiv arXiv 2023
[11]

Treerl: Llm rein- forcement learning with on-policy tree search, 2025

Zhenyu Hou, Ziniu Hu, Yujiang Li, Rui Lu, Jie Tang, and Yuxiao Dong. Treerl: Llm rein- forcement learning with on-policy tree search, 2025. URL https://arxiv.org/abs/2506. 11902

work page 2025
[12]

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Ar- mando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code, 2024. URL https://arxiv.org/abs/ 2403.07974

work page internal anchor Pith review Pith/arXiv arXiv 2024
[13]

Tree search for llm agent reinforcement learning, 2026

Yuxiang Ji, Ziyu Ma, Yong Wang, Guanhua Chen, Xiangxiang Chu, and Liaoni Wu. Tree search for llm agent reinforcement learning, 2026. URLhttps://arxiv.org/abs/2509.21240

work page arXiv 2026
[14]

Hover: A dataset for many-hop fact extraction and claim verification, 2020

Yichen Jiang, Shikha Bordia, Zheng Zhong, Charles Dognin, Maneesh Singh, and Mohit Bansal. Hover: A dataset for many-hop fact extraction and claim verification, 2020. URL https://arxiv.org/abs/2011.03088

work page arXiv 2020
[15]

SWE-bench: Can language models resolve real-world github issues? InThe Twelfth International Conference on Learning Representations, 2024

Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R Narasimhan. SWE-bench: Can language models resolve real-world github issues? InThe Twelfth International Conference on Learning Representations, 2024. URL https: //openreview.net/forum?id=VTF8yNQM66. 10

work page 2024
[16]

DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines

Omar Khattab, Arnav Singhvi, Paridhi Maheshwari, Zhiyuan Zhang, Keshav Santhanam, Sri Vardhamanan, Saiful Haq, Ashutosh Sharma, Thomas T. Joshi, Hanna Moazam, Heather Miller, Matei Zaharia, and Christopher Potts. DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines, October 2023. URLhttps://arxiv.org/abs/2310.03714. arXiv:2310.03714

work page internal anchor Pith review Pith/arXiv arXiv 2023
[17]

arXiv preprint arXiv:2601.13295 , year=

Arpandeep Khatua, Hao Zhu, Peter Tran, Arya Prabhudesai, Frederic Sadrieh, Johann K. Lieberwirth, Xinkai Yu, Yicheng Fu, Michael J. Ryan, Jiaxin Pei, and Diyi Yang. CooperBench: Why Coding Agents Cannot be Your Teammates Yet, January 2026. URL https://arxiv. org/abs/2601.13295. arXiv:2601.13295

work page arXiv 2026
[18]

Scaling Test-Time Compute for Agentic Coding

Joongwon Kim, Wannan Yang, Kelvin Niu, Hongming Zhang, Yun Zhu, Eryk Helenowski, Ruan Silva, Zhengxing Chen, Srinivasan Iyer, Manzil Zaheer, Daniel Fried, Hannaneh Hajishirzi, Sanjeev Arora, Gabriel Synnaeve, Ruslan Salakhutdinov, and Anirudh Goyal. Scaling Test-Time Compute for Agentic Coding, 2026. URLhttps://arxiv.org/abs/2604.16529

work page internal anchor Pith review Pith/arXiv arXiv 2026
[19]

Tinker, 2025

Thinking Machines Lab. Tinker, 2025. URLhttps://thinkingmachines.ai/tinker/

work page 2025
[20]

Meta-Harness: End-to-End Optimization of Model Harnesses

Yoonho Lee, Roshen Nair, Qizheng Zhang, Kangwook Lee, Omar Khattab, and Chelsea Finn. Meta-Harness: End-to-End Optimization of Model Harnesses, March 2026. URL https://arxiv.org/abs/2603.28052. arXiv:2603.28052

work page internal anchor Pith review arXiv 2026
[21]

Agentic Aggregation for Parallel Scaling of Long-Horizon Agentic Tasks

Yoonsang Lee, Howard Yen, Xi Ye, and Danqi Chen. Agentic Aggregation for Parallel Scaling of Long-Horizon Agentic Tasks, 2026. URLhttps://arxiv.org/abs/2604.11753

work page internal anchor Pith review Pith/arXiv arXiv 2026
[22]

Combee: Scaling Prompt Learning for Self-Improving Language Model Agents

Hanchen Li, Runyuan He, Qizheng Zhang, Changxiu Ji, Qiuyang Mang, Xiaokun Chen, Lakshya A. Agrawal, Wei-Liang Liao, Eric Yang, Alvin Cheung, James Zou, Kunle Oluko- tun, Ion Stoica, and Joseph E. Gonzalez. Combee: Scaling Prompt Learning for Self- Improving Language Model Agents, April 2026. URLhttp://arxiv.org/abs/2604.04247. arXiv:2604.04247 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2026
[23]

AgentGit: A Version Control Framework for Reliable and Scalable LLM-Powered Multi-Agent Systems, November 2025

Yang Li, Siqi Ping, Xiyu Chen, Xiaojian Qi, Zigan Wang, Ye Luo, and Xiaowei Zhang. AgentGit: A Version Control Framework for Reliable and Scalable LLM-Powered Multi-Agent Systems, November 2025. URLhttps://arxiv.org/abs/2511.00628. arXiv:2511.00628

work page arXiv 2025
[24]

Stop wasting your tokens: Towards efficient runtime multi-agent systems.arXiv preprint arXiv:2510.26585, 2025

Fulin Lin, Shaowen Chen, Ruishan Fang, Hongwei Wang, and Tao Lin. Stop wasting your tokens: Towards efficient runtime multi-agent systems.arXiv preprint arXiv:2510.26585, 2025. URLhttps://arxiv.org/abs/2510.26585

work page arXiv 2025
[25]

Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces

Mike A. Merrill, Alexander G. Shaw, Nicholas Carlini, Boxuan Li, Harsh Raj, Ivan Bercovich, Lin Shi, Jeong Yeon Shin, Thomas Walshe, E. Kelly Buchanan, Junhong Shen, Guanghao Ye, Haowei Lin, Jason Poulos, Maoyu Wang, Marianna Nezhurina, Jenia Jitsev, Di Lu, Orfeas Menis Mastromichalakis, Zhiwei Xu, Zizhao Chen, Yue Liu, Robert Zhang, Leon Liangyu Chen, An...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[26]

Alexander Novikov, Ngân V˜u, Marvin Eisenberger, Emilien Dupont, Po-Sen Huang, Adam Zsolt Wagner, Sergey Shirobokov, Borislav Kozlovskii, Francisco J. R. Ruiz, Abbas Mehrabian, M. Pawan Kumar, Abigail See, Swarat Chaudhuri, George Holland, Alex Davies, Sebastian Nowozin, Pushmeet Kohli, and Matej Balog. AlphaEvolve: A coding agent for scientific and algor...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[27]

Nvidia nemotron 3: Efficient and open intelligence, 2025

NVIDIA. Nvidia nemotron 3: Efficient and open intelligence, 2025. URL https://arxiv. org/abs/2512.20856. White Paper

work page arXiv 2025
[28]

Generalizing verifiable instruction following,

Valentina Pyatkin, Saumya Malik, Victoria Graf, Hamish Ivison, Shengyi Huang, Pradeep Dasigi, Nathan Lambert, and Hannaneh Hajishirzi. Generalizing verifiable instruction following,

work page
[29]

URLhttps://arxiv.org/abs/2507.02833

work page arXiv
[30]

Qwen3.5: Towards native multimodal agents, February 2026

Qwen Team. Qwen3.5: Towards native multimodal agents, February 2026. URL https: //qwen.ai/blog?id=qwen3.5

work page 2026
[31]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y . K. Li, Y . Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models, 2024. URL https://arxiv.org/abs/ 2402.03300

work page internal anchor Pith review Pith/arXiv arXiv 2024
[32]

Reflexion: Language Agents with Verbal Reinforcement Learning

Noah Shinn, Federico Cassano, Edward Berman, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language Agents with Verbal Reinforcement Learning, March 2023. URLhttps://arxiv.org/abs/2303.11366. arXiv:2303.11366

work page internal anchor Pith review Pith/arXiv arXiv 2023
[33]

Reinforcement learning: An introduction 1st edition.Exp

RS Sutton and AG Barto. Reinforcement learning: An introduction 1st edition.Exp. Psychol. Learn. Mem. Cogn, 30:1302–1321, 1998

work page 1998
[34]

Hindsight credit assignment for long-horizon llm agents, 2026

Hui-Ze Tan, Xiao-Wen Yang, Hao Chen, Jie-Jing Shao, Yi Wen, Yuteng Shen, Weihong Luo, Xiku Du, Lan-Zhe Guo, and Yu-Feng Li. Hindsight credit assignment for long-horizon llm agents, 2026. URLhttps://arxiv.org/abs/2603.08754

work page arXiv 2026
[35]

Counterfactual explanations without open- ing the black box: Automated decisions and the gdpr.Harvard Journal of Law & Technology, 31(2):841–887, 2018

Sandra Wachter, Brent Mittelstadt, and Chris Russell. Counterfactual explanations without open- ing the black box: Automated decisions and the gdpr.Harvard Journal of Law & Technology, 31(2):841–887, 2018

work page 2018
[36]

Fork, Explore, Commit: OS Primitives for Agentic Exploration, February 2026

Cong Wang and Yusheng Zheng. Fork, Explore, Commit: OS Primitives for Agentic Exploration, February 2026. URLhttps://arxiv.org/abs/2602.08199. arXiv:2602.08199

work page arXiv 2026
[37]

AgentSPEX: An Agent SPecification and EXecution Language

Pengcheng Wang, Jerry Huang, Jiarui Yao, Rui Pan, Peizhi Niu, Yaowenqi Liu, Ruida Wang, Renhao Lu, Yuwei Guo, and Tong Zhang. AgentSPEX: An Agent SPecification and EXecution Language, 2026. URLhttps://arxiv.org/abs/2604.13346

work page internal anchor Pith review Pith/arXiv arXiv 2026
[38]

Miao Xiong, Zhiyuan Hu, Xinyang Lu, Yifei Li, Jie Fu, Junxian He, and Bryan Hooi

Ruiyi Wang and Prithviraj Ammanabrolu. A practitioner’s guide to multi-turn agentic reinforce- ment learning, 2025. URLhttps://arxiv.org/abs/2510.01132

work page arXiv 2025
[39]

The OpenHands Software Agent SDK: A Composable and Extensible Foundation for Production Agents

Xingyao Wang, Jiayi Pan, Binyuan Hui, et al. OpenHands V1: Event-sourced state management for multi-agent coding systems.MLSys, 2026. URL https://arxiv.org/abs/2511.03690. arXiv:2511.03690

work page internal anchor Pith review Pith/arXiv arXiv 2026
[40]

ThetaEvolve: Test-time Learning on Open Problems,

Yiping Wang, Shao-Rong Su, Zhiyuan Zeng, Eva Xu, Liliang Ren, Xinyu Yang, Zeyi Huang, Xuehai He, Luyao Ma, Baolin Peng, Hao Cheng, Pengcheng He, Weizhu Chen, Shuohang Wang, Simon Shaolei Du, and Yelong Shen. ThetaEvolve: Test-time Learning on Open Problems,

work page
[41]

URLhttps://arxiv.org/abs/2511.23473

work page arXiv
[42]

Evo-Memory: Benchmarking LLM Agent Test-time Learning with Self-Evolving Memory

Tianxin Wei, Noveen Sachdeva, Benjamin Coleman, Zhankui He, Yuanchen Bei, Xuying Ning, Mengting Ai, Yunzhe Li, Jingrui He, Ed H. Chi, Chi Wang, Shuo Chen, Fernando Pereira, Wang- Cheng Kang, and Derek Zhiyuan Cheng. Evo-Memory: Benchmarking LLM Agent Test-time Learning with Self-Evolving Memory, 2025. URLhttps://arxiv.org/abs/2511.20857

work page internal anchor Pith review arXiv 2025
[43]

Group-Evolving Agents: Open-Ended Self-Improvement via Experience Sharing, 2026

Zhaotian Weng, Antonis Antoniades, Deepak Nathani, Zhen Zhang, Xiao Pu, and Xin Eric Wang. Group-Evolving Agents: Open-Ended Self-Improvement via Experience Sharing, 2026. URLhttps://arxiv.org/abs/2602.04837

work page arXiv 2026
[44]

AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation

Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, Ahmed Hassan Awadallah, Ryen W White, Doug Burger, and Chi Wang. AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation, August 2023. URLhttps://arxiv.org/abs/2308.08155. arXiv:2308.08155. 12

work page internal anchor Pith review Pith/arXiv arXiv 2023
[45]

Agent0: Unleashing Self-Evolving Agents from Zero Data via Tool-Integrated Reasoning,

Peng Xia, Kaide Zeng, Jiaqi Liu, Can Qin, Fang Wu, Yiyang Zhou, Caiming Xiong, and Huaxiu Yao. Agent0: Unleashing Self-Evolving Agents from Zero Data via Tool-Integrated Reasoning,

work page
[46]

URLhttps://arxiv.org/abs/2511.16043

work page arXiv
[47]

Lillicrap, Kenji Kawaguchi, and Michael Shieh

Yuxi Xie, Anirudh Goyal, Wenyue Zheng, Min-Yen Kan, Timothy P. Lillicrap, Kenji Kawaguchi, and Michael Shieh. Monte carlo tree search boosts reasoning via iterative preference learning,

work page
[48]

URLhttps://arxiv.org/abs/2405.00451

work page arXiv
[49]

Recursive Language Models

Alex L. Zhang, Tim Kraska, and Omar Khattab. Recursive language models, 2025. URL https://arxiv.org/abs/2512.24601

work page internal anchor Pith review Pith/arXiv arXiv 2025
[50]

Zhang, Zhening Li, and Omar Khattab

Alex L. Zhang, Zhening Li, and Omar Khattab. The Mismanaged Geniuses Hypothesis, 2026. URLhttps://alexzhang13.github.io/blog/2026/mgh/. Blog post

work page 2026
[51]

Agentracer: Who is inducing failure in the llm agentic systems?arXiv preprint arXiv:2509.03312, 2025

Guibin Zhang, Junhao Wang, Junjie Chen, Wangchunshu Zhou, Kun Wang, and Shuicheng Yan. Agentracer: Who is inducing failure in the llm agentic systems?, 2025. URL https: //arxiv.org/abs/2509.03312

work page arXiv 2025
[52]

Darwin G

Jenny Zhang, Shengran Hu, Cong Lu, Robert Lange, and Jeff Clune. Darwin Gödel Machine: Open-Ended Evolution of Self-Improving Agents, 2025. URL https://arxiv.org/abs/ 2505.22954

work page arXiv 2025
[53]

Hyperagents.arXiv preprint arXiv:2603.19461, 2026

Jenny Zhang, Bingchen Zhao, Wannan Yang, Jakob Foerster, Jeff Clune, Minqi Jiang, Sam Devlin, and Tatiana Shavrina. Hyperagents, March 2026. URL https://arxiv.org/abs/ 2603.19461. arXiv:2603.19461

work page arXiv 2026
[54]

Tomz, Christopher D

Jiayi Zhang, Simon Yu, Derek Chong, Anthony Sicilia, Michael R. Tomz, Christopher D. Manning, and Weiyan Shi. Verbalized sampling: How to mitigate mode collapse and unlock LLM diversity. InarXiv preprint arXiv:2510.01171, 2025. URL https://arxiv.org/abs/ 2510.01171

work page arXiv 2025
[55]

Agentic Context Engineering: Evolving Contexts for Self-Improving Language Models

Qizheng Zhang, Changran Hu, Shubhangi Upasani, Boyuan Ma, Fenglu Hong, Vamsidhar Kamanuru, Jay Rainton, Chen Wu, Mengmeng Ji, Hanchen Li, Urmish Thakker, James Zou, and Kunle Olukotun. Agentic Context Engineering: Evolving Contexts for Self-Improving Language Models, 2025. URLhttps://arxiv.org/abs/2510.04618

work page internal anchor Pith review arXiv 2025
[56]

MemRL: Self-Evolving Agents via Runtime Reinforcement Learning on Episodic Memory,

Shengtao Zhang, Jiaqian Wang, Ruiwen Zhou, Junwei Liao, Yuchen Feng, Zhuo Li, Yujie Zheng, Weinan Zhang, Ying Wen, Zhiyu Li, Feiyu Xiong, Yutao Qi, Bo Tang, and Muning Wen. MemRL: Self-Evolving Agents via Runtime Reinforcement Learning on Episodic Memory,

work page
[57]

URLhttps://arxiv.org/abs/2601.03192

work page arXiv
[58]

Language agent tree search unifies reasoning acting and planning in language models.arXiv preprint arXiv:2310.04406, 2023

Andy Zhou, Kai Yan, Michal Shlapentokh-Rothman, Haohan Wang, and Yu-Xiong Wang. Language Agent Tree Search Unifies Reasoning Acting and Planning in Language Models, October 2023. URLhttps://arxiv.org/abs/2310.04406. arXiv:2310.04406

work page arXiv 2023
[59]

2–3% of one turn

Maciej ´Swiechowski, Konrad Godlewski, Bartosz Sawicki, and Jacek Ma ´ndziuk. Monte carlo tree search: a review of recent modifications and applications.Artificial Intelligence Review, 56(3):2497–2562, 2022. ISSN 1573-7462. doi: 10.1007/s10462-022-10228-y. URL http://dx.doi.org/10.1007/s10462-022-10228-y. A Limitations and Future Works A.1 Limitations Pro...

work page doi:10.1007/s10462-022-10228-y 2022
[60]

This should be the default for the vast majority of agents on most ticks

“none”— everything is fine, let the agent keep working. This should be the default for the vast majority of agents on most ticks. Over-intervention destroys progress

work page
[61]

oh, the supervisor is nudging me

“steer”— CHEAPEST intervention. The agent’s conversation is kept intact; we only append a new user message with your guidance so the agent sees it as “oh, the supervisor is nudging me”. Full conversation history and tool call context are preserved, KV cache is reused. Use this when: • the agent is broadly on task but drifting or about to make a minor wron...

work page
[62]

redirect

“redirect”— EXPENSIVE. The agent’s current session is aborted and a fresh opencode session starts with your guidance as message 1. The agent loses ALL memory of what it has explored, read, tried, or learned — it starts from scratch (but the files it already edited are still there on disk). Use this when: • the agent is stuck in an obvious loop (same tool,...

work page
[63]

none”. • “Stuck in a loop

“revert”— EXPENSIVE and destructive. Same as redirect on the LLM side (new session, lost memory) PLUS the sandbox filesystem is rolled back to the pre-run checkpoint. All files the agent edited are discarded. Use this ONLY when: • the agent wrote files that corrupt the repo (overwrote core code with garbage, introduced unrelated changes, broke syntax) • t...

work page
[64]

(turn 1, session 0 only)Read README.md for the search policy, mechanism axes, and loop hard rules; readORIENTATION.mdfor the run parameters

work page
[65]

(turn 1, every session)Read brief.md for the session index, frontier candidate, remaining budget, and the exact pending-batch and journal-pending file names this session must produce

work page
[66]

(working turns)Use the read-only inspection tools (§F.2.3) to inspect prior runs; pick a base reference (frontier,baseline,promoted, arun_NNN, acand-XXX, or a Meta-Git scope ref); construct sibling variants withstage_variant, or by writing files directly under variants/session_NNN/vXX/workflow/; attach targeted_examples = {improve, protect, invariant}to each

work page
[67]

Whenfinish_session returns control to the host, the proposer’s LLM session terminates and no chat-history context survives across sessions

(handoff turn)Write hypothesis_logs/session_NNN.md with ### Findings, ### Hypotheses,### Considered & Rejected, and### Selected Batch sections; drop a session-fragment into journal_pending/session_NNN.md; write the manifest to pending_batches/session_NNN.json; callfinish_session. Whenfinish_session returns control to the host, the proposer’s LLM session t...

work page
[68]

It is the host’s compact projection of 6candidates, failures, prior logs, and the handoff contract

Read ‘brief.md‘ first. It is the host’s compact projection of 6candidates, failures, prior logs, and the handoff contract

work page
[69]

Choose a set of prior sources as the batch base: ‘frontier‘, 8‘baseline‘, ‘promoted‘, a run id, a candidate id, or a Meta-Git 9scope ref/name

work page
[70]

Create sibling variants from that set of bases using 11‘stage_variant‘, or by passing full ‘files‘/‘workflow_dir‘ entries 12in a batch manifest

work page
[71]

These targeted checks are the 15preflight before hidden aggregate dev scoring

Every variant must include explicit targeted examples with improve, 14protect, and invariant intent. These targeted checks are the 15preflight before hidden aggregate dev scoring

work page
[72]

Call ‘run_counterfactual_batch‘ or ‘submit_counterfactual_batch ‘

work page
[73]

Inspect aggregate outcomes, write the required 18‘experiment_logs/eXXX.md‘ with ‘## Outcome‘ and ‘## Next‘, then 19call ‘finish_session ‘. 20 21Rules: fix failure classes, not literal train examples; preserve the 22Agentic task shape; use valid train ids from brief.md, 23candidate_catalog.json, or traces/metrics; avoid near-duplicate prompt 24tweaks; trea...

work page
[74]

show_history; read one trace.md of an interesting prior 35run; identify a failure CLASS (>=2 examples)

ANALYZE. show_history; read one trace.md of an interesting prior 35run; identify a failure CLASS (>=2 examples)

work page
[75]

Required before every batch

HYPOTHESIZE. Required before every batch. Write 37hypotheses/hNNN_*.md with Branch from / Claim / Proposed 38change / Expected outcome / Why this differs from 39previous attempts / Cache consequence sections

work page
[76]

branch(from_ref=...) to reset workflow/

BRANCH. branch(from_ref=...) to reset workflow/

work page
[77]

EDIT + check_workflow

work page
[78]

43Provide explicit targeted_examples for every variant

Prefer run_counterfactual_batch with several staged siblings. 43Provide explicit targeted_examples for every variant

work page
[79]

Required between runs

OBSERVE. Required between runs. Write observations/oNNN.md

work page
[80]

Flat dev is not a stop condition; 46switch to a structurally different move

Repeat until budget exhausted. Flat dev is not a stop condition; 46switch to a structurally different move. F.2.3 Proposer tool surface The proposer’s tools fall into four groups: filesystem inspection and editing, ledger inspection, effect- level introspection, and host-mediated evaluation. Table 13 lists the live surface; full JSON-Schema specifications...

work page

Showing first 80 references.