Recognition: 1 theorem link
· Lean TheoremShepherd: A Runtime Substrate Empowering Meta-Agents with a Formalized Execution Trace
Pith reviewed 2026-05-12 03:21 UTC · model grok-4.3
The pith
Shepherd formalizes meta-agent operations as functions on a Git-like execution trace that records every interaction for fast forking and replay.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Shepherd records every agent-environment interaction as a typed event in a Git-like execution trace, enabling any past state to be forked and replayed. The substrate forks the agent process and its filesystem five times faster than Docker and reuses more than 95 percent of prompt cache on replay. When applied to runtime intervention, counterfactual meta-optimization, and Tree-RL training, the trace produces measurable gains in pass rates, benchmark scores, and training efficiency across the reported tasks.
What carries the argument
The typed execution trace that stores every agent-environment interaction as an event and supports forking of both the agent process and its filesystem.
If this is right
- A live supervisor using the trace can raise pair-coding pass rates from 28.8 percent to 54.7 percent on CooperBench.
- Branching exploration inside the trace outperforms baselines on four benchmarks by as much as 11 points and reduces wall-clock time by as much as 58 percent.
- Forking rollouts at selected turns inside the trace raises TerminalBench-2 performance from 34.2 percent to 39.4 percent.
- Any past agent state captured in the trace can be replayed or branched without restarting the full environment.
Where Pith is reading between the lines
- The same trace structure could let developers version and debug ordinary single-agent systems the way git versions code.
- High cache reuse on replay suggests the mechanism may scale to longer-horizon agent runs where repeated prompt computation would otherwise dominate cost.
- If the Lean mechanization of core operations is extended, it could support machine-checked proofs that certain meta-agent interventions preserve safety properties.
Load-bearing premise
The reported gains in intervention success, optimization scores, and RL performance arise from the trace and forking features rather than from unmeasured differences in experimental setup or implementation.
What would settle it
Run the same three applications with the forking and trace recording disabled while keeping every other component fixed; if the pass-rate, benchmark, and training improvements disappear, the central claim is supported.
Figures
read the original abstract
We introduce Shepherd, a functional programming model that formalizes meta-agent operations on target agents as functions, with core operations mechanized in Lean. Shepherd records every agent-environment interaction as a typed event in a Git-like execution trace, enabling any past state to be forked and replayed. The system forks the agent process and its filesystem $5\times$ faster than Docker, achieving $>95\%$ prompt-cache reuse on replay. We demonstrate the model through three applications. First, in runtime intervention, a live supervisor increases pair coding pass rates from 28.8% to 54.7% on CooperBench. Second, in counterfactual meta-optimization, branching exploration outperforms baselines across four benchmarks by up to 11 points while reducing wall-clock time by up to 58%. Third, in Tree-RL training, forking rollouts at selected turns improves TerminalBench-2 performance from 34.2% to 39.4%. These results establish Shepherd as an efficient infrastructure for programming meta-agents. We open-source the system to support future research.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Shepherd, a functional programming model that formalizes meta-agent operations on target agents as functions, with core operations mechanized in Lean. It records every agent-environment interaction as a typed event in a Git-like execution trace, enabling forking and replay of past states. The system forks agent processes and filesystems 5× faster than Docker with >95% prompt-cache reuse on replay. It demonstrates the model in three applications: runtime intervention raising pair-coding pass rates from 28.8% to 54.7% on CooperBench; counterfactual meta-optimization outperforming baselines by up to 11 points with up to 58% wall-clock reduction across four benchmarks; and Tree-RL training improving TerminalBench-2 from 34.2% to 39.4%. These results are presented as establishing Shepherd as efficient infrastructure for programming meta-agents, with the system open-sourced.
Significance. If the empirical claims hold under proper controls, Shepherd could provide a useful formalized runtime substrate for meta-agent development, leveraging execution traces for intervention, optimization, and training. The mechanization of core operations in Lean is a clear strength, supplying machine-checked proofs for the model. Open-sourcing the system is also a positive step that supports reproducibility and community follow-on work.
major comments (1)
- [Abstract] Abstract: the abstract reports specific performance improvements (pair-coding pass rate 28.8%→54.7% on CooperBench, up to 11-point gains with 58% wall-clock reduction, TerminalBench-2 34.2%→39.4%) but supplies no experimental details, baselines, statistical tests, error bars, or methodology. This prevents verification that the gains are attributable to the typed execution trace and forking features rather than unstated factors, which is load-bearing for the central claim that 'these results establish Shepherd as an efficient infrastructure for programming meta-agents.'
Simulated Author's Rebuttal
We thank the referee for their positive assessment of Shepherd's contributions, including the Lean mechanization and open-sourcing, and for the constructive feedback on the abstract. We address the major comment below.
read point-by-point responses
-
Referee: [Abstract] Abstract: the abstract reports specific performance improvements (pair-coding pass rate 28.8%→54.7% on CooperBench, up to 11-point gains with 58% wall-clock reduction, TerminalBench-2 34.2%→39.4%) but supplies no experimental details, baselines, statistical tests, error bars, or methodology. This prevents verification that the gains are attributable to the typed execution trace and forking features rather than unstated factors, which is load-bearing for the central claim that 'these results establish Shepherd as an efficient infrastructure for programming meta-agents.'
Authors: The abstract is intentionally concise to highlight key outcomes, following standard academic practice. The full manuscript provides the requested experimental details, including baselines, statistical tests, error bars, and methodology, in the dedicated evaluation sections for each application (runtime intervention, counterfactual meta-optimization, and Tree-RL training). These sections describe controlled experiments that isolate the contributions of the typed execution traces and forking mechanisms, supporting the attribution of the reported gains. The abstract's central claim is thus grounded in the body of the paper rather than standing alone. revision: no
Circularity Check
No circularity: empirical claims rest on reported results without derivations or self-referential reductions
full rationale
The provided abstract contains no equations, derivations, fitted parameters, or self-citations. It introduces Shepherd as a functional model with execution traces and forking, then reports three separate empirical applications (runtime intervention on CooperBench, counterfactual optimization on four benchmarks, Tree-RL on TerminalBench-2) with performance deltas. These results are presented as demonstrations rather than as outputs derived from the system's definition by construction. No load-bearing step reduces a prediction or uniqueness claim to an input fit or prior self-citation; the central infrastructure claim is supported by the listed experimental outcomes, which remain externally verifiable in principle.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean, AlexanderDuality.lean, ArithmeticFromLogic.leanreality_from_one_distinction echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
core operations mechanized in Lean... small algebraic-effects calculus... proof envelopes... typed event in a Git-like execution trace... fork... replay
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
https://www.anthropic.com/engineering/managed-agents. URL https://www.anthropic. com/engineering/managed-agents
-
[2]
GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning
Lakshya A. Agrawal, Shangyin Tan, Dilara Soylu, Noah Ziems, Rishi Khare, Krista Opsahl- Ong, Arnav Singhvi, Herumb Shandilya, Michael J. Ryan, Meng Jiang, Christopher Potts, Koushik Sen, Alexandros G. Dimakis, Ion Stoica, Dan Klein, Matei Zaharia, and Omar Khattab. GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning, July 2025. URL htt...
work page internal anchor Pith review arXiv 2025
-
[3]
Parth Asawa, Alan Zhu, Abby O’Neill, Matei Zaharia, Alexandros G. Dimakis, and Joseph E. Gonzalez. How to Train Your Advisor: Steering Black-Box LLMs with Advisor Models, October 2025. URLhttp://arxiv.org/abs/2510.02453. arXiv:2510.02453 [cs]
-
[4]
Why Do Multi-Agent LLM Systems Fail?
Mert Cemri, Melissa Z. Pan, Shuyi Yang, Lakshya A. Agrawal, Bhavya Chopra, Rishabh Tiwari, Kurt Keutzer, Aditya Parameswaran, Dan Klein, Kannan Ramchandran, Matei Zaharia, Joseph E. Gonzalez, and Ion Stoica. Why do multi-agent llm systems fail?, 2025. URL https://arxiv.org/abs/2503.13657
work page internal anchor Pith review arXiv 2025
-
[5]
Context-lite multi-turn reinforcement learning for LLM agents
Wentse Chen, Jiayu Chen, Hao Zhu, and Jeff Schneider. Context-lite multi-turn reinforcement learning for LLM agents. InES-FoMo III: 3rd Workshop on Efficient Systems for Foundation Models, 2025. URLhttps://openreview.net/forum?id=6CE5PLsZdW
work page 2025
-
[6]
Ching-An Cheng, Allen Nie, and Adith Swaminathan. Trace is the Next AutoDiff: Generative Optimization with Rich Feedback, Execution Traces, and LLMs, June 2024. URL https: //arxiv.org/abs/2406.16218. arXiv:2406.16218
-
[7]
Goodman, and Dimitris Papailiopoulos
Kanishk Gandhi, Shivam Garg, Noah D. Goodman, and Dimitris Papailiopoulos. Endless Terminals: Scaling RL Environments for Terminal Agents, January 2026. URL https:// arxiv.org/abs/2601.16443. arXiv:2601.16443
-
[8]
Measuring mathematical problem solving with the math dataset,
Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset,
-
[9]
URLhttps://arxiv.org/abs/2103.03874
work page internal anchor Pith review Pith/arXiv arXiv
-
[10]
MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework
Sirui Hong, Mingchen Zhuge, Jiaqi Chen, Xiawu Zheng, Yuheng Cheng, Ceyao Zhang, Jinlin Wang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, Chenyu Ran, Lingfeng Xiao, Chenglin Wu, and Jürgen Schmidhuber. MetaGPT: Meta Programming for A Multi- Agent Collaborative Framework, August 2023. URLhttps://arxiv.org/abs/2308.00352. arXiv:2308.00352
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[11]
Treerl: Llm rein- forcement learning with on-policy tree search, 2025
Zhenyu Hou, Ziniu Hu, Yujiang Li, Rui Lu, Jie Tang, and Yuxiao Dong. Treerl: Llm rein- forcement learning with on-policy tree search, 2025. URL https://arxiv.org/abs/2506. 11902
work page 2025
-
[12]
LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code
Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Ar- mando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code, 2024. URL https://arxiv.org/abs/ 2403.07974
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[13]
Tree search for llm agent reinforcement learning, 2026
Yuxiang Ji, Ziyu Ma, Yong Wang, Guanhua Chen, Xiangxiang Chu, and Liaoni Wu. Tree search for llm agent reinforcement learning, 2026. URLhttps://arxiv.org/abs/2509.21240
-
[14]
Hover: A dataset for many-hop fact extraction and claim verification, 2020
Yichen Jiang, Shikha Bordia, Zheng Zhong, Charles Dognin, Maneesh Singh, and Mohit Bansal. Hover: A dataset for many-hop fact extraction and claim verification, 2020. URL https://arxiv.org/abs/2011.03088
-
[15]
Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R Narasimhan. SWE-bench: Can language models resolve real-world github issues? InThe Twelfth International Conference on Learning Representations, 2024. URL https: //openreview.net/forum?id=VTF8yNQM66. 10
work page 2024
-
[16]
DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines
Omar Khattab, Arnav Singhvi, Paridhi Maheshwari, Zhiyuan Zhang, Keshav Santhanam, Sri Vardhamanan, Saiful Haq, Ashutosh Sharma, Thomas T. Joshi, Hanna Moazam, Heather Miller, Matei Zaharia, and Christopher Potts. DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines, October 2023. URLhttps://arxiv.org/abs/2310.03714. arXiv:2310.03714
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[17]
arXiv preprint arXiv:2601.13295 , year=
Arpandeep Khatua, Hao Zhu, Peter Tran, Arya Prabhudesai, Frederic Sadrieh, Johann K. Lieberwirth, Xinkai Yu, Yicheng Fu, Michael J. Ryan, Jiaxin Pei, and Diyi Yang. CooperBench: Why Coding Agents Cannot be Your Teammates Yet, January 2026. URL https://arxiv. org/abs/2601.13295. arXiv:2601.13295
-
[18]
Scaling Test-Time Compute for Agentic Coding
Joongwon Kim, Wannan Yang, Kelvin Niu, Hongming Zhang, Yun Zhu, Eryk Helenowski, Ruan Silva, Zhengxing Chen, Srinivasan Iyer, Manzil Zaheer, Daniel Fried, Hannaneh Hajishirzi, Sanjeev Arora, Gabriel Synnaeve, Ruslan Salakhutdinov, and Anirudh Goyal. Scaling Test-Time Compute for Agentic Coding, 2026. URLhttps://arxiv.org/abs/2604.16529
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[19]
Thinking Machines Lab. Tinker, 2025. URLhttps://thinkingmachines.ai/tinker/
work page 2025
-
[20]
Meta-Harness: End-to-End Optimization of Model Harnesses
Yoonho Lee, Roshen Nair, Qizheng Zhang, Kangwook Lee, Omar Khattab, and Chelsea Finn. Meta-Harness: End-to-End Optimization of Model Harnesses, March 2026. URL https://arxiv.org/abs/2603.28052. arXiv:2603.28052
work page internal anchor Pith review arXiv 2026
-
[21]
Agentic Aggregation for Parallel Scaling of Long-Horizon Agentic Tasks
Yoonsang Lee, Howard Yen, Xi Ye, and Danqi Chen. Agentic Aggregation for Parallel Scaling of Long-Horizon Agentic Tasks, 2026. URLhttps://arxiv.org/abs/2604.11753
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[22]
Combee: Scaling Prompt Learning for Self-Improving Language Model Agents
Hanchen Li, Runyuan He, Qizheng Zhang, Changxiu Ji, Qiuyang Mang, Xiaokun Chen, Lakshya A. Agrawal, Wei-Liang Liao, Eric Yang, Alvin Cheung, James Zou, Kunle Oluko- tun, Ion Stoica, and Joseph E. Gonzalez. Combee: Scaling Prompt Learning for Self- Improving Language Model Agents, April 2026. URLhttp://arxiv.org/abs/2604.04247. arXiv:2604.04247 [cs]
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[23]
Yang Li, Siqi Ping, Xiyu Chen, Xiaojian Qi, Zigan Wang, Ye Luo, and Xiaowei Zhang. AgentGit: A Version Control Framework for Reliable and Scalable LLM-Powered Multi-Agent Systems, November 2025. URLhttps://arxiv.org/abs/2511.00628. arXiv:2511.00628
-
[24]
Fulin Lin, Shaowen Chen, Ruishan Fang, Hongwei Wang, and Tao Lin. Stop wasting your tokens: Towards efficient runtime multi-agent systems.arXiv preprint arXiv:2510.26585, 2025. URLhttps://arxiv.org/abs/2510.26585
-
[25]
Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces
Mike A. Merrill, Alexander G. Shaw, Nicholas Carlini, Boxuan Li, Harsh Raj, Ivan Bercovich, Lin Shi, Jeong Yeon Shin, Thomas Walshe, E. Kelly Buchanan, Junhong Shen, Guanghao Ye, Haowei Lin, Jason Poulos, Maoyu Wang, Marianna Nezhurina, Jenia Jitsev, Di Lu, Orfeas Menis Mastromichalakis, Zhiwei Xu, Zizhao Chen, Yue Liu, Robert Zhang, Leon Liangyu Chen, An...
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[26]
Alexander Novikov, Ngân V˜u, Marvin Eisenberger, Emilien Dupont, Po-Sen Huang, Adam Zsolt Wagner, Sergey Shirobokov, Borislav Kozlovskii, Francisco J. R. Ruiz, Abbas Mehrabian, M. Pawan Kumar, Abigail See, Swarat Chaudhuri, George Holland, Alex Davies, Sebastian Nowozin, Pushmeet Kohli, and Matej Balog. AlphaEvolve: A coding agent for scientific and algor...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[27]
Nvidia nemotron 3: Efficient and open intelligence, 2025
NVIDIA. Nvidia nemotron 3: Efficient and open intelligence, 2025. URL https://arxiv. org/abs/2512.20856. White Paper
-
[28]
Generalizing verifiable instruction following,
Valentina Pyatkin, Saumya Malik, Victoria Graf, Hamish Ivison, Shengyi Huang, Pradeep Dasigi, Nathan Lambert, and Hannaneh Hajishirzi. Generalizing verifiable instruction following,
- [29]
-
[30]
Qwen3.5: Towards native multimodal agents, February 2026
Qwen Team. Qwen3.5: Towards native multimodal agents, February 2026. URL https: //qwen.ai/blog?id=qwen3.5
work page 2026
-
[31]
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y . K. Li, Y . Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models, 2024. URL https://arxiv.org/abs/ 2402.03300
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[32]
Reflexion: Language Agents with Verbal Reinforcement Learning
Noah Shinn, Federico Cassano, Edward Berman, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language Agents with Verbal Reinforcement Learning, March 2023. URLhttps://arxiv.org/abs/2303.11366. arXiv:2303.11366
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[33]
Reinforcement learning: An introduction 1st edition.Exp
RS Sutton and AG Barto. Reinforcement learning: An introduction 1st edition.Exp. Psychol. Learn. Mem. Cogn, 30:1302–1321, 1998
work page 1998
-
[34]
Hindsight credit assignment for long-horizon llm agents, 2026
Hui-Ze Tan, Xiao-Wen Yang, Hao Chen, Jie-Jing Shao, Yi Wen, Yuteng Shen, Weihong Luo, Xiku Du, Lan-Zhe Guo, and Yu-Feng Li. Hindsight credit assignment for long-horizon llm agents, 2026. URLhttps://arxiv.org/abs/2603.08754
-
[35]
Sandra Wachter, Brent Mittelstadt, and Chris Russell. Counterfactual explanations without open- ing the black box: Automated decisions and the gdpr.Harvard Journal of Law & Technology, 31(2):841–887, 2018
work page 2018
-
[36]
Fork, Explore, Commit: OS Primitives for Agentic Exploration, February 2026
Cong Wang and Yusheng Zheng. Fork, Explore, Commit: OS Primitives for Agentic Exploration, February 2026. URLhttps://arxiv.org/abs/2602.08199. arXiv:2602.08199
-
[37]
AgentSPEX: An Agent SPecification and EXecution Language
Pengcheng Wang, Jerry Huang, Jiarui Yao, Rui Pan, Peizhi Niu, Yaowenqi Liu, Ruida Wang, Renhao Lu, Yuwei Guo, and Tong Zhang. AgentSPEX: An Agent SPecification and EXecution Language, 2026. URLhttps://arxiv.org/abs/2604.13346
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[38]
Miao Xiong, Zhiyuan Hu, Xinyang Lu, Yifei Li, Jie Fu, Junxian He, and Bryan Hooi
Ruiyi Wang and Prithviraj Ammanabrolu. A practitioner’s guide to multi-turn agentic reinforce- ment learning, 2025. URLhttps://arxiv.org/abs/2510.01132
-
[39]
The OpenHands Software Agent SDK: A Composable and Extensible Foundation for Production Agents
Xingyao Wang, Jiayi Pan, Binyuan Hui, et al. OpenHands V1: Event-sourced state management for multi-agent coding systems.MLSys, 2026. URL https://arxiv.org/abs/2511.03690. arXiv:2511.03690
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[40]
ThetaEvolve: Test-time Learning on Open Problems,
Yiping Wang, Shao-Rong Su, Zhiyuan Zeng, Eva Xu, Liliang Ren, Xinyu Yang, Zeyi Huang, Xuehai He, Luyao Ma, Baolin Peng, Hao Cheng, Pengcheng He, Weizhu Chen, Shuohang Wang, Simon Shaolei Du, and Yelong Shen. ThetaEvolve: Test-time Learning on Open Problems,
- [41]
-
[42]
Evo-Memory: Benchmarking LLM Agent Test-time Learning with Self-Evolving Memory
Tianxin Wei, Noveen Sachdeva, Benjamin Coleman, Zhankui He, Yuanchen Bei, Xuying Ning, Mengting Ai, Yunzhe Li, Jingrui He, Ed H. Chi, Chi Wang, Shuo Chen, Fernando Pereira, Wang- Cheng Kang, and Derek Zhiyuan Cheng. Evo-Memory: Benchmarking LLM Agent Test-time Learning with Self-Evolving Memory, 2025. URLhttps://arxiv.org/abs/2511.20857
work page internal anchor Pith review arXiv 2025
-
[43]
Group-Evolving Agents: Open-Ended Self-Improvement via Experience Sharing, 2026
Zhaotian Weng, Antonis Antoniades, Deepak Nathani, Zhen Zhang, Xiao Pu, and Xin Eric Wang. Group-Evolving Agents: Open-Ended Self-Improvement via Experience Sharing, 2026. URLhttps://arxiv.org/abs/2602.04837
-
[44]
AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation
Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, Ahmed Hassan Awadallah, Ryen W White, Doug Burger, and Chi Wang. AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation, August 2023. URLhttps://arxiv.org/abs/2308.08155. arXiv:2308.08155. 12
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[45]
Agent0: Unleashing Self-Evolving Agents from Zero Data via Tool-Integrated Reasoning,
Peng Xia, Kaide Zeng, Jiaqi Liu, Can Qin, Fang Wu, Yiyang Zhou, Caiming Xiong, and Huaxiu Yao. Agent0: Unleashing Self-Evolving Agents from Zero Data via Tool-Integrated Reasoning,
- [46]
-
[47]
Lillicrap, Kenji Kawaguchi, and Michael Shieh
Yuxi Xie, Anirudh Goyal, Wenyue Zheng, Min-Yen Kan, Timothy P. Lillicrap, Kenji Kawaguchi, and Michael Shieh. Monte carlo tree search boosts reasoning via iterative preference learning,
- [48]
-
[49]
Alex L. Zhang, Tim Kraska, and Omar Khattab. Recursive language models, 2025. URL https://arxiv.org/abs/2512.24601
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[50]
Zhang, Zhening Li, and Omar Khattab
Alex L. Zhang, Zhening Li, and Omar Khattab. The Mismanaged Geniuses Hypothesis, 2026. URLhttps://alexzhang13.github.io/blog/2026/mgh/. Blog post
work page 2026
-
[51]
Agentracer: Who is inducing failure in the llm agentic systems?arXiv preprint arXiv:2509.03312, 2025
Guibin Zhang, Junhao Wang, Junjie Chen, Wangchunshu Zhou, Kun Wang, and Shuicheng Yan. Agentracer: Who is inducing failure in the llm agentic systems?, 2025. URL https: //arxiv.org/abs/2509.03312
- [52]
-
[53]
Hyperagents.arXiv preprint arXiv:2603.19461, 2026
Jenny Zhang, Bingchen Zhao, Wannan Yang, Jakob Foerster, Jeff Clune, Minqi Jiang, Sam Devlin, and Tatiana Shavrina. Hyperagents, March 2026. URL https://arxiv.org/abs/ 2603.19461. arXiv:2603.19461
-
[54]
Jiayi Zhang, Simon Yu, Derek Chong, Anthony Sicilia, Michael R. Tomz, Christopher D. Manning, and Weiyan Shi. Verbalized sampling: How to mitigate mode collapse and unlock LLM diversity. InarXiv preprint arXiv:2510.01171, 2025. URL https://arxiv.org/abs/ 2510.01171
-
[55]
Agentic Context Engineering: Evolving Contexts for Self-Improving Language Models
Qizheng Zhang, Changran Hu, Shubhangi Upasani, Boyuan Ma, Fenglu Hong, Vamsidhar Kamanuru, Jay Rainton, Chen Wu, Mengmeng Ji, Hanchen Li, Urmish Thakker, James Zou, and Kunle Olukotun. Agentic Context Engineering: Evolving Contexts for Self-Improving Language Models, 2025. URLhttps://arxiv.org/abs/2510.04618
work page internal anchor Pith review arXiv 2025
-
[56]
MemRL: Self-Evolving Agents via Runtime Reinforcement Learning on Episodic Memory,
Shengtao Zhang, Jiaqian Wang, Ruiwen Zhou, Junwei Liao, Yuchen Feng, Zhuo Li, Yujie Zheng, Weinan Zhang, Ying Wen, Zhiyu Li, Feiyu Xiong, Yutao Qi, Bo Tang, and Muning Wen. MemRL: Self-Evolving Agents via Runtime Reinforcement Learning on Episodic Memory,
- [57]
-
[58]
Andy Zhou, Kai Yan, Michal Shlapentokh-Rothman, Haohan Wang, and Yu-Xiong Wang. Language Agent Tree Search Unifies Reasoning Acting and Planning in Language Models, October 2023. URLhttps://arxiv.org/abs/2310.04406. arXiv:2310.04406
-
[59]
Maciej ´Swiechowski, Konrad Godlewski, Bartosz Sawicki, and Jacek Ma ´ndziuk. Monte carlo tree search: a review of recent modifications and applications.Artificial Intelligence Review, 56(3):2497–2562, 2022. ISSN 1573-7462. doi: 10.1007/s10462-022-10228-y. URL http://dx.doi.org/10.1007/s10462-022-10228-y. A Limitations and Future Works A.1 Limitations Pro...
-
[60]
This should be the default for the vast majority of agents on most ticks
“none”— everything is fine, let the agent keep working. This should be the default for the vast majority of agents on most ticks. Over-intervention destroys progress
-
[61]
oh, the supervisor is nudging me
“steer”— CHEAPEST intervention. The agent’s conversation is kept intact; we only append a new user message with your guidance so the agent sees it as “oh, the supervisor is nudging me”. Full conversation history and tool call context are preserved, KV cache is reused. Use this when: • the agent is broadly on task but drifting or about to make a minor wron...
-
[62]
“redirect”— EXPENSIVE. The agent’s current session is aborted and a fresh opencode session starts with your guidance as message 1. The agent loses ALL memory of what it has explored, read, tried, or learned — it starts from scratch (but the files it already edited are still there on disk). Use this when: • the agent is stuck in an obvious loop (same tool,...
-
[63]
“revert”— EXPENSIVE and destructive. Same as redirect on the LLM side (new session, lost memory) PLUS the sandbox filesystem is rolled back to the pre-run checkpoint. All files the agent edited are discarded. Use this ONLY when: • the agent wrote files that corrupt the repo (overwrote core code with garbage, introduced unrelated changes, broke syntax) • t...
-
[64]
(turn 1, session 0 only)Read README.md for the search policy, mechanism axes, and loop hard rules; readORIENTATION.mdfor the run parameters
-
[65]
(turn 1, every session)Read brief.md for the session index, frontier candidate, remaining budget, and the exact pending-batch and journal-pending file names this session must produce
-
[66]
(working turns)Use the read-only inspection tools (§F.2.3) to inspect prior runs; pick a base reference (frontier,baseline,promoted, arun_NNN, acand-XXX, or a Meta-Git scope ref); construct sibling variants withstage_variant, or by writing files directly under variants/session_NNN/vXX/workflow/; attach targeted_examples = {improve, protect, invariant}to each
-
[67]
(handoff turn)Write hypothesis_logs/session_NNN.md with ### Findings, ### Hypotheses,### Considered & Rejected, and### Selected Batch sections; drop a session-fragment into journal_pending/session_NNN.md; write the manifest to pending_batches/session_NNN.json; callfinish_session. Whenfinish_session returns control to the host, the proposer’s LLM session t...
-
[68]
It is the host’s compact projection of 6candidates, failures, prior logs, and the handoff contract
Read ‘brief.md‘ first. It is the host’s compact projection of 6candidates, failures, prior logs, and the handoff contract
-
[69]
Choose a set of prior sources as the batch base: ‘frontier‘, 8‘baseline‘, ‘promoted‘, a run id, a candidate id, or a Meta-Git 9scope ref/name
-
[70]
Create sibling variants from that set of bases using 11‘stage_variant‘, or by passing full ‘files‘/‘workflow_dir‘ entries 12in a batch manifest
-
[71]
These targeted checks are the 15preflight before hidden aggregate dev scoring
Every variant must include explicit targeted examples with improve, 14protect, and invariant intent. These targeted checks are the 15preflight before hidden aggregate dev scoring
-
[72]
Call ‘run_counterfactual_batch‘ or ‘submit_counterfactual_batch ‘
-
[73]
Inspect aggregate outcomes, write the required 18‘experiment_logs/eXXX.md‘ with ‘## Outcome‘ and ‘## Next‘, then 19call ‘finish_session ‘. 20 21Rules: fix failure classes, not literal train examples; preserve the 22Agentic task shape; use valid train ids from brief.md, 23candidate_catalog.json, or traces/metrics; avoid near-duplicate prompt 24tweaks; trea...
-
[74]
ANALYZE. show_history; read one trace.md of an interesting prior 35run; identify a failure CLASS (>=2 examples)
-
[75]
HYPOTHESIZE. Required before every batch. Write 37hypotheses/hNNN_*.md with Branch from / Claim / Proposed 38change / Expected outcome / Why this differs from 39previous attempts / Cache consequence sections
- [76]
-
[77]
EDIT + check_workflow
-
[78]
43Provide explicit targeted_examples for every variant
Prefer run_counterfactual_batch with several staged siblings. 43Provide explicit targeted_examples for every variant
- [79]
-
[80]
Flat dev is not a stop condition; 46switch to a structurally different move
Repeat until budget exhausted. Flat dev is not a stop condition; 46switch to a structurally different move. F.2.3 Proposer tool surface The proposer’s tools fall into four groups: filesystem inspection and editing, ledger inspection, effect- level introspection, and host-mediated evaluation. Table 13 lists the live surface; full JSON-Schema specifications...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.