MOSS: Self-Evolution through Source-Level Rewriting in Autonomous Agent Systems

Jun Song; Qianshu Cai; Wei Xue; Xianzhang Jia; Xinmei Tian; Yike Guo; Yonggang Zhang

arxiv: 2605.22794 · v1 · pith:IYUTICPPnew · submitted 2026-05-21 · 💻 cs.AI · cs.LG

MOSS: Self-Evolution through Source-Level Rewriting in Autonomous Agent Systems

Qianshu Cai , Yonggang Zhang , Xianzhang Jia , Wei Xue , Jun Song , Xinmei Tian , Yike Guo This is my paper

Pith reviewed 2026-05-22 04:43 UTC · model grok-4.3

classification 💻 cs.AI cs.LG

keywords self-evolving agentssource-level rewritingautonomous agent systemsfailure-driven evolutioncode self-modificationagent performance improvementOpenClaw

0 comments

The pith

Source-level rewriting lets autonomous agents fix structural failures and raise performance without human updates.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Current autonomous agents remain static after deployment because they can only modify text artifacts like prompts and skills, leaving code structures such as routing and dispatch untouched. MOSS shows that allowing agents to rewrite their source code provides a more general way to evolve, since code is Turing-complete and changes take effect deterministically. The system curates batches of production failures, delegates code edits to an external coding agent, verifies the changes by replaying the failures in temporary environments, and then swaps in the new version with safeguards. This approach matters to a sympathetic reader because it could let deployed agents learn from their mistakes and reduce recurring problems without waiting for human programmers. If the claim holds, agent systems could become more reliable and adaptive over time through repeated self-improvement cycles.

Core claim

MOSS performs self-rewriting at the source level on production agentic substrates. Each evolution is anchored to an automatically curated batch of production-failure evidence and proceeds through a deterministic multi-stage pipeline; code modification is delegated to a pluggable external coding-agent CLI while MOSS retains stage ordering and verdicts. Candidates are verified by replaying the batch against the candidate image in ephemeral trial workers, then promoted via user-consent-gated, in-place container swap with health-probe-gated rollback. On OpenClaw, MOSS lifts a four-task mean grader score from 0.25 to 0.61 in a single cycle without human intervention.

What carries the argument

Deterministic multi-stage pipeline for anchoring source code modifications to failure evidence, with replay verification and gated deployment.

Load-bearing premise

The external coding-agent CLI produces modifications that pass replay verification on the failure batch without introducing undetected regressions or breaking unrelated functionality.

What would settle it

Applying MOSS to OpenClaw or a similar system and measuring whether the four-task mean grader score rises after one cycle while checking for regressions in other functions.

Figures

Figures reproduced from arXiv: 2605.22794 by Jun Song, Qianshu Cai, Wei Xue, Xianzhang Jia, Xinmei Tian, Yike Guo, Yonggang Zhang.

**Figure 2.** Figure 2: The four nested levels of MOSS evolution: a pre-loop baseline (Layer 0), an iteration loop [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Iteration-1 trace covering all stages MOSS executed; stage 3 (Plan-Review) and stage 5 [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

read the original abstract

Autonomous agentic systems are largely static after deployment: they do not learn from user interactions, and recurring failures persist until the next human-driven update ships a fix. Self-evolving agents have emerged in response, but all confine evolution to text-mutable artifacts -- skill files, prompt configurations, memory schemas, workflow graphs -- and leave the agent harness untouched. Since routing, hook ordering, state invariants, and dispatch live in code rather than in any text artifact, an entire class of structural failure is physically unreachable from the text layer. We argue that source-level adaptation is a fundamentally more general medium: it is Turing-complete, a strict superset of every text-mutable scope, takes effect deterministically rather than through base-model compliance, and does not erode under long-context drift. We present MOSS, a system that performs self-rewriting at the source level on production agentic substrates. Each evolution is anchored to an automatically curated batch of production-failure evidence and proceeds through a deterministic multi-stage pipeline; code modification is delegated to a pluggable external coding-agent CLI while MOSS retains stage ordering and verdicts. Candidates are verified by replaying the batch against the candidate image in ephemeral trial workers, then promoted via user-consent-gated, in-place container swap with health-probe-gated rollback. On OpenClaw, MOSS lifts a four-task mean grader score from 0.25 to 0.61 in a single cycle without human intervention.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MOSS makes a case for letting agents rewrite their own source code to fix issues text prompts cannot reach, but the verification step only replays the failure batch and leaves regression risk unaddressed.

read the letter

The main thing to know is that this paper argues source-level rewriting gives agents a genuinely broader way to self-evolve than the text-only methods in prior work. Code changes can alter routing, hooks, and invariants that live outside any prompt or skill file, and the authors treat this as a strict superset because it is Turing complete and deterministic rather than dependent on model compliance. They describe a pipeline that pulls a batch of production failures, hands the edit task to an external coding-agent CLI, replays the batch in trial workers to check the candidate, and then does an in-place container swap with health-probe rollback. The OpenClaw result shows the mean grader score moving from 0.25 to 0.61 after one cycle with no human input in between. That framing and the concrete stages are the clearest contribution. The work is honest about the limitation of current self-evolving agents and gives a practical way to move past it. The soft spot is exactly where the stress test points: replaying only the curated failure batch does not logically guarantee that unrelated tasks or state invariants stay intact. Source edits can change ordering or dispatch in ways that only show up later, and the description does not mention a broader regression suite, differential testing against the old image, or formal checks. Without those, the reported lift is hard to trust as stable. The paper is aimed at people building long-running autonomous agents who already see recurring structural failures. Readers working on self-modifying systems or reliable deployment would find the distinction useful even if they end up disagreeing with the safety claims. It is worth sending to a serious referee because the core idea is grounded enough to deserve detailed review and the authors have shipped a working pipeline rather than just a sketch.

Referee Report

1 major / 2 minor

Summary. The manuscript introduces MOSS, a system enabling self-evolution in autonomous agent systems through source-level code rewriting rather than limiting changes to text-mutable artifacts such as prompts or skill files. It describes a deterministic multi-stage pipeline that curates production failure evidence, delegates modifications to a pluggable external coding-agent CLI, verifies candidates by replaying the failure batch in ephemeral workers, and promotes successful edits via user-consent-gated container swaps with health-probe rollback. The central empirical claim is that MOSS raises the mean grader score on four OpenClaw tasks from 0.25 to 0.61 after a single evolution cycle without human intervention.

Significance. If the reported performance improvement is shown to be robust, the work would be significant for autonomous agent research by demonstrating a practical route to structural self-adaptation at the code level. Source rewriting is positioned as a strict superset of text-based methods, offering deterministic effects and Turing completeness that avoid base-model compliance issues. The design choice to retain stage ordering and verdicts while outsourcing edits to an external CLI is a pragmatic strength that supports pluggability.

major comments (1)

Verification and promotion pipeline: The safety argument for promoting edits rests on replay verification against the automatically curated failure batch alone. Because source-level changes can alter routing, hook ordering, and state invariants that affect tasks outside the batch, a passing replay on the failure set does not logically entail absence of regressions on the broader task distribution. No independent regression suite, differential testing on held-out tasks, or invariant check is described, which directly bears on the claim of reliable human-intervention-free evolution.

minor comments (2)

The abstract and pipeline description would benefit from explicit enumeration of the health probes used for rollback gating and the precise definition of the grader score metric on OpenClaw.
Add a brief comparison table or baseline description showing how the 0.25 starting score was obtained and whether any controls for prompt-only evolution were run in the same setting.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the major comment below and describe the revisions we intend to incorporate.

read point-by-point responses

Referee: Verification and promotion pipeline: The safety argument for promoting edits rests on replay verification against the automatically curated failure batch alone. Because source-level changes can alter routing, hook ordering, and state invariants that affect tasks outside the batch, a passing replay on the failure set does not logically entail absence of regressions on the broader task distribution. No independent regression suite, differential testing on held-out tasks, or invariant check is described, which directly bears on the claim of reliable human-intervention-free evolution.

Authors: We agree that verification against the failure batch alone does not guarantee the absence of regressions on the broader task distribution, since source-level edits can affect routing, hook ordering, and state invariants. Our design focuses on production-derived failure evidence to target observed issues, with user-consent gating and health-probe rollback providing additional safeguards during promotion. To address this point, we will revise the manuscript to explicitly discuss the scope of the verification and include results from differential testing on held-out tasks. revision: yes

Circularity Check

0 steps flagged

Empirical pipeline description with no mathematical derivations or self-referential reductions

full rationale

The manuscript presents MOSS as an empirical system for source-level self-rewriting in agentic substrates, anchored to failure batches and verified via replay in ephemeral workers. The reported lift from 0.25 to 0.61 on OpenClaw tasks is framed as the measured outcome of this pipeline rather than a derived prediction or first-principles result. No equations, fitted parameters, uniqueness theorems, or ansatzes appear that could reduce to their own inputs by construction. The central claim therefore remains an externally falsifiable empirical observation against the stated benchmarks and does not exhibit any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities; all technical details remain unspecified.

pith-pipeline@v0.9.0 · 5811 in / 901 out tokens · 35986 ms · 2026-05-22T04:43:40.400578+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages · 12 internal anchors

[1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning

Lakshya A Agrawal, Shangyin Tan, Dilara Soylu, Noah Ziems, Rishi Khare, Krista Opsahl-Ong, Arnav Singhvi, Herumb Shandilya, Michael J Ryan, Meng Jiang, et al. Gepa: Reflective prompt evolution can outperform reinforcement learning.arXiv preprint arXiv:2507.19457,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Model card for the Claude 3 family released March 2024; documents capabilities, benchmarks, and safety evaluations for Opus, Sonnet, and Haiku

URL https://www.anthropic.com/claude-3-model-card . Model card for the Claude 3 family released March 2024; documents capabilities, benchmarks, and safety evaluations for Opus, Sonnet, and Haiku. Anthropic. Claude Code. https://www.anthropic.com/claude-code,

work page 2024
[4]

Large Language Model based Multi-Agents: A Survey of Progress and Challenges

Anthropic’s command-line coding agent for the Claude model family. Taicheng Guo, Xiuying Chen, Yaqi Wang, Ruidi Chang, Shichao Pei, Nitesh V Chawla, Olaf Wiest, and Xiangliang Zhang. Large language model based multi-agents: A survey of progress and challenges.arXiv preprint arXiv:2402.01680,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

10 Sirui Hong, Mingchen Zhuge, Jonathan Chen, Xiawu Zheng, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Steven Yau, Zijuan Lin, Liyang Zhou, et al

Terminal-UI coding agent for DeepSeek models. 10 Sirui Hong, Mingchen Zhuge, Jonathan Chen, Xiawu Zheng, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Steven Yau, Zijuan Lin, Liyang Zhou, et al. Metagpt: Meta programming for a multi-agent collaborative framework. InInternational Conference on Learning Representations, volume 2024, pages 23247–23275,

work page 2024
[6]

Automated design of agentic systems

Shengran Hu, Cong Lu, and Jeff Clune. Automated design of agentic systems. InInternational Conference on Learning Representations, volume 2025, pages 21344–21377,

work page 2025
[7]

DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines

Omar Khattab, Arnav Singhvi, Paridhi Maheshwari, Zhiyuan Zhang, Keshav Santhanam, Sri Vard- hamanan, Saiful Haq, Ashutosh Sharma, Thomas T Joshi, Hanna Moazam, et al. Dspy: Compiling declarative language model calls into self-improving pipelines.arXiv preprint arXiv:2310.03714,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

Meta-Harness: End-to-End Optimization of Model Harnesses

Yoonho Lee, Roshen Nair, Qizheng Zhang, Kangwook Lee, Omar Khattab, and Chelsea Finn. Meta-harness: End-to-end optimization of model harnesses.arXiv preprint arXiv:2603.28052,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

GenericAgent: A Token-Efficient Self-Evolving LLM Agent via Contextual Information Density Maximization (V1.0)

Jiaqing Liang, Jinyi Han, Weijia Li, Xinyi Wang, Zhoujia Zhang, Zishang Jiang, Ying Liao, Tingyun Li, Ying Huang, Hao Shen, et al. Genericagent: A token-efficient self-evolving llm agent via contextual information density maximization (v1. 0).arXiv preprint arXiv:2604.17091,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models

Aixin Liu, Aoxue Mei, Bangcai Lin, Bing Xue, Bingxuan Wang, Bingzheng Xu, Bochao Wu, Bowei Zhang, Chaofan Lin, Chen Dong, et al. Deepseek-v3. 2: Pushing the frontier of open large language models.arXiv preprint arXiv:2512.02556,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

SkillClaw: Let Skills Evolve Collectively with Agentic Evolver

Ziyu Ma, Shidong Yang, Yuxiang Ji, Xucong Wang, Yong Wang, Yiming Hu, Tongwen Huang, and Xiangxiang Chu. Skillclaw: Let skills evolve collectively with agentic evolver.arXiv preprint arXiv:2604.08377,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

Toolllm: Facilitating large language models to master 16000+ real-world apis

Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, et al. Toolllm: Facilitating large language models to master 16000+ real-world apis. InInternational Conference on Learning Representations, volume 2024, pages 9695–9717,

work page 2024
[13]

arXiv preprint arXiv:2504.15228 , year=

11 Maxime Robeyns, Martin Szummer, and Laurence Aitchison. A self-improving coding agent.arXiv preprint arXiv:2504.15228,

work page arXiv
[14]

From Procedural Skills to Strategy Genes: Towards Experience-Driven Test-Time Evolution

Junjie Wang, Yiming Ren, and Haoyang Zhang. From procedural skills to strategy genes: Towards experience-driven test-time evolution.arXiv preprint arXiv:2604.15097,

work page internal anchor Pith review Pith/arXiv arXiv
[15]

Evoagentx: An automated framework for evolving agentic workflows

Yingxu Wang, Siwei Liu, Jinyuan Fang, and Zaiqiao Meng. Evoagentx: An automated framework for evolving agentic workflows. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 643–655,

work page 2025
[16]

ReAct: Synergizing Reasoning and Acting in Language Models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models.arXiv preprint arXiv:2210.03629,

work page internal anchor Pith review Pith/arXiv arXiv
[17]

Claw-Eval: Towards Trustworthy Evaluation of Autonomous Agents

Bowen Ye, Rang Li, Qibin Yang, Yuanxin Liu, Linli Yao, Hanglong Lv, Zhihui Xie, Chenxin An, Lei Li, Lingpeng Kong, et al. Claw-eval: Toward trustworthy evaluation of autonomous agents.arXiv preprint arXiv:2604.06132,

work page internal anchor Pith review Pith/arXiv arXiv
[18]

Darwin Godel Machine: Open-Ended Evolution of Self-Improving Agents

Jenny Zhang, Shengran Hu, Cong Lu, Robert Lange, and Jeff Clune. Darwin godel machine: Open-ended evolution of self-improving agents.arXiv preprint arXiv:2505.22954,

work page internal anchor Pith review Pith/arXiv arXiv
[19]

Hyperagents.arXiv preprint arXiv:2603.19461, 2026

Jenny Zhang, Bingchen Zhao, Wannan Yang, Jakob Foerster, Jeff Clune, Minqi Jiang, Sam Devlin, and Tatiana Shavrina. Hyperagents.arXiv preprint arXiv:2603.19461,

work page arXiv

[1] [1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning

Lakshya A Agrawal, Shangyin Tan, Dilara Soylu, Noah Ziems, Rishi Khare, Krista Opsahl-Ong, Arnav Singhvi, Herumb Shandilya, Michael J Ryan, Meng Jiang, et al. Gepa: Reflective prompt evolution can outperform reinforcement learning.arXiv preprint arXiv:2507.19457,

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

Model card for the Claude 3 family released March 2024; documents capabilities, benchmarks, and safety evaluations for Opus, Sonnet, and Haiku

URL https://www.anthropic.com/claude-3-model-card . Model card for the Claude 3 family released March 2024; documents capabilities, benchmarks, and safety evaluations for Opus, Sonnet, and Haiku. Anthropic. Claude Code. https://www.anthropic.com/claude-code,

work page 2024

[4] [4]

Large Language Model based Multi-Agents: A Survey of Progress and Challenges

Anthropic’s command-line coding agent for the Claude model family. Taicheng Guo, Xiuying Chen, Yaqi Wang, Ruidi Chang, Shichao Pei, Nitesh V Chawla, Olaf Wiest, and Xiangliang Zhang. Large language model based multi-agents: A survey of progress and challenges.arXiv preprint arXiv:2402.01680,

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

10 Sirui Hong, Mingchen Zhuge, Jonathan Chen, Xiawu Zheng, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Steven Yau, Zijuan Lin, Liyang Zhou, et al

Terminal-UI coding agent for DeepSeek models. 10 Sirui Hong, Mingchen Zhuge, Jonathan Chen, Xiawu Zheng, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Steven Yau, Zijuan Lin, Liyang Zhou, et al. Metagpt: Meta programming for a multi-agent collaborative framework. InInternational Conference on Learning Representations, volume 2024, pages 23247–23275,

work page 2024

[6] [6]

Automated design of agentic systems

Shengran Hu, Cong Lu, and Jeff Clune. Automated design of agentic systems. InInternational Conference on Learning Representations, volume 2025, pages 21344–21377,

work page 2025

[7] [7]

DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines

Omar Khattab, Arnav Singhvi, Paridhi Maheshwari, Zhiyuan Zhang, Keshav Santhanam, Sri Vard- hamanan, Saiful Haq, Ashutosh Sharma, Thomas T Joshi, Hanna Moazam, et al. Dspy: Compiling declarative language model calls into self-improving pipelines.arXiv preprint arXiv:2310.03714,

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

Meta-Harness: End-to-End Optimization of Model Harnesses

Yoonho Lee, Roshen Nair, Qizheng Zhang, Kangwook Lee, Omar Khattab, and Chelsea Finn. Meta-harness: End-to-end optimization of model harnesses.arXiv preprint arXiv:2603.28052,

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

GenericAgent: A Token-Efficient Self-Evolving LLM Agent via Contextual Information Density Maximization (V1.0)

Jiaqing Liang, Jinyi Han, Weijia Li, Xinyi Wang, Zhoujia Zhang, Zishang Jiang, Ying Liao, Tingyun Li, Ying Huang, Hao Shen, et al. Genericagent: A token-efficient self-evolving llm agent via contextual information density maximization (v1. 0).arXiv preprint arXiv:2604.17091,

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models

Aixin Liu, Aoxue Mei, Bangcai Lin, Bing Xue, Bingxuan Wang, Bingzheng Xu, Bochao Wu, Bowei Zhang, Chaofan Lin, Chen Dong, et al. Deepseek-v3. 2: Pushing the frontier of open large language models.arXiv preprint arXiv:2512.02556,

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

SkillClaw: Let Skills Evolve Collectively with Agentic Evolver

Ziyu Ma, Shidong Yang, Yuxiang Ji, Xucong Wang, Yong Wang, Yiming Hu, Tongwen Huang, and Xiangxiang Chu. Skillclaw: Let skills evolve collectively with agentic evolver.arXiv preprint arXiv:2604.08377,

work page internal anchor Pith review Pith/arXiv arXiv

[12] [12]

Toolllm: Facilitating large language models to master 16000+ real-world apis

Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, et al. Toolllm: Facilitating large language models to master 16000+ real-world apis. InInternational Conference on Learning Representations, volume 2024, pages 9695–9717,

work page 2024

[13] [13]

arXiv preprint arXiv:2504.15228 , year=

11 Maxime Robeyns, Martin Szummer, and Laurence Aitchison. A self-improving coding agent.arXiv preprint arXiv:2504.15228,

work page arXiv

[14] [14]

From Procedural Skills to Strategy Genes: Towards Experience-Driven Test-Time Evolution

Junjie Wang, Yiming Ren, and Haoyang Zhang. From procedural skills to strategy genes: Towards experience-driven test-time evolution.arXiv preprint arXiv:2604.15097,

work page internal anchor Pith review Pith/arXiv arXiv

[15] [15]

Evoagentx: An automated framework for evolving agentic workflows

Yingxu Wang, Siwei Liu, Jinyuan Fang, and Zaiqiao Meng. Evoagentx: An automated framework for evolving agentic workflows. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 643–655,

work page 2025

[16] [16]

ReAct: Synergizing Reasoning and Acting in Language Models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models.arXiv preprint arXiv:2210.03629,

work page internal anchor Pith review Pith/arXiv arXiv

[17] [17]

Claw-Eval: Towards Trustworthy Evaluation of Autonomous Agents

Bowen Ye, Rang Li, Qibin Yang, Yuanxin Liu, Linli Yao, Hanglong Lv, Zhihui Xie, Chenxin An, Lei Li, Lingpeng Kong, et al. Claw-eval: Toward trustworthy evaluation of autonomous agents.arXiv preprint arXiv:2604.06132,

work page internal anchor Pith review Pith/arXiv arXiv

[18] [18]

Darwin Godel Machine: Open-Ended Evolution of Self-Improving Agents

Jenny Zhang, Shengran Hu, Cong Lu, Robert Lange, and Jeff Clune. Darwin godel machine: Open-ended evolution of self-improving agents.arXiv preprint arXiv:2505.22954,

work page internal anchor Pith review Pith/arXiv arXiv

[19] [19]

Hyperagents.arXiv preprint arXiv:2603.19461, 2026

Jenny Zhang, Bingchen Zhao, Wannan Yang, Jakob Foerster, Jeff Clune, Minqi Jiang, Sam Devlin, and Tatiana Shavrina. Hyperagents.arXiv preprint arXiv:2603.19461,

work page arXiv