MOSS: Self-Evolution through Source-Level Rewriting in Autonomous Agent Systems

Huajiang Zheng; Jun Song; Qianshu Cai; Wei Xue; Xianzhang Jia; Xinmei Tian; Yike Guo; Yonggang Zhang

REVIEW 2 major objections 1 minor 1 cited by

Autonomous agents can rewrite their own source code to fix structural failures that text changes cannot reach.

Reviewed by Pith at T0; open to challenge. T0 means a machine referee read the full paper against a public rubric. the ladder, T0–T4 →

Challenge this review Re-run · record.json Download PDF Read on arXiv ↗

T0 review · grok-4.3

2026-06-30 16:42 UTC pith:IYUTICPP

load-bearing objection MOSS has a concrete pipeline for source-level agent self-rewriting with container swaps, but the verification only replays past failures and the single reported result has no controls or baselines. the 2 major comments →

arxiv 2605.22794 v2 pith:IYUTICPP submitted 2026-05-21 cs.AI cs.LG

MOSS: Self-Evolution through Source-Level Rewriting in Autonomous Agent Systems

Qianshu Cai , Yonggang Zhang , Xianzhang Jia , Huajiang Zheng , Wei Xue , Jun Song , Xinmei Tian , Yike Guo This is my paper

classification cs.AI cs.LG

keywords self-evolving agentssource code rewritingautonomous agent systemsfailure-driven adaptationcontainer deploymentagentic self-improvement

verification ladder T0 review T1 audit T2 compute T3 formal T4 reserved

The pith

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Current self-evolving agents modify only text artifacts such as prompts and workflows, leaving code-level elements like routing, hooks, and state invariants unreachable. MOSS performs evolution directly at the source level through a deterministic pipeline that curates failure evidence, delegates modifications to an external coding tool, verifies candidates by replaying failures on trial workers, and promotes changes with rollback safeguards. The approach is presented as Turing-complete and deterministic, independent of base-model compliance or context length. On the OpenClaw benchmark it raises mean grader score from 0.25 to 0.61 after one cycle with no human intervention. If the claim holds, recurring structural failures become fixable without waiting for external updates.

Core claim

MOSS performs self-rewriting at the source level on production agentic substrates. Each evolution is anchored to an automatically curated batch of production-failure evidence and proceeds through a deterministic multi-stage pipeline; code modification is delegated to a pluggable external coding-agent CLI while MOSS retains stage ordering and verdicts. Candidates are verified by replaying the batch against the candidate image in ephemeral trial workers, then promoted via user-consent-gated, in-place container swap with health-probe-gated rollback.

What carries the argument

Source-level rewriting pipeline anchored to failure batches, verified on ephemeral trial workers, and promoted with consent-gated container swap.

Load-bearing premise

Replaying a curated batch of past failures on ephemeral trial workers is sufficient to certify that a source change will not introduce new structural failures during live operation.

What would settle it

Live deployment of a promoted change that produces new structural failures absent from the original system on the same tasks.

Watch this falsifier — get emailed when new claim-graph text bears on it.

If this is right

Structural failures in routing, hook ordering, and dispatch become reachable for autonomous repair.
Evolution occurs deterministically through code rather than probabilistic text generation.
Changes remain stable under long-context drift because they reside in source rather than prompts.
A single cycle of failure-driven rewriting can raise task performance without external intervention.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same pipeline could be applied to other containerized agent systems by swapping the coding CLI.
Hybrid use with text-layer methods might address both code and prompt-level issues in one loop.
Repeated cycles could accumulate improvements that compound across multiple failure types.
Rollback mechanisms might need extension if verification batches miss rare edge cases.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit.

Desk Editor's Note

MOSS has a concrete pipeline for source-level agent self-rewriting with container swaps, but the verification only replays past failures and the single reported result has no controls or baselines.

read the letter

Hi colleague,

The main thing to know about MOSS is that it builds a deterministic pipeline for agents to edit their own source code based on production failures, then promotes the change via container swap after replay verification. This is a step past the usual prompt or skill file tweaks.

The paper does a clean job laying out why source-level changes matter: they can reach routing, hooks, and invariants that text artifacts cannot touch, and the changes are deterministic rather than dependent on model compliance. The architecture keeps MOSS in charge of stages and verdicts while delegating the actual code edit to an external coding CLI, and the ephemeral trial workers plus health-probe rollback are practical choices for deployment.

The soft spots sit in the evidence and the verification claim. The abstract reports one lift on OpenClaw from 0.25 to 0.61 but gives no variance, multiple runs, or text-only baseline, so the size of the effect is hard to judge. More importantly, the verification replays only a curated batch of prior failures; that leaves any new structural problems created by the rewrite untested, and the paper does not add static checks or broader testing to close the gap. The stress-test note on this point holds up from the abstract.

This is for people working on long-running agent systems who want to explore deeper self-modification. It has enough of a system description and a focused claim to deserve peer review, though it will need stronger experiments and a tighter argument on safety before the central result lands.

Referee Report

2 major / 1 minor

Summary. The manuscript presents MOSS, a system enabling self-evolution in autonomous agent systems via source-level code rewriting rather than limiting changes to text-based artifacts like prompts or workflows. The approach involves curating batches of production failures, delegating code modifications to an external coding agent, verifying candidates by replaying failures in ephemeral workers, and promoting changes with user consent and rollback mechanisms. The central empirical result is an increase in the four-task mean grader score on the OpenClaw benchmark from 0.25 to 0.61 following a single evolution cycle without human intervention.

Significance. If the empirical result and the safety of the source-level edits can be substantiated, this work would be significant for the field of autonomous agents by addressing a class of structural failures that text-mutable methods cannot reach. The argument that source-level adaptation is Turing-complete and deterministic is a conceptual strength. The use of a pluggable external coding agent and health-probe-gated rollback are practical contributions. However, the current presentation provides insufficient detail to evaluate these claims.

major comments (2)

[Abstract] Abstract: The claim of lifting the mean grader score from 0.25 to 0.61 is presented without controls, variance estimates, multiple independent runs, or comparisons to text-only baselines, rendering the central empirical result unevaluable from the provided information.
[the described multi-stage pipeline] the described multi-stage pipeline: Verification by replaying a finite batch of past failures on ephemeral trial workers does not address the possibility of new structural failures introduced by the source rewrite (e.g., altered hook ordering or state invariants outside the replayed scenarios). This is load-bearing for the assertion that the edit is safe for promotion.

minor comments (1)

[Abstract] Abstract: The four tasks comprising the OpenClaw mean and the definition of the grader are not specified, which would aid interpretation of the reported scores.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address the two major points below and indicate the changes we will make.

read point-by-point responses

Referee: [Abstract] Abstract: The claim of lifting the mean grader score from 0.25 to 0.61 is presented without controls, variance estimates, multiple independent runs, or comparisons to text-only baselines, rendering the central empirical result unevaluable from the provided information.

Authors: We agree the abstract requires more context. In revision we will add a reference to the experimental protocol in Section 4, include a text-only baseline comparison, and explicitly note that the reported result is from a single production cycle without variance estimates from repeated independent runs. The latter cannot be supplied from the existing data. revision: partial
Referee: [the described multi-stage pipeline] the described multi-stage pipeline: Verification by replaying a finite batch of past failures on ephemeral trial workers does not address the possibility of new structural failures introduced by the source rewrite (e.g., altered hook ordering or state invariants outside the replayed scenarios). This is load-bearing for the assertion that the edit is safe for promotion.

Authors: This observation is correct. Replay verification only confirms behavior on the curated failure batch. The promotion step relies on the health-probe-gated rollback to handle any new structural issues post-deployment. We will expand the pipeline description to state this limitation explicitly and clarify the role of rollback in the safety argument. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical result is direct measurement

full rationale

The paper reports a measured performance lift (0.25 to 0.61 on OpenClaw) from executing the MOSS pipeline on production failures, with verification via replay on a curated batch. No equations, fitted parameters renamed as predictions, self-citations, uniqueness theorems, or ansatzes appear in the provided text. The result is presented as an observed grader score rather than a derived quantity that reduces to its inputs by construction. The verification limitation noted in the skeptic attack concerns coverage rather than circularity in any derivation chain.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No mathematical free parameters, axioms, or invented entities are introduced; the description is at the level of system architecture and one empirical outcome.

pith-pipeline@v0.9.1-grok · 5816 in / 961 out tokens · 33667 ms · 2026-06-30T16:42:27.930807+00:00 · methodology

0 comments

read the original abstract

Autonomous agentic systems are largely static after deployment: they do not learn from user interactions, and recurring failures persist until the next human-driven update ships a fix. Self-evolving agents have emerged in response, but all confine evolution to text-mutable artifacts -- skill files, prompt configurations, memory schemas, workflow graphs -- and leave the agent harness untouched. Since routing, hook ordering, state invariants, and dispatch live in code rather than in any text artifact, an entire class of structural failure is physically unreachable from the text layer. We argue that source-level adaptation is a fundamentally more general medium: it is Turing-complete, a strict superset of every text-mutable scope, takes effect deterministically rather than through base-model compliance, and does not erode under long-context drift. We present MOSS, a system that performs self-rewriting at the source level on production agentic substrates. Each evolution is anchored to an automatically curated batch of production-failure evidence and proceeds through a deterministic multi-stage pipeline; code modification is delegated to a pluggable external coding-agent CLI while MOSS retains stage ordering and verdicts. Candidates are verified by replaying the batch against the candidate image in ephemeral trial workers, then promoted via user-consent-gated, in-place container swap with health-probe-gated rollback. On OpenClaw, MOSS lifts a four-task mean grader score from 0.25 to 0.61 in a single cycle without human intervention.

Figures

Figures reproduced from arXiv: 2605.22794 by Huajiang Zheng, Jun Song, Qianshu Cai, Wei Xue, Xianzhang Jia, Xinmei Tian, Yike Guo, Yonggang Zhang.

**Figure 2.** Figure 2: The four nested levels of MOSS evolution: a pre-loop baseline (Layer 0), an iteration loop [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Iteration-1 trace covering all stages MOSS executed; stage 3 (Plan-Review) and stage 5 [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

Review history (2 revisions) →

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

TTHE: Test-Time Harness Evolution
cs.SE 2026-07 conditional novelty 6.0

An LLM agent can improve itself at test time by rewriting its surrounding executable harness from unlabeled traces, using only proxy signals and a frozen model.

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages · cited by 1 Pith paper · 12 internal anchors

[1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning

Lakshya A Agrawal, Shangyin Tan, Dilara Soylu, Noah Ziems, Rishi Khare, Krista Opsahl-Ong, Arnav Singhvi, Herumb Shandilya, Michael J Ryan, Meng Jiang, et al. Gepa: Reflective prompt evolution can outperform reinforcement learning.arXiv preprint arXiv:2507.19457,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Model card for the Claude 3 family released March 2024; documents capabilities, benchmarks, and safety evaluations for Opus, Sonnet, and Haiku

URL https://www.anthropic.com/claude-3-model-card . Model card for the Claude 3 family released March 2024; documents capabilities, benchmarks, and safety evaluations for Opus, Sonnet, and Haiku. Anthropic. Claude Code. https://www.anthropic.com/claude-code,

work page 2024
[4]

Large Language Model based Multi-Agents: A Survey of Progress and Challenges

Anthropic’s command-line coding agent for the Claude model family. Taicheng Guo, Xiuying Chen, Yaqi Wang, Ruidi Chang, Shichao Pei, Nitesh V Chawla, Olaf Wiest, and Xiangliang Zhang. Large language model based multi-agents: A survey of progress and challenges.arXiv preprint arXiv:2402.01680,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

10 Sirui Hong, Mingchen Zhuge, Jonathan Chen, Xiawu Zheng, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Steven Yau, Zijuan Lin, Liyang Zhou, et al

Terminal-UI coding agent for DeepSeek models. 10 Sirui Hong, Mingchen Zhuge, Jonathan Chen, Xiawu Zheng, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Steven Yau, Zijuan Lin, Liyang Zhou, et al. Metagpt: Meta programming for a multi-agent collaborative framework. InInternational Conference on Learning Representations, volume 2024, pages 23247–23275,

work page 2024
[6]

Automated design of agentic systems

Shengran Hu, Cong Lu, and Jeff Clune. Automated design of agentic systems. InInternational Conference on Learning Representations, volume 2025, pages 21344–21377,

work page 2025
[7]

DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines

Omar Khattab, Arnav Singhvi, Paridhi Maheshwari, Zhiyuan Zhang, Keshav Santhanam, Sri Vard- hamanan, Saiful Haq, Ashutosh Sharma, Thomas T Joshi, Hanna Moazam, et al. Dspy: Compiling declarative language model calls into self-improving pipelines.arXiv preprint arXiv:2310.03714,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

Meta-Harness: End-to-End Optimization of Model Harnesses

Yoonho Lee, Roshen Nair, Qizheng Zhang, Kangwook Lee, Omar Khattab, and Chelsea Finn. Meta-harness: End-to-end optimization of model harnesses.arXiv preprint arXiv:2603.28052,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

GenericAgent: A Token-Efficient Self-Evolving LLM Agent via Contextual Information Density Maximization (V1.0)

Jiaqing Liang, Jinyi Han, Weijia Li, Xinyi Wang, Zhoujia Zhang, Zishang Jiang, Ying Liao, Tingyun Li, Ying Huang, Hao Shen, et al. Genericagent: A token-efficient self-evolving llm agent via contextual information density maximization (v1. 0).arXiv preprint arXiv:2604.17091,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models

Aixin Liu, Aoxue Mei, Bangcai Lin, Bing Xue, Bingxuan Wang, Bingzheng Xu, Bochao Wu, Bowei Zhang, Chaofan Lin, Chen Dong, et al. Deepseek-v3. 2: Pushing the frontier of open large language models.arXiv preprint arXiv:2512.02556,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

SkillClaw: Let Skills Evolve Collectively with Agentic Evolver

Ziyu Ma, Shidong Yang, Yuxiang Ji, Xucong Wang, Yong Wang, Yiming Hu, Tongwen Huang, and Xiangxiang Chu. Skillclaw: Let skills evolve collectively with agentic evolver.arXiv preprint arXiv:2604.08377,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

Toolllm: Facilitating large language models to master 16000+ real-world apis

Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, et al. Toolllm: Facilitating large language models to master 16000+ real-world apis. InInternational Conference on Learning Representations, volume 2024, pages 9695–9717,

work page 2024
[13]

A self-improving coding agent

11 Maxime Robeyns, Martin Szummer, and Laurence Aitchison. A self-improving coding agent.arXiv preprint arXiv:2504.15228,

work page arXiv
[14]

From Procedural Skills to Strategy Genes: Towards Experience-Driven Test-Time Evolution

Junjie Wang, Yiming Ren, and Haoyang Zhang. From procedural skills to strategy genes: Towards experience-driven test-time evolution.arXiv preprint arXiv:2604.15097,

work page internal anchor Pith review Pith/arXiv arXiv
[15]

Evoagentx: An automated framework for evolving agentic workflows

Yingxu Wang, Siwei Liu, Jinyuan Fang, and Zaiqiao Meng. Evoagentx: An automated framework for evolving agentic workflows. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 643–655,

work page 2025
[16]

ReAct: Synergizing Reasoning and Acting in Language Models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models.arXiv preprint arXiv:2210.03629,

work page internal anchor Pith review Pith/arXiv arXiv
[17]

Claw-Eval: Towards Trustworthy Evaluation of Autonomous Agents

Bowen Ye, Rang Li, Qibin Yang, Yuanxin Liu, Linli Yao, Hanglong Lv, Zhihui Xie, Chenxin An, Lei Li, Lingpeng Kong, et al. Claw-eval: Toward trustworthy evaluation of autonomous agents.arXiv preprint arXiv:2604.06132,

work page internal anchor Pith review Pith/arXiv arXiv
[18]

Darwin Godel Machine: Open-Ended Evolution of Self-Improving Agents

Jenny Zhang, Shengran Hu, Cong Lu, Robert Lange, and Jeff Clune. Darwin godel machine: Open-ended evolution of self-improving agents.arXiv preprint arXiv:2505.22954,

work page internal anchor Pith review Pith/arXiv arXiv
[19]

Hyperagents, 2026

Jenny Zhang, Bingchen Zhao, Wannan Yang, Jakob Foerster, Jeff Clune, Minqi Jiang, Sam Devlin, and Tatiana Shavrina. Hyperagents.arXiv preprint arXiv:2603.19461,

work page arXiv

[1] [1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning

Lakshya A Agrawal, Shangyin Tan, Dilara Soylu, Noah Ziems, Rishi Khare, Krista Opsahl-Ong, Arnav Singhvi, Herumb Shandilya, Michael J Ryan, Meng Jiang, et al. Gepa: Reflective prompt evolution can outperform reinforcement learning.arXiv preprint arXiv:2507.19457,

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

Model card for the Claude 3 family released March 2024; documents capabilities, benchmarks, and safety evaluations for Opus, Sonnet, and Haiku

URL https://www.anthropic.com/claude-3-model-card . Model card for the Claude 3 family released March 2024; documents capabilities, benchmarks, and safety evaluations for Opus, Sonnet, and Haiku. Anthropic. Claude Code. https://www.anthropic.com/claude-code,

work page 2024

[4] [4]

Large Language Model based Multi-Agents: A Survey of Progress and Challenges

Anthropic’s command-line coding agent for the Claude model family. Taicheng Guo, Xiuying Chen, Yaqi Wang, Ruidi Chang, Shichao Pei, Nitesh V Chawla, Olaf Wiest, and Xiangliang Zhang. Large language model based multi-agents: A survey of progress and challenges.arXiv preprint arXiv:2402.01680,

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

10 Sirui Hong, Mingchen Zhuge, Jonathan Chen, Xiawu Zheng, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Steven Yau, Zijuan Lin, Liyang Zhou, et al

Terminal-UI coding agent for DeepSeek models. 10 Sirui Hong, Mingchen Zhuge, Jonathan Chen, Xiawu Zheng, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Steven Yau, Zijuan Lin, Liyang Zhou, et al. Metagpt: Meta programming for a multi-agent collaborative framework. InInternational Conference on Learning Representations, volume 2024, pages 23247–23275,

work page 2024

[6] [6]

Automated design of agentic systems

Shengran Hu, Cong Lu, and Jeff Clune. Automated design of agentic systems. InInternational Conference on Learning Representations, volume 2025, pages 21344–21377,

work page 2025

[7] [7]

DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines

Omar Khattab, Arnav Singhvi, Paridhi Maheshwari, Zhiyuan Zhang, Keshav Santhanam, Sri Vard- hamanan, Saiful Haq, Ashutosh Sharma, Thomas T Joshi, Hanna Moazam, et al. Dspy: Compiling declarative language model calls into self-improving pipelines.arXiv preprint arXiv:2310.03714,

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

Meta-Harness: End-to-End Optimization of Model Harnesses

Yoonho Lee, Roshen Nair, Qizheng Zhang, Kangwook Lee, Omar Khattab, and Chelsea Finn. Meta-harness: End-to-end optimization of model harnesses.arXiv preprint arXiv:2603.28052,

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

GenericAgent: A Token-Efficient Self-Evolving LLM Agent via Contextual Information Density Maximization (V1.0)

Jiaqing Liang, Jinyi Han, Weijia Li, Xinyi Wang, Zhoujia Zhang, Zishang Jiang, Ying Liao, Tingyun Li, Ying Huang, Hao Shen, et al. Genericagent: A token-efficient self-evolving llm agent via contextual information density maximization (v1. 0).arXiv preprint arXiv:2604.17091,

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models

Aixin Liu, Aoxue Mei, Bangcai Lin, Bing Xue, Bingxuan Wang, Bingzheng Xu, Bochao Wu, Bowei Zhang, Chaofan Lin, Chen Dong, et al. Deepseek-v3. 2: Pushing the frontier of open large language models.arXiv preprint arXiv:2512.02556,

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

SkillClaw: Let Skills Evolve Collectively with Agentic Evolver

Ziyu Ma, Shidong Yang, Yuxiang Ji, Xucong Wang, Yong Wang, Yiming Hu, Tongwen Huang, and Xiangxiang Chu. Skillclaw: Let skills evolve collectively with agentic evolver.arXiv preprint arXiv:2604.08377,

work page internal anchor Pith review Pith/arXiv arXiv

[12] [12]

Toolllm: Facilitating large language models to master 16000+ real-world apis

Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, et al. Toolllm: Facilitating large language models to master 16000+ real-world apis. InInternational Conference on Learning Representations, volume 2024, pages 9695–9717,

work page 2024

[13] [13]

A self-improving coding agent

11 Maxime Robeyns, Martin Szummer, and Laurence Aitchison. A self-improving coding agent.arXiv preprint arXiv:2504.15228,

work page arXiv

[14] [14]

From Procedural Skills to Strategy Genes: Towards Experience-Driven Test-Time Evolution

Junjie Wang, Yiming Ren, and Haoyang Zhang. From procedural skills to strategy genes: Towards experience-driven test-time evolution.arXiv preprint arXiv:2604.15097,

work page internal anchor Pith review Pith/arXiv arXiv

[15] [15]

Evoagentx: An automated framework for evolving agentic workflows

Yingxu Wang, Siwei Liu, Jinyuan Fang, and Zaiqiao Meng. Evoagentx: An automated framework for evolving agentic workflows. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 643–655,

work page 2025

[16] [16]

ReAct: Synergizing Reasoning and Acting in Language Models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models.arXiv preprint arXiv:2210.03629,

work page internal anchor Pith review Pith/arXiv arXiv

[17] [17]

Claw-Eval: Towards Trustworthy Evaluation of Autonomous Agents

Bowen Ye, Rang Li, Qibin Yang, Yuanxin Liu, Linli Yao, Hanglong Lv, Zhihui Xie, Chenxin An, Lei Li, Lingpeng Kong, et al. Claw-eval: Toward trustworthy evaluation of autonomous agents.arXiv preprint arXiv:2604.06132,

work page internal anchor Pith review Pith/arXiv arXiv

[18] [18]

Darwin Godel Machine: Open-Ended Evolution of Self-Improving Agents

Jenny Zhang, Shengran Hu, Cong Lu, Robert Lange, and Jeff Clune. Darwin godel machine: Open-ended evolution of self-improving agents.arXiv preprint arXiv:2505.22954,

work page internal anchor Pith review Pith/arXiv arXiv

[19] [19]

Hyperagents, 2026

Jenny Zhang, Bingchen Zhao, Wannan Yang, Jakob Foerster, Jeff Clune, Minqi Jiang, Sam Devlin, and Tatiana Shavrina. Hyperagents.arXiv preprint arXiv:2603.19461,

work page arXiv