arxiv: 2604.17406 · v2 · submitted 2026-04-19 · 💻 cs.AI

Recognition: unknown

EvoMaster: A Foundational Evolving Agent Framework for Agentic Science at Scale

Xinyu Zhu , Yuzhu Cai , Zexi Liu , Cheng Wang , Fengyang Li , Wenkai Jin , Wanxu Liu , Zehao Bing

show 15 more authors

Bingyang Zheng Jingyi Chai Shuo Tang Rui Ye Yuwen Du Xianghe Pang Yaxin Du Tingjia Miao Yuzhi Zhang Ruoxue Liao Zhaohan Ding Linfeng Zhang Yanfeng Wang Weinan E Siheng Chen

Authors on Pith no claims yet

Pith reviewed 2026-05-10 05:55 UTC · model grok-4.3

classification 💻 cs.AI

keywords evolving agentsagentic scienceself-evolutionscientific discoveryautonomous agentshypothesis refinementcontinuous learningagent frameworks

0 comments

The pith

EvoMaster is a foundational framework that lets scientific agents continuously evolve through self-critique and knowledge accumulation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to show that existing agent frameworks fall short for scientific discovery because they are static and do not learn from experience. EvoMaster addresses this by implementing a continuous self-evolution process where agents refine hypotheses, critique their outputs, and build knowledge over repeated trials. This setup is meant to closely follow the iterative nature of human science. If effective, it would make it straightforward to create specialized, improving agents for any scientific area without starting from scratch each time. The framework serves as a base that developers can extend in minimal code to support broad applications in fields from physics to machine learning.

Core claim

EvoMaster is presented as a foundational evolving agent framework specifically engineered for Agentic Science at Scale. It operates on the principle of continuous self-evolution, enabling agents to iteratively refine hypotheses, perform self-critique, and accumulate knowledge across experimental cycles in a way that mirrors human scientific inquiry. As a domain-agnostic base harness, it facilitates the easy scaling of self-evolving scientific agents for arbitrary disciplines. The authors developed the SciMaster ecosystem upon this framework across domains such as machine learning, physics, and general science to demonstrate its utility.

What carries the argument

The continuous self-evolution process, which drives iterative hypothesis refinement, self-critique, and knowledge accumulation in the agents.

Load-bearing premise

The process of continuous self-evolution actually leads to meaningful gains in scientific reasoning and discovery ability instead of merely tuning to particular benchmarks.

What would settle it

Observing whether agents using EvoMaster show sustained performance gains on entirely new scientific problems after several rounds of self-evolution, as opposed to plateauing or matching the results of non-evolving agent versions.

read the original abstract

The convergence of large language models and agents is catalyzing a new era of scientific discovery: Agentic Science. While the scientific method is inherently iterative, existing agent frameworks are predominantly static, narrowly scoped, and lack the capacity to learn from trial and error. To bridge this gap, we present EvoMaster, a foundational evolving agent framework engineered specifically for Agentic Science at Scale. Driven by the core principle of continuous self-evolution, EvoMaster empowers agents to iteratively refine hypotheses, self-critique, and progressively accumulate knowledge across experimental cycles, faithfully mirroring human scientific inquiry. Crucially, as a domain-agnostic base harness, EvoMaster is exceptionally easy to scale up -- enabling developers to build and deploy highly capable, self-evolving scientific agents for arbitrary disciplines in approximately 100 lines of code. Built upon EvoMaster, we incubated the SciMaster ecosystem across domains such as machine learning, physics, and general science. Evaluations on four authoritative benchmarks (Humanity's Last Exam, MLE-Bench Lite, BrowseComp, and FrontierScience) demonstrate that EvoMaster achieves state-of-the-art scores of 41.1%, 75.8%, 73.3%, and 53.3%, respectively. It comprehensively outperforms the general-purpose baseline OpenClaw with relative improvements ranging from +159% to +316%, robustly validating its efficacy and generality as the premier foundational framework for the next generation of autonomous scientific discovery. EvoMaster is available at https://github.com/sjtu-sai-agents/EvoMaster.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

EvoMaster offers a domain-agnostic self-evolving agent base that claims large benchmark gains for scientific tasks, but the abstract gives no sign that the gains come from the evolution loop rather than extra steps or prompting.

read the letter

The main new piece here is a lightweight harness that lets people spin up self-evolving agents for arbitrary science domains in roughly 100 lines of code, then shows it powering an ecosystem called SciMaster across machine learning, physics, and general science. The framework tries to copy the iterative loop of hypothesis, critique, and knowledge accumulation that human researchers use, and it ships on GitHub so others can test it directly. That practical angle is useful for groups already building LLM agents who want something they can extend without starting from scratch. The reported numbers—41.1% on Humanity's Last Exam, 75.8% on MLE-Bench Lite, 73.3% on BrowseComp, 53.3% on FrontierScience, and big relative lifts over OpenClaw—are the headline results. If the full paper backs those up with clear protocols and controls, the framework could be a convenient base for agentic science work. The soft spot is exactly the one the stress-test flags: nothing in the abstract shows an ablation that isolates the continuous self-evolution from simply running longer trajectories, using more calls, or applying domain-tuned prompts. Without that, the causal claim that the evolving mechanism itself drives the gains stays untested. The benchmarks are listed but the text supplies no details on how baselines were implemented, whether statistical tests were run, or how the tasks were chosen. That leaves the performance story hard to evaluate from the given material. This paper is aimed at researchers working on LLM agents for discovery pipelines who need a starting scaffold they can scale. A serious referee could check the missing controls and see whether the self-evolution actually adds value beyond extra compute. I would send it to review rather than desk-reject so the experimental design gets proper scrutiny.

Referee Report

2 major / 1 minor

Summary. The paper introduces EvoMaster, a domain-agnostic evolving agent framework for Agentic Science that enables continuous self-evolution via iterative hypothesis refinement, self-critique, and knowledge accumulation. It claims to be scalable (deployable in ~100 lines of code) and demonstrates SOTA results on four benchmarks: 41.1% on Humanity's Last Exam, 75.8% on MLE-Bench Lite, 73.3% on BrowseComp, and 53.3% on FrontierScience, with relative gains of +159% to +316% over the OpenClaw baseline. The work also describes the SciMaster ecosystem built on EvoMaster across ML, physics, and general science, with code released on GitHub.

Significance. If the performance claims and causal attribution to self-evolution hold after proper controls, this could provide a useful foundational harness for building self-improving scientific agents at scale, with the open-source release aiding reproducibility. The emphasis on mirroring the iterative scientific method is conceptually aligned with needs in agentic AI, though the current evidence base does not yet allow assessment of whether the gains reflect genuine reasoning improvements or other factors.

major comments (2)

[Abstract] Abstract: The headline benchmark scores and relative improvements over OpenClaw are stated without any description of experimental protocols, number of trials, statistical tests, baseline re-implementations, compute budgets, or controls for trajectory length and LLM call count. This absence leaves the central performance claims without verifiable support.
[Abstract] Abstract: No ablation or controlled comparison is reported to isolate the contribution of the continuous self-evolution loop (hypothesis refinement and self-critique) from confounding factors such as additional inference steps, domain-specific prompting, or simply running a static agent for more iterations. Without such evidence the causal link between the framework's core principle and the reported gains remains untested.

minor comments (1)

[Abstract] Abstract: The claim that EvoMaster is 'exceptionally easy to scale up' in approximately 100 lines of code would benefit from a brief concrete code snippet or pseudocode example in the main text to illustrate the base harness.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback emphasizing experimental rigor and the need to strengthen causal claims. We address each major comment below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [Abstract] Abstract: The headline benchmark scores and relative improvements over OpenClaw are stated without any description of experimental protocols, number of trials, statistical tests, baseline re-implementations, compute budgets, or controls for trajectory length and LLM call count. This absence leaves the central performance claims without verifiable support.

Authors: We agree the abstract is too concise to include these details. The full manuscript describes the evaluation protocols, multiple trials, statistical reporting, baseline re-implementations, and compute budgets in the Experimental Setup section, with explicit controls for trajectory length and LLM call counts to ensure fair comparisons. We will revise the abstract to include a brief summary of the methodology and add a dedicated paragraph on controls in the main text for improved verifiability. revision: yes
Referee: [Abstract] Abstract: No ablation or controlled comparison is reported to isolate the contribution of the continuous self-evolution loop (hypothesis refinement and self-critique) from confounding factors such as additional inference steps, domain-specific prompting, or simply running a static agent for more iterations. Without such evidence the causal link between the framework's core principle and the reported gains remains untested.

Authors: We acknowledge that dedicated ablations would more rigorously isolate the self-evolution components. The manuscript currently demonstrates gains via comparison to the static OpenClaw baseline under matched conditions. To address potential confounders, we will add controlled ablation experiments in the revision, comparing the full framework against variants lacking hypothesis refinement or self-critique while holding iteration count, LLM calls, and prompting constant. This will provide direct evidence for the contribution of the evolving loop. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the derivation chain

full rationale

The paper describes an empirical agent framework and reports direct benchmark scores without any mathematical derivation chain, equations, or fitted parameters. No self-definitional reductions, predictions derived from inputs by construction, or load-bearing self-citations appear in the abstract or described content. The performance claims rest on external benchmark evaluations against baselines, which are independent measurements rather than tautological outputs. This is a standard non-circular empirical presentation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests primarily on the domain assumption that LLM agents can reliably self-critique and improve through iterative cycles, with the framework itself serving as the main invented component.

axioms (1)

domain assumption LLM-based agents can effectively perform self-critique and iterative hypothesis refinement in scientific tasks
This is the core principle invoked to justify the self-evolution capability.

invented entities (1)

EvoMaster evolving agent framework no independent evidence
purpose: To provide a scalable, domain-agnostic base for self-evolving scientific agents
The framework is introduced as a new harness without external independent validation beyond the reported benchmarks.

pith-pipeline@v0.9.0 · 5666 in / 1257 out tokens · 45905 ms · 2026-05-10T05:55:16.555100+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

18 extracted references · 18 canonical work pages · 4 internal anchors

[1]

J. Chai, S. Tang, R. Ye, et al. SciMaster: Towards general-purpose scientific AI agents, Part I. X-Master as foundation: Can we lead on Humanity’s Last Exam? arXiv preprint arXiv:2507.05241,

work page arXiv
[2]

J. S. Chan, N. Chowdhury, O. Jaffe, J. Aung, D. Sherburn, E. Mays, G. Starace, K. Liu, L. Maksin, T. Patil, et al. MLE-Bench: Evaluating machine learning agents on machine learning engineer- ing. arXiv preprint arXiv:2410.07095,

work page Pith review arXiv
[3]

arXiv:2410.05080. L. Gao et al. Accelerating scientific discovery with AI agents: A community perspective. arXiv preprint arXiv:2501.04227,

work page arXiv
[4]

Towards an AI co-scientist

J. Gottweis et al. Towards an AI co-scientist. arXiv preprint arXiv:2502.18864,

work page internal anchor Pith review arXiv
[5]

Mlagentbench: Evaluating language agents on machine learning experimentation.arXiv preprint arXiv:2310.03302, 2023

Q. Huang, J. Vora, P . Liang, and J. Leskovec. MLAgentBench: Evaluating language agents on machine learning experimentation. arXiv preprint arXiv:2310.03302,

work page arXiv
[6]

LangChain: Build context-aware reasoning applications

LangChain. LangChain: Build context-aware reasoning applications. https://www.langch ain.com/, 2025a. 15 LangChain. LangGraph: Build stateful, multi-actor applications with LLMs. https://langch ain-ai.github.io/langgraph/, 2025b. Z. Lei, G. Liu, et al. EmboCoach-Bench: Benchmarking AI agents on developing embodied robots. arXiv preprint arXiv:2601.21570,

work page arXiv
[7]

Z. Liu, Y. Cai, X. Zhu, et al. ML-Master: Towards AI-for-AI via integration of exploration and reasoning. arXiv preprint arXiv:2506.16499,

work page arXiv
[8]

C. Lu, C. Lu, R. T. Lange, J. Foerster, J. Clune, and D. Ha. The AI scientist: Towards fully automated open-ended scientific discovery. arXiv preprint arXiv:2408.06292,

work page internal anchor Pith review arXiv
[9]

T. Miao, J. Dai, et al. PhysMaster: Building an autonomous AI physicist for theoretical and computational physics research. arXiv preprint arXiv:2512.19799,

work page arXiv
[10]

J. Nam, J. Yoon, J. Chen, J. Shin, S. Ö. Arık, and T. Pfister. Mle-star: Machine learning engineering agent via search and targeted refinement. arXiv preprint arXiv:2506.15692,

work page arXiv
[11]

X. Pang, S. Tang, R. Ye, et al. BrowseMaster: Towards scalable web browsing via tool-augmented programmatic agent pair. arXiv preprint arXiv:2508.09129,

work page arXiv
[12]

Humanity's Last Exam

arXiv:2501.14249. DOI:10.1038/s41586-025-09962-4. M. D. Skarlinski, S. Cox, J. M. Laurent, et al. Language agents achieve superhuman synthesis of scientific knowledge. arXiv preprint arXiv:2409.13740,

work page internal anchor Pith review doi:10.1038/s41586-025-09962-4
[13]

Swanson, D

K. Swanson, D. Wu, et al. Virtual lab: AI agents design new nanobody binders for SARS-CoV-2. arXiv preprint arXiv:2407.16928,

work page arXiv
[14]

arXiv:2601.21165 , institution =

The Royal Swedish Academy of Sciences. The Nobel prize in chemistry 2024: Computational protein design and protein structure prediction. https://www.nobelprize.org/prizes/ chemistry/2024/summary/, 2024a. The Royal Swedish Academy of Sciences. The Nobel prize in physics 2024: Machine learning with artificial neural networks. https://www.nobelprize.org/priz...

work page arXiv 2024
[15]

The AI Scientist-v2: Workshop-Level Automated Scientific Discovery via Agentic Tree Search

Y. Yamada, C. Lu, C. Lu, et al. The AI scientist-v2: Workshop-level automated scientific discovery via agentic tree search. arXiv preprint arXiv:2504.08066,

work page internal anchor Pith review Pith/arXiv arXiv
[16]

X. Yang, X. Yang, S. Fang, Y. Zhang, J. Wang, B. Xian, Q. Li, J. Li, M. Xu, Y. Li, et al. R&d-agent: An llm-agent framework towards autonomous data science. arXiv preprint arXiv:2505.14738,

work page arXiv
[17]

Bohrium +

L. Zhang, S. Chen, Y. Cai, J. Chai, J. Chang, K. Chen, Z. X. Chen, Z. Ding, Y. Du, Y. Gao, et al. Bohrium+ scimaster: Building the infrastructure and ecosystem for agentic science at scale. arXiv preprint arXiv:2512.20469,

work page arXiv
[18]

X. Zhu, Y. Cai, Z. Liu, et al. Toward ultra-long-horizon agentic science: Cognitive accumulation for machine learning engineering. arXiv preprint arXiv:2601.10402,

work page arXiv