Claw AI Lab: An Autonomous Multi-Agent Research Team

Cheng Chen; Deheng Ye; Deyi Ji; Dingcheng Gao; Fan Wu; Fayao Liu; Guosheng Lin; Lanyun Zhu; Qi Zhu; Taiyu Zhang

arxiv: 2605.22662 · v1 · pith:ENGNXGXLnew · submitted 2026-05-21 · 💻 cs.AI

Claw AI Lab: An Autonomous Multi-Agent Research Team

Fan Wu , Cheng Chen , Zhenshan Tan , Taiyu Zhang , Xinzhen Xu , Yanyu Qian , Dingcheng Gao , Lanyun Zhu

show 7 more authors

Qi Zhu Yi Tan Deyi Ji Guosheng Lin Tianrun Chen Deheng Ye Fayao Liu

This is my paper

Pith reviewed 2026-05-22 05:30 UTC · model grok-4.3

classification 💻 cs.AI

keywords autonomous researchmulti-agent systemsAI laboratorycode integration harnessautomated experimentationresearch team simulationinteractive AI workflows

0 comments

The pith

Claw AI Lab lets users launch a full customizable multi-agent research team from a single prompt with live monitoring and code integration.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Claw AI Lab as a platform that shifts automated AI research from fixed single-agent pipelines to an interactive laboratory where one prompt creates a team with assigned roles, collaborative workflows, and real-time controls. It adds the Claw-Code Harness to link local codebases, datasets, and checkpoints directly to experiments while routing results back into the loop. Internal tests on five case studies found expert judges preferring the outputs over a prior baseline for novelty, completeness, and presentation quality. A sympathetic reader would see this as making autonomous research more steerable and less prone to incomplete or unfaithful results.

Core claim

By instantiating complete research teams with customizable roles and workflows plus a code harness that connects local resources to runnable experiments and feeds artifacts back, Claw AI Lab produces higher-quality research artifacts than single-agent baselines in internal judgments.

What carries the argument

The Claw AI Lab platform, which instantiates a full multi-agent team from one prompt together with the Claw-Code Harness that links codebases and returns execution artifacts into the research cycle.

If this is right

Researchers gain modes for exploration, multi-agent discussion, and reproduction with rollback and resume controls.
Experiments become easier to inspect and iterate because artifacts flow back into the system rather than remaining isolated.
Common failures such as partial runs and malformed result reporting are reduced through tighter code-to-paper integration.
The system supports distinct research modes that make the overall process more laboratory-like and controllable.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach could extend to non-AI scientific domains if the harness is generalized beyond code execution to handle lab instruments or simulation engines.
Teams built this way might allow human researchers to intervene at any step without restarting the entire process, changing how oversight is applied in automated workflows.
If the harness pattern proves robust, similar integration layers could be added to other agent frameworks to improve reproducibility across projects.

Load-bearing premise

That preference ratings from a small internal group of expert judges on five unspecified case studies reliably measure better research novelty, completeness, and quality.

What would settle it

An external evaluation using a larger and independent set of judges on a broader collection of research tasks where Claw AI Lab shows no consistent preference or lower scores on the same metrics.

Figures

Figures reproduced from arXiv: 2605.22662 by Cheng Chen, Deheng Ye, Deyi Ji, Dingcheng Gao, Fan Wu, Fayao Liu, Guosheng Lin, Lanyun Zhu, Qi Zhu, Taiyu Zhang, Tianrun Chen, Xinzhen Xu, Yanyu Qian, Yi Tan, Zhenshan Tan.

**Figure 2.** Figure 2: Detailed comparison for four paper pairs scored by Gemini and ChatGPT, respectively. [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

read the original abstract

We present Claw AI Lab, a lab-native autonomous research platform that advances automated research from a hidden prompt-to-paper pipeline into an interactive AI laboratory. Rather than centering the system around a single agent or a fixed serial workflow, we allow users to instantiate a full research team from one prompt, with customizable roles, collaborative workflows, real-time monitoring, artifact inspection, and rollback/resume control through a unified dashboard. The platform also supports distinct research modes for exploration, multi-agent discussion, and reproduction, making autonomous research substantially more steerable and laboratory-like in practice. A key practical contribution of Claw AI Lab lies in its Claw-Code Harness, which connects local codebases, datasets, and checkpoints to runnable experiments and feeds execution artifacts back into the research loop. As a result, the harness improves not only execution integration, but also experimental completion and result integrity: experiments are easier to inspect, iterate on, and faithfully transfer into final papers, reducing common failure modes such as partial runs and malformed result reporting. In our internal evaluation on five AI research case studies, using AutoResearchClaw as the baseline, Claw AI Lab is consistently preferred by AI expert judges on idea novelty, experiment completeness, and paper presentation quality. We view Claw AI Lab as an early step toward a new paradigm: autonomous research as usable, interactive, and reliability-aware scientific infrastructure.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Claw AI Lab describes a multi-agent platform with a dashboard and local code harness, but its main claim of consistent preference over the baseline rests on five internal case studies with no disclosed judge details or controls.

read the letter

The paper presents Claw AI Lab as a platform that lets users spin up a full research team from a prompt, with customizable roles, real-time monitoring, rollback through a dashboard, and separate modes for exploration or reproduction. The Claw-Code Harness is the part that actually connects to local codebases, datasets, and checkpoints so experiments can run and feed artifacts back into the loop. That integration is the clearest practical addition over earlier single-agent or fixed-workflow setups, and it could reduce some of the usual problems with partial runs or sloppy result reporting in generated papers. The architecture description is straightforward and shows they thought about steerability and inspection in a lab setting rather than just end-to-end generation. What stands out is the attempt to make autonomous research feel more like an interactive system instead of a black-box pipeline. The evaluation is the weak point. The paper states that on five AI research case studies, AI expert judges consistently preferred Claw AI Lab over AutoResearchClaw for novelty, completeness, and presentation quality. No information appears on how many judges were involved, how they were selected, whether the comparison was blinded, what rubrics were used, or any measure of agreement. The case studies themselves are not described. This leaves the preference claim without enough grounding to treat it as solid evidence of improvement. A reader working on automated research tools might still pick up useful ideas from the dashboard and harness design. The paper is not trying to solve a core open question in the field; it is offering infrastructure that could be built on. It is coherent enough on its own terms to warrant peer review, mainly so referees can push for clearer evaluation or objective metrics instead of the current internal judgment. I would flag the evaluation section for revision but would not desk-reject on that basis alone.

Referee Report

2 major / 1 minor

Summary. The paper presents Claw AI Lab, a multi-agent autonomous research platform allowing instantiation of customizable research teams from a single prompt, with collaborative workflows, real-time dashboard monitoring, artifact inspection, rollback controls, and distinct modes for exploration, discussion, and reproduction. A core component is the Claw-Code Harness for integrating local codebases, datasets, and checkpoints into runnable experiments with feedback into the research loop. The central claim is that this makes autonomous research more steerable and laboratory-like, evidenced by an internal evaluation on five AI research case studies where Claw AI Lab was consistently preferred over the AutoResearchClaw baseline by AI expert judges on idea novelty, experiment completeness, and paper presentation quality.

Significance. If the evaluation results hold under more rigorous scrutiny, the work could contribute to practical advances in automated research systems by addressing execution integrity and iteration challenges through integrated harnesses and interactive team structures, potentially improving reproducibility in AI-driven discovery pipelines.

major comments (2)

[Evaluation section (internal case studies)] The load-bearing empirical support in the evaluation section consists of preference judgments on five unspecified AI research case studies. No details are provided on case study selection criteria, the number or expertise of the AI expert judges, blinding procedures, evaluation rubrics, inter-rater agreement statistics, or any quantitative controls, which leaves the claim of consistent preference without verifiable grounding.
[Claw-Code Harness description] The manuscript asserts that the Claw-Code Harness improves experimental completion and result integrity over prior approaches, but provides no quantitative metrics (such as completion rates, error frequencies, or statistical comparisons) to support this; the description remains qualitative despite being central to the platform's practical contribution.

minor comments (1)

[Abstract and evaluation] The abstract and evaluation description refer to 'five AI research case studies' without even brief high-level descriptors of their topics or domains, which would help readers assess the scope of the claimed improvements.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback, which identifies key areas where additional rigor would strengthen the manuscript. We address each major comment below and commit to revisions that directly respond to the concerns raised.

read point-by-point responses

Referee: [Evaluation section (internal case studies)] The load-bearing empirical support in the evaluation section consists of preference judgments on five unspecified AI research case studies. No details are provided on case study selection criteria, the number or expertise of the AI expert judges, blinding procedures, evaluation rubrics, inter-rater agreement statistics, or any quantitative controls, which leaves the claim of consistent preference without verifiable grounding.

Authors: We agree that the evaluation section requires substantially more methodological detail to support the reported preferences. In the revised manuscript we will expand this section to specify the criteria used to select the five AI research case studies, the number and expertise of the AI expert judges (including their relevant publication records and experience), the blinding procedures implemented, the exact evaluation rubrics supplied to judges, inter-rater agreement statistics (e.g., Cohen’s kappa or Fleiss’ kappa), and any quantitative controls or statistical tests performed. These additions will provide the verifiable grounding requested. revision: yes
Referee: [Claw-Code Harness description] The manuscript asserts that the Claw-Code Harness improves experimental completion and result integrity over prior approaches, but provides no quantitative metrics (such as completion rates, error frequencies, or statistical comparisons) to support this; the description remains qualitative despite being central to the platform's practical contribution.

Authors: We acknowledge that the current description of the Claw-Code Harness is qualitative and that quantitative evidence would better substantiate its claimed benefits. We will revise the relevant sections to include quantitative metrics drawn from our internal testing, such as experiment completion rates, observed error frequencies, and direct comparisons against the baseline where available. Should additional controlled measurements be needed, we will conduct them and report the results in the updated manuscript. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims rest on descriptive system design and independent internal evaluation

full rationale

The paper describes an autonomous multi-agent research platform and its features (team instantiation, dashboard, Claw-Code Harness) without any mathematical derivation chain, equations, fitted parameters, or predictions. The central empirical claim is an internal preference judgment over a named external baseline (AutoResearchClaw) on five case studies; this is presented as a direct test result rather than a quantity derived from or equivalent to the system's own inputs by construction. No self-citation load-bearing steps, uniqueness theorems, or ansatz smuggling appear in the provided text. The evaluation may have methodological limitations, but these do not constitute circularity under the specified patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The platform description relies on standard assumptions about multi-agent collaboration and code execution reliability without introducing new mathematical axioms or fitted parameters; no invented physical or theoretical entities are postulated.

axioms (1)

domain assumption Multi-agent systems with customizable roles can produce higher-quality research outputs than single-agent baselines when given appropriate collaboration workflows.
Invoked implicitly when claiming consistent preference over AutoResearchClaw in the internal evaluation.

pith-pipeline@v0.9.0 · 5815 in / 1428 out tokens · 34613 ms · 2026-05-22T05:30:48.417631+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

hierarchical multi-agent framework that automates the end-to-end research process by decomposing it into five structured layers: Idea, Planning, Coding, Experiment, and Writing
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Claw-Code Harness ... improves not only execution integration, but also experimental completion and result integrity

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

26 extracted references · 26 canonical work pages · 4 internal anchors

[1]

Scaling Learning Algorithms Towards

Bengio, Yoshua and LeCun, Yann , booktitle =. Scaling Learning Algorithms Towards

work page
[2]

and Osindero, Simon and Teh, Yee Whye , journal =

Hinton, Geoffrey E. and Osindero, Simon and Teh, Yee Whye , journal =. A Fast Learning Algorithm for Deep Belief Nets , volume =

work page
[3]

2016 , publisher=

Deep learning , author=. 2016 , publisher=

work page 2016
[4]

2026 , organization =

Liu, Jiaqi and Xia, Peng and Han, Siwei and Qiu, Shi and Zhang, Letian and Chen, Guiming and Tu, Haoqin and Yang, Xinyu and Zhou, Jiawei and Zhu, Hongtu and Li, Yun and Zhang, Jiaheng and Zhou, Yuyin and Zheng, Zeyu and Xie, Cihang and Ding, Mingyu and Yao, Huaxiu , title =. 2026 , organization =

work page 2026
[5]

2026 , organization =

Karpathy, Andrej , title =. 2026 , organization =

work page 2026
[6]

2026 , howpublished =

work page 2026
[9]

Findings of the Association for Computational Linguistics: EMNLP 2025 , pages =

Agent Laboratory: Using LLM Agents as Research Assistants , author =. Findings of the Association for Computational Linguistics: EMNLP 2025 , pages =

work page 2025
[11]

2025 , howpublished =

Accelerating Scientific Breakthroughs with an AI Co-Scientist , author =. 2025 , howpublished =

work page 2025
[13]

Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages =

DeepResearcher: Scaling Deep Research via Reinforcement Learning in Real-world Environments , author =. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages =

work page 2025
[14]

Proceedings of the 42nd International Conference on Machine Learning , series =

PaperBench: Evaluating AI's Ability to Replicate AI Research , author =. Proceedings of the 42nd International Conference on Machine Learning , series =

work page
[17]

arXiv preprint arXiv:2508.15126 , year =

aiXiv: A Next-Generation Open Access Ecosystem for Scientific Discovery Generated by AI Scientists , author =. arXiv preprint arXiv:2508.15126 , year =

work page arXiv
[19]

2024 , howpublished=

Black Forest Labs , title=. 2024 , howpublished=

work page 2024
[20]

EpiBench: Benchmarking Multi-turn Research Workflows for Multimodal Agents

Xuan Dong, Huanyang Zheng, Tianhao Niu, Zhe Han, Pengzhan Li, Bofei Liu, Zhengyang Liu, Guancheng Li, Qingfu Zhu, and Wanxiang Che. Epibench: Benchmarking multi-turn research workflows for multimodal agents. arXiv preprint arXiv:2604.05557, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[21]

Robin: A multi-agent system for automating scientific discovery

Ali Essam Ghareeb, Benjamin Chang, Ludovico Mitchener, Angela Yiu, Caralyn J. Szostkiewicz, Jon M. Laurent, Muhammed T. Razzak, Andrew D. White, Michaela M. Hinks, and Samuel G. Rodriques. Robin: A multi-agent system for automating scientific discovery. arXiv preprint arXiv:2505.13400, 2025

work page arXiv 2025
[22]

Towards an AI co-scientist

Juraj Gottweis, Wei-Hung Weng, Alexander Daryin, Tao Tu, Anil Palepu, Petar Sirkovic, Artiom Myaskovsky, Felix Weissenberger, Keran Rong, Ryutaro Tanno, et al. Towards an ai co-scientist. arXiv preprint arXiv:2502.18864, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[23]

autoresearch, 2026

Andrej Karpathy. autoresearch, 2026. URL https://github.com/karpathy/autoresearch

work page 2026
[24]

Black Forest Labs. Flux. https://github.com/black-forest-labs/flux, 2024

work page 2024
[25]

Build your personalized research group: A multiagent framework for continual and interactive science automation.arXiv preprint arXiv:2510.15624, 2025

Ed Li, Junyu Ren, Xintian Pan, Cat Yan, Chuanhao Li, Dirk Bergemann, and Zhuoran Yang. Build your personalized research group: A multiagent framework for continual and interactive science automation. arXiv preprint arXiv:2510.15624, 2025

work page arXiv 2025
[26]

Autoresearchclaw: Fully autonomous research from idea to paper, 2026

Jiaqi Liu, Peng Xia, Siwei Han, Shi Qiu, Letian Zhang, Guiming Chen, Haoqin Tu, Xinyu Yang, Jiawei Zhou, Hongtu Zhu, Yun Li, Jiaheng Zhang, Yuyin Zhou, Zeyu Zheng, Cihang Xie, Mingyu Ding, and Huaxiu Yao. Autoresearchclaw: Fully autonomous research from idea to paper, 2026. URL https://github.com/aiming-lab/AutoResearchClaw

work page 2026
[27]

The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery

Chris Lu, Cong Lu, Robert Tjarko Lange, Jakob N. Foerster, Jeff Clune, and David Ha. The ai scientist: Towards fully automated open-ended scientific discovery. arXiv preprint arXiv:2408.06292, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[28]

Agent laboratory: Using llm agents as research assistants

Samuel Schmidgall, Yusheng Su, Ze Wang, Ximeng Sun, Jialian Wu, Xiaodong Yu, Jiang Liu, Michael Moor, Zicheng Liu, and Emad Barsoum. Agent laboratory: Using llm agents as research assistants. In Findings of the Association for Computational Linguistics: EMNLP 2025, pp.\ 5977--6043, 2025

work page 2025
[29]

Paperbench: Evaluating ai's ability to replicate ai research

Giulio Starace, Oliver Jaffe, Dane Sherburn, James Aung, Jun Shern Chan, Leon Maksin, Rachel Dias, Evan Mays, Benjamin Kinsella, Wyatt Thompson, Johannes Heidecke, Amelia Glaese, and Tejal Patwardhan. Paperbench: Evaluating ai's ability to replicate ai research. In Proceedings of the 42nd International Conference on Machine Learning, volume 267 of Proceed...

work page 2025
[30]

Claw Code

UltraWorkers . Claw Code . https://github.com/ultraworkers/claw-code, 2026. Public Rust implementation of the claw CLI agent harness. Accessed: 2026-05-18

work page 2026
[31]

Phycustom: Towards realistic physical customization in text-to-image generation

Fan Wu, Cheng Chen, Zhoujie Fu, Jiacheng Wei, Yi Xu, Deheng Ye, and Guosheng Lin. Phycustom: Towards realistic physical customization in text-to-image generation. arXiv preprint arXiv:2512.02794, 2025

work page arXiv 2025
[32]

The AI Scientist-v2: Workshop-Level Automated Scientific Discovery via Agentic Tree Search

Yutaro Yamada, Robert Tjarko Lange, Cong Lu, Shengran Hu, Chris Lu, Jakob Foerster, Jeff Clune, and David Ha. The ai scientist-v2: Workshop-level automated scientific discovery via agentic tree search. arXiv preprint arXiv:2504.08066, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[33]

Deepresearcher: Scaling deep research via reinforcement learning in real-world environments

Yuxiang Zheng, Dayuan Fu, Xiangkun Hu, Xiaojie Cai, Lyumanshan Ye, Pengrui Lu, and Pengfei Liu. Deepresearcher: Scaling deep research via reinforcement learning in real-world environments. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pp.\ 414--431, 2025

work page 2025

[1] [1]

Scaling Learning Algorithms Towards

Bengio, Yoshua and LeCun, Yann , booktitle =. Scaling Learning Algorithms Towards

work page

[2] [2]

and Osindero, Simon and Teh, Yee Whye , journal =

Hinton, Geoffrey E. and Osindero, Simon and Teh, Yee Whye , journal =. A Fast Learning Algorithm for Deep Belief Nets , volume =

work page

[3] [3]

2016 , publisher=

Deep learning , author=. 2016 , publisher=

work page 2016

[4] [4]

2026 , organization =

Liu, Jiaqi and Xia, Peng and Han, Siwei and Qiu, Shi and Zhang, Letian and Chen, Guiming and Tu, Haoqin and Yang, Xinyu and Zhou, Jiawei and Zhu, Hongtu and Li, Yun and Zhang, Jiaheng and Zhou, Yuyin and Zheng, Zeyu and Xie, Cihang and Ding, Mingyu and Yao, Huaxiu , title =. 2026 , organization =

work page 2026

[5] [5]

2026 , organization =

Karpathy, Andrej , title =. 2026 , organization =

work page 2026

[6] [6]

2026 , howpublished =

work page 2026

[7] [9]

Findings of the Association for Computational Linguistics: EMNLP 2025 , pages =

Agent Laboratory: Using LLM Agents as Research Assistants , author =. Findings of the Association for Computational Linguistics: EMNLP 2025 , pages =

work page 2025

[8] [11]

2025 , howpublished =

Accelerating Scientific Breakthroughs with an AI Co-Scientist , author =. 2025 , howpublished =

work page 2025

[9] [13]

Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages =

DeepResearcher: Scaling Deep Research via Reinforcement Learning in Real-world Environments , author =. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages =

work page 2025

[10] [14]

Proceedings of the 42nd International Conference on Machine Learning , series =

PaperBench: Evaluating AI's Ability to Replicate AI Research , author =. Proceedings of the 42nd International Conference on Machine Learning , series =

work page

[11] [17]

arXiv preprint arXiv:2508.15126 , year =

aiXiv: A Next-Generation Open Access Ecosystem for Scientific Discovery Generated by AI Scientists , author =. arXiv preprint arXiv:2508.15126 , year =

work page arXiv

[12] [19]

2024 , howpublished=

Black Forest Labs , title=. 2024 , howpublished=

work page 2024

[13] [20]

EpiBench: Benchmarking Multi-turn Research Workflows for Multimodal Agents

Xuan Dong, Huanyang Zheng, Tianhao Niu, Zhe Han, Pengzhan Li, Bofei Liu, Zhengyang Liu, Guancheng Li, Qingfu Zhu, and Wanxiang Che. Epibench: Benchmarking multi-turn research workflows for multimodal agents. arXiv preprint arXiv:2604.05557, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[14] [21]

Robin: A multi-agent system for automating scientific discovery

Ali Essam Ghareeb, Benjamin Chang, Ludovico Mitchener, Angela Yiu, Caralyn J. Szostkiewicz, Jon M. Laurent, Muhammed T. Razzak, Andrew D. White, Michaela M. Hinks, and Samuel G. Rodriques. Robin: A multi-agent system for automating scientific discovery. arXiv preprint arXiv:2505.13400, 2025

work page arXiv 2025

[15] [22]

Towards an AI co-scientist

Juraj Gottweis, Wei-Hung Weng, Alexander Daryin, Tao Tu, Anil Palepu, Petar Sirkovic, Artiom Myaskovsky, Felix Weissenberger, Keran Rong, Ryutaro Tanno, et al. Towards an ai co-scientist. arXiv preprint arXiv:2502.18864, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[16] [23]

autoresearch, 2026

Andrej Karpathy. autoresearch, 2026. URL https://github.com/karpathy/autoresearch

work page 2026

[17] [24]

Black Forest Labs. Flux. https://github.com/black-forest-labs/flux, 2024

work page 2024

[18] [25]

Build your personalized research group: A multiagent framework for continual and interactive science automation.arXiv preprint arXiv:2510.15624, 2025

Ed Li, Junyu Ren, Xintian Pan, Cat Yan, Chuanhao Li, Dirk Bergemann, and Zhuoran Yang. Build your personalized research group: A multiagent framework for continual and interactive science automation. arXiv preprint arXiv:2510.15624, 2025

work page arXiv 2025

[19] [26]

Autoresearchclaw: Fully autonomous research from idea to paper, 2026

Jiaqi Liu, Peng Xia, Siwei Han, Shi Qiu, Letian Zhang, Guiming Chen, Haoqin Tu, Xinyu Yang, Jiawei Zhou, Hongtu Zhu, Yun Li, Jiaheng Zhang, Yuyin Zhou, Zeyu Zheng, Cihang Xie, Mingyu Ding, and Huaxiu Yao. Autoresearchclaw: Fully autonomous research from idea to paper, 2026. URL https://github.com/aiming-lab/AutoResearchClaw

work page 2026

[20] [27]

The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery

Chris Lu, Cong Lu, Robert Tjarko Lange, Jakob N. Foerster, Jeff Clune, and David Ha. The ai scientist: Towards fully automated open-ended scientific discovery. arXiv preprint arXiv:2408.06292, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[21] [28]

Agent laboratory: Using llm agents as research assistants

Samuel Schmidgall, Yusheng Su, Ze Wang, Ximeng Sun, Jialian Wu, Xiaodong Yu, Jiang Liu, Michael Moor, Zicheng Liu, and Emad Barsoum. Agent laboratory: Using llm agents as research assistants. In Findings of the Association for Computational Linguistics: EMNLP 2025, pp.\ 5977--6043, 2025

work page 2025

[22] [29]

Paperbench: Evaluating ai's ability to replicate ai research

Giulio Starace, Oliver Jaffe, Dane Sherburn, James Aung, Jun Shern Chan, Leon Maksin, Rachel Dias, Evan Mays, Benjamin Kinsella, Wyatt Thompson, Johannes Heidecke, Amelia Glaese, and Tejal Patwardhan. Paperbench: Evaluating ai's ability to replicate ai research. In Proceedings of the 42nd International Conference on Machine Learning, volume 267 of Proceed...

work page 2025

[23] [30]

Claw Code

UltraWorkers . Claw Code . https://github.com/ultraworkers/claw-code, 2026. Public Rust implementation of the claw CLI agent harness. Accessed: 2026-05-18

work page 2026

[24] [31]

Phycustom: Towards realistic physical customization in text-to-image generation

Fan Wu, Cheng Chen, Zhoujie Fu, Jiacheng Wei, Yi Xu, Deheng Ye, and Guosheng Lin. Phycustom: Towards realistic physical customization in text-to-image generation. arXiv preprint arXiv:2512.02794, 2025

work page arXiv 2025

[25] [32]

The AI Scientist-v2: Workshop-Level Automated Scientific Discovery via Agentic Tree Search

Yutaro Yamada, Robert Tjarko Lange, Cong Lu, Shengran Hu, Chris Lu, Jakob Foerster, Jeff Clune, and David Ha. The ai scientist-v2: Workshop-level automated scientific discovery via agentic tree search. arXiv preprint arXiv:2504.08066, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[26] [33]

Deepresearcher: Scaling deep research via reinforcement learning in real-world environments

Yuxiang Zheng, Dayuan Fu, Xiangkun Hu, Xiaojie Cai, Lyumanshan Ye, Pengrui Lu, and Pengfei Liu. Deepresearcher: Scaling deep research via reinforcement learning in real-world environments. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pp.\ 414--431, 2025

work page 2025