pith. sign in

arxiv: 2605.28655 · v1 · pith:CZ3WP35Xnew · submitted 2026-05-27 · 💻 cs.AI

AutoScientists: Self-Organizing Agent Teams for Long-Running Scientific Experimentation

Pith reviewed 2026-06-29 12:36 UTC · model grok-4.3

classification 💻 cs.AI
keywords AI agentsself-organizingscientific experimentationbiomedical machine learningprotein fitnessdecentralized teamshypothesis generation
0
0 comments X

The pith

Self-organizing AI agent teams outperform prior single-agent methods in long-running scientific experiments under matched budgets.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

AutoScientists deploys multiple AI agents that share an experimental state and self-organize into teams focused on promising hypotheses. These agents critique each other's proposals before committing compute resources and exchange information about what worked or failed. The system is evaluated on biomedical machine learning benchmarks, language model training, and protein fitness prediction. It reports better average performance than earlier AI agents while using the same experimental budget. This matters if sustained parallel exploration and knowledge retention improve discovery rates in iterative science.

Core claim

The central discovery is that a decentralized team of AI agents, which interpret a shared experimental state, self-organize around hypotheses, critique proposals before compute use, and share successes and failures, produces higher performance than single-agent or centrally planned approaches on three classes of scientific tasks.

What carries the argument

Self-organizing agent teams that form around hypotheses in a shared state with critique and knowledge sharing.

Load-bearing premise

The agents can reliably interpret shared experimental state, form effective self-organized teams, critique proposals before compute use, and share knowledge without coordination overhead or selection bias.

What would settle it

A replication on the BioML-Bench or ProteinGym tasks where the team-based approach yields no statistically significant improvement over the strongest single-agent baseline.

Figures

Figures reproduced from arXiv: 2605.28655 by Ada Fang, Marinka Zitnik, Shanghua Gao.

Figure 1
Figure 1. Figure 1: Self-organizing agent teams for long-running experimentation. Overview of AUTOSCIENTISTS. Agents identify promising research directions, organize into teams, and execute experiments in parallel. optimization or fixed pipelines. They typically follow a single reasoning thread or use a search-space decomposition set at the start of the run. This assumption breaks down in long-running scientific experimentati… view at source ↗
Figure 2
Figure 2. Figure 2: Model card produced by AUTOSCIENTISTS. TDC hERG Blocking Prediction model discovered by AUTOSCIENTISTS. All agents in AUTOSCIENTISTS use the same base model, Claude Code coding agent [50] with the base LLM Claude Sonnet 4.6 [51]. We use the same model backend for AUTOSCIENTISTS and the Autoresearch baseline. Each agent is repeatedly invoked by a deterministic monitor process in a heartbeat loop. AUTOSCIENT… view at source ↗
Figure 3
Figure 3. Figure 3: AUTOSCIENTISTS improves performance across BioML-Bench tasks. Performance on 24 biomedical tasks measured by leaderboard percentile (left), proportion above the public leaderboard median (middle), and proportion awarded a medal (right). Error bars show standard error of the mean. Additional results are reported in Table S6. Results. We report aggregate performance in [PITH_FULL_IMAGE:figures/full_fig_p006… view at source ↗
Figure 4
Figure 4. Figure 4: AUTOSCIENTISTS sustains improvement during long-running GPT training optimization. GPT nanochat training optimization: AUTOSCIENTISTS vs. Autoresearch [3]. (a) From Autoresearch baseline (val_bpb = 0.998): AUTOSCIENTISTS reaches val_bpb ≈ 0.978 in 34 experiments vs. 65 for Autoresearch, a 1.9× speedup at the matched loss. (b) From a AUTOSCIENTISTS champion obtained after 50 prior AUTOSCIENTISTS experiments… view at source ↗
Figure 5
Figure 5. Figure 5: Emergent coordination during long-running experimental search. Illustrations of AUTOSCIEN￾TISTS agent-team interactions in long-running research experiments, featuring representative quotes from the agents. 9 [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
read the original abstract

Scientific research proceeds through iterative cycles of hypothesis generation, experiment design, execution, and revision. AI agents can automate parts of this process, but existing approaches typically follow a single research trajectory or coordinate through a central planner with fixed objectives. As a result, they struggle to sustain parallel exploration, adapt as experimental evidence changes, or preserve knowledge of failed directions over long-running experiments. We introduce AutoScientists, a decentralized team of AI agents for long-running computational scientific experimentation. Agents interpret a shared experimental state, self-organize into teams around promising hypotheses, critique proposals before using experimental compute, and share successes and failures to reduce redundant exploration. Under matched experimental budgets, AutoScientists improves over prior AI agents across biomedical machine learning, language-model training optimization, and protein fitness prediction. On BioML-Bench, spanning biomedical imaging, protein engineering, single-cell omics, and drug discovery, AutoScientists achieves a mean leaderboard percentile of 74.4% across 24 tasks, improving over the strongest AI agent by +8.33%. On GPT training optimization, AutoScientists reaches a target validation bits-per-byte 1.9x faster than Autoresearch and continues discovering improvements from a starting champion where the single-agent approach finds none (7 vs. 0 accepted improvements). On ProteinGym fitness prediction, AutoScientists discovers a method for ACE2-Spike binding that improves over the current state-of-the-art model by +12.5% in Spearman correlation. Applied without modification across all 217 ProteinGym assays, the same method improves over the prior state of the art by +6.5% (Spearman correlation).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper introduces AutoScientists, a decentralized multi-agent system in which AI agents interpret a shared experimental state, self-organize around promising hypotheses, critique proposals before expending compute, and exchange successes and failures to reduce redundant exploration. It claims that, under matched experimental budgets, this approach outperforms prior single-agent and centrally planned baselines on three suites: BioML-Bench (mean leaderboard percentile 74.4% across 24 tasks, +8.33% over strongest prior agent), GPT training optimization (1.9× faster to target validation bits-per-byte and 7 vs. 0 accepted improvements), and ProteinGym (one assay +12.5% Spearman, 217 assays +6.5% Spearman).

Significance. If the reported gains can be shown to arise specifically from the self-organization and critique mechanisms under rigorously matched budgets, the work would provide concrete evidence that decentralized agent teams can sustain longer, less redundant scientific search trajectories than existing single-trajectory or centrally coordinated agents.

major comments (3)
  1. [Abstract] Abstract: the central claim that gains occur 'under matched experimental budgets' is load-bearing, yet the abstract supplies no protocol for budget accounting (token count, LLM calls, wall-clock time, or proposal count). Without this accounting it is impossible to attribute the 74.4% percentile, 1.9× speedup, or Spearman improvements to self-organization rather than unmatched total compute or prompting differences.
  2. [Abstract] Abstract: no ablation is described that removes the team self-organization, pre-compute critique, or knowledge-sharing layers while keeping total budget fixed. This omission prevents verification that the claimed mechanisms, rather than simply running more parallel trajectories, produce the observed deltas (7 vs. 0 improvements, +8.33% percentile).
  3. [Abstract] Abstract: the reported numbers (74.4% mean percentile, 1.9× speedup, +12.5% and +6.5% Spearman) are given without statistical tests, run-to-run variance, exact agent implementations, or data-exclusion rules. These omissions directly affect the soundness of the cross-benchmark superiority claim.
minor comments (1)
  1. [Abstract] The abstract lists three benchmark suites but does not name the precise tasks, the exact prior-agent baselines, or the leaderboard construction details needed to reproduce the percentile and Spearman figures.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for these constructive comments emphasizing experimental rigor. We address each point below and will revise the abstract and relevant sections accordingly.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that gains occur 'under matched experimental budgets' is load-bearing, yet the abstract supplies no protocol for budget accounting (token count, LLM calls, wall-clock time, or proposal count). Without this accounting it is impossible to attribute the 74.4% percentile, 1.9× speedup, or Spearman improvements to self-organization rather than unmatched total compute or prompting differences.

    Authors: We agree that the abstract should explicitly state the budget-matching protocol. In the revision we will append the following sentence: 'Budgets are matched by equalizing the total number of LLM API calls and token consumption across methods, with full per-experiment accounting in Section 3.2; wall-clock time is not used as the primary metric owing to differences in parallelization.' This directly addresses attribution to the self-organization mechanisms. revision: yes

  2. Referee: [Abstract] Abstract: no ablation is described that removes the team self-organization, pre-compute critique, or knowledge-sharing layers while keeping total budget fixed. This omission prevents verification that the claimed mechanisms, rather than simply running more parallel trajectories, produce the observed deltas (7 vs. 0 improvements, +8.33% percentile).

    Authors: The full manuscript (Section 4.3 and supplementary ablations) already contains controlled ablations that remove self-organization, critique, and knowledge-sharing one at a time while holding the LLM-call budget fixed; each removal measurably degrades performance toward single-agent baselines. These results are not summarized in the abstract. We will add one sentence to the abstract referencing the ablation outcomes to make the mechanistic contribution explicit. revision: yes

  3. Referee: [Abstract] Abstract: the reported numbers (74.4% mean percentile, 1.9× speedup, +12.5% and +6.5% Spearman) are given without statistical tests, run-to-run variance, exact agent implementations, or data-exclusion rules. These omissions directly affect the soundness of the cross-benchmark superiority claim.

    Authors: We acknowledge that the abstract omits these details. The main text already reports standard deviations across the 24 BioML-Bench tasks and across the 217 ProteinGym assays, and the GPT optimization includes three independent trajectories. Exact agent prompts and code are released with the paper; data-exclusion rules (invalid or duplicate proposals) are described in Section 3.3. In revision we will insert a short clause in the abstract noting 'results averaged with reported standard deviations' and will add p-value comparisons where the number of replicates permits. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical system with benchmark results only

full rationale

The paper introduces an agent architecture and reports empirical benchmark gains (74.4% mean percentile on BioML-Bench, 1.9x faster GPT convergence, +12.5% and +6.5% Spearman on ProteinGym) under the claim of matched budgets. No equations, derivations, fitted parameters renamed as predictions, or self-citations appear in the provided text. All load-bearing claims are external experimental comparisons rather than self-referential definitions or reductions to inputs by construction, satisfying the self-contained criterion.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the domain assumption that AI agents possess sufficient capability to interpret shared state, self-organize, and critique without central control; no free parameters or invented physical entities are mentioned in the abstract.

axioms (1)
  • domain assumption AI agents can reliably interpret a shared experimental state, self-organize into teams, critique proposals, and share knowledge to reduce redundancy
    This premise is required for the decentralized coordination mechanism described in the abstract to produce the claimed performance gains.
invented entities (1)
  • AutoScientists decentralized agent team no independent evidence
    purpose: To sustain parallel exploration and knowledge retention in long-running scientific experiments
    The system itself is the primary contribution introduced in the abstract.

pith-pipeline@v0.9.1-grok · 5830 in / 1340 out tokens · 55291 ms · 2026-06-29T12:36:00.267925+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Discovering Crystal Structure Prediction Algorithms with an AI Co-Scientist

    cs.LG 2026-06 unverdicted novelty 5.0

    HACO adapts MaskGIT from vision into MaskGXT with symmetry tokens and stratified sampling, reaching 79.06% METRe accuracy on MP-20 polymorph split versus 70.87% for the best baseline.

Reference graph

Works this paper leans on

101 extracted references · 42 canonical work pages · cited by 1 Pith paper · 12 internal anchors

  1. [1]

    Empow- ering biomedical discovery with ai agents.Cell, 187(22):6125–6151, 2024

    Shanghua Gao, Ada Fang, Yepeng Huang, Valentina Giunchiglia, Ayush Noori, Jonathan Richard Schwarz, Yasha Ektefaie, Jovana Kondic, and Marinka Zitnik. Empow- ering biomedical discovery with ai agents.Cell, 187(22):6125–6151, 2024

  2. [2]

    Miller, Matthew Greenig, Benjamin Tenmann, and Bo Wang

    Henry E. Miller, Matthew Greenig, Benjamin Tenmann, and Bo Wang. BioML-bench: Eval- uation of AI agents for end-to-end biomedical ML.bioRxiv, 2025. doi: 10.1101/2025.09.01. 673319. URL https://www.biorxiv.org/content/early/2025/09/28/2025.09.01.673319. 10

  3. [3]

    Autoresearch: AI agents running research on single-GPU nanochat training automatically

    Andrej Karpathy. Autoresearch: AI agents running research on single-GPU nanochat training automatically. https://github.com/karpathy/autoresearch, 2026. GitHub repository

  4. [4]

    Kosmos: An AI Scientist for Autonomous Discovery

    Ludovico Mitchener, Angela Yiu, Benjamin Chang, Mathieu Bourdenx, Tyler Nadolski, Arvis Sulovari, Eric C Landsness, Daniel L Barabasi, Siddharth Narayanan, Nicky Evans, et al. Kosmos: An AI scientist for autonomous discovery.arXiv preprint arXiv:2511.02824, 2025

  5. [5]

    Stella: Towards a biomedical world model with self-evolving multimodal agents.bioRxiv, 2026

    Ruofan Jin, Mingyang Xu, Fei Meng, Guancheng Wan, Qingran Cai, Yize Jiang, Jin Han, Yuanyuan Chen, Wanqing Lu, Mengyang Wang, Zhiqian Lan, Yuxuan Jiang, Junhong Liu, Dongyao Wang, Le Cong, and Zaixi Zhang. Stella: Towards a biomedical world model with self-evolving multimodal agents.bioRxiv, 2026. doi: 10.1101/2025.07.01.662467. URL https://www.biorxiv.or...

  6. [6]

    Txagent: an ai agent for therapeutic reason- ing across a universe of tools.arXiv preprint arXiv:2503.10970, 2025

    Shanghua Gao, Richard Zhu, Zhenglun Kong, Ayush Noori, Xiaorui Su, Curtis Ginder, Theodoros Tsiligkaridis, and Marinka Zitnik. Txagent: an ai agent for therapeutic reason- ing across a universe of tools.arXiv preprint arXiv:2503.10970, 2025

  7. [7]

    Towards an AI co-scientist

    Juraj Gottweis, Wei-Hung Weng, Alexander Daryin, Tao Tu, Anil Palepu, Petar Sirkovic, Artiom Myaskovsky, Felix Weissenberger, Keran Rong, Ryutaro Tanno, et al. Towards an ai co-scientist. arXiv preprint arXiv:2502.18864, 2025

  8. [8]

    Ai mirrors experimental science to uncover a mechanism of gene transfer crucial to bacterial evolution

    José R Penadés, Juraj Gottweis, Lingchen He, Jonasz B Patkowski, Alexander Daryin, Wei- Hung Weng, Tao Tu, Anil Palepu, Artiom Myaskovsky, Annalisa Pawlosky, et al. Ai mirrors experimental science to uncover a mechanism of gene transfer crucial to bacterial evolution. Cell, 188(23):6654–6665, 2025

  9. [9]

    Li, Shanghua Gao, Wanxiang Shen, Valentina Giunchiglia, Andrew Shen, Yepeng Huang, Zhenglun Kong, and Marinka Zitnik

    Pengwei Sui, Michelle M. Li, Shanghua Gao, Wanxiang Shen, Valentina Giunchiglia, Andrew Shen, Yepeng Huang, Zhenglun Kong, and Marinka Zitnik. Medea: An omics ai agent for therapeutic discovery.bioRxiv, 2026. doi: 10.64898/2026.01.16.696667. URL https: //www.biorxiv.org/content/early/2026/01/20/2026.01.16.696667

  10. [10]

    Biomni: A general-purpose biomedical AI agent

    Kexin Huang, Serena Zhang, Hanchen Wang, Yuanhao Qu, Yingzhou Lu, Yusuf Roohani, Ryan Li, Lin Qiu, Gavin Li, Junze Zhang, et al. Biomni: A general-purpose biomedical AI agent. biorxiv, 2025

  11. [11]

    AIDE: AI-Driven Exploration in the Space of Code

    Zhengyao Jiang, Dominik Schmidt, Dhruv Srikanth, Dixing Xu, Ian Kaplan, Deniss Ja- cenko, and Yuxiang Wu. AIDE: AI-driven exploration in the space of code.arXiv preprint arXiv:2502.13138, 2025

  12. [12]

    CORAL: Towards Autonomous Multi-Agent Evolution for Open-Ended Discovery

    Ao Qu, Han Zheng, Zijian Zhou, Yihao Yan, Yihong Tang, Shao Yong Ong, Fenglu Hong, Kaichen Zhou, Chonghe Jiang, Minwei Kong, et al. Coral: Towards autonomous multi-agent evolution for open-ended discovery.arXiv preprint arXiv:2604.01658, 2026

  13. [13]

    Internagent-1.5: A unified agentic framework for long-horizon autonomous scientific discovery.arXiv preprint arXiv:2602.08990, 2026

    Shiyang Feng, Runmin Ma, Xiangchao Yan, Yue Fan, Yusong Hu, Songtao Huang, Shuaiyu Zhang, Zongsheng Cao, Tianshuo Peng, Jiakang Yuan, et al. Internagent-1.5: A unified agentic framework for long-horizon autonomous scientific discovery.arXiv preprint arXiv:2602.08990, 2026

  14. [14]

    The virtual lab of ai agents designs new sars-cov-2 nanobodies.Nature, 646(8085):716–723, 2025

    Kyle Swanson, Wesley Wu, Nash L Bulaong, John E Pak, and James Zou. The virtual lab of ai agents designs new sars-cov-2 nanobodies.Nature, 646(8085):716–723, 2025

  15. [15]

    Improv- ing factuality and reasoning in language models through multiagent debate

    Yilun Du, Shuang Li, Antonio Torralba, Joshua B Tenenbaum, and Igor Mordatch. Improv- ing factuality and reasoning in language models through multiagent debate. InForty-first international conference on machine learning, 2024

  16. [16]

    ReConcile: Round-table conference improves reasoning via consensus among diverse LLMs, 2024

    Justin Chih-Yao Chen, Swarnadeep Saha, and Mohit Bansal. ReConcile: Round-table conference improves reasoning via consensus among diverse LLMs, 2024. URL https: //arxiv.org/abs/2309.13007

  17. [17]

    Proteingym: Large-scale benchmarks for protein fitness prediction and design

    Pascal Notin, Aaron Kollasch, Daniel Ritter, Lood van Niekerk, Steffanie Paul, Han Spinner, Nathan Rollins, Ada Shaw, Rose Orenbuch, Ruben Weitzman, Jonathan Frazer, Mafalda Dias, Dinko Franceschi, Yarin Gal, and Debora Marks. Proteingym: Large-scale benchmarks for protein fitness prediction and design. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Ha...

  18. [18]

    Agentic AI for scientific discovery: A survey of progress, challenges, and future directions,

    Mourad Gridach, Jay Nanavati, Khaldoun Zine El Abidine, Lenon Mendes, and Christina Mack. Agentic AI for scientific discovery: A survey of progress, challenges, and future directions,

  19. [19]

    URL https://arxiv.org/abs/2503.08979

  20. [20]

    A vision for auto research with LLM agents, 2025

    Chengwei Liu, Chong Wang, Jiayue Cao, Jingquan Ge, Kun Wang, Lyuye Zhang, Ming-Ming Cheng, Penghai Zhao, Tianlin Li, Xiaojun Jia, Xiang Li, Xingshuai Li, Yang Liu, Yebo Feng, Yihao Huang, Yijia Xu, Yuqiang Sun, Zhenhong Zhou, and Zhengzi Xu. A vision for auto research with LLM agents, 2025. URL https://arxiv.org/abs/2504.18765

  21. [21]

    Agent laboratory: Using LLM agents as research assistants.Findings of the Association for Computational Linguistics: EMNLP 2025, pages 5977–6043, 2025

    Samuel Schmidgall, Yusheng Su, Ze Wang, Ximeng Sun, Jialian Wu, Xiaodong Yu, Jiang Liu, Michael Moor, Zicheng Liu, and Emad Barsoum. Agent laboratory: Using LLM agents as research assistants.Findings of the Association for Computational Linguistics: EMNLP 2025, pages 5977–6043, 2025

  22. [22]

    Jonathan Bragg, Mike D’Arcy, Nishant Balepur, Dan Bareket, Bhavana Dalvi, Sergey Feldman, Dany Haddad, Jena D. Hwang, Peter Jansen, Varsha Kishore, Bodhisattwa Prasad Majumder, Aakanksha Naik, Sigal Rahamimov, Kyle Richardson, Amanpreet Singh, Harshit Surana, Aryeh Tiktinsky, Rosni Vasu, Guy Wiener, Chloe Anastasiades, Stefan Candra, Jason Dunkelberger, D...

  23. [23]

    LMR-BENCH: Evaluating LLM agent’s ability on reproducing language modeling research,

    Shuo Yan, Ruochen Li, Ziming Luo, Zimu Wang, Daoyang Li, Liqiang Jing, Kaiyu He, Peilin Wu, George Michalopoulos, Yue Zhang, Ziyang Zhang, Mian Zhang, Zhiyu Chen, and Xinya Du. LMR-BENCH: Evaluating LLM agent’s ability on reproducing language modeling research,

  24. [24]

    URL https://arxiv.org/abs/2506.17335

  25. [25]

    Dechao Bu, Jingbo Sun, Kun Li, Zihao He, Wei Huang, Jinlin Hu, Shanshan Zhang, Shuang- shuang Lei, Peipei Huo, Zhihao Wang, et al. Empowering ai data scientists using a multi-agent llm framework with self-evolving capabilities for autonomous, tool-aware biomedical data analyses.Nature Biomedical Engineering, pages 1–16, 2026

  26. [26]

    Robin: A multi-agent system for automating scientific discovery.arXiv preprint arXiv:2505.13400, 2025

    Ali Essam Ghareeb, Benjamin Chang, Ludovico Mitchener, Angela Yiu, Caralyn J Szostkiewicz, Jon M Laurent, Muhammed T Razzak, Andrew D White, Michaela M Hinks, and Samuel G Rodriques. Robin: A multi-agent system for automating scientific discovery.arXiv preprint arXiv:2505.13400, 2025

  27. [27]

    GenoMAS: A Multi-Agent Framework for Scientific Discovery via Code-Driven Gene Expression Analysis

    Haoyang Liu, Yijiang Li, and Haohan Wang. GenoMAS: A multi-agent framework for scientific discovery via code-driven gene expression analysis.arXiv preprint arXiv:2507.21035, 2025

  28. [28]

    Piflow: Principle-aware scientific discovery with multi-agent collaboration.arXiv preprint arXiv:2505.15047, 2025

    Yingming Pu, Tao Lin, and Hongyu Chen. Piflow: Principle-aware scientific discovery with multi-agent collaboration.arXiv preprint arXiv:2505.15047, 2025

  29. [29]

    Scitoolagent: a knowledge-graph-driven scientific agent for multitool integration.Nature Computational Science, 5(10):962–972, 2025

    Keyan Ding, Jing Yu, Junjie Huang, Yuchen Yang, Qiang Zhang, and Huajun Chen. Scitoolagent: a knowledge-graph-driven scientific agent for multitool integration.Nature Computational Science, 5(10):962–972, 2025

  30. [30]

    ChemBOMAS: Accelerated BO in chemistry with LLM-enhanced multi-agent system.arXiv preprint arXiv:2509.08736, 2025

    Dong Han, Zhehong Ai, Pengxiang Cai, Shanya Lu, Jianpeng Chen, Zihao Ye, Shuzhou Sun, Ben Gao, Lingli Ge, Weida Wang, et al. ChemBOMAS: Accelerated BO in chemistry with LLM-enhanced multi-agent system.arXiv preprint arXiv:2509.08736, 2025

  31. [31]

    SR-scientist: Scientific equation discovery with agentic AI.arXiv preprint arXiv:2510.11661, 2025

    Shijie Xia, Yuhan Sun, and Pengfei Liu. SR-scientist: Scientific equation discovery with agentic AI.arXiv preprint arXiv:2510.11661, 2025

  32. [32]

    SelfAI: A self-directed framework for long-horizon scientific discovery, 2025

    Xiao Wu, Ting-Zhu Huang, Liang-Jian Deng, Xiaobing Yu, Yu Zhong, Shangqi Deng, Ufaq Khan, Jianghao Wu, Xiaofeng Liu, Imran Razzak, Xiaojun Chang, and Yutong Xie. SelfAI: A self-directed framework for long-horizon scientific discovery, 2025. URL https://arxiv.org/abs/ 2512.00403. 12

  33. [33]

    EvoScientist: Towards multi-agent evolving AI scientists for end-to-end scientific discovery, 2026

    Yougang Lyu, Xi Zhang, Xinhao Yi, Yuyue Zhao, Shuyu Guo, Wenxiang Hu, Jan Piotrowski, Jakub Kaliski, Jacopo Urbani, Zaiqiao Meng, Lun Zhou, and Xiaohui Yan. EvoScientist: Towards multi-agent evolving AI scientists for end-to-end scientific discovery, 2026. URL https://arxiv.org/abs/2603.08127

  34. [34]

    CASCADE: Cumulative agentic skill creation through autonomous development and evolution

    Xu Huang, Junwu Chen, Yuxing Fei, Zhuohan Li, Philippe Schwaller, and Gerbrand Ceder. CASCADE: Cumulative agentic skill creation through autonomous development and evolution. arXiv preprint arXiv:2512.23880, 2025

  35. [35]

    Towards end-to-end automation of ai research.Nature, 651(8107):914–919, 2026

    Chris Lu, Cong Lu, Robert Tjarko Lange, Yutaro Yamada, Shengran Hu, Jakob Foerster, David Ha, and Jeff Clune. Towards end-to-end automation of ai research.Nature, 651(8107):914–919, 2026

  36. [36]

    AlphaEvolve: A coding agent for scientific and algorithmic discovery

    Alexander Novikov, Ngân V˜u, Marvin Eisenberger, Emilien Dupont, Po-Sen Huang, Adam Zsolt Wagner, Sergey Shirobokov, Borislav Kozlovskii, Francisco JR Ruiz, Abbas Mehrabian, et al. AlphaEvolve: A coding agent for scientific and algorithmic discovery.arXiv preprint arXiv:2506.13131, 2025

  37. [37]

    Language agents as optimizable graphs.arXiv preprint arXiv:2402.16823, 2024

    Mingchen Zhuge, Wenyi Wang, Louis Kirsch, Francesco Faccio, Dmitrii Khizbullin, and Jürgen Schmidhuber. Language agents as optimizable graphs.arXiv preprint arXiv:2402.16823, 2024

  38. [38]

    On the resilience of LLM-based multi-agent collaboration with faulty agents

    Jen tse Huang, Jiaxu Zhou, Tailin Jin, Xuhui Zhou, Zixi Chen, Wenxuan Wang, Youliang Yuan, Michael Lyu, and Maarten Sap. On the resilience of LLM-based multi-agent collaboration with faulty agents. InForty-second International Conference on Machine Learning, 2025. URL https://openreview.net/forum?id=bkiM54QftZ

  39. [39]

    Can ai agents agree?arXiv preprint arXiv:2603.01213, 2026

    Frédéric Berdoz, Leonardo Rugli, and Roger Wattenhofer. Can ai agents agree?arXiv preprint arXiv:2603.01213, 2026

  40. [40]

    Khanh-Tung Tran, Dung Dao, Minh-Duong Nguyen, Quoc-Viet Pham, Barry O’Sullivan, and Hoang D. Nguyen. Multi-agent collaboration mechanisms: A survey of LLMs, 2025. URL https://arxiv.org/abs/2501.06322

  41. [41]

    Understanding agent scaling in LLM-based multi-agent systems via diversity.arXiv preprint arXiv:2602.03794, 2026

    Yingxuan Yang, Chengrui Qu, Muning Wen, Laixi Shi, Ying Wen, Weinan Zhang, Adam Wierman, and Shangding Gu. Understanding agent scaling in LLM-based multi-agent systems via diversity.arXiv preprint arXiv:2602.03794, 2026

  42. [42]

    Towards a Science of Scaling Agent Systems

    Yubin Kim, Ken Gu, Chanwoo Park, Chunjong Park, Samuel Schmidgall, A Ali Heydari, Yao Yan, Zhihan Zhang, Yuchen Zhuang, Mark Malhotra, et al. Towards a science of scaling agent systems.arXiv preprint arXiv:2512.08296, 2025

  43. [43]

    Multi-Agent Teams Hold Experts Back

    Aneesh Pappu, Batu El, Hancheng Cao, Carmelo di Nolfo, Yanchao Sun, Meng Cao, and James Zou. Multi-agent teams hold experts back.arXiv preprint arXiv:2602.01011, 2026

  44. [44]

    MultiAgentBench: Evaluating the collaboration and competition of LLM agents, 2025

    Kunlun Zhu, Hongyi Du, Zhaochen Hong, Xiaocheng Yang, Shuyi Guo, Zhe Wang, Zhenhailong Wang, Cheng Qian, Xiangru Tang, Heng Ji, and Jiaxuan You. MultiAgentBench: Evaluating the collaboration and competition of LLM agents, 2025. URL https://arxiv.org/abs/2503.01935

  45. [45]

    Collaborative research across disciplinary and organi- zational boundaries.Social studies of science, 35(5):703–722, 2005

    Jonathon N Cummings and Sara Kiesler. Collaborative research across disciplinary and organi- zational boundaries.Social studies of science, 35(5):703–722, 2005

  46. [46]

    The increasing dominance of teams in production of knowledge.Science, 316(5827):1036–1039, 2007

    Stefan Wuchty, Benjamin F Jones, and Brian Uzzi. The increasing dominance of teams in production of knowledge.Science, 316(5827):1036–1039, 2007

  47. [47]

    Flat teams drive scientific innovation.Proceedings of the National Academy of Sciences, 119(23):e2200927119, 2022

    Fengli Xu, Lingfei Wu, and James Evans. Flat teams drive scientific innovation.Proceedings of the National Academy of Sciences, 119(23):e2200927119, 2022

  48. [48]

    The science of team science: A review of the empirical evidence and research gaps on collaboration in science.American psychologist, 73(4):532, 2018

    Kara L Hall, Amanda L V ogel, Grace C Huang, Katrina J Serrano, Elise L Rice, Sophia P Tsakraklides, and Stephen M Fiore. The science of team science: A review of the empirical evidence and research gaps on collaboration in science.American psychologist, 73(4):532, 2018. 13

  49. [49]

    Toward ultra-long-horizon agentic science: Cognitive accumulation for machine learning engineering.arXiv preprint arXiv:2601.10402, 2026

    Xinyu Zhu, Yuzhu Cai, Zexi Liu, Bingyang Zheng, Cheng Wang, Rui Ye, Jiaao Chen, Han- rui Wang, Wei-Chen Wang, Yuzhi Zhang, et al. Toward ultra-long-horizon agentic science: Cognitive accumulation for machine learning engineering.arXiv preprint arXiv:2601.10402, 2026

  50. [50]

    Emergent Coordination in Multi-Agent Language Models

    Christoph Riedl. Emergent coordination in multi-agent language models.arXiv preprint arXiv:2510.05174, 2025

  51. [51]

    Model cards for model reporting

    Margaret Mitchell, Simone Wu, Andrew Zaldivar, Parker Barnes, Lucy Vasserman, Ben Hutchin- son, Elena Spitzer, Inioluwa Deborah Raji, and Timnit Gebru. Model cards for model reporting. InProceedings of the conference on fairness, accountability, and transparency, pages 220–229, 2019

  52. [52]

    Claude Code: Overview

    Anthropic. Claude Code: Overview. https://code.claude.com/docs/en/overview, 2026. Product documentation. Accessed: 2026-05-06

  53. [53]

    Claude Sonnet 4.6

    Anthropic. Claude Sonnet 4.6. https://www.anthropic.com/claude/sonnet, 2026. Model documentation. Model ID:claude-sonnet-4-6. Accessed: 2026-05-06

  54. [54]

    Mlagentbench: Evaluating language agents on machine learning experimentation.arXiv preprint arXiv:2310.03302, 2023

    Qian Huang, Jian V ora, Percy Liang, and Jure Leskovec. Mlagentbench: Evaluating language agents on machine learning experimentation.arXiv preprint arXiv:2310.03302, 2023

  55. [55]

    Kermut: Composite kernel regression for protein variant effects

    Peter Mø rch Groth, Mads Herbert Kerrn, Lars Olsen, Jesper Salomon, and Wouter Boomsma. Kermut: Composite kernel regression for protein variant effects. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors,Advances in Neural Information Processing Systems, volume 37, pages 29514–29565. Curran Associates, Inc., 2024...

  56. [56]

    S. L. Lee, P. Yadav, Y . Li, J. J. Meudt, J. Strang, D. Hebel, A. Alfson, S. J. Olson, T. R. Kruser, J. B. Smilowitz, K. Borchert, B. Loritz, L. Gharzai, S. Karimpour, J. Bayouth, and M. F. Bassetti. Uw-madison gi tract image segmentation. https://kaggle.com/competitions/ uw-madison-gi-tract-image-segmentation, 2022. Kaggle

  57. [57]

    Osic pulmonary fibrosis progression

    Ahmed Shahin, Carmela Wegworth, David, Elizabeth Estes, Julia Elliott, Justin Zita, Si- monWalsh, Slepetys, and Will Cukierski. Osic pulmonary fibrosis progression. https: //kaggle.com/competitions/osic-pulmonary-fibrosis-progression, 2020. Kaggle

  58. [58]

    Histopathologic cancer detection

    Will Cukierski. Histopathologic cancer detection. https://kaggle.com/competitions/ histopathologic-cancer-detection, 2018. Kaggle

  59. [59]

    Rsna-miccai brain tumor radiogenomic classification

    Adam Flanders, Chris Carr, Evan Calabrese, PhD FelipeKitamura, MD, inversion, JeffRudie, John Mongan, Julia Elliott, Luciano Prevedello, Michelle Riopel, sprint, Spyridon Bakas, and Ujjwal. Rsna-miccai brain tumor radiogenomic classification. https://kaggle.com/competitions/ rsna-miccai-brain-tumor-radiogenomic-classification, 2021. Kaggle

  60. [60]

    Therapeutics data commons: Machine learning datasets and tasks for drug discovery and development.arXiv preprint arXiv:2102.09548, 2021

    Kexin Huang, Tianfan Fu, Wenhao Gao, Yue Zhao, Yusuf Roohani, Jure Leskovec, Connor W Coley, Cao Xiao, Jimeng Sun, and Marinka Zitnik. Therapeutics data commons: Machine learn- ing datasets and tasks for drug discovery and development.arXiv preprint arXiv:2102.09548, 2021

  61. [61]

    Polaris: The benchmarking platform for drug discovery

    Polaris. Polaris: The benchmarking platform for drug discovery. https://polarishub.io/, 2026. Accessed: May 2026

  62. [62]

    Defining and benchmarking open problems in single-cell analysis.Nature Biotechnology, 43(7):1035– 1040, 2025

    Malte D Luecken, Scott Gigante, Daniel B Burkhardt, Robrecht Cannoodt, Daniel C Strobl, Nikolay S Markov, Luke Zappia, Giovanni Palla, Wesley Lewis, Daniel Dimitrov, et al. Defining and benchmarking open problems in single-cell analysis.Nature Biotechnology, 43(7):1035– 1040, 2025

  63. [63]

    Evolutionary-scale prediction of atomic-level protein structure with a language model.Science, 379(6637):1123–1130, 2023

    Zeming Lin, Halil Akin, Roshan Rao, Brian Hie, Zhongkai Zhu, Wenting Lu, Nikita Smetanin, Robert Verkuil, Ori Kabeli, Yaniv Shmueli, et al. Evolutionary-scale prediction of atomic-level protein structure with a language model.Science, 379(6637):1123–1130, 2023. 14

  64. [64]

    Robust deep learning–based protein sequence design using proteinmpnn.Science, 378(6615):49–56, 2022

    Justas Dauparas, Ivan Anishchenko, Nathaniel Bennett, Hua Bai, Robert J Ragotte, Lukas F Milles, Basile IM Wicky, Alexis Courbet, Rob J de Haas, Neville Bethel, et al. Robust deep learning–based protein sequence design using proteinmpnn.Science, 378(6615):49–56, 2022

  65. [65]

    Chemberta: large-scale self- supervised pretraining for molecular property prediction.arXiv preprint arXiv:2010.09885, 2020

    Seyone Chithrananda, Gabriel Grand, and Bharath Ramsundar. Chemberta: large-scale self- supervised pretraining for molecular property prediction.arXiv preprint arXiv:2010.09885, 2020

  66. [66]

    Chemprop: a machine learning package for chemical property prediction.Journal of chemical information and modeling, 64 (1):9–17, 2024

    Esther Heid, Kevin P Greenman, Yunsie Chung, Shih-Cheng Li, David E Graff, Florence H Vermeire, Haoyang Wu, William H Green, and Charles J McGill. Chemprop: a machine learning package for chemical property prediction.Journal of chemical information and modeling, 64 (1):9–17, 2024

  67. [67]

    https://www.rdkit.org, 2026

    RDKit: Open-source cheminformatics. https://www.rdkit.org, 2026. Accessed: May 2026

  68. [68]

    Xgboost: A scalable tree boosting system

    Tianqi Chen and Carlos Guestrin. XGBoost: A scalable tree boosting system. InProceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’16, pages 785–794, New York, NY , USA, 2016. ACM. ISBN 978-1-4503-4232-2. doi: 10.1145/2939672.2939785. URL http://doi.acm.org/10.1145/2939672.2939785

  69. [69]

    Lightgbm: A highly efficient gradient boosting decision tree.Advances in neural information processing systems, 30, 2017

    Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, and Tie-Yan Liu. Lightgbm: A highly efficient gradient boosting decision tree.Advances in neural information processing systems, 30, 2017

  70. [70]

    CatBoost: gradient boosting with categorical features support

    Anna Veronika Dorogush, Vasily Ershov, and Andrey Gulin. Catboost: gradient boosting with categorical features support.arXiv preprint arXiv:1810.11363, 2018

  71. [71]

    Efficientnet: Rethinking model scaling for convolutional neural networks

    Mingxing Tan and Quoc Le. Efficientnet: Rethinking model scaling for convolutional neural networks. InInternational conference on machine learning, pages 6105–6114. PMLR, 2019

  72. [72]

    Pytorch: An imperative style, high-performance deep learning library.Advances in neural information processing systems, 32, 2019

    Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high-performance deep learning library.Advances in neural information processing systems, 32, 2019

  73. [73]

    Masked inverse folding with sequence transfer for protein representation learning.Protein Engineering, Design and Selection, 36: gzad015, 2023

    Kevin K Yang, Niccolò Zanichelli, and Hugh Yeh. Masked inverse folding with sequence transfer for protein representation learning.Protein Engineering, Design and Selection, 36: gzad015, 2023

  74. [74]

    From high-throughput evaluation to wet-lab studies: advancing mutation effect prediction with a retrieval-enhanced model.Bioinformatics, 41(Supplement 1):i401–i409, 07 2025

    Yang Tan, Ruilin Wang, Banghao Wu, Liang Hong, and Bingxin Zhou. From high-throughput evaluation to wet-lab studies: advancing mutation effect prediction with a retrieval-enhanced model.Bioinformatics, 41(Supplement 1):i401–i409, 07 2025. doi: 10.1093/bioinformatics/ btaf189. URL https://doi.org/10.1093/bioinformatics/btaf189

  75. [75]

    Prosst: Protein language modeling with quantized structure and disentangled attention.Advances in Neural Information Processing Systems, 37: 35700–35726, 2024

    Mingchen Li, Yang Tan, Xinzhu Ma, Bozitao Zhong, Huiqun Yu, Ziyi Zhou, Wanli Ouyang, Bingxin Zhou, Pan Tan, and Liang Hong. Prosst: Protein language modeling with quantized structure and disentangled attention.Advances in Neural Information Processing Systems, 37: 35700–35726, 2024

  76. [76]

    Residue conser- vation and solvent accessibility are (almost) all you need for predicting mutational effects in proteins.Bioinformatics, 41(6):btaf322, 2025

    Matsvei Tsishyn, Pauline Hermans, Marianne Rooman, and Fabrizio Pucci. Residue conser- vation and solvent accessibility are (almost) all you need for predicting mutational effects in proteins.Bioinformatics, 41(6):btaf322, 2025

  77. [77]

    Prescott: a population aware, epistatic, and structural model accurately predicts missense effects.Genome Biology, 26(1):113, 2025

    Mustafa Tekpinar, Laurent David, Thomas Henry, and Alessandra Carbone. Prescott: a population aware, epistatic, and structural model accurately predicts missense effects.Genome Biology, 26(1):113, 2025

  78. [78]

    xtrimopglm: unified 100-billion-parameter pretrained transformer for deciphering the language of proteins.Nature Methods, 22(5):1028–1039, 2025

    Bo Chen, Xingyi Cheng, Pan Li, Yangli-ao Geng, Jing Gong, Shen Li, Zhilei Bei, Xu Tan, Boyan Wang, Xin Zeng, et al. xtrimopglm: unified 100-billion-parameter pretrained transformer for deciphering the language of proteins.Nature Methods, 22(5):1028–1039, 2025

  79. [79]

    Saprot: Protein language modeling with structure-aware vocabulary

    Jin Su, Chenchen Han, Yuyang Zhou, Junjie Shan, Xibin Zhou, and Fajie Yuan. Saprot: Protein language modeling with structure-aware vocabulary. InThe Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=6MRm3G4NiU. 15

  80. [80]

    Learning inverse folding from millions of predicted structures

    Chloe Hsu, Robert Verkuil, Jason Liu, Zeming Lin, Brian Hie, Tom Sercu, Adam Lerer, and Alexander Rives. Learning inverse folding from millions of predicted structures. InInternational conference on machine learning, pages 8946–8970. PMLR, 2022

Showing first 80 references.