pith. sign in

arxiv: 2606.24530 · v1 · pith:DWZHFPVTnew · submitted 2026-06-23 · 💻 cs.CL

NatureBench: Can Coding Agents Match the Published SOTA of Nature-Family Papers?

Pith reviewed 2026-06-26 00:11 UTC · model grok-4.3

classification 💻 cs.CL
keywords coding agentsscientific discoverybenchmarkNature papersmethodological translationagent evaluationcontainerized environments
0
0 comments X

The pith

AI coding agents surpass published SOTA on only 17.8 percent of tasks from Nature-family papers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces NatureBench to test whether coding agents can perform genuine scientific discovery on 90 tasks extracted from peer-reviewed Nature publications. It evaluates ten frontier agent setups in a search-disabled setting and reports that even the best model exceeds the original paper's SOTA on fewer than one in five tasks. Success occurs mainly when agents reframe the original problem as a standard supervised learning task rather than devising new scientific methods. Failures stem from incorrect method selection and limited compute allocation, not from misunderstanding the task. The benchmark uses a containerized pipeline to standardize evaluation across disciplines.

Core claim

The strongest model surpasses SOTA on only 17.8% of tasks under the g>0.1 criterion. Analysis of method pathways reveals that agents succeed primarily through methodological translation, converting scientific tasks into familiar supervised prediction problems, rather than through genuine scientific invention. Failures are dominated by wrong method choice and insufficient compute budget, not by task misunderstanding.

What carries the argument

NatureBench, a set of 90 tasks distilled from Nature-family papers, together with the NatureGym pipeline that builds standardized containerized environments for each task.

If this is right

  • Agents that succeed do so by converting the original scientific question into a supervised prediction problem they already know how to solve.
  • Wrong method choice and insufficient compute budget account for most failures rather than inability to parse the task statement.
  • Standardized containerization removes environment-fragmentation barriers that previously made agent-on-research benchmarks hard to trust.
  • The benchmark supplies a public leaderboard and maintainer-side reproduction protocol for tracking future progress.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the translation pattern persists, scaling compute alone will not close the gap to scientific invention.
  • The same evaluation pipeline could be applied to papers outside the Nature family to test whether the 17.8 percent ceiling is domain-specific.
  • A follow-up study could measure whether agents improve when given explicit incentives or tools for proposing novel experimental designs rather than defaulting to familiar models.

Load-bearing premise

The 90 tasks distilled from Nature-family papers, together with the NatureGym containerization process, faithfully represent the original scientific discovery problems and experimental setups without introducing systematic biases in difficulty or required expertise.

What would settle it

A controlled experiment in which an agent configuration, under the identical web-search-disabled protocol, surpasses the published SOTA on more than half the tasks while using methods that are not simple translations of existing supervised-learning pipelines.

read the original abstract

We introduce NatureBench, a cross-discipline benchmark of 90 tasks distilled from peer-reviewed Nature-family publications, designed to evaluate whether AI coding agents can move beyond reproduction toward discovery on real scientific problems. NatureBench is built on NatureGym, an automated pipeline that constructs a standardized, per-task containerized environment from a source paper, addressing the environment-fragmentation problem that has limited the credibility of prior agent-on-research benchmarks. Evaluating ten frontier agent configurations under a strict web-search-disabled protocol, we find that the strongest model surpasses SOTA on only 17.8% of tasks under the g>0.1 criterion. Analysis of method pathways reveals that agents succeed primarily through methodological translation, converting scientific tasks into familiar supervised prediction problems, rather than through genuine scientific invention. Failures are dominated by wrong method choice and insufficient compute budget, not by task misunderstanding. We release the benchmark, the NatureGym pipeline, and a public leaderboard with maintainer-side reproduction. Code: https://github.com/FrontisAI/NatureBench

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces NatureBench, a cross-disciplinary benchmark of 90 tasks distilled from peer-reviewed Nature-family publications, constructed via the NatureGym automated pipeline that generates standardized containerized environments per task. Under a web-search-disabled protocol, ten frontier agent configurations are evaluated; the strongest model surpasses published SOTA on only 17.8% of tasks using the g>0.1 criterion. Pathway analysis indicates successes arise mainly from methodological translation (recasting tasks as supervised prediction problems) rather than genuine scientific invention, while failures stem primarily from incorrect method choice and insufficient compute rather than task misunderstanding. The benchmark, pipeline, and public leaderboard with maintainer-side reproduction are released.

Significance. If the 90 distilled tasks and NatureGym environments faithfully preserve the original scientific framing, required expertise, and method space of the source papers, the 17.8% result would provide concrete evidence that current coding agents remain limited in moving from reproduction to discovery on real scientific problems. The explicit release of the full benchmark, containerization pipeline, and reproducible leaderboard constitutes a clear strength, enabling direct verification and extension by the community.

major comments (3)
  1. [Abstract] Abstract: The central performance claim (strongest model surpasses SOTA on 17.8% of tasks under g>0.1) and associated failure-mode analysis are stated without any description of how SOTA values were extracted from the source papers, how the g>0.1 threshold was computed, whether error bars or statistical tests were applied, or the precise criteria used to select and distill the 90 tasks. These omissions are load-bearing for the headline empirical result.
  2. [§3 and §4] §3 (NatureBench construction) and §4 (Evaluation): The interpretation that agents succeed via methodological translation rather than invention, and that the low success rate reflects agent limitations rather than benchmark artifacts, rests on the untested assumption that the distillation and containerization steps preserve original problem difficulty and expertise requirements. No validation, sensitivity analysis, or comparison to the source papers' original experimental setups is reported.
  3. [§4] §4 (Results): The pathway analysis and attribution of failures to 'wrong method choice' and 'insufficient compute budget' presuppose that the NatureGym environments allow agents access to the full original method space; if containerization systematically narrows this space or reformulates tasks into ML-friendly forms, the reported failure modes become circular with the benchmark design.
minor comments (2)
  1. [Abstract] The GitHub link is provided but the manuscript does not describe the exact protocol for maintainer-side reproduction or how the public leaderboard will be updated when new agents are evaluated.
  2. [Abstract] Notation for g>0.1 is introduced without an explicit equation or definition in the main text; a short formal definition would improve clarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting the need for greater transparency in our empirical claims and validation of benchmark construction. We address each major comment below. Where details were omitted, we will revise the manuscript to include them.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central performance claim (strongest model surpasses SOTA on 17.8% of tasks under g>0.1) and associated failure-mode analysis are stated without any description of how SOTA values were extracted from the source papers, how the g>0.1 threshold was computed, whether error bars or statistical tests were applied, or the precise criteria used to select and distill the 90 tasks. These omissions are load-bearing for the headline empirical result.

    Authors: We agree these details are essential. In the revised version we will expand §3 with a dedicated subsection on task selection criteria (relevance to discovery vs. reproduction, feasibility of containerization, diversity across disciplines), SOTA extraction protocol (manual review of results tables and supplementary material in each source paper, selecting the strongest reported metric under comparable conditions), and the g>0.1 definition (normalized gap g = (agent_perf - SOTA)/SOTA > 0.1). We will report error bars from multiple agent seeds where compute permits and note the absence of formal statistical tests due to the heterogeneous task metrics. A concise reference to these procedures will be added to the abstract. revision: yes

  2. Referee: [§3 and §4] §3 (NatureBench construction) and §4 (Evaluation): The interpretation that agents succeed via methodological translation rather than invention, and that the low success rate reflects agent limitations rather than benchmark artifacts, rests on the untested assumption that the distillation and containerization steps preserve original problem difficulty and expertise requirements. No validation, sensitivity analysis, or comparison to the source papers' original experimental setups is reported.

    Authors: We acknowledge the absence of explicit validation. The NatureGym pipeline was designed to retain the original scientific objective and required expertise by extracting the core experimental protocol verbatim into the container; however, we did not previously report a sensitivity check. In revision we will add (i) a table comparing key hyperparameters and evaluation metrics between original papers and NatureGym environments for a random subset of 15 tasks, and (ii) a short sensitivity analysis showing that relaxing container constraints does not materially change the 17.8 % figure. This will directly test the preservation assumption. revision: yes

  3. Referee: [§4] §4 (Results): The pathway analysis and attribution of failures to 'wrong method choice' and 'insufficient compute budget' presuppose that the NatureGym environments allow agents access to the full original method space; if containerization systematically narrows this space or reformulates tasks into ML-friendly forms, the reported failure modes become circular with the benchmark design.

    Authors: We maintain that the environments were constructed to expose the full method space described in each source paper, including non-ML baselines where present. Nevertheless, the concern about possible narrowing is valid. In the revision we will insert a new paragraph in §4 that (a) enumerates the method categories explicitly permitted in each environment, (b) provides qualitative evidence that agents could (and sometimes did) invoke non-ML approaches, and (c) discusses the residual risk of implicit ML bias introduced by containerization. This will clarify that the failure-mode attributions are not purely circular. revision: yes

Circularity Check

0 steps flagged

No circularity: results derive from external published SOTA comparisons on independently constructed benchmark tasks.

full rationale

The paper distills 90 tasks from external Nature-family publications and measures agent performance against those papers' independently reported SOTA numbers. The 17.8% success rate under g>0.1 and the method-pathway observations are direct empirical counts from agent executions; no equations, fitted parameters, or self-citations reduce these counts to quantities defined by the authors themselves. The benchmark-construction assumption is a standard external-validity concern rather than a self-referential reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central performance claim rests on the assumption that Nature-family tasks can be faithfully containerized and that published SOTA numbers are directly comparable to agent outputs; no free parameters are described in the abstract.

axioms (1)
  • domain assumption Tasks extracted from Nature papers can be standardized into containerized environments that preserve the original scientific validity and difficulty
    Invoked to justify the NatureGym pipeline and the claim that agents are tested on real scientific problems
invented entities (2)
  • NatureBench no independent evidence
    purpose: Cross-discipline benchmark of 90 tasks for evaluating coding agents on scientific discovery
    Newly introduced construct; no independent evidence outside this paper
  • NatureGym no independent evidence
    purpose: Automated pipeline that constructs standardized per-task containerized environments from source papers
    Newly introduced construct; no independent evidence outside this paper

pith-pipeline@v0.9.1-grok · 5766 in / 1391 out tokens · 21942 ms · 2026-06-26T00:11:13.138233+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

99 extracted references · 7 canonical work pages

  1. [1]

    Accurate structure prediction of biomolecular interactions with alphafold 3

    Josh Abramson, Jonas Adler, Jack Dunger, Richard Evans, Tim Green, Alexander Pritzel, Olaf Ronneberger, Lindsay Willmore, Andrew J Ballard, Joshua Bambrick, et al. Accurate structure prediction of biomolecular interactions with alphafold 3. Nature, 630 0 (8016): 0 493--500, 2024

  2. [2]

    Claude code: An agentic coding tool

    Anthropic . Claude code: An agentic coding tool. https://github.com/anthropics/claude-code, 2025

  3. [3]

    System card: Claude opus 4.6

    Anthropic . System card: Claude opus 4.6. https://www.anthropic.com/claude-opus-4-6-system-card, 2026 a

  4. [4]

    System card: Claude opus 4.7

    Anthropic . System card: Claude opus 4.7. https://www.anthropic.com/claude-opus-4-7-system-card, 2026 b

  5. [5]

    Claude api pricing

    Anthropic . Claude api pricing. https://platform.claude.com/docs/en/about-claude/pricing, 2026 c

  6. [6]

    OpenScholar : Synthesizing scientific literature with retrieval-augmented LM s, 2024

    Akari Asai, Jacqueline He, Rulin Shao, Weijia Shi, Amanpreet Singh, Joseph Chee Chang, Kyle Lo, Luca Soldaini, Sergey Feldman, Mike D'Arcy, David Wadden, Matt Latzke, Minyang Tian, Pan Ji, Shengyan Liu, Hao Tong, Bohao Wu, Yanyu Xiong, Luke Zettlemoyer, Graham Neubig, Dan Weld, Doug Downey, Wen tau Yih, Pang Wei Koh, and Hannaneh Hajishirzi. OpenScholar :...

  7. [7]

    Accurate prediction of protein structures and interactions using a three-track neural network

    Minkyung Baek, Frank DiMaio, Ivan Anishchenko, Justas Dauparas, Sergey Ovchinnikov, Gyu Rie Lee, Jue Wang, Qian Cong, Lisa N Kinch, R Dustin Schaeffer, et al. Accurate prediction of protein structures and interactions using a three-track neural network. Science, 373 0 (6557): 0 871--876, 2021

  8. [8]

    Mask-prior-guided denoising diffusion improves inverse protein folding

    Peizhen Bai, Filip Miljkovi \'c , Xianyuan Liu, Leonardo De Maria, Rebecca Croasdale-Wood, Owen Rackham, and Haiping Lu. Mask-prior-guided denoising diffusion improves inverse protein folding. Nature Machine Intelligence, 7 0 (6): 0 876--888, 2025

  9. [9]

    Atomically accurate de novo design of antibodies with rfdiffusion

    Nathaniel R Bennett, Joseph L Watson, Robert J Ragotte, Andrew J Borst, D \'e Jena \'e L See, Connor Weidle, Riti Biswas, Yutong Yu, Ellen L Shrock, Russell Ault, et al. Atomically accurate de novo design of antibodies with rfdiffusion. Nature, 649 0 (8095): 0 183--193, 2026

  10. [10]

    A foundation model for the earth system

    Cristian Bodnar, Wessel P Bruinsma, Ana Lucic, Megan Stanley, Anna Allen, Johannes Brandstetter, Patrick Garvan, Maik Riechert, Jonathan A Weyn, Haiyu Dong, et al. A foundation model for the earth system. Nature, 641 0 (8065): 0 1180--1187, 2025

  11. [11]

    Autonomous chemical research with large language models

    Daniil A Boiko, Robert MacKnight, Ben Kline, and Gabe Gomes. Autonomous chemical research with large language models. Nature, 624 0 (7992): 0 570--578, 2023

  12. [12]

    Genome modelling and design across all domains of life with evo 2

    Garyk Brixi, Matthew G Durrant, Jerome Ku, Mohsen Naghipourfar, Michael Poli, Gwanggyu Sun, Greg Brockman, Daniel Chang, Alison Fanton, Gabriel A Gonzalez, et al. Genome modelling and design across all domains of life with evo 2. Nature, 652 0 (8112): 0 1349--1361, 2026

  13. [13]

    AdaEvolve : Adaptive LLM driven zeroth-order optimization, 2026

    Mert Cemri, Shubham Agrawal, Akshat Gupta, Shu Liu, Audrey Cheng, Qiuyang Mang, Ashwin Naren, Lutfi Eren Erdogan, Koushik Sen, Matei Zaharia, Alex Dimakis, and Ion Stoica. AdaEvolve : Adaptive LLM driven zeroth-order optimization, 2026. URL https://arxiv.org/abs/2602.20133

  14. [14]

    Mle-bench: Evaluating machine learning agents on machine learning engineering

    Jun Shern Chan, Neil Chowdhury, Oliver Jaffe, James Aung, Dane Sherburn, Evan Mays, Giulio Starace, Kevin Liu, Leon Maksin, Tejal Patwardhan, et al. Mle-bench: Evaluating machine learning agents on machine learning engineering. In International Conference on Learning Representations, volume 2025, pages 50466--50494, 2025

  15. [15]

    A generalized-template-based graph neural network for accurate organic reactivity prediction

    Shuan Chen and Yousung Jung. A generalized-template-based graph neural network for accurate organic reactivity prediction. Nature Machine Intelligence, 4 0 (9): 0 772--780, 2022. doi:10.1038/s42256-022-00526-z

  16. [16]

    Accurate proteome-wide missense variant effect prediction with alphamissense

    Jun Cheng, Guido Novati, Joshua Pan, Clare Bycroft, Akvil \.e Z emgulyt \.e , Taylor Applebaum, Alexander Pritzel, Lai Hong Wong, Michal Zielinski, Tobias Sargeant, et al. Accurate proteome-wide missense variant effect prediction with alphamissense. Science, 381 0 (6664): 0 eadg7492, 2023

  17. [17]

    Frontier-Eng : Benchmarking self-evolving agents on real-world engineering tasks with generative optimization, 2026

    Yizhe Chi, Deyao Hong, Dapeng Jiang, Tianwei Luo, Kaisen Yang, Boshi Zhang, Zhe Cao, Xiaoyan Fan, Bingxiang He, Han Hao, Weiyang Jin, Dianqiao Lei, Qingle Liu, Houde Qian, Bowen Wang, Situ Wang, Youjie Zheng, Yifan Zhou, Calvin Xiao, Eren Cai, and Qinhuai Na. Frontier-Eng : Benchmarking self-evolving agents on real-world engineering tasks with generative ...

  18. [18]

    scgpt: toward building a foundation model for single-cell multi-omics using generative ai

    Haotian Cui, Chloe Wang, Hassaan Maan, Kuan Pang, Fengning Luo, Nan Duan, and Bo Wang. scgpt: toward building a foundation model for single-cell multi-omics using generative ai. Nature methods, 21 0 (8): 0 1470--1480, 2024

  19. [19]

    de Almeida, Hassan Sirelkhatim, Guillaume Richard, Marcin Skwark, Karim Beguir, Marie Lopez, and Thomas Pierrot

    Hugo Dalla-Torre, Liam Gonzalez, Javier Mendoza-Revilla, Nicolas Lopez Carranza, Adam Henryk Grzywaczewski, Francesco Oteri, Christian Dallago, Evan Trop, Bernardo P. de Almeida, Hassan Sirelkhatim, Guillaume Richard, Marcin Skwark, Karim Beguir, Marie Lopez, and Thomas Pierrot. Nucleotide transformer: building and evaluating robust foundation models for ...

  20. [20]

    Deepseek v4 preview release

    DeepSeek . Deepseek v4 preview release. https://api-docs.deepseek.com/news/news260424, 2026

  21. [21]

    Faroughy, Sofia Palacios Schweitzer, Ian Pang, Siddharth Mishra-Sharma, and David Shih

    Darius A. Faroughy, Sofia Palacios Schweitzer, Ian Pang, Siddharth Mishra-Sharma, and David Shih. Collider -bench: Benchmarking AI agents with particle physics analysis reproduction, 2026. URL https://arxiv.org/abs/2605.13950

  22. [22]

    Ruiz, Julian Schrittwieser, Grzegorz Swirszcz, et al

    Alhussein Fawzi, Matej Balog, Aja Huang, Thomas Hubert, Bernardino Romera-Paredes, Mohammadamin Barekatain, Alexander Novikov, Francisco J R. Ruiz, Julian Schrittwieser, Grzegorz Swirszcz, et al. Discovering faster matrix multiplication algorithms with reinforcement learning. Nature, 610 0 (7930): 0 47--53, 2022 a

  23. [23]

    Alhussein Fawzi, Matej Balog, Aja Huang, Thomas Hubert, Bernardino Romera-Paredes, Mohammadamin Barekatain, Alexander Novikov, Francisco J. R. Ruiz, Julian Schrittwieser, Grzegorz Swirszcz, David Silver, Demis Hassabis, and Pushmeet Kohli. Discovering faster matrix multiplication algorithms with reinforcement learning. Nature, 610: 0 47--53, 2022 b . doi:...

  24. [24]

    MMReview : A multidisciplinary and multimodal benchmark for LLM -based peer review automation, 2025

    Xian Gao, Jiacheng Ruan, Zongyun Zhang, Jingsheng Gao, Ting Liu, and Yuzhuo Fu. MMReview : A multidisciplinary and multimodal benchmark for LLM -based peer review automation, 2025. URL https://arxiv.org/abs/2508.14146

  25. [25]

    A multi-agent system for automating scientific discovery

    Ali Essam Ghareeb, Benjamin Chang, Ludovico Mitchener, et al. A multi-agent system for automating scientific discovery. Nature, 2026. doi:10.1038/s41586-026-10652-y. URL https://www.nature.com/articles/s41586-026-10652-y

  26. [26]

    Gemini cli: An open-source ai agent

    Google . Gemini cli: An open-source ai agent. https://github.com/google-gemini/gemini-cli, 2025

  27. [27]

    Gemini 3.5 flash model card

    Google DeepMind . Gemini 3.5 flash model card. https://deepmind.google/models/model-cards/gemini-3-5-flash/, 2026

  28. [28]

    Accelerating scientific discovery with co-scientist

    Juraj Gottweis, Wei-Hung Weng, Alexander Daryin, Tao Tu, Petar Sirkovic, Artiom Myaskovsky, Grzegorz Glowaty, Felix Weissenberger, Alessio Orlandi, Dan Popovici, et al. Accelerating scientific discovery with co-scientist. Nature, pages 1--3, 2026 a

  29. [29]

    Accelerating scientific discovery with Co-Scientist

    Juraj Gottweis et al. Accelerating scientific discovery with Co-Scientist . Nature, 2026 b . doi:10.1038/s41586-026-10644-y. URL https://www.nature.com/articles/s41586-026-10644-y

  30. [30]

    Artificial intelligence tools expand scientists’ impact but contract science’s focus

    Qianyue Hao, Fengli Xu, Yong Li, and James Evans. Artificial intelligence tools expand scientists’ impact but contract science’s focus. Nature, pages 1--7, 2026

  31. [31]

    Closed-form continuous-time neural networks

    Ramin Hasani, Mathias Lechner, Alexander Amini, Lucas Liebenwein, Aaron Ray, Max Tschaikowski, Gerald Teschl, and Daniela Rus. Closed-form continuous-time neural networks. Nature Machine Intelligence, 4 0 (11): 0 992--1003, 2022

  32. [32]

    REPRO -bench: Can agentic AI systems assess the reproducibility of social science research?, 2025

    Chuxuan Hu, Liyun Zhang, Yeji Lim, Aum Wadhwani, Austin Peters, and Daniel Kang. REPRO -bench: Can agentic AI systems assess the reproducibility of social science research?, 2025. URL https://arxiv.org/abs/2507.18901

  33. [33]

    MLAgentBench : Evaluating language agents on machine learning experimentation, 2023

    Qian Huang, Jian Vora, Percy Liang, and Jure Leskovec. MLAgentBench : Evaluating language agents on machine learning experimentation, 2023. URL https://arxiv.org/abs/2310.03302

  34. [34]

    Ziyang Huang, Yi Cao, Ali K. Shargh, Jing Luo, Ruidong Mei, Mohd Zaki, Zhan Liu, Wyatt Bunstine, William Jurayj, Somdatta Goswami, Tyrel McQueen, Michael Shields, Jaafar El-Awady, Paulette Clancy, Benjamin Van Durme, Nicholas Andrews, William Walden, and Daniel Khashabi. Can coding agents reproduce findings in computational materials science?, 2026. URL h...

  35. [35]

    Olympiad-level formal mathematical reasoning with reinforcement learning

    Thomas Hubert, Rishi Mehta, Laurent Sartran, Mikl \'o s Z Horv \'a th, Goran Z u z i \'c , Eric Wieser, Aja Huang, Julian Schrittwieser, Yannick Schroecker, Hussain Masoom, et al. Olympiad-level formal mathematical reasoning with reinforcement learning. Nature, pages 1--3, 2025

  36. [36]

    Equivariant 3d-conditional diffusion model for molecular linker design

    Ilia Igashov, Hannes St \"a rk, Cl \'e ment Vignac, Arne Schneuing, Victor Garcia Satorras, Pascal Frossard, Max Welling, Michael Bronstein, and Bruno Correia. Equivariant 3d-conditional diffusion model for molecular linker design. Nature Machine Intelligence, 6 0 (4): 0 417--427, 2024

  37. [37]

    ALE -bench: A benchmark for long-horizon objective-driven algorithm engineering

    Yuki Imajuku, Kohki Horie, Yoichi Iwata, Kensho Aoki, Naohiro Takahashi, and Takuya Akiba. ALE -bench: A benchmark for long-horizon objective-driven algorithm engineering. In Advances in Neural Information Processing Systems, Datasets and Benchmarks Track, 2025. URL https://openreview.net/forum?id=JCjGvbsOmQ

  38. [38]

    DISCOVERYWORLD : A virtual environment for developing and evaluating automated scientific discovery agents, 2024

    Peter Jansen, Marc-Alexandre Cote, Tushar Khot, Erin Bransom, Bhavana Dalvi Mishra, Bodhisattwa Prasad Majumder, Oyvind Tafjord, and Peter Clark. DISCOVERYWORLD : A virtual environment for developing and evaluating automated scientific discovery agents, 2024. URL https://arxiv.org/abs/2406.06769

  39. [39]

    DeltaEvolve : Accelerating scientific discovery through momentum-driven evolution, 2026

    Jiachen Jiang, Tianyu Ding, and Zhihui Zhu. DeltaEvolve : Accelerating scientific discovery through momentum-driven evolution, 2026. URL https://arxiv.org/abs/2602.02919

  40. [40]

    Highly accurate protein structure prediction with alphafold

    John Jumper, Richard Evans, Alexander Pritzel, Tim Green, Michael Figurnov, Olaf Ronneberger, Kathryn Tunyasuvunakool, Russ Bates, Augustin Z \' dek, Anna Potapenko, et al. Highly accurate protein structure prediction with alphafold. nature, 596 0 (7873): 0 583--589, 2021

  41. [41]

    autoresearch

    Andrej Karpathy. autoresearch . https://github.com/karpathy/autoresearch, 2026

  42. [42]

    From reproduction to replication: Evaluating research agents with progressive code masking, 2025

    Gyeongwon James Kim, Alex Wilf, Louis-Philippe Morency, and Daniel Fried. From reproduction to replication: Evaluating research agents with progressive code masking, 2025. URL https://arxiv.org/abs/2506.19724

  43. [43]

    Cell2location maps fine-grained cell types in spatial transcriptomics

    Vitalii Kleshchevnikov, Artem Shmatko, Emma Dann, Alexander Aivazidis, Hamish W King, Tong Li, Rasa Elmentaite, Artem Lomakin, Veronika Kedlian, Adam Gayoso, et al. Cell2location maps fine-grained cell types in spatial transcriptomics. Nature biotechnology, 40 0 (5): 0 661--671, 2022

  44. [44]

    Rodriques, and Andrew D

    Jakub Lala, Odhran O'Donoghue, Aleksandar Shtedritski, Sam Cox, Samuel G. Rodriques, and Andrew D. White. PaperQA : Retrieval-augmented generative agent for scientific research, 2023. URL https://arxiv.org/abs/2312.07559

  45. [45]

    Learning skillful medium-range global weather forecasting

    Remi Lam, Alvaro Sanchez-Gonzalez, Matthew Willson, Peter Wirnsberger, Meire Fortunato, Ferran Alet, Suman Ravuri, Timo Ewalds, Zach Eaton-Rosen, Weihua Hu, et al. Learning skillful medium-range global weather forecasting. Science, 382 0 (6677): 0 1416--1421, 2023

  46. [46]

    Laurent, Joseph D

    Jon M. Laurent, Joseph D. Janizek, Michael Ruzo, Michaela M. Hinks, Michael J. Hammerling, Siddharth Narayanan, Manvitha Ponnapati, Andrew D. White, and Samuel G. Rodriques. LAB -bench: Measuring capabilities of language models for biology research, 2024. URL https://arxiv.org/abs/2407.10362

  47. [47]

    AutoSOTA : An end-to-end automated research system for state-of-the-art AI model discovery, 2026

    Yu Li, Chenyang Shao, Xinyang Liu, Ruotong Zhao, Peijie Liu, Hongyuan Su, Zhibin Chen, Qinglong Yang, Anjie Xu, Yi Fang, Qingbin Zeng, Tianxing Li, Jingbo Xu, Fengli Xu, Yong Li, and Tie-Yan Liu. AutoSOTA : An end-to-end automated research system for state-of-the-art AI model discovery, 2026. URL https://arxiv.org/abs/2604.05550

  48. [48]

    Can large language models provide useful feedback on research papers? a large-scale empirical analysis, 2023

    Weixin Liang, Yuhui Zhang, Hancheng Cao, Binglu Wang, Daisy Ding, Xinyu Yang, Kailas Vodrahalli, Siyu He, Daniel Smith, Yian Yin, Daniel McFarland, and James Zou. Can large language models provide useful feedback on research papers? a large-scale empirical analysis, 2023. URL https://arxiv.org/abs/2310.01783

  49. [49]

    Position: Agentic evolution is the path to evolving LLM s, 2026

    Minhua Lin, Hanqing Lu, Zhan Shi, Bing He, Rui Mao, Zhiwei Zhang, Zongyu Wu, Xianfeng Tang, Hui Liu, Zhenwei Dai, Xiang Zhang, Suhang Wang, Benoit Dumoulin, and Jian Pei. Position: Agentic evolution is the path to evolving LLM s, 2026. URL https://arxiv.org/abs/2602.00359

  50. [50]

    Evolutionary-scale prediction of atomic-level protein structure with a language model

    Zeming Lin, Halil Akin, Roshan Rao, Brian Hie, Zhongkai Zhu, Wenting Lu, Nikita Smetanin, Robert Verkuil, Ori Kabeli, Yaniv Shmueli, et al. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science, 379 0 (6637): 0 1123--1130, 2023

  51. [51]

    Scientific algorithm discovery by augmenting AlphaEvolve with deep research, 2025

    Gang Liu, Yihan Zhu, et al. Scientific algorithm discovery by augmenting AlphaEvolve with deep research, 2025. URL https://arxiv.org/abs/2510.06056

  52. [52]

    Ryan Liu and Nihar B. Shah. ReviewerGPT ? an exploratory study on using large language models for paper reviewing, 2023. URL https://arxiv.org/abs/2306.00622

  53. [53]

    Evox: Meta-evolution for automated discovery

    Shu Liu, Shubham Agarwal, Monishwaran Maheswaran, Mert Cemri, Zhifei Li, Qiuyang Mang, Ashwin Naren, Ethan Boneh, Audrey Cheng, Melissa Z Pan, et al. Evox: Meta-evolution for automated discovery. arXiv preprint arXiv:2602.23413, 2026

  54. [54]

    The AI scientist: Towards fully automated open-ended scientific discovery, 2024

    Chris Lu, Cong Lu, Robert Tjarko Lange, Jakob Foerster, Jeff Clune, and David Ha. The AI scientist: Towards fully automated open-ended scientific discovery, 2024. URL https://arxiv.org/abs/2408.06292

  55. [55]

    Towards end-to-end automation of ai research

    Chris Lu, Cong Lu, Robert Tjarko Lange, Yutaro Yamada, Shengran Hu, Jakob Foerster, David Ha, and Jeff Clune. Towards end-to-end automation of ai research. Nature, 651 0 (8107): 0 914--919, 2026

  56. [56]

    AIRS -bench: a suite of tasks for frontier AI research science agents, 2026

    Alisia Lupidi, Bhavul Gauri, Thomas Simon Foster, Bassel Al Omari, Despoina Magka, Alberto Pepe, Alexis Audran-Reiss, Muna Aghamelu, Nicolas Baldwin, Lucia Cipolina-Kun, Jean-Christophe Gagnon-Audet, Chee Hau Leow, Sandra Lefdal, Hossam Mossalam, Abhinav Moudgil, Saba Nazir, Emanuel Tewolde, Isabel Urrego, Jordi Armengol Estape, Amar Budhiraja, Gaurav Cha...

  57. [57]

    Du, Max Simchowitz, Jiantao Jiao, Dawn Song, and Chi Jin

    Bohan Lyu, Yucheng Yang, Siqiao Huang, Jiaru Zhang, Qixin Xu, Xinghan Li, Xinyang Han, Yicheng Zhang, Huaqing Zhang, Runhan Huang, Kaicheng Yang, Zitao Chen, Wentao Guo, Junlin Yang, Xinyue Ai, Wenhao Chai, Yadi Cao, Ziran Yang, Kun Wang, Dapeng Jiang, Huan-ang Gao, Shange Tang, Chengshuai Shi, Simon S. Du, Max Simchowitz, Jiantao Jiao, Dawn Song, and Chi...

  58. [58]

    EvoScientist : Towards multi-agent evolving AI scientists for end-to-end scientific discovery, 2026 b

    Yougang Lyu, Xi Zhang, Xinhao Yi, Yuyue Zhao, Shuyu Guo, Wenxiang Hu, Jan Piotrowski, Jakub Kaliski, Jacopo Urbani, Zaiqiao Meng, Lun Zhou, and Xiaohui Yan. EvoScientist : Towards multi-agent evolving AI scientists for end-to-end scientific discovery, 2026 b . URL https://arxiv.org/abs/2603.08127

  59. [59]

    Gonzalez, Jingbo Shang, and Alvin Cheung

    Qiuyang Mang, Wenhao Chai, Zhifei Li, Huanzhi Mao, Shang Zhou, Alexander Du, Hanchen Li, Shu Liu, Edwin Chen, Yichuan Wang, Xieting Chu, Zerui Cheng, Yuan Xu, Tian Xia, Zirui Wang, Tianneng Shi, Jianzhu Yao, Yilong Zhao, Qizheng Zhang, Charlie Ruan, Zeyu Shen, Kaiyuan Liu, Runyuan He, Dong Xing, Zerui Li, Zirong Zeng, Yige Jiang, Lufeng Cheng, Ziyi Zhao, ...

  60. [60]

    Scaling deep learning for materials discovery

    Amil Merchant, Simon Batzner, Samuel S Schoenholz, Muratahan Aykol, Gowoon Cheon, and Ekin Dogus Cubuk. Scaling deep learning for materials discovery. Nature, 624 0 (7990): 0 80--85, 2023

  61. [61]

    Multigate: integrative analysis and regulatory inference in spatial multi-omics data via graph representation learning

    Jishuai Miao, Jinzhao Li, Jingxue Xin, Jiajuan Tu, Muyang Ge, Ji Qi, Xiaocheng Zhou, Ying Zhu, Can Yang, and Zhixiang Lin. Multigate: integrative analysis and regulatory inference in spatial multi-omics data via graph representation learning. Nature Communications, 16 0 (1): 0 9403, 2025

  62. [62]

    Minimax m2.7: Early echoes of self-evolution

    MiniMax . Minimax m2.7: Early echoes of self-evolution. https://www.minimax.io/news/minimax-m27-en, 2026

  63. [63]

    Kimi k2.6

    Moonshot AI . Kimi k2.6. https://www.kimi.com/ai-models/kimi-k2-6, 2026

  64. [64]

    MLGym : A new framework and benchmark for advancing AI research agents, 2025

    Deepak Nathani, Lovish Madaan, Nicholas Roberts, Nikolay Bashlykov, Ajay Menon, Vincent Moens, Amar Budhiraja, Despoina Magka, Vladislav Vorotilov, Gaurav Chaurasia, Dieuwke Hupkes, Ricardo Silveira Cabral, Tatiana Shavrina, Jakob Foerster, Yoram Bachrach, William Yang Wang, and Roberta Raileanu. MLGym : A new framework and benchmark for advancing AI rese...

  65. [65]

    Alexander Novikov, Ngan Vu, Marvin Eisenberger, Emilien Dupont, Po-Sen Huang, Adam Zsolt Wagner, Sergey Shirobokov, Borislav Kozlovskii, Francisco J. R. Ruiz, Abbas Mehrabian, M. Pawan Kumar, Abigail See, Swarat Chaudhuri, George Holland, Alex Davies, Sebastian Nowozin, Pushmeet Kohli, and Matej Balog. AlphaEvolve : A coding agent for scientific and algor...

  66. [66]

    Codex cli: Lightweight coding agent that runs in your terminal

    OpenAI . Codex cli: Lightweight coding agent that runs in your terminal. https://github.com/openai/codex, 2025

  67. [67]

    Gpt-5.4 thinking system card

    OpenAI . Gpt-5.4 thinking system card. https://openai.com/index/gpt-5-4-thinking-system-card/, 2026 a

  68. [68]

    Gpt-5.5 system card

    OpenAI . Gpt-5.5 system card. https://openai.com/index/gpt-5-5-system-card/, 2026 b

  69. [69]

    What are tokens and how to count them? https://help.openai.com/en/articles/4936856-what-are-tokens-and-how-to-count-them, 2026 c

    OpenAI . What are tokens and how to count them? https://help.openai.com/en/articles/4936856-what-are-tokens-and-how-to-count-them, 2026 c

  70. [70]

    Weak signal extraction enabled by deep neural network denoising of diffraction data

    Jens Oppliger, M Michael Denner, Julia K \"u spert, Ruggero Frison, Qisi Wang, Alexander Morawietz, Oleh Ivashko, Ann-Christin Dippel, Martin von Zimmermann, Izabela Bia o, et al. Weak signal extraction enabled by deep neural network denoising of diffraction data. Nature Machine Intelligence, 6 0 (2): 0 180--186, 2024

  71. [71]

    Human--ai adaptive dynamics drives the emergence of information cocoons

    Jinghua Piao, Jiazhen Liu, Fang Zhang, Jun Su, and Yong Li. Human--ai adaptive dynamics drives the emergence of information cocoons. Nature Machine Intelligence, 5 0 (11): 0 1214--1224, 2023

  72. [72]

    Enhanced spatial clustering of single-molecule localizations with graph neural networks

    Jes \'u s Pineda, Sergi Mas \'o -Orriols, Montse Masoliver, Joan Bertran, Mattias Goks \"o r, Giovanni Volpe, and Carlo Manzo. Enhanced spatial clustering of single-molecule localizations with graph neural networks. Nature Communications, 16 0 (1): 0 9693, 2025

  73. [73]

    Probabilistic weather forecasting with machine learning

    Ilan Price, Alvaro Sanchez-Gonzalez, Ferran Alet, Tom R Andersson, Andrew El-Kadi, Dominic Masters, Timo Ewalds, Jacklynn Stott, Shakir Mohamed, Peter Battaglia, et al. Probabilistic weather forecasting with machine learning. Nature, 637 0 (8044): 0 84--90, 2025

  74. [74]

    Mle-dojo: Interactive environments for empowering llm agents in machine learning engineering

    Rushi Qiang, Yuchen Zhuang, Yinghao Li, Dingu Sagar VK, Rongzhi Zhang, Changhao Li, Ian Wong, Sherry Yang, Percy Liang, Chao Zhang, et al. Mle-dojo: Interactive environments for empowering llm agents in machine learning engineering. Advances in Neural Information Processing Systems, 38, 2026

  75. [75]

    Qwen3.7: The agent frontier

    Qwen Team . Qwen3.7: The agent frontier. https://qwen.ai/blog?id=qwen3.7, 2026

  76. [76]

    PostTrainBench : Can LLM agents automate LLM post-training?, 2026

    Ben Rank, Hardik Bhatnagar, Ameya Prabhu, Shira Eisenberg, Karina Nguyen, Matthias Bethge, and Maksym Andriushchenko. PostTrainBench : Can LLM agents automate LLM post-training?, 2026. URL https://arxiv.org/abs/2603.08640

  77. [77]

    Romera-Paredes, M

    Bernardino Romera-Paredes, Mohammadamin Barekatain, Alexander Novikov, Matej Balog, M. Pawan Kumar, Emilien Dupont, Francisco J. R. Ruiz, Jordan S. Ellenberg, Pengming Wang, Omar Fawzi, Pushmeet Kohli, and Alhussein Fawzi. Mathematical discoveries from program search with large language models. Nature, 625: 0 468--475, 2024. doi:10.1038/s41586-023-06924-6...

  78. [78]

    Siegel, Sayash Kapoor, Nitya Nagdir, Benedikt Stroebl, and Arvind Narayanan

    Zachary S. Siegel, Sayash Kapoor, Nitya Nagdir, Benedikt Stroebl, and Arvind Narayanan. CORE -bench: Fostering the credibility of published research through a computational reproducibility agent benchmark, 2024. URL https://arxiv.org/abs/2409.11363

  79. [79]

    Skarlinski, Sam Cox, Jon M

    Michael D. Skarlinski, Sam Cox, Jon M. Laurent, James D. Braza, Michaela Hinks, Michael J. Hammerling, Manvitha Ponnapati, Samuel G. Rodriques, and Andrew D. White. Language agents achieve superhuman synthesis of scientific knowledge, 2024. URL https://arxiv.org/abs/2409.13740

  80. [80]

    PaperBench : Evaluating AI 's ability to replicate AI research, 2025

    Giulio Starace, Oliver Jaffe, Dane Sherburn, James Aung, Jun Shern Chan, Leon Maksin, Rachel Dias, Evan Mays, Benjamin Kinsella, Wyatt Thompson, Johannes Heidecke, Amelia Glaese, and Tejal Patwardhan. PaperBench : Evaluating AI 's ability to replicate AI research, 2025. URL https://arxiv.org/abs/2504.01848

Showing first 80 references.