NatureBench: Can Coding Agents Match the Published SOTA of Nature-Family Papers?

Bingxiang He; Bowen Zhou; Che Jiang; Jincheng Zhong; Junlin Yang; Kaikai Zhao; Kai Tian; Kaiyan Zhang; Lejun Cheng; Ning Ding

arxiv: 2606.24530 · v1 · pith:DWZHFPVTnew · submitted 2026-06-23 · 💻 cs.CL

NatureBench: Can Coding Agents Match the Published SOTA of Nature-Family Papers?

Yuru Wang , Lejun Cheng , Yuxin Zuo , Sihang Zeng , Bingxiang He , Che Jiang , Junlin Yang , Yuchong Wang

show 9 more authors

Kaikai Zhao Weifeng Huang Kai Tian Zhenzhao Yuan Jincheng Zhong Weizhi Wang Ning Ding Bowen Zhou Kaiyan Zhang

This is my paper

Pith reviewed 2026-06-26 00:11 UTC · model grok-4.3

classification 💻 cs.CL

keywords coding agentsscientific discoverybenchmarkNature papersmethodological translationagent evaluationcontainerized environments

0 comments

The pith

AI coding agents surpass published SOTA on only 17.8 percent of tasks from Nature-family papers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces NatureBench to test whether coding agents can perform genuine scientific discovery on 90 tasks extracted from peer-reviewed Nature publications. It evaluates ten frontier agent setups in a search-disabled setting and reports that even the best model exceeds the original paper's SOTA on fewer than one in five tasks. Success occurs mainly when agents reframe the original problem as a standard supervised learning task rather than devising new scientific methods. Failures stem from incorrect method selection and limited compute allocation, not from misunderstanding the task. The benchmark uses a containerized pipeline to standardize evaluation across disciplines.

Core claim

The strongest model surpasses SOTA on only 17.8% of tasks under the g>0.1 criterion. Analysis of method pathways reveals that agents succeed primarily through methodological translation, converting scientific tasks into familiar supervised prediction problems, rather than through genuine scientific invention. Failures are dominated by wrong method choice and insufficient compute budget, not by task misunderstanding.

What carries the argument

NatureBench, a set of 90 tasks distilled from Nature-family papers, together with the NatureGym pipeline that builds standardized containerized environments for each task.

If this is right

Agents that succeed do so by converting the original scientific question into a supervised prediction problem they already know how to solve.
Wrong method choice and insufficient compute budget account for most failures rather than inability to parse the task statement.
Standardized containerization removes environment-fragmentation barriers that previously made agent-on-research benchmarks hard to trust.
The benchmark supplies a public leaderboard and maintainer-side reproduction protocol for tracking future progress.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the translation pattern persists, scaling compute alone will not close the gap to scientific invention.
The same evaluation pipeline could be applied to papers outside the Nature family to test whether the 17.8 percent ceiling is domain-specific.
A follow-up study could measure whether agents improve when given explicit incentives or tools for proposing novel experimental designs rather than defaulting to familiar models.

Load-bearing premise

The 90 tasks distilled from Nature-family papers, together with the NatureGym containerization process, faithfully represent the original scientific discovery problems and experimental setups without introducing systematic biases in difficulty or required expertise.

What would settle it

A controlled experiment in which an agent configuration, under the identical web-search-disabled protocol, surpasses the published SOTA on more than half the tasks while using methods that are not simple translations of existing supervised-learning pipelines.

read the original abstract

We introduce NatureBench, a cross-discipline benchmark of 90 tasks distilled from peer-reviewed Nature-family publications, designed to evaluate whether AI coding agents can move beyond reproduction toward discovery on real scientific problems. NatureBench is built on NatureGym, an automated pipeline that constructs a standardized, per-task containerized environment from a source paper, addressing the environment-fragmentation problem that has limited the credibility of prior agent-on-research benchmarks. Evaluating ten frontier agent configurations under a strict web-search-disabled protocol, we find that the strongest model surpasses SOTA on only 17.8% of tasks under the g>0.1 criterion. Analysis of method pathways reveals that agents succeed primarily through methodological translation, converting scientific tasks into familiar supervised prediction problems, rather than through genuine scientific invention. Failures are dominated by wrong method choice and insufficient compute budget, not by task misunderstanding. We release the benchmark, the NatureGym pipeline, and a public leaderboard with maintainer-side reproduction. Code: https://github.com/FrontisAI/NatureBench

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Agents beat published SOTA on only 17.8% of these Nature tasks, mostly by turning them into standard supervised problems, but the benchmark's task fidelity is the untested core assumption.

read the letter

The main point is that the strongest agent configuration only exceeds the published SOTA on 17.8% of the 90 tasks when using the g>0.1 threshold, and the wins come from recasting the problems as familiar supervised prediction tasks rather than inventing new scientific methods.

The paper builds NatureBench by pulling tasks directly from peer-reviewed Nature-family papers and uses NatureGym to automate containerized environments for each one. This setup directly addresses the environment fragmentation that has hurt earlier agent benchmarks. They evaluate ten frontier agent setups under a no-web-search protocol, provide a breakdown of failure modes (wrong method choice and compute limits dominate, not task misunderstanding), and release the full benchmark, pipeline, and a public leaderboard with reproduction support. Those elements are concrete and address a real gap in how we test agents on scientific work.

The soft spot is the unverified assumption that the distilled tasks and containerization preserve the original discovery framing and required expertise. If the process systematically converts hard problems into forms already common in ML training data or selects cases where published SOTA numbers are brittle, the 17.8% figure and the translation-versus-invention split become artifacts of construction rather than evidence of agent limits. The abstract states the number but gives no details on SOTA extraction, task selection criteria, or how g>0.1 was applied, and there are no error bars or robustness checks mentioned.

This is for groups working on coding agents for science who need benchmarks grounded in real papers instead of synthetic tasks. Readers tracking measurable progress toward discovery-level performance will get usable numbers and failure categories from it. It deserves peer review because the benchmark construction and evaluation protocol are specific enough to check and improve, even if the methods section will need expansion.

Referee Report

3 major / 2 minor

Summary. The paper introduces NatureBench, a cross-disciplinary benchmark of 90 tasks distilled from peer-reviewed Nature-family publications, constructed via the NatureGym automated pipeline that generates standardized containerized environments per task. Under a web-search-disabled protocol, ten frontier agent configurations are evaluated; the strongest model surpasses published SOTA on only 17.8% of tasks using the g>0.1 criterion. Pathway analysis indicates successes arise mainly from methodological translation (recasting tasks as supervised prediction problems) rather than genuine scientific invention, while failures stem primarily from incorrect method choice and insufficient compute rather than task misunderstanding. The benchmark, pipeline, and public leaderboard with maintainer-side reproduction are released.

Significance. If the 90 distilled tasks and NatureGym environments faithfully preserve the original scientific framing, required expertise, and method space of the source papers, the 17.8% result would provide concrete evidence that current coding agents remain limited in moving from reproduction to discovery on real scientific problems. The explicit release of the full benchmark, containerization pipeline, and reproducible leaderboard constitutes a clear strength, enabling direct verification and extension by the community.

major comments (3)

[Abstract] Abstract: The central performance claim (strongest model surpasses SOTA on 17.8% of tasks under g>0.1) and associated failure-mode analysis are stated without any description of how SOTA values were extracted from the source papers, how the g>0.1 threshold was computed, whether error bars or statistical tests were applied, or the precise criteria used to select and distill the 90 tasks. These omissions are load-bearing for the headline empirical result.
[§3 and §4] §3 (NatureBench construction) and §4 (Evaluation): The interpretation that agents succeed via methodological translation rather than invention, and that the low success rate reflects agent limitations rather than benchmark artifacts, rests on the untested assumption that the distillation and containerization steps preserve original problem difficulty and expertise requirements. No validation, sensitivity analysis, or comparison to the source papers' original experimental setups is reported.
[§4] §4 (Results): The pathway analysis and attribution of failures to 'wrong method choice' and 'insufficient compute budget' presuppose that the NatureGym environments allow agents access to the full original method space; if containerization systematically narrows this space or reformulates tasks into ML-friendly forms, the reported failure modes become circular with the benchmark design.

minor comments (2)

[Abstract] The GitHub link is provided but the manuscript does not describe the exact protocol for maintainer-side reproduction or how the public leaderboard will be updated when new agents are evaluated.
[Abstract] Notation for g>0.1 is introduced without an explicit equation or definition in the main text; a short formal definition would improve clarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting the need for greater transparency in our empirical claims and validation of benchmark construction. We address each major comment below. Where details were omitted, we will revise the manuscript to include them.

read point-by-point responses

Referee: [Abstract] Abstract: The central performance claim (strongest model surpasses SOTA on 17.8% of tasks under g>0.1) and associated failure-mode analysis are stated without any description of how SOTA values were extracted from the source papers, how the g>0.1 threshold was computed, whether error bars or statistical tests were applied, or the precise criteria used to select and distill the 90 tasks. These omissions are load-bearing for the headline empirical result.

Authors: We agree these details are essential. In the revised version we will expand §3 with a dedicated subsection on task selection criteria (relevance to discovery vs. reproduction, feasibility of containerization, diversity across disciplines), SOTA extraction protocol (manual review of results tables and supplementary material in each source paper, selecting the strongest reported metric under comparable conditions), and the g>0.1 definition (normalized gap g = (agent_perf - SOTA)/SOTA > 0.1). We will report error bars from multiple agent seeds where compute permits and note the absence of formal statistical tests due to the heterogeneous task metrics. A concise reference to these procedures will be added to the abstract. revision: yes
Referee: [§3 and §4] §3 (NatureBench construction) and §4 (Evaluation): The interpretation that agents succeed via methodological translation rather than invention, and that the low success rate reflects agent limitations rather than benchmark artifacts, rests on the untested assumption that the distillation and containerization steps preserve original problem difficulty and expertise requirements. No validation, sensitivity analysis, or comparison to the source papers' original experimental setups is reported.

Authors: We acknowledge the absence of explicit validation. The NatureGym pipeline was designed to retain the original scientific objective and required expertise by extracting the core experimental protocol verbatim into the container; however, we did not previously report a sensitivity check. In revision we will add (i) a table comparing key hyperparameters and evaluation metrics between original papers and NatureGym environments for a random subset of 15 tasks, and (ii) a short sensitivity analysis showing that relaxing container constraints does not materially change the 17.8 % figure. This will directly test the preservation assumption. revision: yes
Referee: [§4] §4 (Results): The pathway analysis and attribution of failures to 'wrong method choice' and 'insufficient compute budget' presuppose that the NatureGym environments allow agents access to the full original method space; if containerization systematically narrows this space or reformulates tasks into ML-friendly forms, the reported failure modes become circular with the benchmark design.

Authors: We maintain that the environments were constructed to expose the full method space described in each source paper, including non-ML baselines where present. Nevertheless, the concern about possible narrowing is valid. In the revision we will insert a new paragraph in §4 that (a) enumerates the method categories explicitly permitted in each environment, (b) provides qualitative evidence that agents could (and sometimes did) invoke non-ML approaches, and (c) discusses the residual risk of implicit ML bias introduced by containerization. This will clarify that the failure-mode attributions are not purely circular. revision: yes

Circularity Check

0 steps flagged

No circularity: results derive from external published SOTA comparisons on independently constructed benchmark tasks.

full rationale

The paper distills 90 tasks from external Nature-family publications and measures agent performance against those papers' independently reported SOTA numbers. The 17.8% success rate under g>0.1 and the method-pathway observations are direct empirical counts from agent executions; no equations, fitted parameters, or self-citations reduce these counts to quantities defined by the authors themselves. The benchmark-construction assumption is a standard external-validity concern rather than a self-referential reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central performance claim rests on the assumption that Nature-family tasks can be faithfully containerized and that published SOTA numbers are directly comparable to agent outputs; no free parameters are described in the abstract.

axioms (1)

domain assumption Tasks extracted from Nature papers can be standardized into containerized environments that preserve the original scientific validity and difficulty
Invoked to justify the NatureGym pipeline and the claim that agents are tested on real scientific problems

invented entities (2)

NatureBench no independent evidence
purpose: Cross-discipline benchmark of 90 tasks for evaluating coding agents on scientific discovery
Newly introduced construct; no independent evidence outside this paper
NatureGym no independent evidence
purpose: Automated pipeline that constructs standardized per-task containerized environments from source papers
Newly introduced construct; no independent evidence outside this paper

pith-pipeline@v0.9.1-grok · 5766 in / 1391 out tokens · 21942 ms · 2026-06-26T00:11:13.138233+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

99 extracted references · 7 canonical work pages

[1]

Accurate structure prediction of biomolecular interactions with alphafold 3

Josh Abramson, Jonas Adler, Jack Dunger, Richard Evans, Tim Green, Alexander Pritzel, Olaf Ronneberger, Lindsay Willmore, Andrew J Ballard, Joshua Bambrick, et al. Accurate structure prediction of biomolecular interactions with alphafold 3. Nature, 630 0 (8016): 0 493--500, 2024

2024
[2]

Claude code: An agentic coding tool

Anthropic . Claude code: An agentic coding tool. https://github.com/anthropics/claude-code, 2025

2025
[3]

System card: Claude opus 4.6

Anthropic . System card: Claude opus 4.6. https://www.anthropic.com/claude-opus-4-6-system-card, 2026 a

2026
[4]

System card: Claude opus 4.7

Anthropic . System card: Claude opus 4.7. https://www.anthropic.com/claude-opus-4-7-system-card, 2026 b

2026
[5]

Claude api pricing

Anthropic . Claude api pricing. https://platform.claude.com/docs/en/about-claude/pricing, 2026 c

2026
[6]

OpenScholar : Synthesizing scientific literature with retrieval-augmented LM s, 2024

Akari Asai, Jacqueline He, Rulin Shao, Weijia Shi, Amanpreet Singh, Joseph Chee Chang, Kyle Lo, Luca Soldaini, Sergey Feldman, Mike D'Arcy, David Wadden, Matt Latzke, Minyang Tian, Pan Ji, Shengyan Liu, Hao Tong, Bohao Wu, Yanyu Xiong, Luke Zettlemoyer, Graham Neubig, Dan Weld, Doug Downey, Wen tau Yih, Pang Wei Koh, and Hannaneh Hajishirzi. OpenScholar :...

arXiv 2024
[7]

Accurate prediction of protein structures and interactions using a three-track neural network

Minkyung Baek, Frank DiMaio, Ivan Anishchenko, Justas Dauparas, Sergey Ovchinnikov, Gyu Rie Lee, Jue Wang, Qian Cong, Lisa N Kinch, R Dustin Schaeffer, et al. Accurate prediction of protein structures and interactions using a three-track neural network. Science, 373 0 (6557): 0 871--876, 2021

2021
[8]

Mask-prior-guided denoising diffusion improves inverse protein folding

Peizhen Bai, Filip Miljkovi \'c , Xianyuan Liu, Leonardo De Maria, Rebecca Croasdale-Wood, Owen Rackham, and Haiping Lu. Mask-prior-guided denoising diffusion improves inverse protein folding. Nature Machine Intelligence, 7 0 (6): 0 876--888, 2025

2025
[9]

Atomically accurate de novo design of antibodies with rfdiffusion

Nathaniel R Bennett, Joseph L Watson, Robert J Ragotte, Andrew J Borst, D \'e Jena \'e L See, Connor Weidle, Riti Biswas, Yutong Yu, Ellen L Shrock, Russell Ault, et al. Atomically accurate de novo design of antibodies with rfdiffusion. Nature, 649 0 (8095): 0 183--193, 2026

2026
[10]

A foundation model for the earth system

Cristian Bodnar, Wessel P Bruinsma, Ana Lucic, Megan Stanley, Anna Allen, Johannes Brandstetter, Patrick Garvan, Maik Riechert, Jonathan A Weyn, Haiyu Dong, et al. A foundation model for the earth system. Nature, 641 0 (8065): 0 1180--1187, 2025

2025
[11]

Autonomous chemical research with large language models

Daniil A Boiko, Robert MacKnight, Ben Kline, and Gabe Gomes. Autonomous chemical research with large language models. Nature, 624 0 (7992): 0 570--578, 2023

2023
[12]

Genome modelling and design across all domains of life with evo 2

Garyk Brixi, Matthew G Durrant, Jerome Ku, Mohsen Naghipourfar, Michael Poli, Gwanggyu Sun, Greg Brockman, Daniel Chang, Alison Fanton, Gabriel A Gonzalez, et al. Genome modelling and design across all domains of life with evo 2. Nature, 652 0 (8112): 0 1349--1361, 2026

2026
[13]

AdaEvolve : Adaptive LLM driven zeroth-order optimization, 2026

Mert Cemri, Shubham Agrawal, Akshat Gupta, Shu Liu, Audrey Cheng, Qiuyang Mang, Ashwin Naren, Lutfi Eren Erdogan, Koushik Sen, Matei Zaharia, Alex Dimakis, and Ion Stoica. AdaEvolve : Adaptive LLM driven zeroth-order optimization, 2026. URL https://arxiv.org/abs/2602.20133

arXiv 2026
[14]

Mle-bench: Evaluating machine learning agents on machine learning engineering

Jun Shern Chan, Neil Chowdhury, Oliver Jaffe, James Aung, Dane Sherburn, Evan Mays, Giulio Starace, Kevin Liu, Leon Maksin, Tejal Patwardhan, et al. Mle-bench: Evaluating machine learning agents on machine learning engineering. In International Conference on Learning Representations, volume 2025, pages 50466--50494, 2025

2025
[15]

A generalized-template-based graph neural network for accurate organic reactivity prediction

Shuan Chen and Yousung Jung. A generalized-template-based graph neural network for accurate organic reactivity prediction. Nature Machine Intelligence, 4 0 (9): 0 772--780, 2022. doi:10.1038/s42256-022-00526-z

work page doi:10.1038/s42256-022-00526-z 2022
[16]

Accurate proteome-wide missense variant effect prediction with alphamissense

Jun Cheng, Guido Novati, Joshua Pan, Clare Bycroft, Akvil \.e Z emgulyt \.e , Taylor Applebaum, Alexander Pritzel, Lai Hong Wong, Michal Zielinski, Tobias Sargeant, et al. Accurate proteome-wide missense variant effect prediction with alphamissense. Science, 381 0 (6664): 0 eadg7492, 2023

2023
[17]

Frontier-Eng : Benchmarking self-evolving agents on real-world engineering tasks with generative optimization, 2026

Yizhe Chi, Deyao Hong, Dapeng Jiang, Tianwei Luo, Kaisen Yang, Boshi Zhang, Zhe Cao, Xiaoyan Fan, Bingxiang He, Han Hao, Weiyang Jin, Dianqiao Lei, Qingle Liu, Houde Qian, Bowen Wang, Situ Wang, Youjie Zheng, Yifan Zhou, Calvin Xiao, Eren Cai, and Qinhuai Na. Frontier-Eng : Benchmarking self-evolving agents on real-world engineering tasks with generative ...

Pith/arXiv arXiv 2026
[18]

scgpt: toward building a foundation model for single-cell multi-omics using generative ai

Haotian Cui, Chloe Wang, Hassaan Maan, Kuan Pang, Fengning Luo, Nan Duan, and Bo Wang. scgpt: toward building a foundation model for single-cell multi-omics using generative ai. Nature methods, 21 0 (8): 0 1470--1480, 2024

2024
[19]

de Almeida, Hassan Sirelkhatim, Guillaume Richard, Marcin Skwark, Karim Beguir, Marie Lopez, and Thomas Pierrot

Hugo Dalla-Torre, Liam Gonzalez, Javier Mendoza-Revilla, Nicolas Lopez Carranza, Adam Henryk Grzywaczewski, Francesco Oteri, Christian Dallago, Evan Trop, Bernardo P. de Almeida, Hassan Sirelkhatim, Guillaume Richard, Marcin Skwark, Karim Beguir, Marie Lopez, and Thomas Pierrot. Nucleotide transformer: building and evaluating robust foundation models for ...

work page doi:10.1038/s41592-024-02523-z 2025
[20]

Deepseek v4 preview release

DeepSeek . Deepseek v4 preview release. https://api-docs.deepseek.com/news/news260424, 2026

2026
[21]

Faroughy, Sofia Palacios Schweitzer, Ian Pang, Siddharth Mishra-Sharma, and David Shih

Darius A. Faroughy, Sofia Palacios Schweitzer, Ian Pang, Siddharth Mishra-Sharma, and David Shih. Collider -bench: Benchmarking AI agents with particle physics analysis reproduction, 2026. URL https://arxiv.org/abs/2605.13950

Pith/arXiv arXiv 2026
[22]

Ruiz, Julian Schrittwieser, Grzegorz Swirszcz, et al

Alhussein Fawzi, Matej Balog, Aja Huang, Thomas Hubert, Bernardino Romera-Paredes, Mohammadamin Barekatain, Alexander Novikov, Francisco J R. Ruiz, Julian Schrittwieser, Grzegorz Swirszcz, et al. Discovering faster matrix multiplication algorithms with reinforcement learning. Nature, 610 0 (7930): 0 47--53, 2022 a

2022
[23]

Alhussein Fawzi, Matej Balog, Aja Huang, Thomas Hubert, Bernardino Romera-Paredes, Mohammadamin Barekatain, Alexander Novikov, Francisco J. R. Ruiz, Julian Schrittwieser, Grzegorz Swirszcz, David Silver, Demis Hassabis, and Pushmeet Kohli. Discovering faster matrix multiplication algorithms with reinforcement learning. Nature, 610: 0 47--53, 2022 b . doi:...

work page doi:10.1038/s41586-022-05172-4 2022
[24]

MMReview : A multidisciplinary and multimodal benchmark for LLM -based peer review automation, 2025

Xian Gao, Jiacheng Ruan, Zongyun Zhang, Jingsheng Gao, Ting Liu, and Yuzhuo Fu. MMReview : A multidisciplinary and multimodal benchmark for LLM -based peer review automation, 2025. URL https://arxiv.org/abs/2508.14146

arXiv 2025
[25]

A multi-agent system for automating scientific discovery

Ali Essam Ghareeb, Benjamin Chang, Ludovico Mitchener, et al. A multi-agent system for automating scientific discovery. Nature, 2026. doi:10.1038/s41586-026-10652-y. URL https://www.nature.com/articles/s41586-026-10652-y

work page doi:10.1038/s41586-026-10652-y 2026
[26]

Gemini cli: An open-source ai agent

Google . Gemini cli: An open-source ai agent. https://github.com/google-gemini/gemini-cli, 2025

2025
[27]

Gemini 3.5 flash model card

Google DeepMind . Gemini 3.5 flash model card. https://deepmind.google/models/model-cards/gemini-3-5-flash/, 2026

2026
[28]

Accelerating scientific discovery with co-scientist

Juraj Gottweis, Wei-Hung Weng, Alexander Daryin, Tao Tu, Petar Sirkovic, Artiom Myaskovsky, Grzegorz Glowaty, Felix Weissenberger, Alessio Orlandi, Dan Popovici, et al. Accelerating scientific discovery with co-scientist. Nature, pages 1--3, 2026 a

2026
[29]

Accelerating scientific discovery with Co-Scientist

Juraj Gottweis et al. Accelerating scientific discovery with Co-Scientist . Nature, 2026 b . doi:10.1038/s41586-026-10644-y. URL https://www.nature.com/articles/s41586-026-10644-y

work page doi:10.1038/s41586-026-10644-y 2026
[30]

Artificial intelligence tools expand scientists’ impact but contract science’s focus

Qianyue Hao, Fengli Xu, Yong Li, and James Evans. Artificial intelligence tools expand scientists’ impact but contract science’s focus. Nature, pages 1--7, 2026

2026
[31]

Closed-form continuous-time neural networks

Ramin Hasani, Mathias Lechner, Alexander Amini, Lucas Liebenwein, Aaron Ray, Max Tschaikowski, Gerald Teschl, and Daniela Rus. Closed-form continuous-time neural networks. Nature Machine Intelligence, 4 0 (11): 0 992--1003, 2022

2022
[32]

REPRO -bench: Can agentic AI systems assess the reproducibility of social science research?, 2025

Chuxuan Hu, Liyun Zhang, Yeji Lim, Aum Wadhwani, Austin Peters, and Daniel Kang. REPRO -bench: Can agentic AI systems assess the reproducibility of social science research?, 2025. URL https://arxiv.org/abs/2507.18901

arXiv 2025
[33]

MLAgentBench : Evaluating language agents on machine learning experimentation, 2023

Qian Huang, Jian Vora, Percy Liang, and Jure Leskovec. MLAgentBench : Evaluating language agents on machine learning experimentation, 2023. URL https://arxiv.org/abs/2310.03302

arXiv 2023
[34]

Ziyang Huang, Yi Cao, Ali K. Shargh, Jing Luo, Ruidong Mei, Mohd Zaki, Zhan Liu, Wyatt Bunstine, William Jurayj, Somdatta Goswami, Tyrel McQueen, Michael Shields, Jaafar El-Awady, Paulette Clancy, Benjamin Van Durme, Nicholas Andrews, William Walden, and Daniel Khashabi. Can coding agents reproduce findings in computational materials science?, 2026. URL h...

Pith/arXiv arXiv 2026
[35]

Olympiad-level formal mathematical reasoning with reinforcement learning

Thomas Hubert, Rishi Mehta, Laurent Sartran, Mikl \'o s Z Horv \'a th, Goran Z u z i \'c , Eric Wieser, Aja Huang, Julian Schrittwieser, Yannick Schroecker, Hussain Masoom, et al. Olympiad-level formal mathematical reasoning with reinforcement learning. Nature, pages 1--3, 2025

2025
[36]

Equivariant 3d-conditional diffusion model for molecular linker design

Ilia Igashov, Hannes St \"a rk, Cl \'e ment Vignac, Arne Schneuing, Victor Garcia Satorras, Pascal Frossard, Max Welling, Michael Bronstein, and Bruno Correia. Equivariant 3d-conditional diffusion model for molecular linker design. Nature Machine Intelligence, 6 0 (4): 0 417--427, 2024

2024
[37]

ALE -bench: A benchmark for long-horizon objective-driven algorithm engineering

Yuki Imajuku, Kohki Horie, Yoichi Iwata, Kensho Aoki, Naohiro Takahashi, and Takuya Akiba. ALE -bench: A benchmark for long-horizon objective-driven algorithm engineering. In Advances in Neural Information Processing Systems, Datasets and Benchmarks Track, 2025. URL https://openreview.net/forum?id=JCjGvbsOmQ

2025
[38]

DISCOVERYWORLD : A virtual environment for developing and evaluating automated scientific discovery agents, 2024

Peter Jansen, Marc-Alexandre Cote, Tushar Khot, Erin Bransom, Bhavana Dalvi Mishra, Bodhisattwa Prasad Majumder, Oyvind Tafjord, and Peter Clark. DISCOVERYWORLD : A virtual environment for developing and evaluating automated scientific discovery agents, 2024. URL https://arxiv.org/abs/2406.06769

arXiv 2024
[39]

DeltaEvolve : Accelerating scientific discovery through momentum-driven evolution, 2026

Jiachen Jiang, Tianyu Ding, and Zhihui Zhu. DeltaEvolve : Accelerating scientific discovery through momentum-driven evolution, 2026. URL https://arxiv.org/abs/2602.02919

arXiv 2026
[40]

Highly accurate protein structure prediction with alphafold

John Jumper, Richard Evans, Alexander Pritzel, Tim Green, Michael Figurnov, Olaf Ronneberger, Kathryn Tunyasuvunakool, Russ Bates, Augustin Z \' dek, Anna Potapenko, et al. Highly accurate protein structure prediction with alphafold. nature, 596 0 (7873): 0 583--589, 2021

2021
[41]

autoresearch

Andrej Karpathy. autoresearch . https://github.com/karpathy/autoresearch, 2026

2026
[42]

From reproduction to replication: Evaluating research agents with progressive code masking, 2025

Gyeongwon James Kim, Alex Wilf, Louis-Philippe Morency, and Daniel Fried. From reproduction to replication: Evaluating research agents with progressive code masking, 2025. URL https://arxiv.org/abs/2506.19724

arXiv 2025
[43]

Cell2location maps fine-grained cell types in spatial transcriptomics

Vitalii Kleshchevnikov, Artem Shmatko, Emma Dann, Alexander Aivazidis, Hamish W King, Tong Li, Rasa Elmentaite, Artem Lomakin, Veronika Kedlian, Adam Gayoso, et al. Cell2location maps fine-grained cell types in spatial transcriptomics. Nature biotechnology, 40 0 (5): 0 661--671, 2022

2022
[44]

Rodriques, and Andrew D

Jakub Lala, Odhran O'Donoghue, Aleksandar Shtedritski, Sam Cox, Samuel G. Rodriques, and Andrew D. White. PaperQA : Retrieval-augmented generative agent for scientific research, 2023. URL https://arxiv.org/abs/2312.07559

arXiv 2023
[45]

Learning skillful medium-range global weather forecasting

Remi Lam, Alvaro Sanchez-Gonzalez, Matthew Willson, Peter Wirnsberger, Meire Fortunato, Ferran Alet, Suman Ravuri, Timo Ewalds, Zach Eaton-Rosen, Weihua Hu, et al. Learning skillful medium-range global weather forecasting. Science, 382 0 (6677): 0 1416--1421, 2023

2023
[46]

Laurent, Joseph D

Jon M. Laurent, Joseph D. Janizek, Michael Ruzo, Michaela M. Hinks, Michael J. Hammerling, Siddharth Narayanan, Manvitha Ponnapati, Andrew D. White, and Samuel G. Rodriques. LAB -bench: Measuring capabilities of language models for biology research, 2024. URL https://arxiv.org/abs/2407.10362

Pith/arXiv arXiv 2024
[47]

AutoSOTA : An end-to-end automated research system for state-of-the-art AI model discovery, 2026

Yu Li, Chenyang Shao, Xinyang Liu, Ruotong Zhao, Peijie Liu, Hongyuan Su, Zhibin Chen, Qinglong Yang, Anjie Xu, Yi Fang, Qingbin Zeng, Tianxing Li, Jingbo Xu, Fengli Xu, Yong Li, and Tie-Yan Liu. AutoSOTA : An end-to-end automated research system for state-of-the-art AI model discovery, 2026. URL https://arxiv.org/abs/2604.05550

Pith/arXiv arXiv 2026
[48]

Can large language models provide useful feedback on research papers? a large-scale empirical analysis, 2023

Weixin Liang, Yuhui Zhang, Hancheng Cao, Binglu Wang, Daisy Ding, Xinyu Yang, Kailas Vodrahalli, Siyu He, Daniel Smith, Yian Yin, Daniel McFarland, and James Zou. Can large language models provide useful feedback on research papers? a large-scale empirical analysis, 2023. URL https://arxiv.org/abs/2310.01783

arXiv 2023
[49]

Position: Agentic evolution is the path to evolving LLM s, 2026

Minhua Lin, Hanqing Lu, Zhan Shi, Bing He, Rui Mao, Zhiwei Zhang, Zongyu Wu, Xianfeng Tang, Hui Liu, Zhenwei Dai, Xiang Zhang, Suhang Wang, Benoit Dumoulin, and Jian Pei. Position: Agentic evolution is the path to evolving LLM s, 2026. URL https://arxiv.org/abs/2602.00359

arXiv 2026
[50]

Evolutionary-scale prediction of atomic-level protein structure with a language model

Zeming Lin, Halil Akin, Roshan Rao, Brian Hie, Zhongkai Zhu, Wenting Lu, Nikita Smetanin, Robert Verkuil, Ori Kabeli, Yaniv Shmueli, et al. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science, 379 0 (6637): 0 1123--1130, 2023

2023
[51]

Scientific algorithm discovery by augmenting AlphaEvolve with deep research, 2025

Gang Liu, Yihan Zhu, et al. Scientific algorithm discovery by augmenting AlphaEvolve with deep research, 2025. URL https://arxiv.org/abs/2510.06056

arXiv 2025
[52]

Ryan Liu and Nihar B. Shah. ReviewerGPT ? an exploratory study on using large language models for paper reviewing, 2023. URL https://arxiv.org/abs/2306.00622

arXiv 2023
[53]

Evox: Meta-evolution for automated discovery

Shu Liu, Shubham Agarwal, Monishwaran Maheswaran, Mert Cemri, Zhifei Li, Qiuyang Mang, Ashwin Naren, Ethan Boneh, Audrey Cheng, Melissa Z Pan, et al. Evox: Meta-evolution for automated discovery. arXiv preprint arXiv:2602.23413, 2026

arXiv 2026
[54]

The AI scientist: Towards fully automated open-ended scientific discovery, 2024

Chris Lu, Cong Lu, Robert Tjarko Lange, Jakob Foerster, Jeff Clune, and David Ha. The AI scientist: Towards fully automated open-ended scientific discovery, 2024. URL https://arxiv.org/abs/2408.06292

Pith/arXiv arXiv 2024
[55]

Towards end-to-end automation of ai research

Chris Lu, Cong Lu, Robert Tjarko Lange, Yutaro Yamada, Shengran Hu, Jakob Foerster, David Ha, and Jeff Clune. Towards end-to-end automation of ai research. Nature, 651 0 (8107): 0 914--919, 2026

2026
[56]

AIRS -bench: a suite of tasks for frontier AI research science agents, 2026

Alisia Lupidi, Bhavul Gauri, Thomas Simon Foster, Bassel Al Omari, Despoina Magka, Alberto Pepe, Alexis Audran-Reiss, Muna Aghamelu, Nicolas Baldwin, Lucia Cipolina-Kun, Jean-Christophe Gagnon-Audet, Chee Hau Leow, Sandra Lefdal, Hossam Mossalam, Abhinav Moudgil, Saba Nazir, Emanuel Tewolde, Isabel Urrego, Jordi Armengol Estape, Amar Budhiraja, Gaurav Cha...

arXiv 2026
[57]

Du, Max Simchowitz, Jiantao Jiao, Dawn Song, and Chi Jin

Bohan Lyu, Yucheng Yang, Siqiao Huang, Jiaru Zhang, Qixin Xu, Xinghan Li, Xinyang Han, Yicheng Zhang, Huaqing Zhang, Runhan Huang, Kaicheng Yang, Zitao Chen, Wentao Guo, Junlin Yang, Xinyue Ai, Wenhao Chai, Yadi Cao, Ziran Yang, Kun Wang, Dapeng Jiang, Huan-ang Gao, Shange Tang, Chengshuai Shi, Simon S. Du, Max Simchowitz, Jiantao Jiao, Dawn Song, and Chi...

Pith/arXiv arXiv 2026
[58]

EvoScientist : Towards multi-agent evolving AI scientists for end-to-end scientific discovery, 2026 b

Yougang Lyu, Xi Zhang, Xinhao Yi, Yuyue Zhao, Shuyu Guo, Wenxiang Hu, Jan Piotrowski, Jakub Kaliski, Jacopo Urbani, Zaiqiao Meng, Lun Zhou, and Xiaohui Yan. EvoScientist : Towards multi-agent evolving AI scientists for end-to-end scientific discovery, 2026 b . URL https://arxiv.org/abs/2603.08127

arXiv 2026
[59]

Gonzalez, Jingbo Shang, and Alvin Cheung

Qiuyang Mang, Wenhao Chai, Zhifei Li, Huanzhi Mao, Shang Zhou, Alexander Du, Hanchen Li, Shu Liu, Edwin Chen, Yichuan Wang, Xieting Chu, Zerui Cheng, Yuan Xu, Tian Xia, Zirui Wang, Tianneng Shi, Jianzhu Yao, Yilong Zhao, Qizheng Zhang, Charlie Ruan, Zeyu Shen, Kaiyuan Liu, Runyuan He, Dong Xing, Zerui Li, Zirong Zeng, Yige Jiang, Lufeng Cheng, Ziyi Zhao, ...

arXiv 2025
[60]

Scaling deep learning for materials discovery

Amil Merchant, Simon Batzner, Samuel S Schoenholz, Muratahan Aykol, Gowoon Cheon, and Ekin Dogus Cubuk. Scaling deep learning for materials discovery. Nature, 624 0 (7990): 0 80--85, 2023

2023
[61]

Multigate: integrative analysis and regulatory inference in spatial multi-omics data via graph representation learning

Jishuai Miao, Jinzhao Li, Jingxue Xin, Jiajuan Tu, Muyang Ge, Ji Qi, Xiaocheng Zhou, Ying Zhu, Can Yang, and Zhixiang Lin. Multigate: integrative analysis and regulatory inference in spatial multi-omics data via graph representation learning. Nature Communications, 16 0 (1): 0 9403, 2025

2025
[62]

Minimax m2.7: Early echoes of self-evolution

MiniMax . Minimax m2.7: Early echoes of self-evolution. https://www.minimax.io/news/minimax-m27-en, 2026

2026
[63]

Kimi k2.6

Moonshot AI . Kimi k2.6. https://www.kimi.com/ai-models/kimi-k2-6, 2026

2026
[64]

MLGym : A new framework and benchmark for advancing AI research agents, 2025

Deepak Nathani, Lovish Madaan, Nicholas Roberts, Nikolay Bashlykov, Ajay Menon, Vincent Moens, Amar Budhiraja, Despoina Magka, Vladislav Vorotilov, Gaurav Chaurasia, Dieuwke Hupkes, Ricardo Silveira Cabral, Tatiana Shavrina, Jakob Foerster, Yoram Bachrach, William Yang Wang, and Roberta Raileanu. MLGym : A new framework and benchmark for advancing AI rese...

arXiv 2025
[65]

Alexander Novikov, Ngan Vu, Marvin Eisenberger, Emilien Dupont, Po-Sen Huang, Adam Zsolt Wagner, Sergey Shirobokov, Borislav Kozlovskii, Francisco J. R. Ruiz, Abbas Mehrabian, M. Pawan Kumar, Abigail See, Swarat Chaudhuri, George Holland, Alex Davies, Sebastian Nowozin, Pushmeet Kohli, and Matej Balog. AlphaEvolve : A coding agent for scientific and algor...

Pith/arXiv arXiv 2025
[66]

Codex cli: Lightweight coding agent that runs in your terminal

OpenAI . Codex cli: Lightweight coding agent that runs in your terminal. https://github.com/openai/codex, 2025

2025
[67]

Gpt-5.4 thinking system card

OpenAI . Gpt-5.4 thinking system card. https://openai.com/index/gpt-5-4-thinking-system-card/, 2026 a

2026
[68]

Gpt-5.5 system card

OpenAI . Gpt-5.5 system card. https://openai.com/index/gpt-5-5-system-card/, 2026 b

2026
[69]

What are tokens and how to count them? https://help.openai.com/en/articles/4936856-what-are-tokens-and-how-to-count-them, 2026 c

OpenAI . What are tokens and how to count them? https://help.openai.com/en/articles/4936856-what-are-tokens-and-how-to-count-them, 2026 c

arXiv 2026
[70]

Weak signal extraction enabled by deep neural network denoising of diffraction data

Jens Oppliger, M Michael Denner, Julia K \"u spert, Ruggero Frison, Qisi Wang, Alexander Morawietz, Oleh Ivashko, Ann-Christin Dippel, Martin von Zimmermann, Izabela Bia o, et al. Weak signal extraction enabled by deep neural network denoising of diffraction data. Nature Machine Intelligence, 6 0 (2): 0 180--186, 2024

2024
[71]

Human--ai adaptive dynamics drives the emergence of information cocoons

Jinghua Piao, Jiazhen Liu, Fang Zhang, Jun Su, and Yong Li. Human--ai adaptive dynamics drives the emergence of information cocoons. Nature Machine Intelligence, 5 0 (11): 0 1214--1224, 2023

2023
[72]

Enhanced spatial clustering of single-molecule localizations with graph neural networks

Jes \'u s Pineda, Sergi Mas \'o -Orriols, Montse Masoliver, Joan Bertran, Mattias Goks \"o r, Giovanni Volpe, and Carlo Manzo. Enhanced spatial clustering of single-molecule localizations with graph neural networks. Nature Communications, 16 0 (1): 0 9693, 2025

2025
[73]

Probabilistic weather forecasting with machine learning

Ilan Price, Alvaro Sanchez-Gonzalez, Ferran Alet, Tom R Andersson, Andrew El-Kadi, Dominic Masters, Timo Ewalds, Jacklynn Stott, Shakir Mohamed, Peter Battaglia, et al. Probabilistic weather forecasting with machine learning. Nature, 637 0 (8044): 0 84--90, 2025

2025
[74]

Mle-dojo: Interactive environments for empowering llm agents in machine learning engineering

Rushi Qiang, Yuchen Zhuang, Yinghao Li, Dingu Sagar VK, Rongzhi Zhang, Changhao Li, Ian Wong, Sherry Yang, Percy Liang, Chao Zhang, et al. Mle-dojo: Interactive environments for empowering llm agents in machine learning engineering. Advances in Neural Information Processing Systems, 38, 2026

2026
[75]

Qwen3.7: The agent frontier

Qwen Team . Qwen3.7: The agent frontier. https://qwen.ai/blog?id=qwen3.7, 2026

2026
[76]

PostTrainBench : Can LLM agents automate LLM post-training?, 2026

Ben Rank, Hardik Bhatnagar, Ameya Prabhu, Shira Eisenberg, Karina Nguyen, Matthias Bethge, and Maksym Andriushchenko. PostTrainBench : Can LLM agents automate LLM post-training?, 2026. URL https://arxiv.org/abs/2603.08640

arXiv 2026
[77]

Romera-Paredes, M

Bernardino Romera-Paredes, Mohammadamin Barekatain, Alexander Novikov, Matej Balog, M. Pawan Kumar, Emilien Dupont, Francisco J. R. Ruiz, Jordan S. Ellenberg, Pengming Wang, Omar Fawzi, Pushmeet Kohli, and Alhussein Fawzi. Mathematical discoveries from program search with large language models. Nature, 625: 0 468--475, 2024. doi:10.1038/s41586-023-06924-6...

work page doi:10.1038/s41586-023-06924-6 2024
[78]

Siegel, Sayash Kapoor, Nitya Nagdir, Benedikt Stroebl, and Arvind Narayanan

Zachary S. Siegel, Sayash Kapoor, Nitya Nagdir, Benedikt Stroebl, and Arvind Narayanan. CORE -bench: Fostering the credibility of published research through a computational reproducibility agent benchmark, 2024. URL https://arxiv.org/abs/2409.11363

Pith/arXiv arXiv 2024
[79]

Skarlinski, Sam Cox, Jon M

Michael D. Skarlinski, Sam Cox, Jon M. Laurent, James D. Braza, Michaela Hinks, Michael J. Hammerling, Manvitha Ponnapati, Samuel G. Rodriques, and Andrew D. White. Language agents achieve superhuman synthesis of scientific knowledge, 2024. URL https://arxiv.org/abs/2409.13740

arXiv 2024
[80]

PaperBench : Evaluating AI 's ability to replicate AI research, 2025

Giulio Starace, Oliver Jaffe, Dane Sherburn, James Aung, Jun Shern Chan, Leon Maksin, Rachel Dias, Evan Mays, Benjamin Kinsella, Wyatt Thompson, Johannes Heidecke, Amelia Glaese, and Tejal Patwardhan. PaperBench : Evaluating AI 's ability to replicate AI research, 2025. URL https://arxiv.org/abs/2504.01848

Pith/arXiv arXiv 2025

Showing first 80 references.

[1] [1]

Accurate structure prediction of biomolecular interactions with alphafold 3

Josh Abramson, Jonas Adler, Jack Dunger, Richard Evans, Tim Green, Alexander Pritzel, Olaf Ronneberger, Lindsay Willmore, Andrew J Ballard, Joshua Bambrick, et al. Accurate structure prediction of biomolecular interactions with alphafold 3. Nature, 630 0 (8016): 0 493--500, 2024

2024

[2] [2]

Claude code: An agentic coding tool

Anthropic . Claude code: An agentic coding tool. https://github.com/anthropics/claude-code, 2025

2025

[3] [3]

System card: Claude opus 4.6

Anthropic . System card: Claude opus 4.6. https://www.anthropic.com/claude-opus-4-6-system-card, 2026 a

2026

[4] [4]

System card: Claude opus 4.7

Anthropic . System card: Claude opus 4.7. https://www.anthropic.com/claude-opus-4-7-system-card, 2026 b

2026

[5] [5]

Claude api pricing

Anthropic . Claude api pricing. https://platform.claude.com/docs/en/about-claude/pricing, 2026 c

2026

[6] [6]

OpenScholar : Synthesizing scientific literature with retrieval-augmented LM s, 2024

Akari Asai, Jacqueline He, Rulin Shao, Weijia Shi, Amanpreet Singh, Joseph Chee Chang, Kyle Lo, Luca Soldaini, Sergey Feldman, Mike D'Arcy, David Wadden, Matt Latzke, Minyang Tian, Pan Ji, Shengyan Liu, Hao Tong, Bohao Wu, Yanyu Xiong, Luke Zettlemoyer, Graham Neubig, Dan Weld, Doug Downey, Wen tau Yih, Pang Wei Koh, and Hannaneh Hajishirzi. OpenScholar :...

arXiv 2024

[7] [7]

Accurate prediction of protein structures and interactions using a three-track neural network

Minkyung Baek, Frank DiMaio, Ivan Anishchenko, Justas Dauparas, Sergey Ovchinnikov, Gyu Rie Lee, Jue Wang, Qian Cong, Lisa N Kinch, R Dustin Schaeffer, et al. Accurate prediction of protein structures and interactions using a three-track neural network. Science, 373 0 (6557): 0 871--876, 2021

2021

[8] [8]

Mask-prior-guided denoising diffusion improves inverse protein folding

Peizhen Bai, Filip Miljkovi \'c , Xianyuan Liu, Leonardo De Maria, Rebecca Croasdale-Wood, Owen Rackham, and Haiping Lu. Mask-prior-guided denoising diffusion improves inverse protein folding. Nature Machine Intelligence, 7 0 (6): 0 876--888, 2025

2025

[9] [9]

Atomically accurate de novo design of antibodies with rfdiffusion

Nathaniel R Bennett, Joseph L Watson, Robert J Ragotte, Andrew J Borst, D \'e Jena \'e L See, Connor Weidle, Riti Biswas, Yutong Yu, Ellen L Shrock, Russell Ault, et al. Atomically accurate de novo design of antibodies with rfdiffusion. Nature, 649 0 (8095): 0 183--193, 2026

2026

[10] [10]

A foundation model for the earth system

Cristian Bodnar, Wessel P Bruinsma, Ana Lucic, Megan Stanley, Anna Allen, Johannes Brandstetter, Patrick Garvan, Maik Riechert, Jonathan A Weyn, Haiyu Dong, et al. A foundation model for the earth system. Nature, 641 0 (8065): 0 1180--1187, 2025

2025

[11] [11]

Autonomous chemical research with large language models

Daniil A Boiko, Robert MacKnight, Ben Kline, and Gabe Gomes. Autonomous chemical research with large language models. Nature, 624 0 (7992): 0 570--578, 2023

2023

[12] [12]

Genome modelling and design across all domains of life with evo 2

Garyk Brixi, Matthew G Durrant, Jerome Ku, Mohsen Naghipourfar, Michael Poli, Gwanggyu Sun, Greg Brockman, Daniel Chang, Alison Fanton, Gabriel A Gonzalez, et al. Genome modelling and design across all domains of life with evo 2. Nature, 652 0 (8112): 0 1349--1361, 2026

2026

[13] [13]

AdaEvolve : Adaptive LLM driven zeroth-order optimization, 2026

Mert Cemri, Shubham Agrawal, Akshat Gupta, Shu Liu, Audrey Cheng, Qiuyang Mang, Ashwin Naren, Lutfi Eren Erdogan, Koushik Sen, Matei Zaharia, Alex Dimakis, and Ion Stoica. AdaEvolve : Adaptive LLM driven zeroth-order optimization, 2026. URL https://arxiv.org/abs/2602.20133

arXiv 2026

[14] [14]

Mle-bench: Evaluating machine learning agents on machine learning engineering

Jun Shern Chan, Neil Chowdhury, Oliver Jaffe, James Aung, Dane Sherburn, Evan Mays, Giulio Starace, Kevin Liu, Leon Maksin, Tejal Patwardhan, et al. Mle-bench: Evaluating machine learning agents on machine learning engineering. In International Conference on Learning Representations, volume 2025, pages 50466--50494, 2025

2025

[15] [15]

A generalized-template-based graph neural network for accurate organic reactivity prediction

Shuan Chen and Yousung Jung. A generalized-template-based graph neural network for accurate organic reactivity prediction. Nature Machine Intelligence, 4 0 (9): 0 772--780, 2022. doi:10.1038/s42256-022-00526-z

work page doi:10.1038/s42256-022-00526-z 2022

[16] [16]

Accurate proteome-wide missense variant effect prediction with alphamissense

Jun Cheng, Guido Novati, Joshua Pan, Clare Bycroft, Akvil \.e Z emgulyt \.e , Taylor Applebaum, Alexander Pritzel, Lai Hong Wong, Michal Zielinski, Tobias Sargeant, et al. Accurate proteome-wide missense variant effect prediction with alphamissense. Science, 381 0 (6664): 0 eadg7492, 2023

2023

[17] [17]

Frontier-Eng : Benchmarking self-evolving agents on real-world engineering tasks with generative optimization, 2026

Yizhe Chi, Deyao Hong, Dapeng Jiang, Tianwei Luo, Kaisen Yang, Boshi Zhang, Zhe Cao, Xiaoyan Fan, Bingxiang He, Han Hao, Weiyang Jin, Dianqiao Lei, Qingle Liu, Houde Qian, Bowen Wang, Situ Wang, Youjie Zheng, Yifan Zhou, Calvin Xiao, Eren Cai, and Qinhuai Na. Frontier-Eng : Benchmarking self-evolving agents on real-world engineering tasks with generative ...

Pith/arXiv arXiv 2026

[18] [18]

scgpt: toward building a foundation model for single-cell multi-omics using generative ai

Haotian Cui, Chloe Wang, Hassaan Maan, Kuan Pang, Fengning Luo, Nan Duan, and Bo Wang. scgpt: toward building a foundation model for single-cell multi-omics using generative ai. Nature methods, 21 0 (8): 0 1470--1480, 2024

2024

[19] [19]

de Almeida, Hassan Sirelkhatim, Guillaume Richard, Marcin Skwark, Karim Beguir, Marie Lopez, and Thomas Pierrot

Hugo Dalla-Torre, Liam Gonzalez, Javier Mendoza-Revilla, Nicolas Lopez Carranza, Adam Henryk Grzywaczewski, Francesco Oteri, Christian Dallago, Evan Trop, Bernardo P. de Almeida, Hassan Sirelkhatim, Guillaume Richard, Marcin Skwark, Karim Beguir, Marie Lopez, and Thomas Pierrot. Nucleotide transformer: building and evaluating robust foundation models for ...

work page doi:10.1038/s41592-024-02523-z 2025

[20] [20]

Deepseek v4 preview release

DeepSeek . Deepseek v4 preview release. https://api-docs.deepseek.com/news/news260424, 2026

2026

[21] [21]

Faroughy, Sofia Palacios Schweitzer, Ian Pang, Siddharth Mishra-Sharma, and David Shih

Darius A. Faroughy, Sofia Palacios Schweitzer, Ian Pang, Siddharth Mishra-Sharma, and David Shih. Collider -bench: Benchmarking AI agents with particle physics analysis reproduction, 2026. URL https://arxiv.org/abs/2605.13950

Pith/arXiv arXiv 2026

[22] [22]

Ruiz, Julian Schrittwieser, Grzegorz Swirszcz, et al

Alhussein Fawzi, Matej Balog, Aja Huang, Thomas Hubert, Bernardino Romera-Paredes, Mohammadamin Barekatain, Alexander Novikov, Francisco J R. Ruiz, Julian Schrittwieser, Grzegorz Swirszcz, et al. Discovering faster matrix multiplication algorithms with reinforcement learning. Nature, 610 0 (7930): 0 47--53, 2022 a

2022

[23] [23]

Alhussein Fawzi, Matej Balog, Aja Huang, Thomas Hubert, Bernardino Romera-Paredes, Mohammadamin Barekatain, Alexander Novikov, Francisco J. R. Ruiz, Julian Schrittwieser, Grzegorz Swirszcz, David Silver, Demis Hassabis, and Pushmeet Kohli. Discovering faster matrix multiplication algorithms with reinforcement learning. Nature, 610: 0 47--53, 2022 b . doi:...

work page doi:10.1038/s41586-022-05172-4 2022

[24] [24]

MMReview : A multidisciplinary and multimodal benchmark for LLM -based peer review automation, 2025

Xian Gao, Jiacheng Ruan, Zongyun Zhang, Jingsheng Gao, Ting Liu, and Yuzhuo Fu. MMReview : A multidisciplinary and multimodal benchmark for LLM -based peer review automation, 2025. URL https://arxiv.org/abs/2508.14146

arXiv 2025

[25] [25]

A multi-agent system for automating scientific discovery

Ali Essam Ghareeb, Benjamin Chang, Ludovico Mitchener, et al. A multi-agent system for automating scientific discovery. Nature, 2026. doi:10.1038/s41586-026-10652-y. URL https://www.nature.com/articles/s41586-026-10652-y

work page doi:10.1038/s41586-026-10652-y 2026

[26] [26]

Gemini cli: An open-source ai agent

Google . Gemini cli: An open-source ai agent. https://github.com/google-gemini/gemini-cli, 2025

2025

[27] [27]

Gemini 3.5 flash model card

Google DeepMind . Gemini 3.5 flash model card. https://deepmind.google/models/model-cards/gemini-3-5-flash/, 2026

2026

[28] [28]

Accelerating scientific discovery with co-scientist

Juraj Gottweis, Wei-Hung Weng, Alexander Daryin, Tao Tu, Petar Sirkovic, Artiom Myaskovsky, Grzegorz Glowaty, Felix Weissenberger, Alessio Orlandi, Dan Popovici, et al. Accelerating scientific discovery with co-scientist. Nature, pages 1--3, 2026 a

2026

[29] [29]

Accelerating scientific discovery with Co-Scientist

Juraj Gottweis et al. Accelerating scientific discovery with Co-Scientist . Nature, 2026 b . doi:10.1038/s41586-026-10644-y. URL https://www.nature.com/articles/s41586-026-10644-y

work page doi:10.1038/s41586-026-10644-y 2026

[30] [30]

Artificial intelligence tools expand scientists’ impact but contract science’s focus

Qianyue Hao, Fengli Xu, Yong Li, and James Evans. Artificial intelligence tools expand scientists’ impact but contract science’s focus. Nature, pages 1--7, 2026

2026

[31] [31]

Closed-form continuous-time neural networks

Ramin Hasani, Mathias Lechner, Alexander Amini, Lucas Liebenwein, Aaron Ray, Max Tschaikowski, Gerald Teschl, and Daniela Rus. Closed-form continuous-time neural networks. Nature Machine Intelligence, 4 0 (11): 0 992--1003, 2022

2022

[32] [32]

REPRO -bench: Can agentic AI systems assess the reproducibility of social science research?, 2025

Chuxuan Hu, Liyun Zhang, Yeji Lim, Aum Wadhwani, Austin Peters, and Daniel Kang. REPRO -bench: Can agentic AI systems assess the reproducibility of social science research?, 2025. URL https://arxiv.org/abs/2507.18901

arXiv 2025

[33] [33]

MLAgentBench : Evaluating language agents on machine learning experimentation, 2023

Qian Huang, Jian Vora, Percy Liang, and Jure Leskovec. MLAgentBench : Evaluating language agents on machine learning experimentation, 2023. URL https://arxiv.org/abs/2310.03302

arXiv 2023

[34] [34]

Ziyang Huang, Yi Cao, Ali K. Shargh, Jing Luo, Ruidong Mei, Mohd Zaki, Zhan Liu, Wyatt Bunstine, William Jurayj, Somdatta Goswami, Tyrel McQueen, Michael Shields, Jaafar El-Awady, Paulette Clancy, Benjamin Van Durme, Nicholas Andrews, William Walden, and Daniel Khashabi. Can coding agents reproduce findings in computational materials science?, 2026. URL h...

Pith/arXiv arXiv 2026

[35] [35]

Olympiad-level formal mathematical reasoning with reinforcement learning

Thomas Hubert, Rishi Mehta, Laurent Sartran, Mikl \'o s Z Horv \'a th, Goran Z u z i \'c , Eric Wieser, Aja Huang, Julian Schrittwieser, Yannick Schroecker, Hussain Masoom, et al. Olympiad-level formal mathematical reasoning with reinforcement learning. Nature, pages 1--3, 2025

2025

[36] [36]

Equivariant 3d-conditional diffusion model for molecular linker design

Ilia Igashov, Hannes St \"a rk, Cl \'e ment Vignac, Arne Schneuing, Victor Garcia Satorras, Pascal Frossard, Max Welling, Michael Bronstein, and Bruno Correia. Equivariant 3d-conditional diffusion model for molecular linker design. Nature Machine Intelligence, 6 0 (4): 0 417--427, 2024

2024

[37] [37]

ALE -bench: A benchmark for long-horizon objective-driven algorithm engineering

Yuki Imajuku, Kohki Horie, Yoichi Iwata, Kensho Aoki, Naohiro Takahashi, and Takuya Akiba. ALE -bench: A benchmark for long-horizon objective-driven algorithm engineering. In Advances in Neural Information Processing Systems, Datasets and Benchmarks Track, 2025. URL https://openreview.net/forum?id=JCjGvbsOmQ

2025

[38] [38]

DISCOVERYWORLD : A virtual environment for developing and evaluating automated scientific discovery agents, 2024

Peter Jansen, Marc-Alexandre Cote, Tushar Khot, Erin Bransom, Bhavana Dalvi Mishra, Bodhisattwa Prasad Majumder, Oyvind Tafjord, and Peter Clark. DISCOVERYWORLD : A virtual environment for developing and evaluating automated scientific discovery agents, 2024. URL https://arxiv.org/abs/2406.06769

arXiv 2024

[39] [39]

DeltaEvolve : Accelerating scientific discovery through momentum-driven evolution, 2026

Jiachen Jiang, Tianyu Ding, and Zhihui Zhu. DeltaEvolve : Accelerating scientific discovery through momentum-driven evolution, 2026. URL https://arxiv.org/abs/2602.02919

arXiv 2026

[40] [40]

Highly accurate protein structure prediction with alphafold

John Jumper, Richard Evans, Alexander Pritzel, Tim Green, Michael Figurnov, Olaf Ronneberger, Kathryn Tunyasuvunakool, Russ Bates, Augustin Z \' dek, Anna Potapenko, et al. Highly accurate protein structure prediction with alphafold. nature, 596 0 (7873): 0 583--589, 2021

2021

[41] [41]

autoresearch

Andrej Karpathy. autoresearch . https://github.com/karpathy/autoresearch, 2026

2026

[42] [42]

From reproduction to replication: Evaluating research agents with progressive code masking, 2025

Gyeongwon James Kim, Alex Wilf, Louis-Philippe Morency, and Daniel Fried. From reproduction to replication: Evaluating research agents with progressive code masking, 2025. URL https://arxiv.org/abs/2506.19724

arXiv 2025

[43] [43]

Cell2location maps fine-grained cell types in spatial transcriptomics

Vitalii Kleshchevnikov, Artem Shmatko, Emma Dann, Alexander Aivazidis, Hamish W King, Tong Li, Rasa Elmentaite, Artem Lomakin, Veronika Kedlian, Adam Gayoso, et al. Cell2location maps fine-grained cell types in spatial transcriptomics. Nature biotechnology, 40 0 (5): 0 661--671, 2022

2022

[44] [44]

Rodriques, and Andrew D

Jakub Lala, Odhran O'Donoghue, Aleksandar Shtedritski, Sam Cox, Samuel G. Rodriques, and Andrew D. White. PaperQA : Retrieval-augmented generative agent for scientific research, 2023. URL https://arxiv.org/abs/2312.07559

arXiv 2023

[45] [45]

Learning skillful medium-range global weather forecasting

Remi Lam, Alvaro Sanchez-Gonzalez, Matthew Willson, Peter Wirnsberger, Meire Fortunato, Ferran Alet, Suman Ravuri, Timo Ewalds, Zach Eaton-Rosen, Weihua Hu, et al. Learning skillful medium-range global weather forecasting. Science, 382 0 (6677): 0 1416--1421, 2023

2023

[46] [46]

Laurent, Joseph D

Jon M. Laurent, Joseph D. Janizek, Michael Ruzo, Michaela M. Hinks, Michael J. Hammerling, Siddharth Narayanan, Manvitha Ponnapati, Andrew D. White, and Samuel G. Rodriques. LAB -bench: Measuring capabilities of language models for biology research, 2024. URL https://arxiv.org/abs/2407.10362

Pith/arXiv arXiv 2024

[47] [47]

AutoSOTA : An end-to-end automated research system for state-of-the-art AI model discovery, 2026

Yu Li, Chenyang Shao, Xinyang Liu, Ruotong Zhao, Peijie Liu, Hongyuan Su, Zhibin Chen, Qinglong Yang, Anjie Xu, Yi Fang, Qingbin Zeng, Tianxing Li, Jingbo Xu, Fengli Xu, Yong Li, and Tie-Yan Liu. AutoSOTA : An end-to-end automated research system for state-of-the-art AI model discovery, 2026. URL https://arxiv.org/abs/2604.05550

Pith/arXiv arXiv 2026

[48] [48]

Can large language models provide useful feedback on research papers? a large-scale empirical analysis, 2023

Weixin Liang, Yuhui Zhang, Hancheng Cao, Binglu Wang, Daisy Ding, Xinyu Yang, Kailas Vodrahalli, Siyu He, Daniel Smith, Yian Yin, Daniel McFarland, and James Zou. Can large language models provide useful feedback on research papers? a large-scale empirical analysis, 2023. URL https://arxiv.org/abs/2310.01783

arXiv 2023

[49] [49]

Position: Agentic evolution is the path to evolving LLM s, 2026

Minhua Lin, Hanqing Lu, Zhan Shi, Bing He, Rui Mao, Zhiwei Zhang, Zongyu Wu, Xianfeng Tang, Hui Liu, Zhenwei Dai, Xiang Zhang, Suhang Wang, Benoit Dumoulin, and Jian Pei. Position: Agentic evolution is the path to evolving LLM s, 2026. URL https://arxiv.org/abs/2602.00359

arXiv 2026

[50] [50]

Evolutionary-scale prediction of atomic-level protein structure with a language model

Zeming Lin, Halil Akin, Roshan Rao, Brian Hie, Zhongkai Zhu, Wenting Lu, Nikita Smetanin, Robert Verkuil, Ori Kabeli, Yaniv Shmueli, et al. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science, 379 0 (6637): 0 1123--1130, 2023

2023

[51] [51]

Scientific algorithm discovery by augmenting AlphaEvolve with deep research, 2025

Gang Liu, Yihan Zhu, et al. Scientific algorithm discovery by augmenting AlphaEvolve with deep research, 2025. URL https://arxiv.org/abs/2510.06056

arXiv 2025

[52] [52]

Ryan Liu and Nihar B. Shah. ReviewerGPT ? an exploratory study on using large language models for paper reviewing, 2023. URL https://arxiv.org/abs/2306.00622

arXiv 2023

[53] [53]

Evox: Meta-evolution for automated discovery

Shu Liu, Shubham Agarwal, Monishwaran Maheswaran, Mert Cemri, Zhifei Li, Qiuyang Mang, Ashwin Naren, Ethan Boneh, Audrey Cheng, Melissa Z Pan, et al. Evox: Meta-evolution for automated discovery. arXiv preprint arXiv:2602.23413, 2026

arXiv 2026

[54] [54]

The AI scientist: Towards fully automated open-ended scientific discovery, 2024

Chris Lu, Cong Lu, Robert Tjarko Lange, Jakob Foerster, Jeff Clune, and David Ha. The AI scientist: Towards fully automated open-ended scientific discovery, 2024. URL https://arxiv.org/abs/2408.06292

Pith/arXiv arXiv 2024

[55] [55]

Towards end-to-end automation of ai research

Chris Lu, Cong Lu, Robert Tjarko Lange, Yutaro Yamada, Shengran Hu, Jakob Foerster, David Ha, and Jeff Clune. Towards end-to-end automation of ai research. Nature, 651 0 (8107): 0 914--919, 2026

2026

[56] [56]

AIRS -bench: a suite of tasks for frontier AI research science agents, 2026

Alisia Lupidi, Bhavul Gauri, Thomas Simon Foster, Bassel Al Omari, Despoina Magka, Alberto Pepe, Alexis Audran-Reiss, Muna Aghamelu, Nicolas Baldwin, Lucia Cipolina-Kun, Jean-Christophe Gagnon-Audet, Chee Hau Leow, Sandra Lefdal, Hossam Mossalam, Abhinav Moudgil, Saba Nazir, Emanuel Tewolde, Isabel Urrego, Jordi Armengol Estape, Amar Budhiraja, Gaurav Cha...

arXiv 2026

[57] [57]

Du, Max Simchowitz, Jiantao Jiao, Dawn Song, and Chi Jin

Bohan Lyu, Yucheng Yang, Siqiao Huang, Jiaru Zhang, Qixin Xu, Xinghan Li, Xinyang Han, Yicheng Zhang, Huaqing Zhang, Runhan Huang, Kaicheng Yang, Zitao Chen, Wentao Guo, Junlin Yang, Xinyue Ai, Wenhao Chai, Yadi Cao, Ziran Yang, Kun Wang, Dapeng Jiang, Huan-ang Gao, Shange Tang, Chengshuai Shi, Simon S. Du, Max Simchowitz, Jiantao Jiao, Dawn Song, and Chi...

Pith/arXiv arXiv 2026

[58] [58]

EvoScientist : Towards multi-agent evolving AI scientists for end-to-end scientific discovery, 2026 b

Yougang Lyu, Xi Zhang, Xinhao Yi, Yuyue Zhao, Shuyu Guo, Wenxiang Hu, Jan Piotrowski, Jakub Kaliski, Jacopo Urbani, Zaiqiao Meng, Lun Zhou, and Xiaohui Yan. EvoScientist : Towards multi-agent evolving AI scientists for end-to-end scientific discovery, 2026 b . URL https://arxiv.org/abs/2603.08127

arXiv 2026

[59] [59]

Gonzalez, Jingbo Shang, and Alvin Cheung

Qiuyang Mang, Wenhao Chai, Zhifei Li, Huanzhi Mao, Shang Zhou, Alexander Du, Hanchen Li, Shu Liu, Edwin Chen, Yichuan Wang, Xieting Chu, Zerui Cheng, Yuan Xu, Tian Xia, Zirui Wang, Tianneng Shi, Jianzhu Yao, Yilong Zhao, Qizheng Zhang, Charlie Ruan, Zeyu Shen, Kaiyuan Liu, Runyuan He, Dong Xing, Zerui Li, Zirong Zeng, Yige Jiang, Lufeng Cheng, Ziyi Zhao, ...

arXiv 2025

[60] [60]

Scaling deep learning for materials discovery

Amil Merchant, Simon Batzner, Samuel S Schoenholz, Muratahan Aykol, Gowoon Cheon, and Ekin Dogus Cubuk. Scaling deep learning for materials discovery. Nature, 624 0 (7990): 0 80--85, 2023

2023

[61] [61]

Multigate: integrative analysis and regulatory inference in spatial multi-omics data via graph representation learning

Jishuai Miao, Jinzhao Li, Jingxue Xin, Jiajuan Tu, Muyang Ge, Ji Qi, Xiaocheng Zhou, Ying Zhu, Can Yang, and Zhixiang Lin. Multigate: integrative analysis and regulatory inference in spatial multi-omics data via graph representation learning. Nature Communications, 16 0 (1): 0 9403, 2025

2025

[62] [62]

Minimax m2.7: Early echoes of self-evolution

MiniMax . Minimax m2.7: Early echoes of self-evolution. https://www.minimax.io/news/minimax-m27-en, 2026

2026

[63] [63]

Kimi k2.6

Moonshot AI . Kimi k2.6. https://www.kimi.com/ai-models/kimi-k2-6, 2026

2026

[64] [64]

MLGym : A new framework and benchmark for advancing AI research agents, 2025

Deepak Nathani, Lovish Madaan, Nicholas Roberts, Nikolay Bashlykov, Ajay Menon, Vincent Moens, Amar Budhiraja, Despoina Magka, Vladislav Vorotilov, Gaurav Chaurasia, Dieuwke Hupkes, Ricardo Silveira Cabral, Tatiana Shavrina, Jakob Foerster, Yoram Bachrach, William Yang Wang, and Roberta Raileanu. MLGym : A new framework and benchmark for advancing AI rese...

arXiv 2025

[65] [65]

Alexander Novikov, Ngan Vu, Marvin Eisenberger, Emilien Dupont, Po-Sen Huang, Adam Zsolt Wagner, Sergey Shirobokov, Borislav Kozlovskii, Francisco J. R. Ruiz, Abbas Mehrabian, M. Pawan Kumar, Abigail See, Swarat Chaudhuri, George Holland, Alex Davies, Sebastian Nowozin, Pushmeet Kohli, and Matej Balog. AlphaEvolve : A coding agent for scientific and algor...

Pith/arXiv arXiv 2025

[66] [66]

Codex cli: Lightweight coding agent that runs in your terminal

OpenAI . Codex cli: Lightweight coding agent that runs in your terminal. https://github.com/openai/codex, 2025

2025

[67] [67]

Gpt-5.4 thinking system card

OpenAI . Gpt-5.4 thinking system card. https://openai.com/index/gpt-5-4-thinking-system-card/, 2026 a

2026

[68] [68]

Gpt-5.5 system card

OpenAI . Gpt-5.5 system card. https://openai.com/index/gpt-5-5-system-card/, 2026 b

2026

[69] [69]

What are tokens and how to count them? https://help.openai.com/en/articles/4936856-what-are-tokens-and-how-to-count-them, 2026 c

OpenAI . What are tokens and how to count them? https://help.openai.com/en/articles/4936856-what-are-tokens-and-how-to-count-them, 2026 c

arXiv 2026

[70] [70]

Weak signal extraction enabled by deep neural network denoising of diffraction data

Jens Oppliger, M Michael Denner, Julia K \"u spert, Ruggero Frison, Qisi Wang, Alexander Morawietz, Oleh Ivashko, Ann-Christin Dippel, Martin von Zimmermann, Izabela Bia o, et al. Weak signal extraction enabled by deep neural network denoising of diffraction data. Nature Machine Intelligence, 6 0 (2): 0 180--186, 2024

2024

[71] [71]

Human--ai adaptive dynamics drives the emergence of information cocoons

Jinghua Piao, Jiazhen Liu, Fang Zhang, Jun Su, and Yong Li. Human--ai adaptive dynamics drives the emergence of information cocoons. Nature Machine Intelligence, 5 0 (11): 0 1214--1224, 2023

2023

[72] [72]

Enhanced spatial clustering of single-molecule localizations with graph neural networks

Jes \'u s Pineda, Sergi Mas \'o -Orriols, Montse Masoliver, Joan Bertran, Mattias Goks \"o r, Giovanni Volpe, and Carlo Manzo. Enhanced spatial clustering of single-molecule localizations with graph neural networks. Nature Communications, 16 0 (1): 0 9693, 2025

2025

[73] [73]

Probabilistic weather forecasting with machine learning

Ilan Price, Alvaro Sanchez-Gonzalez, Ferran Alet, Tom R Andersson, Andrew El-Kadi, Dominic Masters, Timo Ewalds, Jacklynn Stott, Shakir Mohamed, Peter Battaglia, et al. Probabilistic weather forecasting with machine learning. Nature, 637 0 (8044): 0 84--90, 2025

2025

[74] [74]

Mle-dojo: Interactive environments for empowering llm agents in machine learning engineering

Rushi Qiang, Yuchen Zhuang, Yinghao Li, Dingu Sagar VK, Rongzhi Zhang, Changhao Li, Ian Wong, Sherry Yang, Percy Liang, Chao Zhang, et al. Mle-dojo: Interactive environments for empowering llm agents in machine learning engineering. Advances in Neural Information Processing Systems, 38, 2026

2026

[75] [75]

Qwen3.7: The agent frontier

Qwen Team . Qwen3.7: The agent frontier. https://qwen.ai/blog?id=qwen3.7, 2026

2026

[76] [76]

PostTrainBench : Can LLM agents automate LLM post-training?, 2026

Ben Rank, Hardik Bhatnagar, Ameya Prabhu, Shira Eisenberg, Karina Nguyen, Matthias Bethge, and Maksym Andriushchenko. PostTrainBench : Can LLM agents automate LLM post-training?, 2026. URL https://arxiv.org/abs/2603.08640

arXiv 2026

[77] [77]

Romera-Paredes, M

Bernardino Romera-Paredes, Mohammadamin Barekatain, Alexander Novikov, Matej Balog, M. Pawan Kumar, Emilien Dupont, Francisco J. R. Ruiz, Jordan S. Ellenberg, Pengming Wang, Omar Fawzi, Pushmeet Kohli, and Alhussein Fawzi. Mathematical discoveries from program search with large language models. Nature, 625: 0 468--475, 2024. doi:10.1038/s41586-023-06924-6...

work page doi:10.1038/s41586-023-06924-6 2024

[78] [78]

Siegel, Sayash Kapoor, Nitya Nagdir, Benedikt Stroebl, and Arvind Narayanan

Zachary S. Siegel, Sayash Kapoor, Nitya Nagdir, Benedikt Stroebl, and Arvind Narayanan. CORE -bench: Fostering the credibility of published research through a computational reproducibility agent benchmark, 2024. URL https://arxiv.org/abs/2409.11363

Pith/arXiv arXiv 2024

[79] [79]

Skarlinski, Sam Cox, Jon M

Michael D. Skarlinski, Sam Cox, Jon M. Laurent, James D. Braza, Michaela Hinks, Michael J. Hammerling, Manvitha Ponnapati, Samuel G. Rodriques, and Andrew D. White. Language agents achieve superhuman synthesis of scientific knowledge, 2024. URL https://arxiv.org/abs/2409.13740

arXiv 2024

[80] [80]

PaperBench : Evaluating AI 's ability to replicate AI research, 2025

Giulio Starace, Oliver Jaffe, Dane Sherburn, James Aung, Jun Shern Chan, Leon Maksin, Rachel Dias, Evan Mays, Benjamin Kinsella, Wyatt Thompson, Johannes Heidecke, Amelia Glaese, and Tejal Patwardhan. PaperBench : Evaluating AI 's ability to replicate AI research, 2025. URL https://arxiv.org/abs/2504.01848

Pith/arXiv arXiv 2025