NatureBench: Can Coding Agents Match the Published SOTA of Nature-Family Papers?
Pith reviewed 2026-06-26 00:11 UTC · model grok-4.3
The pith
AI coding agents surpass published SOTA on only 17.8 percent of tasks from Nature-family papers.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The strongest model surpasses SOTA on only 17.8% of tasks under the g>0.1 criterion. Analysis of method pathways reveals that agents succeed primarily through methodological translation, converting scientific tasks into familiar supervised prediction problems, rather than through genuine scientific invention. Failures are dominated by wrong method choice and insufficient compute budget, not by task misunderstanding.
What carries the argument
NatureBench, a set of 90 tasks distilled from Nature-family papers, together with the NatureGym pipeline that builds standardized containerized environments for each task.
If this is right
- Agents that succeed do so by converting the original scientific question into a supervised prediction problem they already know how to solve.
- Wrong method choice and insufficient compute budget account for most failures rather than inability to parse the task statement.
- Standardized containerization removes environment-fragmentation barriers that previously made agent-on-research benchmarks hard to trust.
- The benchmark supplies a public leaderboard and maintainer-side reproduction protocol for tracking future progress.
Where Pith is reading between the lines
- If the translation pattern persists, scaling compute alone will not close the gap to scientific invention.
- The same evaluation pipeline could be applied to papers outside the Nature family to test whether the 17.8 percent ceiling is domain-specific.
- A follow-up study could measure whether agents improve when given explicit incentives or tools for proposing novel experimental designs rather than defaulting to familiar models.
Load-bearing premise
The 90 tasks distilled from Nature-family papers, together with the NatureGym containerization process, faithfully represent the original scientific discovery problems and experimental setups without introducing systematic biases in difficulty or required expertise.
What would settle it
A controlled experiment in which an agent configuration, under the identical web-search-disabled protocol, surpasses the published SOTA on more than half the tasks while using methods that are not simple translations of existing supervised-learning pipelines.
read the original abstract
We introduce NatureBench, a cross-discipline benchmark of 90 tasks distilled from peer-reviewed Nature-family publications, designed to evaluate whether AI coding agents can move beyond reproduction toward discovery on real scientific problems. NatureBench is built on NatureGym, an automated pipeline that constructs a standardized, per-task containerized environment from a source paper, addressing the environment-fragmentation problem that has limited the credibility of prior agent-on-research benchmarks. Evaluating ten frontier agent configurations under a strict web-search-disabled protocol, we find that the strongest model surpasses SOTA on only 17.8% of tasks under the g>0.1 criterion. Analysis of method pathways reveals that agents succeed primarily through methodological translation, converting scientific tasks into familiar supervised prediction problems, rather than through genuine scientific invention. Failures are dominated by wrong method choice and insufficient compute budget, not by task misunderstanding. We release the benchmark, the NatureGym pipeline, and a public leaderboard with maintainer-side reproduction. Code: https://github.com/FrontisAI/NatureBench
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces NatureBench, a cross-disciplinary benchmark of 90 tasks distilled from peer-reviewed Nature-family publications, constructed via the NatureGym automated pipeline that generates standardized containerized environments per task. Under a web-search-disabled protocol, ten frontier agent configurations are evaluated; the strongest model surpasses published SOTA on only 17.8% of tasks using the g>0.1 criterion. Pathway analysis indicates successes arise mainly from methodological translation (recasting tasks as supervised prediction problems) rather than genuine scientific invention, while failures stem primarily from incorrect method choice and insufficient compute rather than task misunderstanding. The benchmark, pipeline, and public leaderboard with maintainer-side reproduction are released.
Significance. If the 90 distilled tasks and NatureGym environments faithfully preserve the original scientific framing, required expertise, and method space of the source papers, the 17.8% result would provide concrete evidence that current coding agents remain limited in moving from reproduction to discovery on real scientific problems. The explicit release of the full benchmark, containerization pipeline, and reproducible leaderboard constitutes a clear strength, enabling direct verification and extension by the community.
major comments (3)
- [Abstract] Abstract: The central performance claim (strongest model surpasses SOTA on 17.8% of tasks under g>0.1) and associated failure-mode analysis are stated without any description of how SOTA values were extracted from the source papers, how the g>0.1 threshold was computed, whether error bars or statistical tests were applied, or the precise criteria used to select and distill the 90 tasks. These omissions are load-bearing for the headline empirical result.
- [§3 and §4] §3 (NatureBench construction) and §4 (Evaluation): The interpretation that agents succeed via methodological translation rather than invention, and that the low success rate reflects agent limitations rather than benchmark artifacts, rests on the untested assumption that the distillation and containerization steps preserve original problem difficulty and expertise requirements. No validation, sensitivity analysis, or comparison to the source papers' original experimental setups is reported.
- [§4] §4 (Results): The pathway analysis and attribution of failures to 'wrong method choice' and 'insufficient compute budget' presuppose that the NatureGym environments allow agents access to the full original method space; if containerization systematically narrows this space or reformulates tasks into ML-friendly forms, the reported failure modes become circular with the benchmark design.
minor comments (2)
- [Abstract] The GitHub link is provided but the manuscript does not describe the exact protocol for maintainer-side reproduction or how the public leaderboard will be updated when new agents are evaluated.
- [Abstract] Notation for g>0.1 is introduced without an explicit equation or definition in the main text; a short formal definition would improve clarity.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback highlighting the need for greater transparency in our empirical claims and validation of benchmark construction. We address each major comment below. Where details were omitted, we will revise the manuscript to include them.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central performance claim (strongest model surpasses SOTA on 17.8% of tasks under g>0.1) and associated failure-mode analysis are stated without any description of how SOTA values were extracted from the source papers, how the g>0.1 threshold was computed, whether error bars or statistical tests were applied, or the precise criteria used to select and distill the 90 tasks. These omissions are load-bearing for the headline empirical result.
Authors: We agree these details are essential. In the revised version we will expand §3 with a dedicated subsection on task selection criteria (relevance to discovery vs. reproduction, feasibility of containerization, diversity across disciplines), SOTA extraction protocol (manual review of results tables and supplementary material in each source paper, selecting the strongest reported metric under comparable conditions), and the g>0.1 definition (normalized gap g = (agent_perf - SOTA)/SOTA > 0.1). We will report error bars from multiple agent seeds where compute permits and note the absence of formal statistical tests due to the heterogeneous task metrics. A concise reference to these procedures will be added to the abstract. revision: yes
-
Referee: [§3 and §4] §3 (NatureBench construction) and §4 (Evaluation): The interpretation that agents succeed via methodological translation rather than invention, and that the low success rate reflects agent limitations rather than benchmark artifacts, rests on the untested assumption that the distillation and containerization steps preserve original problem difficulty and expertise requirements. No validation, sensitivity analysis, or comparison to the source papers' original experimental setups is reported.
Authors: We acknowledge the absence of explicit validation. The NatureGym pipeline was designed to retain the original scientific objective and required expertise by extracting the core experimental protocol verbatim into the container; however, we did not previously report a sensitivity check. In revision we will add (i) a table comparing key hyperparameters and evaluation metrics between original papers and NatureGym environments for a random subset of 15 tasks, and (ii) a short sensitivity analysis showing that relaxing container constraints does not materially change the 17.8 % figure. This will directly test the preservation assumption. revision: yes
-
Referee: [§4] §4 (Results): The pathway analysis and attribution of failures to 'wrong method choice' and 'insufficient compute budget' presuppose that the NatureGym environments allow agents access to the full original method space; if containerization systematically narrows this space or reformulates tasks into ML-friendly forms, the reported failure modes become circular with the benchmark design.
Authors: We maintain that the environments were constructed to expose the full method space described in each source paper, including non-ML baselines where present. Nevertheless, the concern about possible narrowing is valid. In the revision we will insert a new paragraph in §4 that (a) enumerates the method categories explicitly permitted in each environment, (b) provides qualitative evidence that agents could (and sometimes did) invoke non-ML approaches, and (c) discusses the residual risk of implicit ML bias introduced by containerization. This will clarify that the failure-mode attributions are not purely circular. revision: yes
Circularity Check
No circularity: results derive from external published SOTA comparisons on independently constructed benchmark tasks.
full rationale
The paper distills 90 tasks from external Nature-family publications and measures agent performance against those papers' independently reported SOTA numbers. The 17.8% success rate under g>0.1 and the method-pathway observations are direct empirical counts from agent executions; no equations, fitted parameters, or self-citations reduce these counts to quantities defined by the authors themselves. The benchmark-construction assumption is a standard external-validity concern rather than a self-referential reduction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Tasks extracted from Nature papers can be standardized into containerized environments that preserve the original scientific validity and difficulty
invented entities (2)
-
NatureBench
no independent evidence
-
NatureGym
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Accurate structure prediction of biomolecular interactions with alphafold 3
Josh Abramson, Jonas Adler, Jack Dunger, Richard Evans, Tim Green, Alexander Pritzel, Olaf Ronneberger, Lindsay Willmore, Andrew J Ballard, Joshua Bambrick, et al. Accurate structure prediction of biomolecular interactions with alphafold 3. Nature, 630 0 (8016): 0 493--500, 2024
2024
-
[2]
Claude code: An agentic coding tool
Anthropic . Claude code: An agentic coding tool. https://github.com/anthropics/claude-code, 2025
2025
-
[3]
System card: Claude opus 4.6
Anthropic . System card: Claude opus 4.6. https://www.anthropic.com/claude-opus-4-6-system-card, 2026 a
2026
-
[4]
System card: Claude opus 4.7
Anthropic . System card: Claude opus 4.7. https://www.anthropic.com/claude-opus-4-7-system-card, 2026 b
2026
-
[5]
Claude api pricing
Anthropic . Claude api pricing. https://platform.claude.com/docs/en/about-claude/pricing, 2026 c
2026
-
[6]
OpenScholar : Synthesizing scientific literature with retrieval-augmented LM s, 2024
Akari Asai, Jacqueline He, Rulin Shao, Weijia Shi, Amanpreet Singh, Joseph Chee Chang, Kyle Lo, Luca Soldaini, Sergey Feldman, Mike D'Arcy, David Wadden, Matt Latzke, Minyang Tian, Pan Ji, Shengyan Liu, Hao Tong, Bohao Wu, Yanyu Xiong, Luke Zettlemoyer, Graham Neubig, Dan Weld, Doug Downey, Wen tau Yih, Pang Wei Koh, and Hannaneh Hajishirzi. OpenScholar :...
arXiv 2024
-
[7]
Accurate prediction of protein structures and interactions using a three-track neural network
Minkyung Baek, Frank DiMaio, Ivan Anishchenko, Justas Dauparas, Sergey Ovchinnikov, Gyu Rie Lee, Jue Wang, Qian Cong, Lisa N Kinch, R Dustin Schaeffer, et al. Accurate prediction of protein structures and interactions using a three-track neural network. Science, 373 0 (6557): 0 871--876, 2021
2021
-
[8]
Mask-prior-guided denoising diffusion improves inverse protein folding
Peizhen Bai, Filip Miljkovi \'c , Xianyuan Liu, Leonardo De Maria, Rebecca Croasdale-Wood, Owen Rackham, and Haiping Lu. Mask-prior-guided denoising diffusion improves inverse protein folding. Nature Machine Intelligence, 7 0 (6): 0 876--888, 2025
2025
-
[9]
Atomically accurate de novo design of antibodies with rfdiffusion
Nathaniel R Bennett, Joseph L Watson, Robert J Ragotte, Andrew J Borst, D \'e Jena \'e L See, Connor Weidle, Riti Biswas, Yutong Yu, Ellen L Shrock, Russell Ault, et al. Atomically accurate de novo design of antibodies with rfdiffusion. Nature, 649 0 (8095): 0 183--193, 2026
2026
-
[10]
A foundation model for the earth system
Cristian Bodnar, Wessel P Bruinsma, Ana Lucic, Megan Stanley, Anna Allen, Johannes Brandstetter, Patrick Garvan, Maik Riechert, Jonathan A Weyn, Haiyu Dong, et al. A foundation model for the earth system. Nature, 641 0 (8065): 0 1180--1187, 2025
2025
-
[11]
Autonomous chemical research with large language models
Daniil A Boiko, Robert MacKnight, Ben Kline, and Gabe Gomes. Autonomous chemical research with large language models. Nature, 624 0 (7992): 0 570--578, 2023
2023
-
[12]
Genome modelling and design across all domains of life with evo 2
Garyk Brixi, Matthew G Durrant, Jerome Ku, Mohsen Naghipourfar, Michael Poli, Gwanggyu Sun, Greg Brockman, Daniel Chang, Alison Fanton, Gabriel A Gonzalez, et al. Genome modelling and design across all domains of life with evo 2. Nature, 652 0 (8112): 0 1349--1361, 2026
2026
-
[13]
AdaEvolve : Adaptive LLM driven zeroth-order optimization, 2026
Mert Cemri, Shubham Agrawal, Akshat Gupta, Shu Liu, Audrey Cheng, Qiuyang Mang, Ashwin Naren, Lutfi Eren Erdogan, Koushik Sen, Matei Zaharia, Alex Dimakis, and Ion Stoica. AdaEvolve : Adaptive LLM driven zeroth-order optimization, 2026. URL https://arxiv.org/abs/2602.20133
arXiv 2026
-
[14]
Mle-bench: Evaluating machine learning agents on machine learning engineering
Jun Shern Chan, Neil Chowdhury, Oliver Jaffe, James Aung, Dane Sherburn, Evan Mays, Giulio Starace, Kevin Liu, Leon Maksin, Tejal Patwardhan, et al. Mle-bench: Evaluating machine learning agents on machine learning engineering. In International Conference on Learning Representations, volume 2025, pages 50466--50494, 2025
2025
-
[15]
A generalized-template-based graph neural network for accurate organic reactivity prediction
Shuan Chen and Yousung Jung. A generalized-template-based graph neural network for accurate organic reactivity prediction. Nature Machine Intelligence, 4 0 (9): 0 772--780, 2022. doi:10.1038/s42256-022-00526-z
-
[16]
Accurate proteome-wide missense variant effect prediction with alphamissense
Jun Cheng, Guido Novati, Joshua Pan, Clare Bycroft, Akvil \.e Z emgulyt \.e , Taylor Applebaum, Alexander Pritzel, Lai Hong Wong, Michal Zielinski, Tobias Sargeant, et al. Accurate proteome-wide missense variant effect prediction with alphamissense. Science, 381 0 (6664): 0 eadg7492, 2023
2023
-
[17]
Yizhe Chi, Deyao Hong, Dapeng Jiang, Tianwei Luo, Kaisen Yang, Boshi Zhang, Zhe Cao, Xiaoyan Fan, Bingxiang He, Han Hao, Weiyang Jin, Dianqiao Lei, Qingle Liu, Houde Qian, Bowen Wang, Situ Wang, Youjie Zheng, Yifan Zhou, Calvin Xiao, Eren Cai, and Qinhuai Na. Frontier-Eng : Benchmarking self-evolving agents on real-world engineering tasks with generative ...
Pith/arXiv arXiv 2026
-
[18]
scgpt: toward building a foundation model for single-cell multi-omics using generative ai
Haotian Cui, Chloe Wang, Hassaan Maan, Kuan Pang, Fengning Luo, Nan Duan, and Bo Wang. scgpt: toward building a foundation model for single-cell multi-omics using generative ai. Nature methods, 21 0 (8): 0 1470--1480, 2024
2024
-
[19]
Hugo Dalla-Torre, Liam Gonzalez, Javier Mendoza-Revilla, Nicolas Lopez Carranza, Adam Henryk Grzywaczewski, Francesco Oteri, Christian Dallago, Evan Trop, Bernardo P. de Almeida, Hassan Sirelkhatim, Guillaume Richard, Marcin Skwark, Karim Beguir, Marie Lopez, and Thomas Pierrot. Nucleotide transformer: building and evaluating robust foundation models for ...
-
[20]
Deepseek v4 preview release
DeepSeek . Deepseek v4 preview release. https://api-docs.deepseek.com/news/news260424, 2026
2026
-
[21]
Faroughy, Sofia Palacios Schweitzer, Ian Pang, Siddharth Mishra-Sharma, and David Shih
Darius A. Faroughy, Sofia Palacios Schweitzer, Ian Pang, Siddharth Mishra-Sharma, and David Shih. Collider -bench: Benchmarking AI agents with particle physics analysis reproduction, 2026. URL https://arxiv.org/abs/2605.13950
Pith/arXiv arXiv 2026
-
[22]
Ruiz, Julian Schrittwieser, Grzegorz Swirszcz, et al
Alhussein Fawzi, Matej Balog, Aja Huang, Thomas Hubert, Bernardino Romera-Paredes, Mohammadamin Barekatain, Alexander Novikov, Francisco J R. Ruiz, Julian Schrittwieser, Grzegorz Swirszcz, et al. Discovering faster matrix multiplication algorithms with reinforcement learning. Nature, 610 0 (7930): 0 47--53, 2022 a
2022
-
[23]
Alhussein Fawzi, Matej Balog, Aja Huang, Thomas Hubert, Bernardino Romera-Paredes, Mohammadamin Barekatain, Alexander Novikov, Francisco J. R. Ruiz, Julian Schrittwieser, Grzegorz Swirszcz, David Silver, Demis Hassabis, and Pushmeet Kohli. Discovering faster matrix multiplication algorithms with reinforcement learning. Nature, 610: 0 47--53, 2022 b . doi:...
-
[24]
MMReview : A multidisciplinary and multimodal benchmark for LLM -based peer review automation, 2025
Xian Gao, Jiacheng Ruan, Zongyun Zhang, Jingsheng Gao, Ting Liu, and Yuzhuo Fu. MMReview : A multidisciplinary and multimodal benchmark for LLM -based peer review automation, 2025. URL https://arxiv.org/abs/2508.14146
arXiv 2025
-
[25]
A multi-agent system for automating scientific discovery
Ali Essam Ghareeb, Benjamin Chang, Ludovico Mitchener, et al. A multi-agent system for automating scientific discovery. Nature, 2026. doi:10.1038/s41586-026-10652-y. URL https://www.nature.com/articles/s41586-026-10652-y
-
[26]
Gemini cli: An open-source ai agent
Google . Gemini cli: An open-source ai agent. https://github.com/google-gemini/gemini-cli, 2025
2025
-
[27]
Gemini 3.5 flash model card
Google DeepMind . Gemini 3.5 flash model card. https://deepmind.google/models/model-cards/gemini-3-5-flash/, 2026
2026
-
[28]
Accelerating scientific discovery with co-scientist
Juraj Gottweis, Wei-Hung Weng, Alexander Daryin, Tao Tu, Petar Sirkovic, Artiom Myaskovsky, Grzegorz Glowaty, Felix Weissenberger, Alessio Orlandi, Dan Popovici, et al. Accelerating scientific discovery with co-scientist. Nature, pages 1--3, 2026 a
2026
-
[29]
Accelerating scientific discovery with Co-Scientist
Juraj Gottweis et al. Accelerating scientific discovery with Co-Scientist . Nature, 2026 b . doi:10.1038/s41586-026-10644-y. URL https://www.nature.com/articles/s41586-026-10644-y
-
[30]
Artificial intelligence tools expand scientists’ impact but contract science’s focus
Qianyue Hao, Fengli Xu, Yong Li, and James Evans. Artificial intelligence tools expand scientists’ impact but contract science’s focus. Nature, pages 1--7, 2026
2026
-
[31]
Closed-form continuous-time neural networks
Ramin Hasani, Mathias Lechner, Alexander Amini, Lucas Liebenwein, Aaron Ray, Max Tschaikowski, Gerald Teschl, and Daniela Rus. Closed-form continuous-time neural networks. Nature Machine Intelligence, 4 0 (11): 0 992--1003, 2022
2022
-
[32]
REPRO -bench: Can agentic AI systems assess the reproducibility of social science research?, 2025
Chuxuan Hu, Liyun Zhang, Yeji Lim, Aum Wadhwani, Austin Peters, and Daniel Kang. REPRO -bench: Can agentic AI systems assess the reproducibility of social science research?, 2025. URL https://arxiv.org/abs/2507.18901
arXiv 2025
-
[33]
MLAgentBench : Evaluating language agents on machine learning experimentation, 2023
Qian Huang, Jian Vora, Percy Liang, and Jure Leskovec. MLAgentBench : Evaluating language agents on machine learning experimentation, 2023. URL https://arxiv.org/abs/2310.03302
arXiv 2023
-
[34]
Ziyang Huang, Yi Cao, Ali K. Shargh, Jing Luo, Ruidong Mei, Mohd Zaki, Zhan Liu, Wyatt Bunstine, William Jurayj, Somdatta Goswami, Tyrel McQueen, Michael Shields, Jaafar El-Awady, Paulette Clancy, Benjamin Van Durme, Nicholas Andrews, William Walden, and Daniel Khashabi. Can coding agents reproduce findings in computational materials science?, 2026. URL h...
Pith/arXiv arXiv 2026
-
[35]
Olympiad-level formal mathematical reasoning with reinforcement learning
Thomas Hubert, Rishi Mehta, Laurent Sartran, Mikl \'o s Z Horv \'a th, Goran Z u z i \'c , Eric Wieser, Aja Huang, Julian Schrittwieser, Yannick Schroecker, Hussain Masoom, et al. Olympiad-level formal mathematical reasoning with reinforcement learning. Nature, pages 1--3, 2025
2025
-
[36]
Equivariant 3d-conditional diffusion model for molecular linker design
Ilia Igashov, Hannes St \"a rk, Cl \'e ment Vignac, Arne Schneuing, Victor Garcia Satorras, Pascal Frossard, Max Welling, Michael Bronstein, and Bruno Correia. Equivariant 3d-conditional diffusion model for molecular linker design. Nature Machine Intelligence, 6 0 (4): 0 417--427, 2024
2024
-
[37]
ALE -bench: A benchmark for long-horizon objective-driven algorithm engineering
Yuki Imajuku, Kohki Horie, Yoichi Iwata, Kensho Aoki, Naohiro Takahashi, and Takuya Akiba. ALE -bench: A benchmark for long-horizon objective-driven algorithm engineering. In Advances in Neural Information Processing Systems, Datasets and Benchmarks Track, 2025. URL https://openreview.net/forum?id=JCjGvbsOmQ
2025
-
[38]
Peter Jansen, Marc-Alexandre Cote, Tushar Khot, Erin Bransom, Bhavana Dalvi Mishra, Bodhisattwa Prasad Majumder, Oyvind Tafjord, and Peter Clark. DISCOVERYWORLD : A virtual environment for developing and evaluating automated scientific discovery agents, 2024. URL https://arxiv.org/abs/2406.06769
arXiv 2024
-
[39]
DeltaEvolve : Accelerating scientific discovery through momentum-driven evolution, 2026
Jiachen Jiang, Tianyu Ding, and Zhihui Zhu. DeltaEvolve : Accelerating scientific discovery through momentum-driven evolution, 2026. URL https://arxiv.org/abs/2602.02919
arXiv 2026
-
[40]
Highly accurate protein structure prediction with alphafold
John Jumper, Richard Evans, Alexander Pritzel, Tim Green, Michael Figurnov, Olaf Ronneberger, Kathryn Tunyasuvunakool, Russ Bates, Augustin Z \' dek, Anna Potapenko, et al. Highly accurate protein structure prediction with alphafold. nature, 596 0 (7873): 0 583--589, 2021
2021
-
[41]
autoresearch
Andrej Karpathy. autoresearch . https://github.com/karpathy/autoresearch, 2026
2026
-
[42]
From reproduction to replication: Evaluating research agents with progressive code masking, 2025
Gyeongwon James Kim, Alex Wilf, Louis-Philippe Morency, and Daniel Fried. From reproduction to replication: Evaluating research agents with progressive code masking, 2025. URL https://arxiv.org/abs/2506.19724
arXiv 2025
-
[43]
Cell2location maps fine-grained cell types in spatial transcriptomics
Vitalii Kleshchevnikov, Artem Shmatko, Emma Dann, Alexander Aivazidis, Hamish W King, Tong Li, Rasa Elmentaite, Artem Lomakin, Veronika Kedlian, Adam Gayoso, et al. Cell2location maps fine-grained cell types in spatial transcriptomics. Nature biotechnology, 40 0 (5): 0 661--671, 2022
2022
-
[44]
Jakub Lala, Odhran O'Donoghue, Aleksandar Shtedritski, Sam Cox, Samuel G. Rodriques, and Andrew D. White. PaperQA : Retrieval-augmented generative agent for scientific research, 2023. URL https://arxiv.org/abs/2312.07559
arXiv 2023
-
[45]
Learning skillful medium-range global weather forecasting
Remi Lam, Alvaro Sanchez-Gonzalez, Matthew Willson, Peter Wirnsberger, Meire Fortunato, Ferran Alet, Suman Ravuri, Timo Ewalds, Zach Eaton-Rosen, Weihua Hu, et al. Learning skillful medium-range global weather forecasting. Science, 382 0 (6677): 0 1416--1421, 2023
2023
-
[46]
Jon M. Laurent, Joseph D. Janizek, Michael Ruzo, Michaela M. Hinks, Michael J. Hammerling, Siddharth Narayanan, Manvitha Ponnapati, Andrew D. White, and Samuel G. Rodriques. LAB -bench: Measuring capabilities of language models for biology research, 2024. URL https://arxiv.org/abs/2407.10362
Pith/arXiv arXiv 2024
-
[47]
AutoSOTA : An end-to-end automated research system for state-of-the-art AI model discovery, 2026
Yu Li, Chenyang Shao, Xinyang Liu, Ruotong Zhao, Peijie Liu, Hongyuan Su, Zhibin Chen, Qinglong Yang, Anjie Xu, Yi Fang, Qingbin Zeng, Tianxing Li, Jingbo Xu, Fengli Xu, Yong Li, and Tie-Yan Liu. AutoSOTA : An end-to-end automated research system for state-of-the-art AI model discovery, 2026. URL https://arxiv.org/abs/2604.05550
Pith/arXiv arXiv 2026
-
[48]
Weixin Liang, Yuhui Zhang, Hancheng Cao, Binglu Wang, Daisy Ding, Xinyu Yang, Kailas Vodrahalli, Siyu He, Daniel Smith, Yian Yin, Daniel McFarland, and James Zou. Can large language models provide useful feedback on research papers? a large-scale empirical analysis, 2023. URL https://arxiv.org/abs/2310.01783
arXiv 2023
-
[49]
Position: Agentic evolution is the path to evolving LLM s, 2026
Minhua Lin, Hanqing Lu, Zhan Shi, Bing He, Rui Mao, Zhiwei Zhang, Zongyu Wu, Xianfeng Tang, Hui Liu, Zhenwei Dai, Xiang Zhang, Suhang Wang, Benoit Dumoulin, and Jian Pei. Position: Agentic evolution is the path to evolving LLM s, 2026. URL https://arxiv.org/abs/2602.00359
arXiv 2026
-
[50]
Evolutionary-scale prediction of atomic-level protein structure with a language model
Zeming Lin, Halil Akin, Roshan Rao, Brian Hie, Zhongkai Zhu, Wenting Lu, Nikita Smetanin, Robert Verkuil, Ori Kabeli, Yaniv Shmueli, et al. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science, 379 0 (6637): 0 1123--1130, 2023
2023
-
[51]
Scientific algorithm discovery by augmenting AlphaEvolve with deep research, 2025
Gang Liu, Yihan Zhu, et al. Scientific algorithm discovery by augmenting AlphaEvolve with deep research, 2025. URL https://arxiv.org/abs/2510.06056
arXiv 2025
-
[52]
Ryan Liu and Nihar B. Shah. ReviewerGPT ? an exploratory study on using large language models for paper reviewing, 2023. URL https://arxiv.org/abs/2306.00622
arXiv 2023
-
[53]
Evox: Meta-evolution for automated discovery
Shu Liu, Shubham Agarwal, Monishwaran Maheswaran, Mert Cemri, Zhifei Li, Qiuyang Mang, Ashwin Naren, Ethan Boneh, Audrey Cheng, Melissa Z Pan, et al. Evox: Meta-evolution for automated discovery. arXiv preprint arXiv:2602.23413, 2026
arXiv 2026
-
[54]
The AI scientist: Towards fully automated open-ended scientific discovery, 2024
Chris Lu, Cong Lu, Robert Tjarko Lange, Jakob Foerster, Jeff Clune, and David Ha. The AI scientist: Towards fully automated open-ended scientific discovery, 2024. URL https://arxiv.org/abs/2408.06292
Pith/arXiv arXiv 2024
-
[55]
Towards end-to-end automation of ai research
Chris Lu, Cong Lu, Robert Tjarko Lange, Yutaro Yamada, Shengran Hu, Jakob Foerster, David Ha, and Jeff Clune. Towards end-to-end automation of ai research. Nature, 651 0 (8107): 0 914--919, 2026
2026
-
[56]
AIRS -bench: a suite of tasks for frontier AI research science agents, 2026
Alisia Lupidi, Bhavul Gauri, Thomas Simon Foster, Bassel Al Omari, Despoina Magka, Alberto Pepe, Alexis Audran-Reiss, Muna Aghamelu, Nicolas Baldwin, Lucia Cipolina-Kun, Jean-Christophe Gagnon-Audet, Chee Hau Leow, Sandra Lefdal, Hossam Mossalam, Abhinav Moudgil, Saba Nazir, Emanuel Tewolde, Isabel Urrego, Jordi Armengol Estape, Amar Budhiraja, Gaurav Cha...
arXiv 2026
-
[57]
Du, Max Simchowitz, Jiantao Jiao, Dawn Song, and Chi Jin
Bohan Lyu, Yucheng Yang, Siqiao Huang, Jiaru Zhang, Qixin Xu, Xinghan Li, Xinyang Han, Yicheng Zhang, Huaqing Zhang, Runhan Huang, Kaicheng Yang, Zitao Chen, Wentao Guo, Junlin Yang, Xinyue Ai, Wenhao Chai, Yadi Cao, Ziran Yang, Kun Wang, Dapeng Jiang, Huan-ang Gao, Shange Tang, Chengshuai Shi, Simon S. Du, Max Simchowitz, Jiantao Jiao, Dawn Song, and Chi...
Pith/arXiv arXiv 2026
-
[58]
Yougang Lyu, Xi Zhang, Xinhao Yi, Yuyue Zhao, Shuyu Guo, Wenxiang Hu, Jan Piotrowski, Jakub Kaliski, Jacopo Urbani, Zaiqiao Meng, Lun Zhou, and Xiaohui Yan. EvoScientist : Towards multi-agent evolving AI scientists for end-to-end scientific discovery, 2026 b . URL https://arxiv.org/abs/2603.08127
arXiv 2026
-
[59]
Gonzalez, Jingbo Shang, and Alvin Cheung
Qiuyang Mang, Wenhao Chai, Zhifei Li, Huanzhi Mao, Shang Zhou, Alexander Du, Hanchen Li, Shu Liu, Edwin Chen, Yichuan Wang, Xieting Chu, Zerui Cheng, Yuan Xu, Tian Xia, Zirui Wang, Tianneng Shi, Jianzhu Yao, Yilong Zhao, Qizheng Zhang, Charlie Ruan, Zeyu Shen, Kaiyuan Liu, Runyuan He, Dong Xing, Zerui Li, Zirong Zeng, Yige Jiang, Lufeng Cheng, Ziyi Zhao, ...
arXiv 2025
-
[60]
Scaling deep learning for materials discovery
Amil Merchant, Simon Batzner, Samuel S Schoenholz, Muratahan Aykol, Gowoon Cheon, and Ekin Dogus Cubuk. Scaling deep learning for materials discovery. Nature, 624 0 (7990): 0 80--85, 2023
2023
-
[61]
Multigate: integrative analysis and regulatory inference in spatial multi-omics data via graph representation learning
Jishuai Miao, Jinzhao Li, Jingxue Xin, Jiajuan Tu, Muyang Ge, Ji Qi, Xiaocheng Zhou, Ying Zhu, Can Yang, and Zhixiang Lin. Multigate: integrative analysis and regulatory inference in spatial multi-omics data via graph representation learning. Nature Communications, 16 0 (1): 0 9403, 2025
2025
-
[62]
Minimax m2.7: Early echoes of self-evolution
MiniMax . Minimax m2.7: Early echoes of self-evolution. https://www.minimax.io/news/minimax-m27-en, 2026
2026
-
[63]
Kimi k2.6
Moonshot AI . Kimi k2.6. https://www.kimi.com/ai-models/kimi-k2-6, 2026
2026
-
[64]
MLGym : A new framework and benchmark for advancing AI research agents, 2025
Deepak Nathani, Lovish Madaan, Nicholas Roberts, Nikolay Bashlykov, Ajay Menon, Vincent Moens, Amar Budhiraja, Despoina Magka, Vladislav Vorotilov, Gaurav Chaurasia, Dieuwke Hupkes, Ricardo Silveira Cabral, Tatiana Shavrina, Jakob Foerster, Yoram Bachrach, William Yang Wang, and Roberta Raileanu. MLGym : A new framework and benchmark for advancing AI rese...
arXiv 2025
-
[65]
Alexander Novikov, Ngan Vu, Marvin Eisenberger, Emilien Dupont, Po-Sen Huang, Adam Zsolt Wagner, Sergey Shirobokov, Borislav Kozlovskii, Francisco J. R. Ruiz, Abbas Mehrabian, M. Pawan Kumar, Abigail See, Swarat Chaudhuri, George Holland, Alex Davies, Sebastian Nowozin, Pushmeet Kohli, and Matej Balog. AlphaEvolve : A coding agent for scientific and algor...
Pith/arXiv arXiv 2025
-
[66]
Codex cli: Lightweight coding agent that runs in your terminal
OpenAI . Codex cli: Lightweight coding agent that runs in your terminal. https://github.com/openai/codex, 2025
2025
-
[67]
Gpt-5.4 thinking system card
OpenAI . Gpt-5.4 thinking system card. https://openai.com/index/gpt-5-4-thinking-system-card/, 2026 a
2026
-
[68]
Gpt-5.5 system card
OpenAI . Gpt-5.5 system card. https://openai.com/index/gpt-5-5-system-card/, 2026 b
2026
-
[69]
OpenAI . What are tokens and how to count them? https://help.openai.com/en/articles/4936856-what-are-tokens-and-how-to-count-them, 2026 c
arXiv 2026
-
[70]
Weak signal extraction enabled by deep neural network denoising of diffraction data
Jens Oppliger, M Michael Denner, Julia K \"u spert, Ruggero Frison, Qisi Wang, Alexander Morawietz, Oleh Ivashko, Ann-Christin Dippel, Martin von Zimmermann, Izabela Bia o, et al. Weak signal extraction enabled by deep neural network denoising of diffraction data. Nature Machine Intelligence, 6 0 (2): 0 180--186, 2024
2024
-
[71]
Human--ai adaptive dynamics drives the emergence of information cocoons
Jinghua Piao, Jiazhen Liu, Fang Zhang, Jun Su, and Yong Li. Human--ai adaptive dynamics drives the emergence of information cocoons. Nature Machine Intelligence, 5 0 (11): 0 1214--1224, 2023
2023
-
[72]
Enhanced spatial clustering of single-molecule localizations with graph neural networks
Jes \'u s Pineda, Sergi Mas \'o -Orriols, Montse Masoliver, Joan Bertran, Mattias Goks \"o r, Giovanni Volpe, and Carlo Manzo. Enhanced spatial clustering of single-molecule localizations with graph neural networks. Nature Communications, 16 0 (1): 0 9693, 2025
2025
-
[73]
Probabilistic weather forecasting with machine learning
Ilan Price, Alvaro Sanchez-Gonzalez, Ferran Alet, Tom R Andersson, Andrew El-Kadi, Dominic Masters, Timo Ewalds, Jacklynn Stott, Shakir Mohamed, Peter Battaglia, et al. Probabilistic weather forecasting with machine learning. Nature, 637 0 (8044): 0 84--90, 2025
2025
-
[74]
Mle-dojo: Interactive environments for empowering llm agents in machine learning engineering
Rushi Qiang, Yuchen Zhuang, Yinghao Li, Dingu Sagar VK, Rongzhi Zhang, Changhao Li, Ian Wong, Sherry Yang, Percy Liang, Chao Zhang, et al. Mle-dojo: Interactive environments for empowering llm agents in machine learning engineering. Advances in Neural Information Processing Systems, 38, 2026
2026
-
[75]
Qwen3.7: The agent frontier
Qwen Team . Qwen3.7: The agent frontier. https://qwen.ai/blog?id=qwen3.7, 2026
2026
-
[76]
PostTrainBench : Can LLM agents automate LLM post-training?, 2026
Ben Rank, Hardik Bhatnagar, Ameya Prabhu, Shira Eisenberg, Karina Nguyen, Matthias Bethge, and Maksym Andriushchenko. PostTrainBench : Can LLM agents automate LLM post-training?, 2026. URL https://arxiv.org/abs/2603.08640
arXiv 2026
-
[77]
Pawan Kumar, Emilien Dupont, Francisco J
Bernardino Romera-Paredes, Mohammadamin Barekatain, Alexander Novikov, Matej Balog, M. Pawan Kumar, Emilien Dupont, Francisco J. R. Ruiz, Jordan S. Ellenberg, Pengming Wang, Omar Fawzi, Pushmeet Kohli, and Alhussein Fawzi. Mathematical discoveries from program search with large language models. Nature, 625: 0 468--475, 2024. doi:10.1038/s41586-023-06924-6...
-
[78]
Siegel, Sayash Kapoor, Nitya Nagdir, Benedikt Stroebl, and Arvind Narayanan
Zachary S. Siegel, Sayash Kapoor, Nitya Nagdir, Benedikt Stroebl, and Arvind Narayanan. CORE -bench: Fostering the credibility of published research through a computational reproducibility agent benchmark, 2024. URL https://arxiv.org/abs/2409.11363
Pith/arXiv arXiv 2024
-
[79]
Michael D. Skarlinski, Sam Cox, Jon M. Laurent, James D. Braza, Michaela Hinks, Michael J. Hammerling, Manvitha Ponnapati, Samuel G. Rodriques, and Andrew D. White. Language agents achieve superhuman synthesis of scientific knowledge, 2024. URL https://arxiv.org/abs/2409.13740
arXiv 2024
-
[80]
PaperBench : Evaluating AI 's ability to replicate AI research, 2025
Giulio Starace, Oliver Jaffe, Dane Sherburn, James Aung, Jun Shern Chan, Leon Maksin, Rachel Dias, Evan Mays, Benjamin Kinsella, Wyatt Thompson, Johannes Heidecke, Amelia Glaese, and Tejal Patwardhan. PaperBench : Evaluating AI 's ability to replicate AI research, 2025. URL https://arxiv.org/abs/2504.01848
Pith/arXiv arXiv 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.