pith. machine review for the scientific record. sign in

arxiv: 2605.08678 · v1 · submitted 2026-05-09 · 💻 cs.LG

Recognition: no theorem link

MLS-Bench: A Holistic and Rigorous Assessment of AI Systems on Building Better AI

Authors on Pith no claims yet

Pith reviewed 2026-05-12 01:09 UTC · model grok-4.3

classification 💻 cs.LG
keywords MLS-BenchAI agentsML method inventiongeneralizationscalabilitybenchmarktest-time scalingmethod discovery
0
0 comments X

The pith

AI agents cannot reliably invent ML methods that beat human designs on generalization and scaling tests.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces MLS-Bench to measure whether AI systems can discover machine learning methods that improve performance, hold up across varied settings, and scale to bigger regimes. It sets up 140 tasks in 12 domains where each task requires an agent to modify one part of an ML system or algorithm and then prove the change works more broadly and at larger sizes. The evaluations show agents fall short of consistently beating human-designed methods, with simple tuning proving easier than original invention. A reader would care because reliable agent-driven method discovery could accelerate progress in building stronger AI without ongoing human redesign. The work identifies the core limit as insufficient scientific insight for planning, validating, and scaling ideas, a gap that extra search or compute does not close.

Core claim

MLS-Bench contains 140 tasks across 12 domains. Each task requires an agent to improve one targeted component of an ML system or algorithm and to demonstrate that the improvement generalizes across controlled settings and scales. Current agents remain far from reliably surpassing human-designed methods. Engineering-style tuning is easier for them than genuine method invention. The bottleneck is not only in proposing new methods, but also in the scientific insight needed to plan, validate, and scale claims about them. More search, compute, or context alone does not remove this bottleneck.

What carries the argument

MLS-Bench, the benchmark of 140 tasks across 12 domains that each require an agent to propose and validate an improvement to an ML component with explicit tests for generalization and scalability.

If this is right

  • Agents perform better on engineering adjustments than on creating original methods.
  • Providing more test-time compute, adaptive allocation, or extra context does not overcome the insight limitation.
  • Planning, validating, and scaling claims remain harder than idea generation for current systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Future agent designs may need built-in mechanisms for scientific validation to advance beyond tuning.
  • The benchmark setup could be extended to new domains to check whether the insight gap persists across fields.
  • Human oversight might stay necessary for the insight step while agents handle execution and testing.

Load-bearing premise

The 140 tasks and 12 domains capture the essential skills for inventing generalizable and scalable ML methods without missing major aspects of actual research.

What would settle it

An agent that proposes and validates improvements outperforming human baselines on a majority of the tasks while passing controlled generalization and scaling checks.

Figures

Figures reproduced from arXiv: 2605.08678 by Bohan Lyu, Chengshuai Shi, Chi Jin, Dapeng Jiang, Dawn Song, Huan-ang Gao, Huaqing Zhang, Jiantao Jiao, Jiaru Zhang, Junlin Yang, Kaicheng Yang, Kun Wang, Max Simchowitz, Qixin Xu, Runhan Huang, Shange Tang, Simon S. Du, Siqiao Huang, Wenhao Chai, Wentao Guo, Xinghan Li, Xinyang Han, Xinyue Ai, Yadi Cao, Yicheng Zhang, Yucheng Yang, Ziran Yang, Zitao Chen.

Figure 1
Figure 1. Figure 1: MLS-Bench overview. Left: comparison of Frontier-CS, MLE-Bench, and MLS-Bench [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: MLS-Bench-Lite Performance across 15 models. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Compute profile: GPU vs. CPU task ratio and the distribution of GPU￾hours per experiment. 3.1 Overview Task Scope. MLS-Bench covers 140 tasks across 12 research areas; [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: MLS-Bench’s design: task specification, validity enforcement, and unified scoring. [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Analysis on the evaluation protocol. Left: scientific-innovation prompt vs. engineering￾optimization prompt. Middle: the budget check prohibits the models from hacking model size for higher performance. Right: in-distribution vs. OOD settings, from first proposal to final submission. broader edit space does not enhance their effectiveness; rather, they frequently misuse this flexibility for off-target code… view at source ↗
Figure 6
Figure 6. Figure 6: Test-time scaling. Left: running-best score vs. cumulative for the three inference-time setups. Right: TTT-Discover trained on two tasks, both hacking the visible settings. 8 [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Adaptive compute-allocation experiment. Left: cumulative compute budget consumed along the exploration trajectory for each of the five agents. Right: final-submission score. Results. The adaptive protocol gives agents strictly more experimental choices than Vanilla or the fixed Agent setting, yet performance generally drops as shown in [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Context engineering. We test whether additional context can benefit the agents. We add three settings: (1) Web search, where we equip the agents with a strong search tool based on Tavily2 ; (2) Baseline ctx., which provides detailed derivations, key steps, and reasoning from the baseline papers; and (3) Theory ctx., which provides background from relevant textbooks or theory-oriented literatures [PITH_FUL… view at source ↗
Figure 9
Figure 9. Figure 9: Similarity-weighted baseline performance vs. agent performance. Expert Assessment on agent submissions reveals domi￾nant patterns: 1. agents recombine ingredients drawn from the baselines they are shown and present the recombina￾tion as new; and 2. truly novel components are rare and, when they appear, usually lack a stated reason to help. Per-model style differs: GPT-5.4 reaches for the most structurally … view at source ↗
read the original abstract

Modern AI progress has been driven by ML methods that are generalizable across settings and scalable to larger regimes. As large language models demonstrate advanced capabilities in reasoning, coding, and engineering tasks, it is increasingly important to understand whether they can discover such methods rather than only apply existing ones. We introduce MLS-Bench, a benchmark for evaluating whether AI systems can invent generalizable and scalable ML methods. MLS-Bench contains 140 tasks across 12 domains, each requiring an agent to improve one targeted component of an ML system or algorithm and demonstrate that the improvement generalizes across controlled settings and scales. We find that current agents remain far from reliably surpassing human-designed methods, and that engineering-style tuning is easier for them than genuine method invention. We further study the effects of test-time scaling, adaptive compute allocation, and context provision on agents' discovery performance, together with case studies of their behavior. Our analyses suggest that the bottleneck is not only in proposing new methods, but also in the scientific insight needed to plan, validate, and scale claims about them. More search, compute, or context alone does not remove this bottleneck. We build and maintain a community platform for cumulative and comparable iteration, and release the data and code at https://mls-bench.com.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper introduces MLS-Bench, a benchmark with 140 tasks across 12 domains for evaluating whether AI agents can invent generalizable and scalable ML methods. Each task requires an agent to improve a targeted component of an ML system or algorithm while demonstrating that the improvement generalizes across controlled settings and scales. The authors report that current agents remain far from reliably surpassing human-designed methods, that engineering-style tuning is easier than genuine method invention, and that the bottleneck lies in scientific insight for planning, validation, and scaling. Analyses examine test-time scaling, adaptive compute allocation, and context provision, concluding that more search, compute, or context alone does not remove the bottleneck. Data, code, and a community platform are released.

Significance. If the tasks are constructed such that they require planning, validation, and scaling insight beyond hyperparameter search or prompt engineering, MLS-Bench would offer a valuable, reproducible resource for measuring progress on AI-driven method discovery. The explicit release of data and code, along with the community platform, is a clear strength that enables cumulative iteration and supports the empirical claims about scaling effects.

major comments (1)
  1. Abstract and task construction: the central finding that 'more search, compute, or context alone does not remove this bottleneck' and that tuning is easier than invention is load-bearing on the 140 tasks requiring genuine scientific insight rather than admitting solutions via hyperparameter tuning, local modifications, or prompt engineering without new theoretical justification or controlled scaling experiments. The manuscript should provide concrete examples or analyses demonstrating how the task definitions penalize pure tuning approaches.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive review and for recognizing the potential value of MLS-Bench as a reproducible resource. We address the single major comment below and will revise the manuscript to strengthen the exposition of task construction.

read point-by-point responses
  1. Referee: Abstract and task construction: the central finding that 'more search, compute, or context alone does not remove this bottleneck' and that tuning is easier than invention is load-bearing on the 140 tasks requiring genuine scientific insight rather than admitting solutions via hyperparameter tuning, local modifications, or prompt engineering without new theoretical justification or controlled scaling experiments. The manuscript should provide concrete examples or analyses demonstrating how the task definitions penalize pure tuning approaches.

    Authors: We agree that the central claims rest on the tasks demanding more than hyperparameter tuning or prompt engineering. Each task in MLS-Bench requires an agent to improve a targeted component while also designing and reporting controlled experiments that establish generalization across held-out settings and scaling behavior to larger regimes; these requirements are stated in the task templates and evaluation rubrics. Pure tuning approaches typically succeed on a single training configuration but fail the generalization and scaling criteria, as shown in our agent failure analyses. That said, we acknowledge the manuscript would be clearer with explicit illustrations. In the revision we will add a dedicated subsection (likely in Section 3 or 4) containing 3–4 concrete task examples, the precise success criteria, and side-by-side results showing where tuning-only baselines plateau while insight-driven solutions continue to improve. We will also include a short quantitative comparison of agent success rates on tuning versus invention-oriented subtasks. revision: yes

Circularity Check

0 steps flagged

No significant circularity in MLS-Bench benchmark

full rationale

The paper introduces MLS-Bench as an empirical benchmark consisting of 140 tasks across 12 domains for assessing AI agents on inventing generalizable ML methods. It reports experimental findings that current agents fall short of human-designed methods and that tuning is easier than invention. No mathematical derivations, equations, fitted parameters, or predictions exist that could reduce to inputs by construction. The work releases data and code externally and makes no load-bearing self-citations for any theoretical claim. All central assertions rest on direct task evaluations rather than self-referential definitions or renamed known results. This is a standard benchmark release with no internal derivation chain to inspect for circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that the chosen tasks measure invention capability; no free parameters or invented entities are introduced in the abstract.

axioms (1)
  • domain assumption The selected 140 tasks across 12 domains are representative of generalizable ML method invention.
    Invoked in the benchmark construction and evaluation design.

pith-pipeline@v0.9.0 · 5639 in / 1122 out tokens · 31975 ms · 2026-05-12T01:09:27.664741+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

136 extracted references · 136 canonical work pages · 23 internal anchors

  1. [1]

    Learning to learn by gradient descent by gradient descent.Advances in neural information processing systems, 29, 2016

    Marcin Andrychowicz, Misha Denil, Sergio Gomez, Matthew W Hoffman, David Pfau, Tom Schaul, Brendan Shillingford, and Nando De Freitas. Learning to learn by gradient descent by gradient descent.Advances in neural information processing systems, 29, 2016

  2. [2]

    System card: Claude opus 4 & claude sonnet 4

    Anthropic. System card: Claude opus 4 & claude sonnet 4. https://www-cdn.anthropic. com/07b2a3f9902ee19fe39a36ca638e5ae987bc64dd.pdf, 2025

  3. [3]

    Automated alignment researchers

    Anthropic. Automated alignment researchers. https://www.anthropic.com/research/ automated-alignment-researchers, 2026

  4. [4]

    K. I. Appel and Wolfgang Haken. The solution of the four-color-map problem.Scientific American, 237(4):108–121, 1977

  5. [5]

    arXiv preprint arXiv:2510.14150 , year =

    Henrique Assumpção, Diego Ferreira, Leandro Campos, and Fabricio Murai. Codeevolve: An open source evolutionary coding agent for algorithm discovery and optimization.arXiv preprint arXiv:2510.14150, 2025

  6. [6]

    Qwen Technical Report

    Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, Binyuan Hui, Luo Ji, Mei Li, Junyang Lin, Runji Lin, Dayiheng Liu, Gao Liu, Chengqiang Lu, Keming Lu, Jianxin Ma, Rui Men, Xingzhang Ren, Xuancheng Ren, Chuanqi Tan, Sinan Tan, Jianhong Tu, Peng Wang, Shijie Wang, Wei Wang, Shengguang Wu, Benfeng X...

  7. [7]

    AstaBench: Rigorous Benchmarking of AI Agents with a Scientific Research Suite

    Jonathan Bragg, Mike D’Arcy, Nishant Balepur, Dan Bareket, Bhavana Dalvi, Sergey Feld- man, Dany Haddad, Jena D. Hwang, Peter A. Jansen, Varsha Kishore, Bodhisattwa Prasad Majumder, Aakanksha Naik, Sigal Rahamimov, Kyle Richardson, Amanpreet Singh, Harshit Surana, Aryeh Tiktinsky, Rosni Vasu, Guy Wiener, Chloe Anastasiades, Stefan Candra, Jason Dunkelberg...

  8. [8]

    T. B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric J. Sigler, Mateusz Lit...

  9. [9]

    Broken neural scaling laws

    Ethan Caballero, Kshitij Gupta, Irina Rish, and David Krueger. Broken neural scaling laws. In International Conference on Learning Representations, 2023. URL https://openreview. net/forum?id=sckjveqlCZ

  10. [10]

    Mle-bench: Evaluating machine learning agents on machine learning engineering

    Jun Shern Chan, Neil Chowdhury, Oliver Jaffe, James Aung, Dane Sherburn, Evan Mays, Giulio Starace, Kevin Liu, Leon Maksin, Tejal Patwardhan, Aleksander Madry, and Lilian Weng. Mle-bench: Evaluating machine learning agents on machine learning engineering. In The Thirteenth International Conference on Learning Representations, 2025

  11. [11]

    arXiv preprint arXiv:2505.19955 , year =

    Hui Chen, Miao Xiong, Yujie Lu, Wei Han, Ailin Deng, Yufei He, Jiaying Wu, Yibo Li, Yue Liu, and Bryan Hooi. Mlr-bench: Evaluating ai agents on open-ended machine learning research.arXiv preprint arXiv:2505.19955, 2025

  12. [12]

    Swe-ci: Evaluating agent capabilities in maintain- ing codebases via continuous integration.arXiv preprint arXiv:2603.03823, 2026

    Jialong Chen, Xander Xu, Hu Wei, Chuan Chen, and Bing Zhao. Swe-ci: Evaluating agent capa- bilities in maintaining codebases via continuous integration.arXiv preprint arXiv:2603.03823, 2026

  13. [13]

    Seed-prover 1.5: Mastering undergraduate-level theorem proving via learning from experience.arXiv preprint arXiv:2512.17260,

    Jiangjie Chen, Wenxiang Chen, Jiacheng Du, Jinyi Hu, Zhicheng Jiang, Allan Jie, Xiaoran Jin, Xing Jin, Chenggang Li, Wenlei Shi, Zhihong Wang, Mingxuan Wang, Chenrui Wei, Shufa Wei, Huajian Xin, Fan Yang, Weihao Gao, Zheng Yuan, Tianyang Zhan, Zeyu Zheng, Tianxi Zhou, and Thomas Hanwen Zhu. Seed-prover 1.5: Mastering undergraduate-level theorem proving vi...

  14. [14]

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Pondé de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...

  15. [15]

    Bringmann, John Tran, Wei Liu, Fung Xie, Michael Lightstone, and Humphrey Shi

    Terry Chen, Zhifan Ye, Bing Xu, Zihao Ye, Timmy Liu, Ali Hassani, Tianqi Chen, Andrew Kerr, Haicheng Wu, Yang Xu, Yu-Jung Chen, Hanfeng Chen, Aditya Kane, Ronny Krashinsky, Ming-Yu Liu, Vinod Grover, Luis Ceze, Roger A. Bringmann, John Tran, Wei Liu, Fung Xie, Michael Lightstone, and Humphrey Shi. Avo: Agentic variation operators for autonomous evolutiona...

  16. [16]

    Yan, Ziheng Jiang, Thierry Moreau, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy

    Tianqi Chen, Lianmin Zheng, Eddie Q. Yan, Ziheng Jiang, Thierry Moreau, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. Learning to optimize tensor programs. 2018

  17. [17]

    Agent^2 RL-Bench: Can LLM Agents Engineer Agentic RL Post-Training?

    Wanyi Chen, Xiao Yang, Xu Yang, Tianming Sha, Qizheng Li, Zhuo Wang, Bowen Xian, Fang Kong, Weiqing Liu, and Jiang Bian. Agent2 rl-bench: Can llm agents engineer agentic rl post-training?arXiv preprint arXiv:2604.10547, 2026

  18. [18]

    Baker, Benjamin 12 Burns, Daniel Adu-Ampratwum, Xuhui Huang, Xia Ning, Song Gao, Yu Su, and Huan Sun

    Ziru Chen, Shijie Chen, Yuting Ning, Qianheng Zhang, Boshi Wang, Botao Yu, Yifei Li, Zeyi Liao, Chen Wei, Zitong Lu, Vishal Dey, Mingyi Xue, Frazier N. Baker, Benjamin 12 Burns, Daniel Adu-Ampratwum, Xuhui Huang, Xia Ning, Song Gao, Yu Su, and Huan Sun. Scienceagentbench: Toward rigorous assessment of language agents for data-driven scientific discovery. ...

  19. [19]

    Jordan, Joseph E

    Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anastasios Nikolas Angelopoulos, Tianle Li, Dacheng Li, Banghua Zhu, Hao Zhang, Michael I. Jordan, Joseph E. Gonzalez, and Ion Stoica. Chatbot arena: An open platform for evaluating llms by human preference. InForty-first International Conference on Machine Learning, pages 8359–8388, 2024

  20. [20]

    Coley, Dale A

    Connor W. Coley, Dale A. Thomas, Justin A. M. Lummiss, Jonathan N. Jaworski, C. Breen, Victor Schultz, Travis Hart, Joshua Fishman, Luke Rogers, Hanyu Gao, Robert W. Hicklin, Pieter Plehiers, Joshua Byington, John S. Piotti, William H. Green, A. John Hart, Timothy F. Jamison, and Klavs F. Jensen. A robotic platform for flow synthesis of organic compounds ...

  21. [21]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025

  22. [22]

    Cubuk, Barret Zoph, Dandelion Mané, Vijay Vasudevan, and Quoc V

    Ekin D. Cubuk, Barret Zoph, Dandelion Mané, Vijay Vasudevan, and Quoc V . Le. Autoaug- ment: Learning augmentation strategies from data. InIEEE Conference on Computer Vision and Pattern Recognition, pages 113–123, 2019

  23. [23]

    Flashattention: Fast and memory-efficient exact attention with io-awareness.Advances in neural information processing systems, 35:16344–16359, 2022

    Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. Flashattention: Fast and memory-efficient exact attention with io-awareness.Advances in neural information processing systems, 35:16344–16359, 2022

  24. [24]

    Everett, Lechao Xiao, Mitchell Wortsman, Alexander A

    Katie E. Everett, Lechao Xiao, Mitchell Wortsman, Alexander A. Alemi, Roman Novak, Peter J. Liu, Izzeddin Gur, Jascha Sohl-Dickstein, Leslie Pack Kaelbling, Jaehoon Lee, and Jeffrey Pennington. Scaling exponents across parameterizations and optimizers. InProceedings of the 41st International Conference on Machine Learning, volume 235 ofProceedings of Mach...

  25. [25]

    Model-agnostic meta-learning for fast adaptation of deep networks

    Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. InProceedings of the 34th International Conference on Machine Learning, pages 1126–1135, 2017

  26. [26]

    Researchgym: Evaluating language model agents on real-world ai research.arXiv preprint arXiv:2602.15112, 2026

    Aniketh Garikaparthi, Manasi Patwardhan, and Arman Cohan. Researchgym: Evaluating language model agents on real-world ai research.arXiv preprint arXiv:2602.15112, 2026

  27. [27]

    Improved training speed, accuracy, and data utilization through loss function optimization

    Santiago Gonzalez and Risto Miikkulainen. Improved training speed, accuracy, and data utilization through loss function optimization. InIEEE Congress on Evolutionary Computation, pages 1–8, 2020

  28. [28]

    Juraj Gottweis, Wei-Hung Weng, Alexander N. Daryin, Tao Tu, Anil Palepu, Petar Sirkovic, Artiom Myaskovsky, Felix Weissenberger, Keran Rong, Ryutaro Tanno, Khaled Saab, Dan Popovici, Jacob Blum, Fan Zhang, Katherine Chou, Avinatan Hassidim, Burak Gokturk, Amin Vahdat, Pushmeet Kohli, Yossi Matias, Andrew Carroll, Kavita Kulkarni, Nenad Tomasev, Yuan Guan,...

  29. [29]

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chong Ruan, Damai Dai, Deli Chen, Dongjie Ji, ...

  30. [30]

    Deep residual learning for image recognition

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016

  31. [31]

    Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

  32. [32]

    Training Compute-Optimal Large Language Models

    Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hennigan, Eric Noland, Katie Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karen Simonyan, Erich Elsen, Jack W. Rae, Oriol Vinyals, and Laurent Sifre...

  33. [33]

    Training Compute-Optimal Large Language Models

    doi: 10.48550/arXiv.2203.15556. URLhttps://arxiv.org/abs/2203.15556

  34. [34]

    Mle-agent: Your intelligent companion for seamless ai engineering and research, 2024.https://github.com/MLSysOps/MLE-agent

    Lei Zhang Huaizheng Zhang*, Yizheng Huang*. Mle-agent: Your intelligent companion for seamless ai engineering and research, 2024.https://github.com/MLSysOps/MLE-agent

  35. [35]

    Mlagentbench: Evaluating language agents on machine learning experimentation

    Qian Huang, Jian V ora, Percy Liang, and Jure Leskovec. Mlagentbench: Evaluating language agents on machine learning experimentation. InForty-first International Conference on Machine Learning, pages 20271–20309, 2024

  36. [36]

    Olympiad-level formal mathematical reasoning with reinforcement learning.Nature, pages 1–3, 2025

    Thomas Hubert, Rishi Mehta, Laurent Sartran, Miklós Z Horváth, Goran Žuži´c, Eric Wieser, Aja Huang, Julian Schrittwieser, Yannick Schroecker, Hussain Masoom, et al. Olympiad-level formal mathematical reasoning with reinforcement learning.Nature, pages 1–3, 2025

  37. [37]

    Automated machine learning - methods, systems, challenges

    Frank Hutter, Lars Kotthoff, and Joaquin Vanschoren. Automated machine learning - methods, systems, challenges. 2019

  38. [38]

    ALE-Bench: A benchmark for long-horizon objective-driven algorithm engineering.arXiv preprint arXiv:2506.09050,

    Yuki Imajuku, Kohki Horie, Yoichi Iwata, Kensho Aoki, Naohiro Takahashi, and Takuya Akiba. Ale-bench: A benchmark for long-horizon objective-driven algorithm engineering. arXiv preprint arXiv:2506.09050, 2025

  39. [39]

    Livecodebench: Holistic and contam- ination free evaluation of large language models for code

    Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contam- ination free evaluation of large language models for code. InThe Thirteenth International Conference on Learning Representations, 2025

  40. [40]

    Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R

    Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R. Narasimhan. Swe-bench: Can language models resolve real-world github issues? InThe Twelfth International Conference on Learning Representations, 2024. 14

  41. [41]

    Muon: An optimizer for hidden layers in neural networks, 2024

    Keller Jordan, Yuchen Jin, Vlado Boza, Jiacheng You, Franz Cesista, Laker Newhouse, and Jeremy Bernstein. Muon: An optimizer for hidden layers in neural networks, 2024. URL https://kellerjordan.github.io/posts/muon/

  42. [42]

    John Jumper, Richard Evans, Alexander Pritzel, Tim Green, Michael Figurnov, Olaf Ron- neberger, Kathryn Tunyasuvunakool, Russ Bates, Augustin Žídek, Anna Potapenko, Alex Bridgland, Clemens Meyer, Simon Köhl, Andrew J. Ballard, Andrew Cowie, Bernardino Romera-Paredes, Stanislav Nikolov, Rishub Jain, Jonas Adler, Trevor Back, Stig Petersen, David Reiman, El...

  43. [43]

    Kus- ner

    Jean Kaddour, Oscar Key, Piotr Nawrot, Pasquale Minervini, and Matt J. Kus- ner. No train no gain: Revisiting efficient training algorithms for transformer- based language models. InAdvances in Neural Information Processing Systems, vol- ume 36, 2023. URL https://proceedings.neurips.cc/paper_files/paper/2023/ hash/51f3d6252706100325ddc435ba0ade0e-Abstract...

  44. [44]

    Jared Kaplan, Sam McCandlish, Tom Henighan, T. B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361, 2020

  45. [45]

    autoresearch.https://github.com/karpathy/autoresearch, 2025

    Andrej Karpathy. autoresearch.https://github.com/karpathy/autoresearch, 2025

  46. [46]

    Adam: A Method for Stochastic Optimization

    Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980, 2014

  47. [47]

    On scientific understanding with artificial intelligence

    Mario Krenn, Robert Pollice, Si Yue Guo, Matteo Aldeghi, Alba Cervera-Lierta, Pascal Friederich, Gabriel dos Passos Gomes, Florian Häse, Adrián Jinich, AkshatKumar Nigam, Zhenpeng Yao, and Alán Aspuru-Guzik. On scientific understanding with artificial intelligence. Nature Reviews Physics, 4(12):761–769, 2022

  48. [48]

    Learning skillful medium-range global weather forecasting

    Rémi Lam, Álvaro Sánchez-González, Matthew Willson, Peter Wirnsberger, Meire Fortunato, Ferran Alet, Suman Ravuri, Timo Ewalds, Zach Eaton-Rosen, Weihua Hu, Alexander Merose, Stephan Hoyer, George Holland, Oriol Vinyals, Jacklynn Stott, Alexander Pritzel, Shakir Mohamed, and Peter Battaglia. Learning skillful medium-range global weather forecasting. Scien...

  49. [49]

    2509.19349 , archivePrefix =

    Robert Tjarko Lange, Yuki Imajuku, and Edoardo Cetin. Shinkaevolve: Towards open-ended and sample-efficient program evolution.arXiv preprint arXiv:2509.19349, 2025

  50. [50]

    (mis)fitting scaling laws: A survey of scaling law fitting techniques in deep learning

    Margaret Li, Sneha Kudugunta, and Luke Zettlemoyer. (mis)fitting scaling laws: A survey of scaling law fitting techniques in deep learning. InInternational Conference on Learning Representations, 2025. URLhttps://openreview.net/forum?id=xI71dsS3o4

  51. [51]

    Swe-next: Scalable real-world software engineering tasks for agents.arXiv preprint arXiv:2603.20691, 2026

    Jiarong Liang, Zhiheng Lyu, Zijie Liu, Xiangchao Chen, Ping Nie, Kai Zou, and Wenhu Chen. Swe-next: Scalable real-world software engineering tasks for agents.arXiv preprint arXiv:2603.20691, 2026

  52. [52]

    Goedel-prover: A frontier model for open-source automated theorem proving.arXiv preprint arXiv:2502.07640, 2025

    Yong Lin, Shange Tang, Bohan Lyu, Jiayun Wu, Hongzhou Lin, Kaiyu Yang, Jia Li, Mengzhou Xia, Danqi Chen, Sanjeev Arora, and Chi Jin. Goedel-prover: A frontier model for open-source automated theorem proving.arXiv preprint arXiv:2502.07640, 2025

  53. [53]

    Goedel-Prover-V2: Scaling Formal Theorem Proving with Scaffolded Data Synthesis and Self-Correction

    Yong Lin, Shange Tang, Bohan Lyu, Ziran Yang, Jui-Hui Chung, Haoyu Zhao, Lai Jiang, Yihan Geng, Jiawei Ge, Jingruo Sun, Jiayun Wu, Jiri Gesi, Ximing Lu, David Acuna, Kaiyu Yang, Hongzhou Lin, Yejin Choi, Danqi Chen, Sanjeev Arora, and Chi Jin. Goedel-prover-v2: Scaling formal theorem proving with scaffolded data synthesis and self-correction.arXiv preprin...

  54. [54]

    Darts: Differentiable architecture search

    Hanxiao Liu, Karen Simonyan, and Yiming Yang. Darts: Differentiable architecture search. In7th International Conference on Learning Representations, 2019. 15

  55. [55]

    Muon is Scalable for LLM Training

    Jingyuan Liu, Jianlin Su, Xingcheng Yao, Zhejun Jiang, Guokun Lai, Yulun Du, Yidao Qin, Weixin Xu, Enzhe Lu, Junjie Yan, et al. Muon is scalable for llm training.arXiv preprint arXiv:2502.16982, 2025

  56. [56]

    ResearchBench: Benchmarking LLMs in Scientific Discovery via Inspiration-Based Task Decomposition

    Yujie Liu, Zonglin Yang, Tong Xie, Jinjie Ni, Ben Gao, Yuqiang Li, Shixiang Tang, Wanli Ouyang, Erik Cambria, and Dongzhan Zhou. Researchbench: Benchmarking llms in scientific discovery via inspiration-based task decomposition.arXiv preprint arXiv:2503.21248, 2025

  57. [57]

    The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery

    Chris Lu, Cong Lu, Robert Tjarko Lange, Jakob N. Foerster, Jeff Clune, and David Ha. The ai scientist: Towards fully automated open-ended scientific discovery.arXiv preprint arXiv:2408.06292, 2024

  58. [58]

    Lupidi, B

    Alisia Maria Lupidi, Bhavul Gauri, Thomas Foster, Bassel Al Omari, Despoina Magka, Alberto Pepe, Alexis Audran-Reiss, Muna Aghamelu, Nicolas Mario Baldwin, Lucia Cipolina-Kun, Jean-Christophe Gagnon-Audet, Chee Hau Leow, Sandra Lefdal, Hossam Mossalam, Abhinav Moudgil, Saba Nazir, Emanuel Tewolde, Isabel Urrego, Jordi Armengol-Estapé, Amar Budhi- raja, Ga...

  59. [59]

    Discoverybench: Towards data-driven discovery with large language models

    Bodhisattwa Prasad Majumder, Harshit Surana, Dhruv Agarwal, Bhavana Dalvi Mishra, Abhijeetsingh Meena, Aryan Prakhar, Tirth V ora, Tushar Khot, Ashish Sabharwal, and Peter Clark. Discoverybench: Towards data-driven discovery with large language models. InThe Thirteenth International Conference on Learning Representations, 2025

  60. [60]

    Gonzalez, Jingbo 12 Preprint FrontierCS T eam Shang, and Alvin Cheung

    Qiuyang Mang, Wenhao Chai, Zhifei Li, Huanzhi Mao, Shang Zhou, Alexander Du, Hanchen Li, Shu Liu, Edwin Chen, Yichuan Wang, Xieting Chu, Zerui Cheng, Yuan Xu, Tian Xia, Zirui Wang, Tianneng Shi, Jianzhu Yao, Yilong Zhao, Qizheng Zhang, Charlie Ruan, Zeyu Shen, Kaiyuan Liu, Runyuan He, Dong Xing, Zerui Li, Zirong Zeng, Yige Jiang, Lufeng Cheng, Ziyi Zhao, ...

  61. [61]

    Daniel J. Mankowitz, Andrea Michi, Anton Zhernov, Marco Gelmi, Marco Selvi, Cosmin Paduraru, Edouard Leurent, Shariq Iqbal, Jean-Baptiste Lespiau, Alex Ahern, Thomas Köppe, Kevin Millikin, Stephen Gaffney, Sophie Elster, Jackson Broshear, Chris Gamble, Kieran Milan, Robert Tung, Minjae Hwang, A. Taylan Cemgil, Mohammadamin Barekatain, Yujia Li, Amol Mandh...

  62. [62]

    Batzner, Samuel S

    Amil Merchant, Simon L. Batzner, Samuel S. Schoenholz, Muratahan Aykol, Gowoon Cheon, and Ekin Dogus Cubuk. Scaling deep learning for materials discovery.Nature, 624(7990): 80–85, 2023

  63. [63]

    Gaia: a benchmark for general ai assistants

    Grégoire Mialon, Clémentine Fourrier, Thomas Wolf, Yann LeCun, and Thomas Scialom. Gaia: a benchmark for general ai assistants. InThe Twelfth International Conference on Learning Representations, 2024

  64. [64]

    A graph placement methodology for fast chip design.Nature, 594(7862):207–212, 2021

    Azalia Mirhoseini, Anna Goldie, Mustafa Yazgan, Joe Wenjie Jiang, Ebrahim Songhori, Shen Wang, Young-Joon Lee, Eric Johnson, Omkar Pathak, Azade Nova, et al. A graph placement methodology for fast chip design.Nature, 594(7862):207–212, 2021

  65. [65]

    Rusu, Joel Veness, Marc G

    V olodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. Human-level control through deep reinforcement lea...

  66. [66]

    Swt-bench: Testing and validating real-world bug-fixes with code agents.Advances in Neural Information Processing Systems, 37:81857–81887, 2024

    Niels Mündler, Mark N Müller, Jingxuan He, and Martin Vechev. Swt-bench: Testing and validating real-world bug-fixes with code agents.Advances in Neural Information Processing Systems, 37:81857–81887, 2024

  67. [67]

    Mlgym: A new framework and benchmark for advancing ai research agents,

    Deepak Nathani, Lovish Madaan, Nicholas Roberts, Nikolay Bashlykov, Ajay Menon, Vincent Moens, Amar Budhiraja, Despoina Magka, Vladislav V orotilov, Gaurav Chaurasia, Dieuwke Hupkes, Ricardo Silveira Cabral, Tatiana Shavrina, Jakob N. Foerster, Yoram Bachrach, William Yang Wang, and Roberta Raileanu. Mlgym: A new framework and benchmark for advancing ai r...

  68. [68]

    Alexander Novikov, Ngân Vu, Marvin Eisenberger, Emilien Dupont, Po-Sen Huang, Adam Zsolt Wagner, Sergey Shirobokov, Borislav Kozlovskii, Francisco J. R. Ruiz, Ab- bas Mehrabian, M. Pawan Kumar, Abigail See, Swarat Chaudhuri, George Holland, Alex Davies, Sebastian Nowozin, Pushmeet Kohli, and Matej Balog. Alphaevolve: A coding agent for scientific and algo...

  69. [69]

    GPT-4 Technical Report

    OpenAI. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

  70. [70]

    Openai o3 and o4-mini system card

    OpenAI. Openai o3 and o4-mini system card. https://openai.com/index/ o3-o4-mini-system-card/, 2025. System card, published April 16, 2025

  71. [71]

    Zhang, William Hu, Christopher Ré, and Azalia Mirhoseini

    Anne Ouyang, Simon Guo, Simran Arora, Alex L. Zhang, William Hu, Christopher Ré, and Azalia Mirhoseini. Kernelbench: Can llms write efficient gpu kernels? InForty-second International Conference on Machine Learning, 2025

  72. [72]

    Autolabs: Cognitive multi- agent systems with self-correction for autonomous chemical experimentation.arXiv preprint arXiv:2509.25651, 2025

    Gihan Panapitiya, Emily Saldanha, Heather Job, and Olivia Hess. Autolabs: Cognitive multi- agent systems with self-correction for autonomous chemical experimentation.arXiv preprint arXiv:2509.25651, 2025

  73. [73]

    Claudini: Autoresearch discovers state-of-the-art adversarial attack algorithms for llms.arXiv preprint arXiv:2603.24511, 2026

    Alexander Panfilov, Peter Romov, Igor Shilov, Yves-Alexandre de Montjoye, Jonas Geiping, and Maksym Andriushchenko. Claudini: Autoresearch discovers state-of-the-art adversarial attack algorithms for llms.arXiv preprint arXiv:2603.24511, 2026

  74. [74]

    Heurekabench: A benchmarking framework for ai co-scientist.arXiv preprint arXiv:2601.01678, 2026

    Siba Smarak Panigrahi, Jovana Videnovic, and Maria Brbic. Heurekabench: A benchmarking framework for ai co-scientist.arXiv preprint arXiv:2601.01678, 2026

  75. [75]

    Migrate: Mixed-policy grpo for adaptation at test-time.arXiv preprint arXiv:2508.08641, 2025

    Peter Phan, Dhruv Agarwal, Kavitha Srinivas, Horst Samulowitz, Pavan Kapanipathi, and Andrew McCallum. Migrate: Mixed-policy grpo for adaptation at test-time.arXiv preprint arXiv:2508.08641, 2025

  76. [76]

    Ori Press, Brandon Amos, Haoyu Zhao, Yikai Wu, Samuel K. Ainsworth, Dominik Krupke, Patrick Kidger, Touqir Sajed, Bartolomeo Stellato, Jisun Park, Nathanael Bosch, Eli Meril, Albert Steppi, Arman Zharmagambetov, Fangzhao Zhang, David Perez-Pineiro, Alberto Mercurio, Ni Zhan, Talor Abramovich, Kilian Lieret, Hanlin Zhang, Shirley Huang, Matthias Bethge, an...

  77. [77]

    Mle-dojo: Interactive environments for empowering llm agents in machine learning engineering.arXiv preprint arXiv:2505.07782, 2025

    Rushi Qiang, Yuchen Zhuang, Yinghao Li, Dingu Sagar V . K, Rongzhi Zhang, Changhao Li, Ian Shu-Hei Wong, Sherry Yang, Percy Liang, Chao Zhang, and Bo Dai. Mle-dojo: Interactive environments for empowering llm agents in machine learning engineering.arXiv preprint arXiv:2505.07782, 2025

  78. [78]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021

  79. [79]

    Posttrainbench: Can llm agents automate llm post- training?arXiv preprint arXiv:2603.08640, 2026

    Ben Rank, Hardik Bhatnagar, Ameya Prabhu, Shira Eisenberg, Karina Nguyen, Matthias Bethge, and Maksym Andriushchenko. Posttrainbench: Can llm agents automate llm post- training?arXiv preprint arXiv:2603.08640, 2026. 17

  80. [80]

    Pawan Kumar, Emilien Dupont, Francisco J

    Bernardino Romera-Paredes, Mohammadamin Barekatain, Alexander Novikov, Matej Balog, M. Pawan Kumar, Emilien Dupont, Francisco J. R. Ruiz, Jordan S. Ellenberg, Pengming Wang, Omar Fawzi, Pushmeet Kohli, and Alhussein Fawzi. Mathematical discoveries from program search with large language models.Nature, 625(7995):468–475, 2024

Showing first 80 references.