arxiv: 2605.08678 · v1 · submitted 2026-05-09 · 💻 cs.LG

Recognition: no theorem link

MLS-Bench: A Holistic and Rigorous Assessment of AI Systems on Building Better AI

Bohan Lyu , Yucheng Yang , Siqiao Huang , Jiaru Zhang , Qixin Xu , Xinghan Li , Xinyang Han , Yicheng Zhang

show 20 more authors

Huaqing Zhang Runhan Huang Kaicheng Yang Zitao Chen Wentao Guo Junlin Yang Xinyue Ai Wenhao Chai Yadi Cao Ziran Yang Kun Wang Dapeng Jiang Huan-ang Gao Shange Tang Chengshuai Shi Simon S. Du Max Simchowitz Jiantao Jiao Dawn Song Chi Jin

Authors on Pith no claims yet

Pith reviewed 2026-05-12 01:09 UTC · model grok-4.3

classification 💻 cs.LG

keywords MLS-BenchAI agentsML method inventiongeneralizationscalabilitybenchmarktest-time scalingmethod discovery

0 comments

The pith

AI agents cannot reliably invent ML methods that beat human designs on generalization and scaling tests.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces MLS-Bench to measure whether AI systems can discover machine learning methods that improve performance, hold up across varied settings, and scale to bigger regimes. It sets up 140 tasks in 12 domains where each task requires an agent to modify one part of an ML system or algorithm and then prove the change works more broadly and at larger sizes. The evaluations show agents fall short of consistently beating human-designed methods, with simple tuning proving easier than original invention. A reader would care because reliable agent-driven method discovery could accelerate progress in building stronger AI without ongoing human redesign. The work identifies the core limit as insufficient scientific insight for planning, validating, and scaling ideas, a gap that extra search or compute does not close.

Core claim

MLS-Bench contains 140 tasks across 12 domains. Each task requires an agent to improve one targeted component of an ML system or algorithm and to demonstrate that the improvement generalizes across controlled settings and scales. Current agents remain far from reliably surpassing human-designed methods. Engineering-style tuning is easier for them than genuine method invention. The bottleneck is not only in proposing new methods, but also in the scientific insight needed to plan, validate, and scale claims about them. More search, compute, or context alone does not remove this bottleneck.

What carries the argument

MLS-Bench, the benchmark of 140 tasks across 12 domains that each require an agent to propose and validate an improvement to an ML component with explicit tests for generalization and scalability.

If this is right

Agents perform better on engineering adjustments than on creating original methods.
Providing more test-time compute, adaptive allocation, or extra context does not overcome the insight limitation.
Planning, validating, and scaling claims remain harder than idea generation for current systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Future agent designs may need built-in mechanisms for scientific validation to advance beyond tuning.
The benchmark setup could be extended to new domains to check whether the insight gap persists across fields.
Human oversight might stay necessary for the insight step while agents handle execution and testing.

Load-bearing premise

The 140 tasks and 12 domains capture the essential skills for inventing generalizable and scalable ML methods without missing major aspects of actual research.

What would settle it

An agent that proposes and validates improvements outperforming human baselines on a majority of the tasks while passing controlled generalization and scaling checks.

Figures

Figures reproduced from arXiv: 2605.08678 by Bohan Lyu, Chengshuai Shi, Chi Jin, Dapeng Jiang, Dawn Song, Huan-ang Gao, Huaqing Zhang, Jiantao Jiao, Jiaru Zhang, Junlin Yang, Kaicheng Yang, Kun Wang, Max Simchowitz, Qixin Xu, Runhan Huang, Shange Tang, Simon S. Du, Siqiao Huang, Wenhao Chai, Wentao Guo, Xinghan Li, Xinyang Han, Xinyue Ai, Yadi Cao, Yicheng Zhang, Yucheng Yang, Ziran Yang, Zitao Chen.

**Figure 2.** Figure 2: MLS-Bench-Lite Performance across 15 models. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Compute profile: GPU vs. CPU task ratio and the distribution of GPUhours per experiment. 3.1 Overview Task Scope. MLS-Bench covers 140 tasks across 12 research areas; [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: MLS-Bench’s design: task specification, validity enforcement, and unified scoring. [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Analysis on the evaluation protocol. Left: scientific-innovation prompt vs. engineeringoptimization prompt. Middle: the budget check prohibits the models from hacking model size for higher performance. Right: in-distribution vs. OOD settings, from first proposal to final submission. broader edit space does not enhance their effectiveness; rather, they frequently misuse this flexibility for off-target code… view at source ↗

**Figure 6.** Figure 6: Test-time scaling. Left: running-best score vs. cumulative for the three inference-time setups. Right: TTT-Discover trained on two tasks, both hacking the visible settings. 8 [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Adaptive compute-allocation experiment. Left: cumulative compute budget consumed along the exploration trajectory for each of the five agents. Right: final-submission score. Results. The adaptive protocol gives agents strictly more experimental choices than Vanilla or the fixed Agent setting, yet performance generally drops as shown in [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗

**Figure 8.** Figure 8: Context engineering. We test whether additional context can benefit the agents. We add three settings: (1) Web search, where we equip the agents with a strong search tool based on Tavily2 ; (2) Baseline ctx., which provides detailed derivations, key steps, and reasoning from the baseline papers; and (3) Theory ctx., which provides background from relevant textbooks or theory-oriented literatures [PITH_FUL… view at source ↗

**Figure 9.** Figure 9: Similarity-weighted baseline performance vs. agent performance. Expert Assessment on agent submissions reveals dominant patterns: 1. agents recombine ingredients drawn from the baselines they are shown and present the recombination as new; and 2. truly novel components are rare and, when they appear, usually lack a stated reason to help. Per-model style differs: GPT-5.4 reaches for the most structurally … view at source ↗

read the original abstract

Modern AI progress has been driven by ML methods that are generalizable across settings and scalable to larger regimes. As large language models demonstrate advanced capabilities in reasoning, coding, and engineering tasks, it is increasingly important to understand whether they can discover such methods rather than only apply existing ones. We introduce MLS-Bench, a benchmark for evaluating whether AI systems can invent generalizable and scalable ML methods. MLS-Bench contains 140 tasks across 12 domains, each requiring an agent to improve one targeted component of an ML system or algorithm and demonstrate that the improvement generalizes across controlled settings and scales. We find that current agents remain far from reliably surpassing human-designed methods, and that engineering-style tuning is easier for them than genuine method invention. We further study the effects of test-time scaling, adaptive compute allocation, and context provision on agents' discovery performance, together with case studies of their behavior. Our analyses suggest that the bottleneck is not only in proposing new methods, but also in the scientific insight needed to plan, validate, and scale claims about them. More search, compute, or context alone does not remove this bottleneck. We build and maintain a community platform for cumulative and comparable iteration, and release the data and code at https://mls-bench.com.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MLS-Bench sets up a new test for whether agents can invent generalizable ML methods, but the gap it reports may partly reflect task design rather than pure invention limits.

read the letter

The main point is that current agents still fall short of humans at coming up with ML methods that generalize and scale, and that simply giving them more search, compute, or context does not close the gap. The paper backs this with a new set of 140 tasks across 12 domains, each asking an agent to improve one targeted component and then verify the gain holds under controlled variations and different scales. They also run checks on test-time scaling, adaptive compute allocation, and added context, plus some case studies of agent behavior. Releasing the full data, code, and a community platform for further work is a practical step that lets others test and extend the benchmark directly. That part is useful and straightforward. The softer spot is whether the tasks actually demand the kind of scientific planning and validation the authors highlight. The central claim that insight is the real bottleneck only holds if the tasks are built to penalize pure tuning or local tweaks and to require controlled scaling experiments instead. If many of them can be solved through hyperparameter search or prompt adjustments without new theoretical grounding, the reported agent-human gap mixes implementation difficulty with invention difficulty. The abstract states that extra resources alone do not remove the bottleneck, but that conclusion needs the task definitions to enforce the distinction clearly. This paper is aimed at groups working on AI agents for research automation and method discovery. Readers who follow benchmarks for autonomous ML progress will find the task collection and scaling analyses worth examining. It has enough structure, external releases, and a clear empirical focus to deserve peer review rather than a desk reject. I would send it to referees and ask them specifically to examine how the tasks separate engineering-style tuning from insight-driven proposals.

Referee Report

1 major / 0 minor

Summary. The paper introduces MLS-Bench, a benchmark with 140 tasks across 12 domains for evaluating whether AI agents can invent generalizable and scalable ML methods. Each task requires an agent to improve a targeted component of an ML system or algorithm while demonstrating that the improvement generalizes across controlled settings and scales. The authors report that current agents remain far from reliably surpassing human-designed methods, that engineering-style tuning is easier than genuine method invention, and that the bottleneck lies in scientific insight for planning, validation, and scaling. Analyses examine test-time scaling, adaptive compute allocation, and context provision, concluding that more search, compute, or context alone does not remove the bottleneck. Data, code, and a community platform are released.

Significance. If the tasks are constructed such that they require planning, validation, and scaling insight beyond hyperparameter search or prompt engineering, MLS-Bench would offer a valuable, reproducible resource for measuring progress on AI-driven method discovery. The explicit release of data and code, along with the community platform, is a clear strength that enables cumulative iteration and supports the empirical claims about scaling effects.

major comments (1)

Abstract and task construction: the central finding that 'more search, compute, or context alone does not remove this bottleneck' and that tuning is easier than invention is load-bearing on the 140 tasks requiring genuine scientific insight rather than admitting solutions via hyperparameter tuning, local modifications, or prompt engineering without new theoretical justification or controlled scaling experiments. The manuscript should provide concrete examples or analyses demonstrating how the task definitions penalize pure tuning approaches.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive review and for recognizing the potential value of MLS-Bench as a reproducible resource. We address the single major comment below and will revise the manuscript to strengthen the exposition of task construction.

read point-by-point responses

Referee: Abstract and task construction: the central finding that 'more search, compute, or context alone does not remove this bottleneck' and that tuning is easier than invention is load-bearing on the 140 tasks requiring genuine scientific insight rather than admitting solutions via hyperparameter tuning, local modifications, or prompt engineering without new theoretical justification or controlled scaling experiments. The manuscript should provide concrete examples or analyses demonstrating how the task definitions penalize pure tuning approaches.

Authors: We agree that the central claims rest on the tasks demanding more than hyperparameter tuning or prompt engineering. Each task in MLS-Bench requires an agent to improve a targeted component while also designing and reporting controlled experiments that establish generalization across held-out settings and scaling behavior to larger regimes; these requirements are stated in the task templates and evaluation rubrics. Pure tuning approaches typically succeed on a single training configuration but fail the generalization and scaling criteria, as shown in our agent failure analyses. That said, we acknowledge the manuscript would be clearer with explicit illustrations. In the revision we will add a dedicated subsection (likely in Section 3 or 4) containing 3–4 concrete task examples, the precise success criteria, and side-by-side results showing where tuning-only baselines plateau while insight-driven solutions continue to improve. We will also include a short quantitative comparison of agent success rates on tuning versus invention-oriented subtasks. revision: yes

Circularity Check

0 steps flagged

No significant circularity in MLS-Bench benchmark

full rationale

The paper introduces MLS-Bench as an empirical benchmark consisting of 140 tasks across 12 domains for assessing AI agents on inventing generalizable ML methods. It reports experimental findings that current agents fall short of human-designed methods and that tuning is easier than invention. No mathematical derivations, equations, fitted parameters, or predictions exist that could reduce to inputs by construction. The work releases data and code externally and makes no load-bearing self-citations for any theoretical claim. All central assertions rest on direct task evaluations rather than self-referential definitions or renamed known results. This is a standard benchmark release with no internal derivation chain to inspect for circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that the chosen tasks measure invention capability; no free parameters or invented entities are introduced in the abstract.

axioms (1)

domain assumption The selected 140 tasks across 12 domains are representative of generalizable ML method invention.
Invoked in the benchmark construction and evaluation design.

pith-pipeline@v0.9.0 · 5639 in / 1122 out tokens · 31975 ms · 2026-05-12T01:09:27.664741+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

136 extracted references · 136 canonical work pages · 23 internal anchors

[1]

Learning to learn by gradient descent by gradient descent.Advances in neural information processing systems, 29, 2016

Marcin Andrychowicz, Misha Denil, Sergio Gomez, Matthew W Hoffman, David Pfau, Tom Schaul, Brendan Shillingford, and Nando De Freitas. Learning to learn by gradient descent by gradient descent.Advances in neural information processing systems, 29, 2016

work page 2016
[2]

System card: Claude opus 4 & claude sonnet 4

Anthropic. System card: Claude opus 4 & claude sonnet 4. https://www-cdn.anthropic. com/07b2a3f9902ee19fe39a36ca638e5ae987bc64dd.pdf, 2025

work page 2025
[3]

Automated alignment researchers

Anthropic. Automated alignment researchers. https://www.anthropic.com/research/ automated-alignment-researchers, 2026

work page 2026
[4]

K. I. Appel and Wolfgang Haken. The solution of the four-color-map problem.Scientific American, 237(4):108–121, 1977

work page 1977
[5]

arXiv preprint arXiv:2510.14150 , year =

Henrique Assumpção, Diego Ferreira, Leandro Campos, and Fabricio Murai. Codeevolve: An open source evolutionary coding agent for algorithm discovery and optimization.arXiv preprint arXiv:2510.14150, 2025

work page arXiv 2025
[6]

Qwen Technical Report

Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, Binyuan Hui, Luo Ji, Mei Li, Junyang Lin, Runji Lin, Dayiheng Liu, Gao Liu, Chengqiang Lu, Keming Lu, Jianxin Ma, Rui Men, Xingzhang Ren, Xuancheng Ren, Chuanqi Tan, Sinan Tan, Jianhong Tu, Peng Wang, Shijie Wang, Wei Wang, Shengguang Wu, Benfeng X...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[7]

AstaBench: Rigorous Benchmarking of AI Agents with a Scientific Research Suite

Jonathan Bragg, Mike D’Arcy, Nishant Balepur, Dan Bareket, Bhavana Dalvi, Sergey Feld- man, Dany Haddad, Jena D. Hwang, Peter A. Jansen, Varsha Kishore, Bodhisattwa Prasad Majumder, Aakanksha Naik, Sigal Rahamimov, Kyle Richardson, Amanpreet Singh, Harshit Surana, Aryeh Tiktinsky, Rosni Vasu, Guy Wiener, Chloe Anastasiades, Stefan Candra, Jason Dunkelberg...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[8]

T. B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric J. Sigler, Mateusz Lit...

work page 1901
[9]

Broken neural scaling laws

Ethan Caballero, Kshitij Gupta, Irina Rish, and David Krueger. Broken neural scaling laws. In International Conference on Learning Representations, 2023. URL https://openreview. net/forum?id=sckjveqlCZ

work page 2023
[10]

Mle-bench: Evaluating machine learning agents on machine learning engineering

Jun Shern Chan, Neil Chowdhury, Oliver Jaffe, James Aung, Dane Sherburn, Evan Mays, Giulio Starace, Kevin Liu, Leon Maksin, Tejal Patwardhan, Aleksander Madry, and Lilian Weng. Mle-bench: Evaluating machine learning agents on machine learning engineering. In The Thirteenth International Conference on Learning Representations, 2025

work page 2025
[11]

arXiv preprint arXiv:2505.19955 , year =

Hui Chen, Miao Xiong, Yujie Lu, Wei Han, Ailin Deng, Yufei He, Jiaying Wu, Yibo Li, Yue Liu, and Bryan Hooi. Mlr-bench: Evaluating ai agents on open-ended machine learning research.arXiv preprint arXiv:2505.19955, 2025

work page arXiv 2025
[12]

Swe-ci: Evaluating agent capabilities in maintain- ing codebases via continuous integration.arXiv preprint arXiv:2603.03823, 2026

Jialong Chen, Xander Xu, Hu Wei, Chuan Chen, and Bing Zhao. Swe-ci: Evaluating agent capa- bilities in maintaining codebases via continuous integration.arXiv preprint arXiv:2603.03823, 2026

work page arXiv 2026
[13]

Seed-prover 1.5: Mastering undergraduate-level theorem proving via learning from experience.arXiv preprint arXiv:2512.17260,

Jiangjie Chen, Wenxiang Chen, Jiacheng Du, Jinyi Hu, Zhicheng Jiang, Allan Jie, Xiaoran Jin, Xing Jin, Chenggang Li, Wenlei Shi, Zhihong Wang, Mingxuan Wang, Chenrui Wei, Shufa Wei, Huajian Xin, Fan Yang, Weihao Gao, Zheng Yuan, Tianyang Zhan, Zeyu Zheng, Tianxi Zhou, and Thomas Hanwen Zhu. Seed-prover 1.5: Mastering undergraduate-level theorem proving vi...

work page arXiv 2025
[14]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Pondé de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...

work page internal anchor Pith review Pith/arXiv arXiv 2021
[15]

Bringmann, John Tran, Wei Liu, Fung Xie, Michael Lightstone, and Humphrey Shi

Terry Chen, Zhifan Ye, Bing Xu, Zihao Ye, Timmy Liu, Ali Hassani, Tianqi Chen, Andrew Kerr, Haicheng Wu, Yang Xu, Yu-Jung Chen, Hanfeng Chen, Aditya Kane, Ronny Krashinsky, Ming-Yu Liu, Vinod Grover, Luis Ceze, Roger A. Bringmann, John Tran, Wei Liu, Fung Xie, Michael Lightstone, and Humphrey Shi. Avo: Agentic variation operators for autonomous evolutiona...

work page arXiv 2026
[16]

Yan, Ziheng Jiang, Thierry Moreau, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy

Tianqi Chen, Lianmin Zheng, Eddie Q. Yan, Ziheng Jiang, Thierry Moreau, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. Learning to optimize tensor programs. 2018

work page 2018
[17]

Agent^2 RL-Bench: Can LLM Agents Engineer Agentic RL Post-Training?

Wanyi Chen, Xiao Yang, Xu Yang, Tianming Sha, Qizheng Li, Zhuo Wang, Bowen Xian, Fang Kong, Weiqing Liu, and Jiang Bian. Agent2 rl-bench: Can llm agents engineer agentic rl post-training?arXiv preprint arXiv:2604.10547, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[18]

Baker, Benjamin 12 Burns, Daniel Adu-Ampratwum, Xuhui Huang, Xia Ning, Song Gao, Yu Su, and Huan Sun

Ziru Chen, Shijie Chen, Yuting Ning, Qianheng Zhang, Boshi Wang, Botao Yu, Yifei Li, Zeyi Liao, Chen Wei, Zitong Lu, Vishal Dey, Mingyi Xue, Frazier N. Baker, Benjamin 12 Burns, Daniel Adu-Ampratwum, Xuhui Huang, Xia Ning, Song Gao, Yu Su, and Huan Sun. Scienceagentbench: Toward rigorous assessment of language agents for data-driven scientific discovery. ...

work page 2025
[19]

Jordan, Joseph E

Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anastasios Nikolas Angelopoulos, Tianle Li, Dacheng Li, Banghua Zhu, Hao Zhang, Michael I. Jordan, Joseph E. Gonzalez, and Ion Stoica. Chatbot arena: An open platform for evaluating llms by human preference. InForty-first International Conference on Machine Learning, pages 8359–8388, 2024

work page 2024
[20]

Coley, Dale A

Connor W. Coley, Dale A. Thomas, Justin A. M. Lummiss, Jonathan N. Jaworski, C. Breen, Victor Schultz, Travis Hart, Joshua Fishman, Luke Rogers, Hanyu Gao, Robert W. Hicklin, Pieter Plehiers, Joshua Byington, John S. Piotti, William H. Green, A. John Hart, Timothy F. Jamison, and Klavs F. Jensen. A robotic platform for flow synthesis of organic compounds ...

work page 2019
[21]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[22]

Cubuk, Barret Zoph, Dandelion Mané, Vijay Vasudevan, and Quoc V

Ekin D. Cubuk, Barret Zoph, Dandelion Mané, Vijay Vasudevan, and Quoc V . Le. Autoaug- ment: Learning augmentation strategies from data. InIEEE Conference on Computer Vision and Pattern Recognition, pages 113–123, 2019

work page 2019
[23]

Flashattention: Fast and memory-efficient exact attention with io-awareness.Advances in neural information processing systems, 35:16344–16359, 2022

Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. Flashattention: Fast and memory-efficient exact attention with io-awareness.Advances in neural information processing systems, 35:16344–16359, 2022

work page 2022
[24]

Everett, Lechao Xiao, Mitchell Wortsman, Alexander A

Katie E. Everett, Lechao Xiao, Mitchell Wortsman, Alexander A. Alemi, Roman Novak, Peter J. Liu, Izzeddin Gur, Jascha Sohl-Dickstein, Leslie Pack Kaelbling, Jaehoon Lee, and Jeffrey Pennington. Scaling exponents across parameterizations and optimizers. InProceedings of the 41st International Conference on Machine Learning, volume 235 ofProceedings of Mach...

work page 2024
[25]

Model-agnostic meta-learning for fast adaptation of deep networks

Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. InProceedings of the 34th International Conference on Machine Learning, pages 1126–1135, 2017

work page 2017
[26]

Researchgym: Evaluating language model agents on real-world ai research.arXiv preprint arXiv:2602.15112, 2026

Aniketh Garikaparthi, Manasi Patwardhan, and Arman Cohan. Researchgym: Evaluating language model agents on real-world ai research.arXiv preprint arXiv:2602.15112, 2026

work page arXiv 2026
[27]

Improved training speed, accuracy, and data utilization through loss function optimization

Santiago Gonzalez and Risto Miikkulainen. Improved training speed, accuracy, and data utilization through loss function optimization. InIEEE Congress on Evolutionary Computation, pages 1–8, 2020

work page 2020
[28]

Juraj Gottweis, Wei-Hung Weng, Alexander N. Daryin, Tao Tu, Anil Palepu, Petar Sirkovic, Artiom Myaskovsky, Felix Weissenberger, Keran Rong, Ryutaro Tanno, Khaled Saab, Dan Popovici, Jacob Blum, Fan Zhang, Katherine Chou, Avinatan Hassidim, Burak Gokturk, Amin Vahdat, Pushmeet Kohli, Yossi Matias, Andrew Carroll, Kavita Kulkarni, Nenad Tomasev, Yuan Guan,...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[29]

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chong Ruan, Damai Dai, Deli Chen, Dongjie Ji, ...

work page 2025
[30]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016

work page 2016
[31]

Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

work page 2020
[32]

Training Compute-Optimal Large Language Models

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hennigan, Eric Noland, Katie Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karen Simonyan, Erich Elsen, Jack W. Rae, Oriol Vinyals, and Laurent Sifre...

work page internal anchor Pith review Pith/arXiv arXiv
[33]

Training Compute-Optimal Large Language Models

doi: 10.48550/arXiv.2203.15556. URLhttps://arxiv.org/abs/2203.15556

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2203.15556
[34]

Mle-agent: Your intelligent companion for seamless ai engineering and research, 2024.https://github.com/MLSysOps/MLE-agent

Lei Zhang Huaizheng Zhang*, Yizheng Huang*. Mle-agent: Your intelligent companion for seamless ai engineering and research, 2024.https://github.com/MLSysOps/MLE-agent

work page 2024
[35]

Mlagentbench: Evaluating language agents on machine learning experimentation

Qian Huang, Jian V ora, Percy Liang, and Jure Leskovec. Mlagentbench: Evaluating language agents on machine learning experimentation. InForty-first International Conference on Machine Learning, pages 20271–20309, 2024

work page 2024
[36]

Olympiad-level formal mathematical reasoning with reinforcement learning.Nature, pages 1–3, 2025

Thomas Hubert, Rishi Mehta, Laurent Sartran, Miklós Z Horváth, Goran Žuži´c, Eric Wieser, Aja Huang, Julian Schrittwieser, Yannick Schroecker, Hussain Masoom, et al. Olympiad-level formal mathematical reasoning with reinforcement learning.Nature, pages 1–3, 2025

work page 2025
[37]

Automated machine learning - methods, systems, challenges

Frank Hutter, Lars Kotthoff, and Joaquin Vanschoren. Automated machine learning - methods, systems, challenges. 2019

work page 2019
[38]

ALE-Bench: A benchmark for long-horizon objective-driven algorithm engineering.arXiv preprint arXiv:2506.09050,

Yuki Imajuku, Kohki Horie, Yoichi Iwata, Kensho Aoki, Naohiro Takahashi, and Takuya Akiba. Ale-bench: A benchmark for long-horizon objective-driven algorithm engineering. arXiv preprint arXiv:2506.09050, 2025

work page arXiv 2025
[39]

Livecodebench: Holistic and contam- ination free evaluation of large language models for code

Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contam- ination free evaluation of large language models for code. InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025
[40]

Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R

Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R. Narasimhan. Swe-bench: Can language models resolve real-world github issues? InThe Twelfth International Conference on Learning Representations, 2024. 14

work page 2024
[41]

Muon: An optimizer for hidden layers in neural networks, 2024

Keller Jordan, Yuchen Jin, Vlado Boza, Jiacheng You, Franz Cesista, Laker Newhouse, and Jeremy Bernstein. Muon: An optimizer for hidden layers in neural networks, 2024. URL https://kellerjordan.github.io/posts/muon/

work page 2024
[42]

John Jumper, Richard Evans, Alexander Pritzel, Tim Green, Michael Figurnov, Olaf Ron- neberger, Kathryn Tunyasuvunakool, Russ Bates, Augustin Žídek, Anna Potapenko, Alex Bridgland, Clemens Meyer, Simon Köhl, Andrew J. Ballard, Andrew Cowie, Bernardino Romera-Paredes, Stanislav Nikolov, Rishub Jain, Jonas Adler, Trevor Back, Stig Petersen, David Reiman, El...

work page 2021
[43]

Kus- ner

Jean Kaddour, Oscar Key, Piotr Nawrot, Pasquale Minervini, and Matt J. Kus- ner. No train no gain: Revisiting efficient training algorithms for transformer- based language models. InAdvances in Neural Information Processing Systems, vol- ume 36, 2023. URL https://proceedings.neurips.cc/paper_files/paper/2023/ hash/51f3d6252706100325ddc435ba0ade0e-Abstract...

work page 2023
[44]

Jared Kaplan, Sam McCandlish, Tom Henighan, T. B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2001
[45]

autoresearch.https://github.com/karpathy/autoresearch, 2025

Andrej Karpathy. autoresearch.https://github.com/karpathy/autoresearch, 2025

work page 2025
[46]

Adam: A Method for Stochastic Optimization

Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[47]

On scientific understanding with artificial intelligence

Mario Krenn, Robert Pollice, Si Yue Guo, Matteo Aldeghi, Alba Cervera-Lierta, Pascal Friederich, Gabriel dos Passos Gomes, Florian Häse, Adrián Jinich, AkshatKumar Nigam, Zhenpeng Yao, and Alán Aspuru-Guzik. On scientific understanding with artificial intelligence. Nature Reviews Physics, 4(12):761–769, 2022

work page 2022
[48]

Learning skillful medium-range global weather forecasting

Rémi Lam, Álvaro Sánchez-González, Matthew Willson, Peter Wirnsberger, Meire Fortunato, Ferran Alet, Suman Ravuri, Timo Ewalds, Zach Eaton-Rosen, Weihua Hu, Alexander Merose, Stephan Hoyer, George Holland, Oriol Vinyals, Jacklynn Stott, Alexander Pritzel, Shakir Mohamed, and Peter Battaglia. Learning skillful medium-range global weather forecasting. Scien...

work page 2023
[49]

2509.19349 , archivePrefix =

Robert Tjarko Lange, Yuki Imajuku, and Edoardo Cetin. Shinkaevolve: Towards open-ended and sample-efficient program evolution.arXiv preprint arXiv:2509.19349, 2025

work page arXiv 2025
[50]

(mis)fitting scaling laws: A survey of scaling law fitting techniques in deep learning

Margaret Li, Sneha Kudugunta, and Luke Zettlemoyer. (mis)fitting scaling laws: A survey of scaling law fitting techniques in deep learning. InInternational Conference on Learning Representations, 2025. URLhttps://openreview.net/forum?id=xI71dsS3o4

work page 2025
[51]

Swe-next: Scalable real-world software engineering tasks for agents.arXiv preprint arXiv:2603.20691, 2026

Jiarong Liang, Zhiheng Lyu, Zijie Liu, Xiangchao Chen, Ping Nie, Kai Zou, and Wenhu Chen. Swe-next: Scalable real-world software engineering tasks for agents.arXiv preprint arXiv:2603.20691, 2026

work page arXiv 2026
[52]

Goedel-prover: A frontier model for open-source automated theorem proving.arXiv preprint arXiv:2502.07640, 2025

Yong Lin, Shange Tang, Bohan Lyu, Jiayun Wu, Hongzhou Lin, Kaiyu Yang, Jia Li, Mengzhou Xia, Danqi Chen, Sanjeev Arora, and Chi Jin. Goedel-prover: A frontier model for open-source automated theorem proving.arXiv preprint arXiv:2502.07640, 2025

work page arXiv 2025
[53]

Goedel-Prover-V2: Scaling Formal Theorem Proving with Scaffolded Data Synthesis and Self-Correction

Yong Lin, Shange Tang, Bohan Lyu, Ziran Yang, Jui-Hui Chung, Haoyu Zhao, Lai Jiang, Yihan Geng, Jiawei Ge, Jingruo Sun, Jiayun Wu, Jiri Gesi, Ximing Lu, David Acuna, Kaiyu Yang, Hongzhou Lin, Yejin Choi, Danqi Chen, Sanjeev Arora, and Chi Jin. Goedel-prover-v2: Scaling formal theorem proving with scaffolded data synthesis and self-correction.arXiv preprin...

work page arXiv 2025
[54]

Darts: Differentiable architecture search

Hanxiao Liu, Karen Simonyan, and Yiming Yang. Darts: Differentiable architecture search. In7th International Conference on Learning Representations, 2019. 15

work page 2019
[55]

Muon is Scalable for LLM Training

Jingyuan Liu, Jianlin Su, Xingcheng Yao, Zhejun Jiang, Guokun Lai, Yulun Du, Yidao Qin, Weixin Xu, Enzhe Lu, Junjie Yan, et al. Muon is scalable for llm training.arXiv preprint arXiv:2502.16982, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[56]

ResearchBench: Benchmarking LLMs in Scientific Discovery via Inspiration-Based Task Decomposition

Yujie Liu, Zonglin Yang, Tong Xie, Jinjie Ni, Ben Gao, Yuqiang Li, Shixiang Tang, Wanli Ouyang, Erik Cambria, and Dongzhan Zhou. Researchbench: Benchmarking llms in scientific discovery via inspiration-based task decomposition.arXiv preprint arXiv:2503.21248, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[57]

The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery

Chris Lu, Cong Lu, Robert Tjarko Lange, Jakob N. Foerster, Jeff Clune, and David Ha. The ai scientist: Towards fully automated open-ended scientific discovery.arXiv preprint arXiv:2408.06292, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[58]

Lupidi, B

Alisia Maria Lupidi, Bhavul Gauri, Thomas Foster, Bassel Al Omari, Despoina Magka, Alberto Pepe, Alexis Audran-Reiss, Muna Aghamelu, Nicolas Mario Baldwin, Lucia Cipolina-Kun, Jean-Christophe Gagnon-Audet, Chee Hau Leow, Sandra Lefdal, Hossam Mossalam, Abhinav Moudgil, Saba Nazir, Emanuel Tewolde, Isabel Urrego, Jordi Armengol-Estapé, Amar Budhi- raja, Ga...

work page arXiv 2026
[59]

Discoverybench: Towards data-driven discovery with large language models

Bodhisattwa Prasad Majumder, Harshit Surana, Dhruv Agarwal, Bhavana Dalvi Mishra, Abhijeetsingh Meena, Aryan Prakhar, Tirth V ora, Tushar Khot, Ashish Sabharwal, and Peter Clark. Discoverybench: Towards data-driven discovery with large language models. InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025
[60]

Gonzalez, Jingbo 12 Preprint FrontierCS T eam Shang, and Alvin Cheung

Qiuyang Mang, Wenhao Chai, Zhifei Li, Huanzhi Mao, Shang Zhou, Alexander Du, Hanchen Li, Shu Liu, Edwin Chen, Yichuan Wang, Xieting Chu, Zerui Cheng, Yuan Xu, Tian Xia, Zirui Wang, Tianneng Shi, Jianzhu Yao, Yilong Zhao, Qizheng Zhang, Charlie Ruan, Zeyu Shen, Kaiyuan Liu, Runyuan He, Dong Xing, Zerui Li, Zirong Zeng, Yige Jiang, Lufeng Cheng, Ziyi Zhao, ...

work page arXiv 2025
[61]

Daniel J. Mankowitz, Andrea Michi, Anton Zhernov, Marco Gelmi, Marco Selvi, Cosmin Paduraru, Edouard Leurent, Shariq Iqbal, Jean-Baptiste Lespiau, Alex Ahern, Thomas Köppe, Kevin Millikin, Stephen Gaffney, Sophie Elster, Jackson Broshear, Chris Gamble, Kieran Milan, Robert Tung, Minjae Hwang, A. Taylan Cemgil, Mohammadamin Barekatain, Yujia Li, Amol Mandh...

work page 2023
[62]

Batzner, Samuel S

Amil Merchant, Simon L. Batzner, Samuel S. Schoenholz, Muratahan Aykol, Gowoon Cheon, and Ekin Dogus Cubuk. Scaling deep learning for materials discovery.Nature, 624(7990): 80–85, 2023

work page 2023
[63]

Gaia: a benchmark for general ai assistants

Grégoire Mialon, Clémentine Fourrier, Thomas Wolf, Yann LeCun, and Thomas Scialom. Gaia: a benchmark for general ai assistants. InThe Twelfth International Conference on Learning Representations, 2024

work page 2024
[64]

A graph placement methodology for fast chip design.Nature, 594(7862):207–212, 2021

Azalia Mirhoseini, Anna Goldie, Mustafa Yazgan, Joe Wenjie Jiang, Ebrahim Songhori, Shen Wang, Young-Joon Lee, Eric Johnson, Omkar Pathak, Azade Nova, et al. A graph placement methodology for fast chip design.Nature, 594(7862):207–212, 2021

work page 2021
[65]

Rusu, Joel Veness, Marc G

V olodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. Human-level control through deep reinforcement lea...

work page 2015
[66]

Swt-bench: Testing and validating real-world bug-fixes with code agents.Advances in Neural Information Processing Systems, 37:81857–81887, 2024

Niels Mündler, Mark N Müller, Jingxuan He, and Martin Vechev. Swt-bench: Testing and validating real-world bug-fixes with code agents.Advances in Neural Information Processing Systems, 37:81857–81887, 2024

work page 2024
[67]

Mlgym: A new framework and benchmark for advancing ai research agents,

Deepak Nathani, Lovish Madaan, Nicholas Roberts, Nikolay Bashlykov, Ajay Menon, Vincent Moens, Amar Budhiraja, Despoina Magka, Vladislav V orotilov, Gaurav Chaurasia, Dieuwke Hupkes, Ricardo Silveira Cabral, Tatiana Shavrina, Jakob N. Foerster, Yoram Bachrach, William Yang Wang, and Roberta Raileanu. Mlgym: A new framework and benchmark for advancing ai r...

work page arXiv 2025
[68]

Alexander Novikov, Ngân Vu, Marvin Eisenberger, Emilien Dupont, Po-Sen Huang, Adam Zsolt Wagner, Sergey Shirobokov, Borislav Kozlovskii, Francisco J. R. Ruiz, Ab- bas Mehrabian, M. Pawan Kumar, Abigail See, Swarat Chaudhuri, George Holland, Alex Davies, Sebastian Nowozin, Pushmeet Kohli, and Matej Balog. Alphaevolve: A coding agent for scientific and algo...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[69]

GPT-4 Technical Report

OpenAI. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[70]

Openai o3 and o4-mini system card

OpenAI. Openai o3 and o4-mini system card. https://openai.com/index/ o3-o4-mini-system-card/, 2025. System card, published April 16, 2025

work page 2025
[71]

Zhang, William Hu, Christopher Ré, and Azalia Mirhoseini

Anne Ouyang, Simon Guo, Simran Arora, Alex L. Zhang, William Hu, Christopher Ré, and Azalia Mirhoseini. Kernelbench: Can llms write efficient gpu kernels? InForty-second International Conference on Machine Learning, 2025

work page 2025
[72]

Autolabs: Cognitive multi- agent systems with self-correction for autonomous chemical experimentation.arXiv preprint arXiv:2509.25651, 2025

Gihan Panapitiya, Emily Saldanha, Heather Job, and Olivia Hess. Autolabs: Cognitive multi- agent systems with self-correction for autonomous chemical experimentation.arXiv preprint arXiv:2509.25651, 2025

work page arXiv 2025
[73]

Claudini: Autoresearch discovers state-of-the-art adversarial attack algorithms for llms.arXiv preprint arXiv:2603.24511, 2026

Alexander Panfilov, Peter Romov, Igor Shilov, Yves-Alexandre de Montjoye, Jonas Geiping, and Maksym Andriushchenko. Claudini: Autoresearch discovers state-of-the-art adversarial attack algorithms for llms.arXiv preprint arXiv:2603.24511, 2026

work page arXiv 2026
[74]

Heurekabench: A benchmarking framework for ai co-scientist.arXiv preprint arXiv:2601.01678, 2026

Siba Smarak Panigrahi, Jovana Videnovic, and Maria Brbic. Heurekabench: A benchmarking framework for ai co-scientist.arXiv preprint arXiv:2601.01678, 2026

work page arXiv 2026
[75]

Migrate: Mixed-policy grpo for adaptation at test-time.arXiv preprint arXiv:2508.08641, 2025

Peter Phan, Dhruv Agarwal, Kavitha Srinivas, Horst Samulowitz, Pavan Kapanipathi, and Andrew McCallum. Migrate: Mixed-policy grpo for adaptation at test-time.arXiv preprint arXiv:2508.08641, 2025

work page arXiv 2025
[76]

Ori Press, Brandon Amos, Haoyu Zhao, Yikai Wu, Samuel K. Ainsworth, Dominik Krupke, Patrick Kidger, Touqir Sajed, Bartolomeo Stellato, Jisun Park, Nathanael Bosch, Eli Meril, Albert Steppi, Arman Zharmagambetov, Fangzhao Zhang, David Perez-Pineiro, Alberto Mercurio, Ni Zhan, Talor Abramovich, Kilian Lieret, Hanlin Zhang, Shirley Huang, Matthias Bethge, an...

work page arXiv 2025
[77]

Mle-dojo: Interactive environments for empowering llm agents in machine learning engineering.arXiv preprint arXiv:2505.07782, 2025

Rushi Qiang, Yuchen Zhuang, Yinghao Li, Dingu Sagar V . K, Rongzhi Zhang, Changhao Li, Ian Shu-Hei Wong, Sherry Yang, Percy Liang, Chao Zhang, and Bo Dai. Mle-dojo: Interactive environments for empowering llm agents in machine learning engineering.arXiv preprint arXiv:2505.07782, 2025

work page arXiv 2025
[78]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021

work page 2021
[79]

Posttrainbench: Can llm agents automate llm post- training?arXiv preprint arXiv:2603.08640, 2026

Ben Rank, Hardik Bhatnagar, Ameya Prabhu, Shira Eisenberg, Karina Nguyen, Matthias Bethge, and Maksym Andriushchenko. Posttrainbench: Can llm agents automate llm post- training?arXiv preprint arXiv:2603.08640, 2026. 17

work page arXiv 2026
[80]

Pawan Kumar, Emilien Dupont, Francisco J

Bernardino Romera-Paredes, Mohammadamin Barekatain, Alexander Novikov, Matej Balog, M. Pawan Kumar, Emilien Dupont, Francisco J. R. Ruiz, Jordan S. Ellenberg, Pengming Wang, Omar Fawzi, Pushmeet Kohli, and Alhussein Fawzi. Mathematical discoveries from program search with large language models.Nature, 625(7995):468–475, 2024

work page 2024

Showing first 80 references.