pith. sign in

arxiv: 2605.17373 · v1 · pith:TUCWOVSSnew · submitted 2026-05-17 · 💻 cs.LG · cs.AI

FML-bench: A Controlled Study of AI Research Agent Strategies from the Perspective of Search Dynamics

Pith reviewed 2026-05-20 14:21 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords AI research agentssearch strategiesgreedy hill-climbingtree searchadaptive explorationprocess metricsmachine learning benchmarksopportunity structure
0
0 comments X

The pith

A simple greedy hill-climber nearly matches top tree-search performance in AI research agents, while an adaptive strategy that broadens exploration on stagnation outperforms all six tested approaches.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents FML-Bench to isolate the effects of search strategy from execution details when comparing AI agents that automate machine learning research. It runs six representative agents on 18 fundamental tasks spanning 10 domains and tracks 12 process-level metrics such as convergence speed and exploration focus. Results show that added strategy complexity does not reliably improve outcomes: a basic greedy approach performs almost as well as the strongest tree-search method, and both clearly beat the other agents. An adaptive agent that detects stagnation and switches to wider search beats every fixed strategy. Process metrics further link early focused progress to better final results, while solution variety and raw compute spend show no such link.

Core claim

Evaluating six agents on the 18-task benchmark reveals that strategy complexity alone does not guarantee strong performance: a simple greedy hill-climber nearly matches the best-performing tree-search agent, both well above the remaining agents. Analysis ties this pattern to improvement opportunity structure, with greedy search favored when opportunities are dense and tree or evolutionary search favored when they are sparse. An adaptive agent that switches to broader exploration upon detecting improvement stagnation outperforms the other six agents, and process-level metrics show that early convergence and directionally focused exploration are significantly associated with final performance.

What carries the argument

FML-Bench, a controlled benchmark of 18 ML research tasks with 12 process-level behavioral metrics that separates agent search strategy from execution infrastructure.

If this is right

  • Greedy search tends to be more effective on tasks where improvement opportunities are dense.
  • Tree-search and evolutionary strategies tend to be more effective on tasks where improvement opportunities are sparse.
  • Early convergence and directionally focused exploration predict higher final performance across agents.
  • Solution diversity and total compute cost show no reliable association with final performance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Agents in other scientific domains could adopt similar stagnation-triggered switches between local and global search.
  • Task designers might classify new problems by opportunity density to select or combine strategies in advance.
  • The benchmark could be extended with tasks that deliberately vary opportunity density to test the adaptive rule more directly.

Load-bearing premise

The chosen 18 tasks and 12 metrics adequately represent the dense versus sparse improvement opportunity structures found in real machine learning research problems.

What would settle it

A follow-up study on a fresh collection of ML research tasks finds that the adaptive agent no longer leads the others or that greedy and tree-search performance gaps disappear.

Figures

Figures reproduced from arXiv: 2605.17373 by Anirudh Goyal, Chang Liu, Dianbo Liu, Hou Hei Lam, Qiran Zou, Samson Yu, Srinivas Anumasa, Tianyi Zhang, Tingting Chen, Wenhao Zhao, Yiming Tang, Yingtao Zhu, Zhengyao Jiang, Zufeng Zhang.

Figure 1
Figure 1. Figure 1: Comparison of the six AI research agents on FML-bench. Left: per-agent mean normalized test improvement (left axis) and average pairwise win-rate (right axis), agents ranked by mean improvement. Right: per-agent fingerprint over six process-level axes capturing convergence efficiency, exploration geometry, and cost frugality (higher is better on every axis). experiments) from execution infrastructure (the … view at source ↗
Figure 2
Figure 2. Figure 2: The FML-bench evaluation pipeline. Left: the task specification fed to the agent. Center: the agent iterates a propose, modify, execute loop; only the decision of what to try next is governed by the agent’s own strategy (unlocked icon), while codebase modification and experiment execution (locked icons) are shared framework infrastructure. Right: the framework evaluates the best-validated codebase on a hel… view at source ↗
Figure 3
Figure 3. Figure 3: Mean convergence curves across 18 research tasks. Each line is per-agent mean best-so￾far validation improvement, averaged over 18 tasks × 3 rounds, at each of the 100 optimization steps [PITH_FULL_IMAGE:figures/full_fig_p022_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Search-regime crossover. Per-agent mean normalized test improvement on the high and low opportunity-density partitions; error bars are the cross-round standard deviation. Autoresearch leads the high-density partition but falls to sixth (of seven) on low-density; AdaptiveSearch ranks in the top two on both partitions (second on high-density, first on low-density), confirming that adaptive switching is robus… view at source ↗
Figure 5
Figure 5. Figure 5: Autoresearch’s per-task improvement is the most polarized of the six agents. Left: per-agent improvement distribution across the 18 tasks (3-round mean per cell), agents sorted by std. Right: per-task rank distribution (rank 1 best, rank 6 worst). Autoresearch attains the largest improvement std and the most extreme rank distribution. outlier. GPT-5.4 remains close in mean improvement, but its much lower m… view at source ↗
Figure 6
Figure 6. Figure 6: Raw quality comparison across backbone LLMs. Gemini 3.1 Pro is the most consistent [PITH_FULL_IMAGE:figures/full_fig_p025_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Cost–quality trade-off across backbone LLMs. GPT-5.4 occupies the low-cost regime while [PITH_FULL_IMAGE:figures/full_fig_p025_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Pooled modification-type distribution across three runs for each agent. All agents are [PITH_FULL_IMAGE:figures/full_fig_p026_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Failure-type rate for each agent, measured as the percentage of all trials ending in each [PITH_FULL_IMAGE:figures/full_fig_p027_9.png] view at source ↗
read the original abstract

AI research agents accelerate ML research by automating hypothesis generation, experimentation, and empirical refinement. Existing agent strategies range from greedy hill-climbing to tree search and evolutionary optimization, yet which strategy choices drive performance remains unclear. Answering this question requires a benchmark that separates agent strategy (e.g., search topology) from execution infrastructure (e.g., code editor), so that performance differences are attributable to strategy rather than infrastructure, and that provides process-level metrics beyond final scores to analyze exploration behaviors. Existing benchmarks offer limited support. We propose FML-Bench, a benchmark of 18 fundamental ML research tasks across 10 domains that separates agent strategy from execution infrastructure and defines 12 process-level behavioral metrics. Evaluating six representative agents, we find that: (1) strategy complexity alone does not guarantee strong performance: a simple greedy hill-climber nearly matches the best-performing tree-search agent, both well above the remaining agents; (2) our analysis suggests this pattern relates to improvement opportunity structure: greedy search tends to be more effective when opportunities are dense, while tree-search and evolutionary strategies tend to be more effective when opportunities are sparse; an adaptive agent built on this insight switches to broader exploration upon detecting improvement stagnation and outperforms the other six agents, lending initial support to this observation; and (3) process-level analysis reveals that early convergence and directionally focused exploration are significantly associated with final performance, while solution diversity and compute cost are not. Our benchmark is available at: https://github.com/qrzou/FML-bench.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 3 minor

Summary. The manuscript introduces FML-Bench, a controlled benchmark of 18 fundamental ML research tasks across 10 domains. It separates agent search strategy from execution infrastructure and defines 12 process-level behavioral metrics. Experiments on six representative agents show that a simple greedy hill-climber nearly matches the best tree-search agent (both substantially above the others). The authors interpret this pattern as relating to dense versus sparse improvement opportunity structures across tasks; an adaptive agent that switches to broader exploration upon detecting stagnation outperforms the six baselines. Process-level analysis finds early convergence and directionally focused exploration significantly associated with final performance, while solution diversity and compute cost are not.

Significance. If the empirical results hold, the work supplies reproducible, process-aware evidence on strategy effectiveness for AI research agents, showing that added complexity does not automatically improve outcomes and that adaptive switching can be beneficial. The open benchmark and released code enable direct follow-up studies. The process metrics move the field beyond final-score comparisons toward mechanistic understanding of exploration behavior.

major comments (1)
  1. [Analysis of opportunity structure / adaptive agent construction] Analysis section on opportunity structure and adaptive agent: the classification of tasks into dense versus sparse improvement opportunities is performed post-hoc from the same agent-run improvement frequencies that generate the performance numbers. This creates a risk that the reported strategy-by-task interaction is tautological rather than explanatory. An independent, a-priori measure of opportunity density (defined from task properties before any agent execution) is needed to substantiate the interpretation that underpins the adaptive-agent design and the claim that it generalizes the observed pattern.
minor comments (3)
  1. [Task description] Table 1 or task-description section: confirm whether the 18 tasks are evenly distributed across the stated 10 domains or whether some domains contain multiple related tasks; this affects claims about breadth.
  2. [Metrics definition] Process-metrics definitions: provide explicit formulas or pseudocode for the 12 metrics (especially the stagnation-detection rule used by the adaptive agent) so that future work can replicate the exact thresholds.
  3. [Results] Results figures: add statistical significance markers or confidence intervals to the performance bar charts so readers can judge whether the greedy–tree-search parity and the adaptive-agent gains are reliable.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive and insightful feedback on our manuscript. We address the major comment below and describe the revisions we plan to make.

read point-by-point responses
  1. Referee: [Analysis of opportunity structure / adaptive agent construction] Analysis section on opportunity structure and adaptive agent: the classification of tasks into dense versus sparse improvement opportunities is performed post-hoc from the same agent-run improvement frequencies that generate the performance numbers. This creates a risk that the reported strategy-by-task interaction is tautological rather than explanatory. An independent, a-priori measure of opportunity density (defined from task properties before any agent execution) is needed to substantiate the interpretation that underpins the adaptive-agent design and the claim that it generalizes the observed pattern.

    Authors: We agree that the current categorization of tasks into dense versus sparse improvement opportunities relies on post-hoc analysis of improvement frequencies observed during agent runs, which introduces a potential circularity with the performance results. To address this limitation, we will revise the manuscript to define and compute an independent a-priori measure of opportunity density based solely on intrinsic task properties prior to any agent execution. This measure will incorporate factors such as the dimensionality of the hyperparameter space, the number of modifiable components in the task formulation, and indicators of landscape structure derivable from the task description itself. Tasks will then be reclassified using this pre-defined metric, and we will re-examine the strategy-by-task performance patterns to verify consistency with the dense/sparse distinction. The adaptive agent design and associated claims will be updated to reference this independent categorization, thereby strengthening the explanatory power and generalizability of the results. revision: yes

Circularity Check

0 steps flagged

Empirical benchmark evaluation with direct comparisons; no derivation reduces to fitted inputs or self-citation

full rationale

The paper introduces FML-Bench as an experimental platform with 18 tasks and 12 process metrics, then reports head-to-head performance of six agents plus one adaptive variant constructed from observed patterns. All central claims (greedy nearly matching tree search, adaptive outperforming others, associations with early convergence) rest on these controlled runs and released code rather than any equation, parameter fit, or uniqueness theorem. The dense/sparse opportunity analysis is post-hoc but does not redefine performance metrics or reuse the same data splits for validation in a circular manner. No self-citations are load-bearing for the main results, and no ansatz or renaming of known results is presented as a derivation. This is the normal case of an honest empirical study whose findings can be checked against the public benchmark.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The paper relies on standard assumptions that the chosen 18 tasks are representative of ML research problems and that the 12 behavioral metrics validly capture exploration dynamics; no new physical or mathematical entities are introduced and no parameters are fitted to produce the headline performance numbers.

axioms (2)
  • domain assumption The 18 tasks across 10 domains adequately sample the space of ML research problems for strategy comparison.
    Invoked in the benchmark construction section to justify generalization of findings.
  • standard math Process-level metrics such as early convergence and directional focus can be measured independently of final task performance.
    Used when correlating behavioral metrics with final scores.

pith-pipeline@v0.9.0 · 5860 in / 1537 out tokens · 37790 ms · 2026-05-20T14:21:19.495102+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

67 extracted references · 67 canonical work pages · 15 internal anchors

  1. [1]

    Machine bias: There’s software used across the country to predict future criminals

    Julia Angwin, Jeff Larson, Surya Mattu, and Lauren Kirchner. Machine bias: There’s software used across the country to predict future criminals. and it’s biased against blacks.ProPublica, May 2016. URL https://www.propublica.org/article/machine-bias-risk-asses sments-in-criminal-sentencing

  2. [2]

    Invariant Risk Minimization

    Martin Arjovsky, Léon Bottou, Ishaan Gulrajani, and David Lopez-Paz. Invariant risk mini- mization.arXiv preprint arXiv:1907.02893, 2019

  3. [3]

    Barry Becker and Ronny Kohavi. Adult. UCI Machine Learning Repository, 1996. DOI: https://doi.org/10.24432/C5XW20

  4. [4]

    Rachel K. E. Bellamy, Kuntal Dey, Michael Hind, Samuel C. Hoffman, Stephanie Houde, Kalapriya Kannan, Pranay Lohia, Jacquelyn Martino, Sameep Mehta, Aleksandra Mojsilovic, Seema Nagar, Karthikeyan Natesan Ramamurthy, John Richards, Diptikalyan Saha, Prasanna Sattigeri, Moninder Singh, Kush R. Varshney, and Yunfeng Zhang. AI Fairness 360: An extensible too...

  5. [5]

    Representation learning: A review and new perspectives.IEEE transactions on pattern analysis and machine intelligence, 35(8): 1798–1828, 2013

    Yoshua Bengio, Aaron Courville, and Pascal Vincent. Representation learning: A review and new perspectives.IEEE transactions on pattern analysis and machine intelligence, 35(8): 1798–1828, 2013

  6. [6]

    Fairlearn: A toolkit for assessing and improving fairness in ai

    Sarah Bird, Miro Dudík, Richard Edgar, Brandon Horn, Roman Lutz, Vanessa Milan, Mehrnoosh Sameki, Hanna Wallach, and Kathleen Walker. Fairlearn: A toolkit for assessing and improving fairness in ai. 2020

  7. [7]

    Dp-instahide: Provably defusing poi- soning and backdoor attacks with differentially private data augmentations.arXiv preprint arXiv:2103.02079, 2021

    Eitan Borgnia, Jonas Geiping, Valeriia Cherepanova, Liam Fowl, Arjun Gupta, Amin Ghiasi, Furong Huang, Micah Goldblum, and Tom Goldstein. Dp-instahide: Provably defusing poi- soning and backdoor attacks with differentially private data augmentations.arXiv preprint arXiv:2103.02079, 2021

  8. [8]

    MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering

    Jun Shern Chan, Neil Chowdhury, Oliver Jaffe, James Aung, Dane Sherburn, Evan Mays, Giulio Starace, Kevin Liu, Leon Maksin, Tejal Patwardhan, et al. Mle-bench: Evaluating machine learning agents on machine learning engineering.arXiv preprint arXiv:2410.07095, 2024

  9. [9]

    Causalml: Python package for causal machine learning, 2020

    Huigang Chen, Totte Harinen, Jeong-Yoon Lee, Mike Yung, and Zhenyu Zhao. Causalml: Python package for causal machine learning, 2020

  10. [10]

    Mars: Modular agent with reflective search for automated ai research.arXiv preprint arXiv:2602.02660, 2026

    Jiefeng Chen, Bhavana Dalvi Mishra, Jaehyun Nam, Rui Meng, Tomas Pfister, and Jinsung Yoon. Mars: Modular agent with reflective search for automated ai research.arXiv preprint arXiv:2602.02660, 2026

  11. [11]

    Morgan & Claypool Publishers, 2018

    Zhiyuan Chen and Bing Liu.Lifelong machine learning. Morgan & Claypool Publishers, 2018

  12. [12]

    Scienceagentbench: Toward rigorous assessment of language agents for data-driven scientific discovery

    Ziru Chen, Shijie Chen, Yuting Ning, Qianheng Zhang, Boshi Wang, Botao Yu, Yifei Li, Zeyi Liao, Chen Wei, Zitong Lu, et al. Scienceagentbench: Toward rigorous assessment of language agents for data-driven scientific discovery.arXiv preprint arXiv:2410.05080, 2024

  13. [13]

    solo- learn: A library of self-supervised methods for visual representation learning.Journal of Machine Learning Research, 23(56):1–6, 2022

    Victor Guilherme Turrisi Da Costa, Enrico Fini, Moin Nabi, Nicu Sebe, and Elisa Ricci. solo- learn: A library of self-supervised methods for visual representation learning.Journal of Machine Learning Research, 23(56):1–6, 2022

  14. [14]

    The mnist database of handwritten digit images for machine learning research.IEEE Signal Processing Magazine, 29(6):141–142, 2012

    Li Deng. The mnist database of handwritten digit images for machine learning research.IEEE Signal Processing Magazine, 29(6):141–142, 2012

  15. [15]

    Openunlearning: Accelerating llm unlearning via unified bench- marking of methods and metrics.arXiv preprint arXiv:2506.12618, 2025

    Vineeth Dorna, Anmol Mekala, Wenlong Zhao, Andrew McCallum, Zachary C Lipton, J Zico Kolter, and Pratyush Maini. Openunlearning: Accelerating llm unlearning via unified bench- marking of methods and metrics.arXiv preprint arXiv:2506.12618, 2025

  16. [16]

    In search of lost domain generalization.arXiv preprint arXiv:2007.01434, 2020

    Ishaan Gulrajani and David Lopez-Paz. In search of lost domain generalization.arXiv preprint arXiv:2007.01434, 2020. 10

  17. [17]

    GraphCodeBERT: Pre-training Code Representations with Data Flow

    Daya Guo, Shuo Ren, Shuai Lu, Zhangyin Feng, Duyu Tang, Shujie Liu, Long Zhou, Nan Duan, Alexey Svyatkovskiy, Shengyu Fu, et al. Graphcodebert: Pre-training code representations with data flow.arXiv preprint arXiv:2009.08366, 2020

  18. [18]

    Momentum contrast for unsupervised visual representation learning

    Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual representation learning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9729–9738, 2020

  19. [19]

    Bayesian nonparametric modeling for causal inference.Journal of Computa- tional and Graphical Statistics, 20(1):217–240, 2011

    Jennifer L Hill. Bayesian nonparametric modeling for causal inference.Journal of Computa- tional and Graphical Statistics, 20(1):217–240, 2011

  20. [20]

    Maximilian Idahl and Zahra Ahmadi

    Qian Huang, Jian V ora, Percy Liang, and Jure Leskovec. Mlagentbench: Evaluating language agents on machine learning experimentation.arXiv preprint arXiv:2310.03302, 2023

  21. [21]

    AIDE: AI-Driven Exploration in the Space of Code

    Zhengyao Jiang, Dominik Schmidt, Dhruv Srikanth, Dixing Xu, Ian Kaplan, Deniss Jacenko, and Yuxiang Wu. Aide: Ai-driven exploration in the space of code. 2025. URL https: //arxiv.org/abs/2502.13138

  22. [22]

    SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

    Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues?arXiv preprint arXiv:2310.06770, 2023

  23. [23]

    Dsbench: How far are data science agents from becoming data science experts?arXiv preprint arXiv:2409.07703, 2024

    Liqiang Jing, Zhehui Huang, Xiaoyang Wang, Wenlin Yao, Wenhao Yu, Kaixin Ma, Hongming Zhang, Xinya Du, and Dong Yu. Dsbench: How far are data science agents from becoming data science experts?arXiv preprint arXiv:2409.07703, 2024

  24. [24]

    autoresearch: Ai agents running research on single-gpu nanochat training automatically.https://github.com/karpathy/autoresearch, 2026

    Andrej Karpathy. autoresearch: Ai agents running research on single-gpu nanochat training automatically.https://github.com/karpathy/autoresearch, 2026. GitHub repository

  25. [25]

    Learning multiple layers of features from tiny images

    Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009

  26. [26]

    LAB-Bench: Measuring Capabilities of Language Models for Biology Research

    Jon M Laurent, Joseph D Janizek, Michael Ruzo, Michaela M Hinks, Michael J Hammer- ling, Siddharth Narayanan, Manvitha Ponnapati, Andrew D White, and Samuel G Rodriques. Lab-bench: Measuring capabilities of language models for biology research.arXiv preprint arXiv:2407.10362, 2024

  27. [27]

    Abandoning objectives: Evolution through the search for novelty alone.Evolutionary computation, 19(2):189–223, 2011

    Joel Lehman and Kenneth O Stanley. Abandoning objectives: Evolution through the search for novelty alone.Evolutionary computation, 19(2):189–223, 2011

  28. [28]

    Trustworthy ai: From principles to practices.ACM Computing Surveys, 55(9):1–46, 2023

    Bo Li, Peng Qi, Bo Liu, Shuai Di, Jingen Liu, Jiquan Pei, Jinfeng Yi, and Bowen Zhou. Trustworthy ai: From principles to practices.ACM Computing Surveys, 55(9):1–46, 2023

  29. [29]

    Chris Lu, Cong Lu, Robert Tjarko Lange, Jakob Foerster, Jeff Clune, and David Ha

    Ruochen Li, Teerth Patel, Qingyun Wang, and Xinya Du. Mlr-copilot: Autonomous machine learning research based on large language models agents.arXiv preprint arXiv:2408.14033, 2024

  30. [30]

    Lightly: A python library for self-supervised learning on images

    Lightly-AI. Lightly: A python library for self-supervised learning on images. https://gith ub.com/lightly-ai/lightly, 2025

  31. [31]

    The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery

    Chris Lu, Cong Lu, Robert Tjarko Lange, Jakob Foerster, Jeff Clune, and David Ha. The ai scien- tist: Towards fully automated open-ended scientific discovery.arXiv preprint arXiv:2408.06292, 2024

  32. [32]

    TOFU: A Task of Fictitious Unlearning for LLMs

    Pratyush Maini, Zhili Feng, Avi Schwarzschild, Zachary C Lipton, and J Zico Kolter. Tofu: A task of fictitious unlearning for llms.arXiv preprint arXiv:2401.06121, 2024

  33. [33]

    Communication-efficient learning of deep networks from decentralized data

    Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas. Communication-efficient learning of deep networks from decentralized data. InArtificial intelligence and statistics, pages 1273–1282. Pmlr, 2017

  34. [34]

    A survey on bias and fairness in machine learning.ACM computing surveys (CSUR), 54(6):1–35, 2021

    Ninareh Mehrabi, Fred Morstatter, Nripsuta Saxena, Kristina Lerman, and Aram Galstyan. A survey on bias and fairness in machine learning.ACM computing surveys (CSUR), 54(6):1–35, 2021. 11

  35. [35]

    Illuminating search spaces by mapping elites

    Jean-Baptiste Mouret and Jeff Clune. Illuminating search spaces by mapping elites.arXiv preprint arXiv:1504.04909, 2015

  36. [36]

    Ml privacy meter: Aiding regulatory compliance by quantifying the privacy risks of machine learning.arXiv preprint arXiv:2007.09339, 2020

    Sasi Kumar Murakonda and Reza Shokri. Ml privacy meter: Aiding regulatory compliance by quantifying the privacy risks of machine learning.arXiv preprint arXiv:2007.09339, 2020

  37. [37]

    arXiv preprint arXiv:1807.01069 , year=

    Maria-Irina Nicolae, Mathieu Sinn, Minh Ngoc Tran, Beat Buesser, Ambrish Rawat, Mar- tin Wistuba, Valentina Zantedeschi, Nathalie Baracaldo, Bryant Chen, Heiko Ludwig, et al. Adversarial robustness toolbox v1. 0.0.arXiv preprint arXiv:1807.01069, 2018

  38. [38]

    AlphaEvolve: A coding agent for scientific and algorithmic discovery

    Alexander Novikov, Ngân V˜u, Marvin Eisenberger, Emilien Dupont, Po-Sen Huang, Adam Zsolt Wagner, Sergey Shirobokov, Borislav Kozlovskii, Francisco JR Ruiz, Abbas Mehrabian, et al. Alphaevolve: A coding agent for scientific and algorithmic discovery.arXiv preprint arXiv:2506.13131, 2025

  39. [39]

    Ml-dev-bench: Comparative analysis of ai agents on ml development workflows.arXiv preprint arXiv:2502.00964, 2025

    Harshith Padigela, Chintan Shah, and Dinkar Juyal. Ml-dev-bench: Comparative analysis of ai agents on ml development workflows.arXiv preprint arXiv:2502.00964, 2025

  40. [40]

    The seven tools of causal inference, with reflections on machine learning.Commu- nications of the ACM, 62(3):54–60, 2019

    Judea Pearl. The seven tools of causal inference, with reflections on machine learning.Commu- nications of the ACM, 62(3):54–60, 2019

  41. [41]

    Quality diversity: A new frontier for evolutionary computation.Frontiers in Robotics and AI, 3:40, 2016

    Justin K Pugh, Lisa B Soros, and Kenneth O Stanley. Quality diversity: A new frontier for evolutionary computation.Frontiers in Robotics and AI, 3:40, 2016

  42. [42]

    icarl: Incremental classifier and representation learning

    Sylvestre-Alvise Rebuffi, Alexander Kolesnikov, Georg Sperl, and Christoph H Lampert. icarl: Incremental classifier and representation learning. InProceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 2001–2010, 2017

  43. [43]

    Agent laboratory: Using llm agents as research assistants.Findings of the Association for Computational Linguistics: EMNLP 2025, pages 5977–6043, 2025

    Samuel Schmidgall, Yusheng Su, Ze Wang, Ximeng Sun, Jialian Wu, Xiaodong Yu, Jiang Liu, Michael Moor, Zicheng Liu, and Emad Barsoum. Agent laboratory: Using llm agents as research assistants.Findings of the Association for Computational Linguistics: EMNLP 2025, pages 5977–6043, 2025

  44. [44]

    Openevolve: an open-source evolutionary coding agent

    Asankhaya Sharma. Openevolve: an open-source evolutionary coding agent. https://gith ub.com/algorithmicsuperintelligence/openevolve, 2025. GitHub repository

  45. [45]

    Adapting neural networks for the estimation of treatment effects.Advances in neural information processing systems, 32, 2019

    Claudia Shi, David Blei, and Victor Veitch. Adapting neural networks for the estimation of treatment effects.Advances in neural information processing systems, 32, 2019

  46. [46]

    Easy few-shot learning: ready-to-use code and tutorial notebooks for few-shot image classification.https://github.com/sicara/easy-few-shot-learning, 2024

    Sicara. Easy few-shot learning: ready-to-use code and tutorial notebooks for few-shot image classification.https://github.com/sicara/easy-few-shot-learning, 2024

  47. [47]

    Prototypical networks for few-shot learning

    Jake Snell, Kevin Swersky, and Richard Zemel. Prototypical networks for few-shot learning. Advances in neural information processing systems, 30, 2017

  48. [48]

    Fixmatch: Simplifying semi- supervised learning with consistency and confidence.Advances in neural information processing systems, 33:596–608, 2020

    Kihyuk Sohn, David Berthelot, Nicholas Carlini, Zizhao Zhang, Han Zhang, Colin A Raf- fel, Ekin Dogus Cubuk, Alexey Kurakin, and Chun-Liang Li. Fixmatch: Simplifying semi- supervised learning with consistency and confidence.Advances in neural information processing systems, 33:596–608, 2020

  49. [49]

    arXiv preprint arXiv:2505.18705 , year=

    Jiabin Tang, Lianghao Xia, Zhonghang Li, and Chao Huang. Ai-researcher: Autonomous scientific innovation.arXiv preprint arXiv:2505.18705, 2025

  50. [50]

    arXiv preprint arXiv:2507.02554 , year=

    Edan Toledo, Karen Hambardzumyan, Martin Josifoski, Rishi Hazra, Nicolas Baldwin, Alexis Audran-Reiss, Michael Kuchnik, Despoina Magka, Minqi Jiang, Alisia Maria Lupidi, et al. Ai research agents for machine learning: Search, exploration, and generalization in mle-bench. arXiv preprint arXiv:2507.02554, 2025

  51. [51]

    Three types of incremental learning.Nature Machine Intelligence, 4:1185–1197, 2022

    Gido M van de Ven, Tinne Tuytelaars, and Andreas S Tolias. Three types of incremental learning.Nature Machine Intelligence, 4:1185–1197, 2022

  52. [52]

    Vapnik.Statistical Learning Theory

    Vladimir N. Vapnik.Statistical Learning Theory. Wiley-Interscience, New York, 1998. 12

  53. [53]

    Deep hashing network for unsupervised domain adaptation

    Hemanth Venkateswara, Jose Eusebio, Shayok Chakraborty, and Sethuraman Panchanathan. Deep hashing network for unsupervised domain adaptation. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5018–5027, 2017

  54. [54]

    Matching networks for one shot learning.Advances in neural information processing systems, 29, 2016

    Oriol Vinyals, Charles Blundell, Timothy Lillicrap, Daan Wierstra, et al. Matching networks for one shot learning.Advances in neural information processing systems, 29, 2016

  55. [55]

    SciBench: Evaluating College-Level Scientific Problem-Solving Abilities of Large Language Models

    Xiaoxuan Wang, Ziniu Hu, Pan Lu, Yanqiao Zhu, Jieyu Zhang, Satyen Subramaniam, Arjun R Loomba, Shichang Zhang, Yizhou Sun, and Wei Wang. Scibench: Evaluating college-level scientific problem-solving abilities of large language models.arXiv preprint arXiv:2307.10635, 2023

  56. [56]

    Usb: A unified semi-supervised learning benchmark for classification.Advances in Neural Information Processing Systems, 35:3938–3961, 2022

    Yidong Wang, Hao Chen, Yue Fan, Wang Sun, Ran Tao, Wenxin Hou, Renjie Wang, Linyi Yang, Zhi Zhou, Lan-Zhe Guo, et al. Usb: A unified semi-supervised learning benchmark for classification.Advances in Neural Information Processing Systems, 35:3938–3961, 2022

  57. [57]

    Re-bench: Evaluating frontier ai r&d capabilities of language model agents against human experts.arXiv preprint arXiv:2411.15114, 2024

    Hjalmar Wijk, Tao Lin, Joel Becker, Sami Jawhar, Neev Parikh, Thomas Broadley, Lawrence Chan, Michael Chen, Josh Clymer, Jai Dhyani, et al. Re-bench: Evaluating frontier ai r&d capabilities of language model agents against human experts.arXiv preprint arXiv:2411.15114, 2024

  58. [58]

    The AI Scientist-v2: Workshop-Level Automated Scientific Discovery via Agentic Tree Search

    Yutaro Yamada, Robert Tjarko Lange, Cong Lu, Shengran Hu, Chris Lu, Jakob Foerster, Jeff Clune, and David Ha. The ai scientist-v2: Workshop-level automated scientific discovery via agentic tree search.arXiv preprint arXiv:2504.08066, 2025

  59. [59]

    Opacus: User-friendly differential privacy library in PyTorch.arXiv preprint arXiv:2109.12298, 2021

    Ashkan Yousefpour, Igor Shilov, Alexandre Sablayrolles, Davide Testuggine, Karthik Prasad, Mani Malek, John Nguyen, Sayan Ghosh, Akash Bharadwaj, Jessica Zhao, et al. Opacus: User-friendly differential privacy library in pytorch.arXiv preprint arXiv:2109.12298, 2021

  60. [60]

    Wide Residual Networks

    Sergey Zagoruyko and Nikos Komodakis. Wide residual networks.arXiv preprint arXiv:1605.07146, 2016

  61. [61]

    Barlow twins: Self- supervised learning via redundancy reduction

    Jure Zbontar, Li Jing, Ishan Misra, Yann LeCun, and Stéphane Deny. Barlow twins: Self- supervised learning via redundancy reduction. InInternational conference on machine learning, pages 12310–12320. PMLR, 2021

  62. [62]

    Continual learning through synaptic intelligence

    Friedemann Zenke, Ben Poole, and Surya Ganguli. Continual learning through synaptic intelligence. InInternational conference on machine learning, pages 3987–3995. PMLR, 2017

  63. [63]

    Pfllib: Personalized federated learning algorithm library.arXiv preprint arXiv:2312.04992, 2023

    Jianqing Zhang, Yang Liu, Yang Hua, Hao Wang, Tao Song, Zhengui Xue, Ruhui Ma, and Jian Cao. Pfllib: Personalized federated learning algorithm library.arXiv preprint arXiv:2312.04992, 2023

  64. [64]

    OpenOOD v1.5: Enhanced Benchmark for Out -of- Distribution Detection,

    Jingyang Zhang, Jingkang Yang, Pengyun Wang, Haoqi Wang, Yueqian Lin, Haoran Zhang, Yiyou Sun, Xuefeng Du, Yixuan Li, Ziwei Liu, et al. Openood v1. 5: Enhanced benchmark for out-of-distribution detection.arXiv preprint arXiv:2306.09301, 2023

  65. [65]

    gcastle: A python toolbox for causal discovery.arXiv preprint arXiv:2111.15155, 2021

    Keli Zhang, Shengyu Zhu, Marcus Kalander, Ignavier Ng, Junjian Ye, Zhitang Chen, and Lujia Pan. gcastle: A python toolbox for causal discovery.arXiv preprint arXiv:2111.15155, 2021

  66. [66]

    Dags with no tears: Continuous optimization for structure learning.Advances in neural information processing systems, 31, 2018

    Xun Zheng, Bryon Aragam, Pradeep K Ravikumar, and Eric P Xing. Dags with no tears: Continuous optimization for structure learning.Advances in neural information processing systems, 31, 2018

  67. [67]

    lower is better

    Da-Wei Zhou, Fu-Yun Wang, Han-Jia Ye, and De-Chuan Zhan. Pycil: a python toolbox for class-incremental learning, 2023. 13 A Task descriptions This appendix gives a short description of each of the 18 research tasks in FML-bench, one paragraph per task. For every task we identify the dataset, the baseline algorithm, the agent’s optimization target, and the...