FML-bench: A Controlled Study of AI Research Agent Strategies from the Perspective of Search Dynamics

Anirudh Goyal; Chang Liu; Dianbo Liu; Hou Hei Lam; Qiran Zou; Samson Yu; Srinivas Anumasa; Tianyi Zhang; Tingting Chen; Wenhao Zhao

arxiv: 2605.17373 · v1 · pith:TUCWOVSSnew · submitted 2026-05-17 · 💻 cs.LG · cs.AI

FML-bench: A Controlled Study of AI Research Agent Strategies from the Perspective of Search Dynamics

Qiran Zou , Hou Hei Lam , Wenhao Zhao , Tingting Chen , Yiming Tang , Samson Yu , Yingtao Zhu , Srinivas Anumasa

show 6 more authors

Zufeng Zhang Tianyi Zhang Chang Liu Zhengyao Jiang Anirudh Goyal Dianbo Liu

This is my paper

Pith reviewed 2026-05-20 14:21 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords AI research agentssearch strategiesgreedy hill-climbingtree searchadaptive explorationprocess metricsmachine learning benchmarksopportunity structure

0 comments

The pith

A simple greedy hill-climber nearly matches top tree-search performance in AI research agents, while an adaptive strategy that broadens exploration on stagnation outperforms all six tested approaches.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents FML-Bench to isolate the effects of search strategy from execution details when comparing AI agents that automate machine learning research. It runs six representative agents on 18 fundamental tasks spanning 10 domains and tracks 12 process-level metrics such as convergence speed and exploration focus. Results show that added strategy complexity does not reliably improve outcomes: a basic greedy approach performs almost as well as the strongest tree-search method, and both clearly beat the other agents. An adaptive agent that detects stagnation and switches to wider search beats every fixed strategy. Process metrics further link early focused progress to better final results, while solution variety and raw compute spend show no such link.

Core claim

Evaluating six agents on the 18-task benchmark reveals that strategy complexity alone does not guarantee strong performance: a simple greedy hill-climber nearly matches the best-performing tree-search agent, both well above the remaining agents. Analysis ties this pattern to improvement opportunity structure, with greedy search favored when opportunities are dense and tree or evolutionary search favored when they are sparse. An adaptive agent that switches to broader exploration upon detecting improvement stagnation outperforms the other six agents, and process-level metrics show that early convergence and directionally focused exploration are significantly associated with final performance.

What carries the argument

FML-Bench, a controlled benchmark of 18 ML research tasks with 12 process-level behavioral metrics that separates agent search strategy from execution infrastructure.

If this is right

Greedy search tends to be more effective on tasks where improvement opportunities are dense.
Tree-search and evolutionary strategies tend to be more effective on tasks where improvement opportunities are sparse.
Early convergence and directionally focused exploration predict higher final performance across agents.
Solution diversity and total compute cost show no reliable association with final performance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Agents in other scientific domains could adopt similar stagnation-triggered switches between local and global search.
Task designers might classify new problems by opportunity density to select or combine strategies in advance.
The benchmark could be extended with tasks that deliberately vary opportunity density to test the adaptive rule more directly.

Load-bearing premise

The chosen 18 tasks and 12 metrics adequately represent the dense versus sparse improvement opportunity structures found in real machine learning research problems.

What would settle it

A follow-up study on a fresh collection of ML research tasks finds that the adaptive agent no longer leads the others or that greedy and tree-search performance gaps disappear.

Figures

Figures reproduced from arXiv: 2605.17373 by Anirudh Goyal, Chang Liu, Dianbo Liu, Hou Hei Lam, Qiran Zou, Samson Yu, Srinivas Anumasa, Tianyi Zhang, Tingting Chen, Wenhao Zhao, Yiming Tang, Yingtao Zhu, Zhengyao Jiang, Zufeng Zhang.

**Figure 1.** Figure 1: Comparison of the six AI research agents on FML-bench. Left: per-agent mean normalized test improvement (left axis) and average pairwise win-rate (right axis), agents ranked by mean improvement. Right: per-agent fingerprint over six process-level axes capturing convergence efficiency, exploration geometry, and cost frugality (higher is better on every axis). experiments) from execution infrastructure (the … view at source ↗

**Figure 2.** Figure 2: The FML-bench evaluation pipeline. Left: the task specification fed to the agent. Center: the agent iterates a propose, modify, execute loop; only the decision of what to try next is governed by the agent’s own strategy (unlocked icon), while codebase modification and experiment execution (locked icons) are shared framework infrastructure. Right: the framework evaluates the best-validated codebase on a hel… view at source ↗

**Figure 3.** Figure 3: Mean convergence curves across 18 research tasks. Each line is per-agent mean best-sofar validation improvement, averaged over 18 tasks × 3 rounds, at each of the 100 optimization steps [PITH_FULL_IMAGE:figures/full_fig_p022_3.png] view at source ↗

**Figure 4.** Figure 4: Search-regime crossover. Per-agent mean normalized test improvement on the high and low opportunity-density partitions; error bars are the cross-round standard deviation. Autoresearch leads the high-density partition but falls to sixth (of seven) on low-density; AdaptiveSearch ranks in the top two on both partitions (second on high-density, first on low-density), confirming that adaptive switching is robus… view at source ↗

**Figure 5.** Figure 5: Autoresearch’s per-task improvement is the most polarized of the six agents. Left: per-agent improvement distribution across the 18 tasks (3-round mean per cell), agents sorted by std. Right: per-task rank distribution (rank 1 best, rank 6 worst). Autoresearch attains the largest improvement std and the most extreme rank distribution. outlier. GPT-5.4 remains close in mean improvement, but its much lower m… view at source ↗

**Figure 6.** Figure 6: Raw quality comparison across backbone LLMs. Gemini 3.1 Pro is the most consistent [PITH_FULL_IMAGE:figures/full_fig_p025_6.png] view at source ↗

**Figure 7.** Figure 7: Cost–quality trade-off across backbone LLMs. GPT-5.4 occupies the low-cost regime while [PITH_FULL_IMAGE:figures/full_fig_p025_7.png] view at source ↗

**Figure 8.** Figure 8: Pooled modification-type distribution across three runs for each agent. All agents are [PITH_FULL_IMAGE:figures/full_fig_p026_8.png] view at source ↗

**Figure 9.** Figure 9: Failure-type rate for each agent, measured as the percentage of all trials ending in each [PITH_FULL_IMAGE:figures/full_fig_p027_9.png] view at source ↗

read the original abstract

AI research agents accelerate ML research by automating hypothesis generation, experimentation, and empirical refinement. Existing agent strategies range from greedy hill-climbing to tree search and evolutionary optimization, yet which strategy choices drive performance remains unclear. Answering this question requires a benchmark that separates agent strategy (e.g., search topology) from execution infrastructure (e.g., code editor), so that performance differences are attributable to strategy rather than infrastructure, and that provides process-level metrics beyond final scores to analyze exploration behaviors. Existing benchmarks offer limited support. We propose FML-Bench, a benchmark of 18 fundamental ML research tasks across 10 domains that separates agent strategy from execution infrastructure and defines 12 process-level behavioral metrics. Evaluating six representative agents, we find that: (1) strategy complexity alone does not guarantee strong performance: a simple greedy hill-climber nearly matches the best-performing tree-search agent, both well above the remaining agents; (2) our analysis suggests this pattern relates to improvement opportunity structure: greedy search tends to be more effective when opportunities are dense, while tree-search and evolutionary strategies tend to be more effective when opportunities are sparse; an adaptive agent built on this insight switches to broader exploration upon detecting improvement stagnation and outperforms the other six agents, lending initial support to this observation; and (3) process-level analysis reveals that early convergence and directionally focused exploration are significantly associated with final performance, while solution diversity and compute cost are not. Our benchmark is available at: https://github.com/qrzou/FML-bench.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

FML-bench cleanly compares search strategies for AI research agents and finds greedy nearly matching tree search while an adaptive switcher wins, though the dense/sparse opportunity explanation rests on post-hoc task classification.

read the letter

The main thing to know is that this paper builds a benchmark that holds execution infrastructure fixed so performance differences can be pinned on search strategy, and the results show a simple greedy hill-climber performing close to the top tree-search agent across their tasks, with an adaptive agent that switches to wider exploration on stagnation coming out ahead of the six baselines they tested.

Referee Report

1 major / 3 minor

Summary. The manuscript introduces FML-Bench, a controlled benchmark of 18 fundamental ML research tasks across 10 domains. It separates agent search strategy from execution infrastructure and defines 12 process-level behavioral metrics. Experiments on six representative agents show that a simple greedy hill-climber nearly matches the best tree-search agent (both substantially above the others). The authors interpret this pattern as relating to dense versus sparse improvement opportunity structures across tasks; an adaptive agent that switches to broader exploration upon detecting stagnation outperforms the six baselines. Process-level analysis finds early convergence and directionally focused exploration significantly associated with final performance, while solution diversity and compute cost are not.

Significance. If the empirical results hold, the work supplies reproducible, process-aware evidence on strategy effectiveness for AI research agents, showing that added complexity does not automatically improve outcomes and that adaptive switching can be beneficial. The open benchmark and released code enable direct follow-up studies. The process metrics move the field beyond final-score comparisons toward mechanistic understanding of exploration behavior.

major comments (1)

[Analysis of opportunity structure / adaptive agent construction] Analysis section on opportunity structure and adaptive agent: the classification of tasks into dense versus sparse improvement opportunities is performed post-hoc from the same agent-run improvement frequencies that generate the performance numbers. This creates a risk that the reported strategy-by-task interaction is tautological rather than explanatory. An independent, a-priori measure of opportunity density (defined from task properties before any agent execution) is needed to substantiate the interpretation that underpins the adaptive-agent design and the claim that it generalizes the observed pattern.

minor comments (3)

[Task description] Table 1 or task-description section: confirm whether the 18 tasks are evenly distributed across the stated 10 domains or whether some domains contain multiple related tasks; this affects claims about breadth.
[Metrics definition] Process-metrics definitions: provide explicit formulas or pseudocode for the 12 metrics (especially the stagnation-detection rule used by the adaptive agent) so that future work can replicate the exact thresholds.
[Results] Results figures: add statistical significance markers or confidence intervals to the performance bar charts so readers can judge whether the greedy–tree-search parity and the adaptive-agent gains are reliable.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive and insightful feedback on our manuscript. We address the major comment below and describe the revisions we plan to make.

read point-by-point responses

Referee: [Analysis of opportunity structure / adaptive agent construction] Analysis section on opportunity structure and adaptive agent: the classification of tasks into dense versus sparse improvement opportunities is performed post-hoc from the same agent-run improvement frequencies that generate the performance numbers. This creates a risk that the reported strategy-by-task interaction is tautological rather than explanatory. An independent, a-priori measure of opportunity density (defined from task properties before any agent execution) is needed to substantiate the interpretation that underpins the adaptive-agent design and the claim that it generalizes the observed pattern.

Authors: We agree that the current categorization of tasks into dense versus sparse improvement opportunities relies on post-hoc analysis of improvement frequencies observed during agent runs, which introduces a potential circularity with the performance results. To address this limitation, we will revise the manuscript to define and compute an independent a-priori measure of opportunity density based solely on intrinsic task properties prior to any agent execution. This measure will incorporate factors such as the dimensionality of the hyperparameter space, the number of modifiable components in the task formulation, and indicators of landscape structure derivable from the task description itself. Tasks will then be reclassified using this pre-defined metric, and we will re-examine the strategy-by-task performance patterns to verify consistency with the dense/sparse distinction. The adaptive agent design and associated claims will be updated to reference this independent categorization, thereby strengthening the explanatory power and generalizability of the results. revision: yes

Circularity Check

0 steps flagged

Empirical benchmark evaluation with direct comparisons; no derivation reduces to fitted inputs or self-citation

full rationale

The paper introduces FML-Bench as an experimental platform with 18 tasks and 12 process metrics, then reports head-to-head performance of six agents plus one adaptive variant constructed from observed patterns. All central claims (greedy nearly matching tree search, adaptive outperforming others, associations with early convergence) rest on these controlled runs and released code rather than any equation, parameter fit, or uniqueness theorem. The dense/sparse opportunity analysis is post-hoc but does not redefine performance metrics or reuse the same data splits for validation in a circular manner. No self-citations are load-bearing for the main results, and no ansatz or renaming of known results is presented as a derivation. This is the normal case of an honest empirical study whose findings can be checked against the public benchmark.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The paper relies on standard assumptions that the chosen 18 tasks are representative of ML research problems and that the 12 behavioral metrics validly capture exploration dynamics; no new physical or mathematical entities are introduced and no parameters are fitted to produce the headline performance numbers.

axioms (2)

domain assumption The 18 tasks across 10 domains adequately sample the space of ML research problems for strategy comparison.
Invoked in the benchmark construction section to justify generalization of findings.
standard math Process-level metrics such as early convergence and directional focus can be measured independently of final task performance.
Used when correlating behavioral metrics with final scores.

pith-pipeline@v0.9.0 · 5860 in / 1537 out tokens · 37790 ms · 2026-05-20T14:21:19.495102+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

67 extracted references · 67 canonical work pages · 15 internal anchors

[1]

Machine bias: There’s software used across the country to predict future criminals

Julia Angwin, Jeff Larson, Surya Mattu, and Lauren Kirchner. Machine bias: There’s software used across the country to predict future criminals. and it’s biased against blacks.ProPublica, May 2016. URL https://www.propublica.org/article/machine-bias-risk-asses sments-in-criminal-sentencing

work page 2016
[2]

Invariant Risk Minimization

Martin Arjovsky, Léon Bottou, Ishaan Gulrajani, and David Lopez-Paz. Invariant risk mini- mization.arXiv preprint arXiv:1907.02893, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1907
[3]

Barry Becker and Ronny Kohavi. Adult. UCI Machine Learning Repository, 1996. DOI: https://doi.org/10.24432/C5XW20

work page doi:10.24432/c5xw20 1996
[4]

Rachel K. E. Bellamy, Kuntal Dey, Michael Hind, Samuel C. Hoffman, Stephanie Houde, Kalapriya Kannan, Pranay Lohia, Jacquelyn Martino, Sameep Mehta, Aleksandra Mojsilovic, Seema Nagar, Karthikeyan Natesan Ramamurthy, John Richards, Diptikalyan Saha, Prasanna Sattigeri, Moninder Singh, Kush R. Varshney, and Yunfeng Zhang. AI Fairness 360: An extensible too...

work page internal anchor Pith review Pith/arXiv arXiv 2018
[5]

Representation learning: A review and new perspectives.IEEE transactions on pattern analysis and machine intelligence, 35(8): 1798–1828, 2013

Yoshua Bengio, Aaron Courville, and Pascal Vincent. Representation learning: A review and new perspectives.IEEE transactions on pattern analysis and machine intelligence, 35(8): 1798–1828, 2013

work page 2013
[6]

Fairlearn: A toolkit for assessing and improving fairness in ai

Sarah Bird, Miro Dudík, Richard Edgar, Brandon Horn, Roman Lutz, Vanessa Milan, Mehrnoosh Sameki, Hanna Wallach, and Kathleen Walker. Fairlearn: A toolkit for assessing and improving fairness in ai. 2020

work page 2020
[7]

Dp-instahide: Provably defusing poi- soning and backdoor attacks with differentially private data augmentations.arXiv preprint arXiv:2103.02079, 2021

Eitan Borgnia, Jonas Geiping, Valeriia Cherepanova, Liam Fowl, Arjun Gupta, Amin Ghiasi, Furong Huang, Micah Goldblum, and Tom Goldstein. Dp-instahide: Provably defusing poi- soning and backdoor attacks with differentially private data augmentations.arXiv preprint arXiv:2103.02079, 2021

work page arXiv 2021
[8]

MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering

Jun Shern Chan, Neil Chowdhury, Oliver Jaffe, James Aung, Dane Sherburn, Evan Mays, Giulio Starace, Kevin Liu, Leon Maksin, Tejal Patwardhan, et al. Mle-bench: Evaluating machine learning agents on machine learning engineering.arXiv preprint arXiv:2410.07095, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[9]

Causalml: Python package for causal machine learning, 2020

Huigang Chen, Totte Harinen, Jeong-Yoon Lee, Mike Yung, and Zhenyu Zhao. Causalml: Python package for causal machine learning, 2020

work page 2020
[10]

Mars: Modular agent with reflective search for automated ai research.arXiv preprint arXiv:2602.02660, 2026

Jiefeng Chen, Bhavana Dalvi Mishra, Jaehyun Nam, Rui Meng, Tomas Pfister, and Jinsung Yoon. Mars: Modular agent with reflective search for automated ai research.arXiv preprint arXiv:2602.02660, 2026

work page internal anchor Pith review arXiv 2026
[11]

Morgan & Claypool Publishers, 2018

Zhiyuan Chen and Bing Liu.Lifelong machine learning. Morgan & Claypool Publishers, 2018

work page 2018
[12]

Scienceagentbench: Toward rigorous assessment of language agents for data-driven scientific discovery

Ziru Chen, Shijie Chen, Yuting Ning, Qianheng Zhang, Boshi Wang, Botao Yu, Yifei Li, Zeyi Liao, Chen Wei, Zitong Lu, et al. Scienceagentbench: Toward rigorous assessment of language agents for data-driven scientific discovery.arXiv preprint arXiv:2410.05080, 2024

work page arXiv 2024
[13]

solo- learn: A library of self-supervised methods for visual representation learning.Journal of Machine Learning Research, 23(56):1–6, 2022

Victor Guilherme Turrisi Da Costa, Enrico Fini, Moin Nabi, Nicu Sebe, and Elisa Ricci. solo- learn: A library of self-supervised methods for visual representation learning.Journal of Machine Learning Research, 23(56):1–6, 2022

work page 2022
[14]

The mnist database of handwritten digit images for machine learning research.IEEE Signal Processing Magazine, 29(6):141–142, 2012

Li Deng. The mnist database of handwritten digit images for machine learning research.IEEE Signal Processing Magazine, 29(6):141–142, 2012

work page 2012
[15]

Openunlearning: Accelerating llm unlearning via unified bench- marking of methods and metrics.arXiv preprint arXiv:2506.12618, 2025

Vineeth Dorna, Anmol Mekala, Wenlong Zhao, Andrew McCallum, Zachary C Lipton, J Zico Kolter, and Pratyush Maini. Openunlearning: Accelerating llm unlearning via unified bench- marking of methods and metrics.arXiv preprint arXiv:2506.12618, 2025

work page arXiv 2025
[16]

In search of lost domain generalization.arXiv preprint arXiv:2007.01434, 2020

Ishaan Gulrajani and David Lopez-Paz. In search of lost domain generalization.arXiv preprint arXiv:2007.01434, 2020. 10

work page arXiv 2007
[17]

GraphCodeBERT: Pre-training Code Representations with Data Flow

Daya Guo, Shuo Ren, Shuai Lu, Zhangyin Feng, Duyu Tang, Shujie Liu, Long Zhou, Nan Duan, Alexey Svyatkovskiy, Shengyu Fu, et al. Graphcodebert: Pre-training code representations with data flow.arXiv preprint arXiv:2009.08366, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2009
[18]

Momentum contrast for unsupervised visual representation learning

Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual representation learning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9729–9738, 2020

work page 2020
[19]

Bayesian nonparametric modeling for causal inference.Journal of Computa- tional and Graphical Statistics, 20(1):217–240, 2011

Jennifer L Hill. Bayesian nonparametric modeling for causal inference.Journal of Computa- tional and Graphical Statistics, 20(1):217–240, 2011

work page 2011
[20]

Maximilian Idahl and Zahra Ahmadi

Qian Huang, Jian V ora, Percy Liang, and Jure Leskovec. Mlagentbench: Evaluating language agents on machine learning experimentation.arXiv preprint arXiv:2310.03302, 2023

work page arXiv 2023
[21]

AIDE: AI-Driven Exploration in the Space of Code

Zhengyao Jiang, Dominik Schmidt, Dhruv Srikanth, Dixing Xu, Ian Kaplan, Deniss Jacenko, and Yuxiang Wu. Aide: Ai-driven exploration in the space of code. 2025. URL https: //arxiv.org/abs/2502.13138

work page internal anchor Pith review Pith/arXiv arXiv 2025
[22]

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues?arXiv preprint arXiv:2310.06770, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[23]

Dsbench: How far are data science agents from becoming data science experts?arXiv preprint arXiv:2409.07703, 2024

Liqiang Jing, Zhehui Huang, Xiaoyang Wang, Wenlin Yao, Wenhao Yu, Kaixin Ma, Hongming Zhang, Xinya Du, and Dong Yu. Dsbench: How far are data science agents from becoming data science experts?arXiv preprint arXiv:2409.07703, 2024

work page arXiv 2024
[24]

autoresearch: Ai agents running research on single-gpu nanochat training automatically.https://github.com/karpathy/autoresearch, 2026

Andrej Karpathy. autoresearch: Ai agents running research on single-gpu nanochat training automatically.https://github.com/karpathy/autoresearch, 2026. GitHub repository

work page 2026
[25]

Learning multiple layers of features from tiny images

Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009

work page 2009
[26]

LAB-Bench: Measuring Capabilities of Language Models for Biology Research

Jon M Laurent, Joseph D Janizek, Michael Ruzo, Michaela M Hinks, Michael J Hammer- ling, Siddharth Narayanan, Manvitha Ponnapati, Andrew D White, and Samuel G Rodriques. Lab-bench: Measuring capabilities of language models for biology research.arXiv preprint arXiv:2407.10362, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[27]

Abandoning objectives: Evolution through the search for novelty alone.Evolutionary computation, 19(2):189–223, 2011

Joel Lehman and Kenneth O Stanley. Abandoning objectives: Evolution through the search for novelty alone.Evolutionary computation, 19(2):189–223, 2011

work page 2011
[28]

Trustworthy ai: From principles to practices.ACM Computing Surveys, 55(9):1–46, 2023

Bo Li, Peng Qi, Bo Liu, Shuai Di, Jingen Liu, Jiquan Pei, Jinfeng Yi, and Bowen Zhou. Trustworthy ai: From principles to practices.ACM Computing Surveys, 55(9):1–46, 2023

work page 2023
[29]

Chris Lu, Cong Lu, Robert Tjarko Lange, Jakob Foerster, Jeff Clune, and David Ha

Ruochen Li, Teerth Patel, Qingyun Wang, and Xinya Du. Mlr-copilot: Autonomous machine learning research based on large language models agents.arXiv preprint arXiv:2408.14033, 2024

work page arXiv 2024
[30]

Lightly: A python library for self-supervised learning on images

Lightly-AI. Lightly: A python library for self-supervised learning on images. https://gith ub.com/lightly-ai/lightly, 2025

work page 2025
[31]

The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery

Chris Lu, Cong Lu, Robert Tjarko Lange, Jakob Foerster, Jeff Clune, and David Ha. The ai scien- tist: Towards fully automated open-ended scientific discovery.arXiv preprint arXiv:2408.06292, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[32]

TOFU: A Task of Fictitious Unlearning for LLMs

Pratyush Maini, Zhili Feng, Avi Schwarzschild, Zachary C Lipton, and J Zico Kolter. Tofu: A task of fictitious unlearning for llms.arXiv preprint arXiv:2401.06121, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[33]

Communication-efficient learning of deep networks from decentralized data

Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas. Communication-efficient learning of deep networks from decentralized data. InArtificial intelligence and statistics, pages 1273–1282. Pmlr, 2017

work page 2017
[34]

A survey on bias and fairness in machine learning.ACM computing surveys (CSUR), 54(6):1–35, 2021

Ninareh Mehrabi, Fred Morstatter, Nripsuta Saxena, Kristina Lerman, and Aram Galstyan. A survey on bias and fairness in machine learning.ACM computing surveys (CSUR), 54(6):1–35, 2021. 11

work page 2021
[35]

Illuminating search spaces by mapping elites

Jean-Baptiste Mouret and Jeff Clune. Illuminating search spaces by mapping elites.arXiv preprint arXiv:1504.04909, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[36]

Ml privacy meter: Aiding regulatory compliance by quantifying the privacy risks of machine learning.arXiv preprint arXiv:2007.09339, 2020

Sasi Kumar Murakonda and Reza Shokri. Ml privacy meter: Aiding regulatory compliance by quantifying the privacy risks of machine learning.arXiv preprint arXiv:2007.09339, 2020

work page arXiv 2007
[37]

arXiv preprint arXiv:1807.01069 , year=

Maria-Irina Nicolae, Mathieu Sinn, Minh Ngoc Tran, Beat Buesser, Ambrish Rawat, Mar- tin Wistuba, Valentina Zantedeschi, Nathalie Baracaldo, Bryant Chen, Heiko Ludwig, et al. Adversarial robustness toolbox v1. 0.0.arXiv preprint arXiv:1807.01069, 2018

work page arXiv 2018
[38]

AlphaEvolve: A coding agent for scientific and algorithmic discovery

Alexander Novikov, Ngân V˜u, Marvin Eisenberger, Emilien Dupont, Po-Sen Huang, Adam Zsolt Wagner, Sergey Shirobokov, Borislav Kozlovskii, Francisco JR Ruiz, Abbas Mehrabian, et al. Alphaevolve: A coding agent for scientific and algorithmic discovery.arXiv preprint arXiv:2506.13131, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[39]

Ml-dev-bench: Comparative analysis of ai agents on ml development workflows.arXiv preprint arXiv:2502.00964, 2025

Harshith Padigela, Chintan Shah, and Dinkar Juyal. Ml-dev-bench: Comparative analysis of ai agents on ml development workflows.arXiv preprint arXiv:2502.00964, 2025

work page arXiv 2025
[40]

The seven tools of causal inference, with reflections on machine learning.Commu- nications of the ACM, 62(3):54–60, 2019

Judea Pearl. The seven tools of causal inference, with reflections on machine learning.Commu- nications of the ACM, 62(3):54–60, 2019

work page 2019
[41]

Quality diversity: A new frontier for evolutionary computation.Frontiers in Robotics and AI, 3:40, 2016

Justin K Pugh, Lisa B Soros, and Kenneth O Stanley. Quality diversity: A new frontier for evolutionary computation.Frontiers in Robotics and AI, 3:40, 2016

work page 2016
[42]

icarl: Incremental classifier and representation learning

Sylvestre-Alvise Rebuffi, Alexander Kolesnikov, Georg Sperl, and Christoph H Lampert. icarl: Incremental classifier and representation learning. InProceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 2001–2010, 2017

work page 2001
[43]

Agent laboratory: Using llm agents as research assistants.Findings of the Association for Computational Linguistics: EMNLP 2025, pages 5977–6043, 2025

Samuel Schmidgall, Yusheng Su, Ze Wang, Ximeng Sun, Jialian Wu, Xiaodong Yu, Jiang Liu, Michael Moor, Zicheng Liu, and Emad Barsoum. Agent laboratory: Using llm agents as research assistants.Findings of the Association for Computational Linguistics: EMNLP 2025, pages 5977–6043, 2025

work page 2025
[44]

Openevolve: an open-source evolutionary coding agent

Asankhaya Sharma. Openevolve: an open-source evolutionary coding agent. https://gith ub.com/algorithmicsuperintelligence/openevolve, 2025. GitHub repository

work page 2025
[45]

Adapting neural networks for the estimation of treatment effects.Advances in neural information processing systems, 32, 2019

Claudia Shi, David Blei, and Victor Veitch. Adapting neural networks for the estimation of treatment effects.Advances in neural information processing systems, 32, 2019

work page 2019
[46]

Easy few-shot learning: ready-to-use code and tutorial notebooks for few-shot image classification.https://github.com/sicara/easy-few-shot-learning, 2024

Sicara. Easy few-shot learning: ready-to-use code and tutorial notebooks for few-shot image classification.https://github.com/sicara/easy-few-shot-learning, 2024

work page 2024
[47]

Prototypical networks for few-shot learning

Jake Snell, Kevin Swersky, and Richard Zemel. Prototypical networks for few-shot learning. Advances in neural information processing systems, 30, 2017

work page 2017
[48]

Fixmatch: Simplifying semi- supervised learning with consistency and confidence.Advances in neural information processing systems, 33:596–608, 2020

Kihyuk Sohn, David Berthelot, Nicholas Carlini, Zizhao Zhang, Han Zhang, Colin A Raf- fel, Ekin Dogus Cubuk, Alexey Kurakin, and Chun-Liang Li. Fixmatch: Simplifying semi- supervised learning with consistency and confidence.Advances in neural information processing systems, 33:596–608, 2020

work page 2020
[49]

arXiv preprint arXiv:2505.18705 , year=

Jiabin Tang, Lianghao Xia, Zhonghang Li, and Chao Huang. Ai-researcher: Autonomous scientific innovation.arXiv preprint arXiv:2505.18705, 2025

work page arXiv 2025
[50]

arXiv preprint arXiv:2507.02554 , year=

Edan Toledo, Karen Hambardzumyan, Martin Josifoski, Rishi Hazra, Nicolas Baldwin, Alexis Audran-Reiss, Michael Kuchnik, Despoina Magka, Minqi Jiang, Alisia Maria Lupidi, et al. Ai research agents for machine learning: Search, exploration, and generalization in mle-bench. arXiv preprint arXiv:2507.02554, 2025

work page arXiv 2025
[51]

Three types of incremental learning.Nature Machine Intelligence, 4:1185–1197, 2022

Gido M van de Ven, Tinne Tuytelaars, and Andreas S Tolias. Three types of incremental learning.Nature Machine Intelligence, 4:1185–1197, 2022

work page 2022
[52]

Vapnik.Statistical Learning Theory

Vladimir N. Vapnik.Statistical Learning Theory. Wiley-Interscience, New York, 1998. 12

work page 1998
[53]

Deep hashing network for unsupervised domain adaptation

Hemanth Venkateswara, Jose Eusebio, Shayok Chakraborty, and Sethuraman Panchanathan. Deep hashing network for unsupervised domain adaptation. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5018–5027, 2017

work page 2017
[54]

Matching networks for one shot learning.Advances in neural information processing systems, 29, 2016

Oriol Vinyals, Charles Blundell, Timothy Lillicrap, Daan Wierstra, et al. Matching networks for one shot learning.Advances in neural information processing systems, 29, 2016

work page 2016
[55]

SciBench: Evaluating College-Level Scientific Problem-Solving Abilities of Large Language Models

Xiaoxuan Wang, Ziniu Hu, Pan Lu, Yanqiao Zhu, Jieyu Zhang, Satyen Subramaniam, Arjun R Loomba, Shichang Zhang, Yizhou Sun, and Wei Wang. Scibench: Evaluating college-level scientific problem-solving abilities of large language models.arXiv preprint arXiv:2307.10635, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[56]

Usb: A unified semi-supervised learning benchmark for classification.Advances in Neural Information Processing Systems, 35:3938–3961, 2022

Yidong Wang, Hao Chen, Yue Fan, Wang Sun, Ran Tao, Wenxin Hou, Renjie Wang, Linyi Yang, Zhi Zhou, Lan-Zhe Guo, et al. Usb: A unified semi-supervised learning benchmark for classification.Advances in Neural Information Processing Systems, 35:3938–3961, 2022

work page 2022
[57]

Re-bench: Evaluating frontier ai r&d capabilities of language model agents against human experts.arXiv preprint arXiv:2411.15114, 2024

Hjalmar Wijk, Tao Lin, Joel Becker, Sami Jawhar, Neev Parikh, Thomas Broadley, Lawrence Chan, Michael Chen, Josh Clymer, Jai Dhyani, et al. Re-bench: Evaluating frontier ai r&d capabilities of language model agents against human experts.arXiv preprint arXiv:2411.15114, 2024

work page arXiv 2024
[58]

The AI Scientist-v2: Workshop-Level Automated Scientific Discovery via Agentic Tree Search

Yutaro Yamada, Robert Tjarko Lange, Cong Lu, Shengran Hu, Chris Lu, Jakob Foerster, Jeff Clune, and David Ha. The ai scientist-v2: Workshop-level automated scientific discovery via agentic tree search.arXiv preprint arXiv:2504.08066, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[59]

Opacus: User-friendly differential privacy library in PyTorch.arXiv preprint arXiv:2109.12298, 2021

Ashkan Yousefpour, Igor Shilov, Alexandre Sablayrolles, Davide Testuggine, Karthik Prasad, Mani Malek, John Nguyen, Sayan Ghosh, Akash Bharadwaj, Jessica Zhao, et al. Opacus: User-friendly differential privacy library in pytorch.arXiv preprint arXiv:2109.12298, 2021

work page arXiv 2021
[60]

Wide Residual Networks

Sergey Zagoruyko and Nikos Komodakis. Wide residual networks.arXiv preprint arXiv:1605.07146, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[61]

Barlow twins: Self- supervised learning via redundancy reduction

Jure Zbontar, Li Jing, Ishan Misra, Yann LeCun, and Stéphane Deny. Barlow twins: Self- supervised learning via redundancy reduction. InInternational conference on machine learning, pages 12310–12320. PMLR, 2021

work page 2021
[62]

Continual learning through synaptic intelligence

Friedemann Zenke, Ben Poole, and Surya Ganguli. Continual learning through synaptic intelligence. InInternational conference on machine learning, pages 3987–3995. PMLR, 2017

work page 2017
[63]

Pfllib: Personalized federated learning algorithm library.arXiv preprint arXiv:2312.04992, 2023

Jianqing Zhang, Yang Liu, Yang Hua, Hao Wang, Tao Song, Zhengui Xue, Ruhui Ma, and Jian Cao. Pfllib: Personalized federated learning algorithm library.arXiv preprint arXiv:2312.04992, 2023

work page arXiv 2023
[64]

OpenOOD v1.5: Enhanced Benchmark for Out -of- Distribution Detection,

Jingyang Zhang, Jingkang Yang, Pengyun Wang, Haoqi Wang, Yueqian Lin, Haoran Zhang, Yiyou Sun, Xuefeng Du, Yixuan Li, Ziwei Liu, et al. Openood v1. 5: Enhanced benchmark for out-of-distribution detection.arXiv preprint arXiv:2306.09301, 2023

work page arXiv 2023
[65]

gcastle: A python toolbox for causal discovery.arXiv preprint arXiv:2111.15155, 2021

Keli Zhang, Shengyu Zhu, Marcus Kalander, Ignavier Ng, Junjian Ye, Zhitang Chen, and Lujia Pan. gcastle: A python toolbox for causal discovery.arXiv preprint arXiv:2111.15155, 2021

work page arXiv 2021
[66]

Dags with no tears: Continuous optimization for structure learning.Advances in neural information processing systems, 31, 2018

Xun Zheng, Bryon Aragam, Pradeep K Ravikumar, and Eric P Xing. Dags with no tears: Continuous optimization for structure learning.Advances in neural information processing systems, 31, 2018

work page 2018
[67]

lower is better

Da-Wei Zhou, Fu-Yun Wang, Han-Jia Ye, and De-Chuan Zhan. Pycil: a python toolbox for class-incremental learning, 2023. 13 A Task descriptions This appendix gives a short description of each of the 18 research tasks in FML-bench, one paragraph per task. For every task we identify the dataset, the baseline algorithm, the agent’s optimization target, and the...

work page arXiv 2023

[1] [1]

Machine bias: There’s software used across the country to predict future criminals

Julia Angwin, Jeff Larson, Surya Mattu, and Lauren Kirchner. Machine bias: There’s software used across the country to predict future criminals. and it’s biased against blacks.ProPublica, May 2016. URL https://www.propublica.org/article/machine-bias-risk-asses sments-in-criminal-sentencing

work page 2016

[2] [2]

Invariant Risk Minimization

Martin Arjovsky, Léon Bottou, Ishaan Gulrajani, and David Lopez-Paz. Invariant risk mini- mization.arXiv preprint arXiv:1907.02893, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1907

[3] [3]

Barry Becker and Ronny Kohavi. Adult. UCI Machine Learning Repository, 1996. DOI: https://doi.org/10.24432/C5XW20

work page doi:10.24432/c5xw20 1996

[4] [4]

Rachel K. E. Bellamy, Kuntal Dey, Michael Hind, Samuel C. Hoffman, Stephanie Houde, Kalapriya Kannan, Pranay Lohia, Jacquelyn Martino, Sameep Mehta, Aleksandra Mojsilovic, Seema Nagar, Karthikeyan Natesan Ramamurthy, John Richards, Diptikalyan Saha, Prasanna Sattigeri, Moninder Singh, Kush R. Varshney, and Yunfeng Zhang. AI Fairness 360: An extensible too...

work page internal anchor Pith review Pith/arXiv arXiv 2018

[5] [5]

Representation learning: A review and new perspectives.IEEE transactions on pattern analysis and machine intelligence, 35(8): 1798–1828, 2013

Yoshua Bengio, Aaron Courville, and Pascal Vincent. Representation learning: A review and new perspectives.IEEE transactions on pattern analysis and machine intelligence, 35(8): 1798–1828, 2013

work page 2013

[6] [6]

Fairlearn: A toolkit for assessing and improving fairness in ai

Sarah Bird, Miro Dudík, Richard Edgar, Brandon Horn, Roman Lutz, Vanessa Milan, Mehrnoosh Sameki, Hanna Wallach, and Kathleen Walker. Fairlearn: A toolkit for assessing and improving fairness in ai. 2020

work page 2020

[7] [7]

Dp-instahide: Provably defusing poi- soning and backdoor attacks with differentially private data augmentations.arXiv preprint arXiv:2103.02079, 2021

Eitan Borgnia, Jonas Geiping, Valeriia Cherepanova, Liam Fowl, Arjun Gupta, Amin Ghiasi, Furong Huang, Micah Goldblum, and Tom Goldstein. Dp-instahide: Provably defusing poi- soning and backdoor attacks with differentially private data augmentations.arXiv preprint arXiv:2103.02079, 2021

work page arXiv 2021

[8] [8]

MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering

Jun Shern Chan, Neil Chowdhury, Oliver Jaffe, James Aung, Dane Sherburn, Evan Mays, Giulio Starace, Kevin Liu, Leon Maksin, Tejal Patwardhan, et al. Mle-bench: Evaluating machine learning agents on machine learning engineering.arXiv preprint arXiv:2410.07095, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[9] [9]

Causalml: Python package for causal machine learning, 2020

Huigang Chen, Totte Harinen, Jeong-Yoon Lee, Mike Yung, and Zhenyu Zhao. Causalml: Python package for causal machine learning, 2020

work page 2020

[10] [10]

Mars: Modular agent with reflective search for automated ai research.arXiv preprint arXiv:2602.02660, 2026

Jiefeng Chen, Bhavana Dalvi Mishra, Jaehyun Nam, Rui Meng, Tomas Pfister, and Jinsung Yoon. Mars: Modular agent with reflective search for automated ai research.arXiv preprint arXiv:2602.02660, 2026

work page internal anchor Pith review arXiv 2026

[11] [11]

Morgan & Claypool Publishers, 2018

Zhiyuan Chen and Bing Liu.Lifelong machine learning. Morgan & Claypool Publishers, 2018

work page 2018

[12] [12]

Scienceagentbench: Toward rigorous assessment of language agents for data-driven scientific discovery

Ziru Chen, Shijie Chen, Yuting Ning, Qianheng Zhang, Boshi Wang, Botao Yu, Yifei Li, Zeyi Liao, Chen Wei, Zitong Lu, et al. Scienceagentbench: Toward rigorous assessment of language agents for data-driven scientific discovery.arXiv preprint arXiv:2410.05080, 2024

work page arXiv 2024

[13] [13]

solo- learn: A library of self-supervised methods for visual representation learning.Journal of Machine Learning Research, 23(56):1–6, 2022

Victor Guilherme Turrisi Da Costa, Enrico Fini, Moin Nabi, Nicu Sebe, and Elisa Ricci. solo- learn: A library of self-supervised methods for visual representation learning.Journal of Machine Learning Research, 23(56):1–6, 2022

work page 2022

[14] [14]

The mnist database of handwritten digit images for machine learning research.IEEE Signal Processing Magazine, 29(6):141–142, 2012

Li Deng. The mnist database of handwritten digit images for machine learning research.IEEE Signal Processing Magazine, 29(6):141–142, 2012

work page 2012

[15] [15]

Openunlearning: Accelerating llm unlearning via unified bench- marking of methods and metrics.arXiv preprint arXiv:2506.12618, 2025

Vineeth Dorna, Anmol Mekala, Wenlong Zhao, Andrew McCallum, Zachary C Lipton, J Zico Kolter, and Pratyush Maini. Openunlearning: Accelerating llm unlearning via unified bench- marking of methods and metrics.arXiv preprint arXiv:2506.12618, 2025

work page arXiv 2025

[16] [16]

In search of lost domain generalization.arXiv preprint arXiv:2007.01434, 2020

Ishaan Gulrajani and David Lopez-Paz. In search of lost domain generalization.arXiv preprint arXiv:2007.01434, 2020. 10

work page arXiv 2007

[17] [17]

GraphCodeBERT: Pre-training Code Representations with Data Flow

Daya Guo, Shuo Ren, Shuai Lu, Zhangyin Feng, Duyu Tang, Shujie Liu, Long Zhou, Nan Duan, Alexey Svyatkovskiy, Shengyu Fu, et al. Graphcodebert: Pre-training code representations with data flow.arXiv preprint arXiv:2009.08366, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2009

[18] [18]

Momentum contrast for unsupervised visual representation learning

Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual representation learning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9729–9738, 2020

work page 2020

[19] [19]

Bayesian nonparametric modeling for causal inference.Journal of Computa- tional and Graphical Statistics, 20(1):217–240, 2011

Jennifer L Hill. Bayesian nonparametric modeling for causal inference.Journal of Computa- tional and Graphical Statistics, 20(1):217–240, 2011

work page 2011

[20] [20]

Maximilian Idahl and Zahra Ahmadi

Qian Huang, Jian V ora, Percy Liang, and Jure Leskovec. Mlagentbench: Evaluating language agents on machine learning experimentation.arXiv preprint arXiv:2310.03302, 2023

work page arXiv 2023

[21] [21]

AIDE: AI-Driven Exploration in the Space of Code

Zhengyao Jiang, Dominik Schmidt, Dhruv Srikanth, Dixing Xu, Ian Kaplan, Deniss Jacenko, and Yuxiang Wu. Aide: Ai-driven exploration in the space of code. 2025. URL https: //arxiv.org/abs/2502.13138

work page internal anchor Pith review Pith/arXiv arXiv 2025

[22] [22]

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues?arXiv preprint arXiv:2310.06770, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[23] [23]

Dsbench: How far are data science agents from becoming data science experts?arXiv preprint arXiv:2409.07703, 2024

Liqiang Jing, Zhehui Huang, Xiaoyang Wang, Wenlin Yao, Wenhao Yu, Kaixin Ma, Hongming Zhang, Xinya Du, and Dong Yu. Dsbench: How far are data science agents from becoming data science experts?arXiv preprint arXiv:2409.07703, 2024

work page arXiv 2024

[24] [24]

autoresearch: Ai agents running research on single-gpu nanochat training automatically.https://github.com/karpathy/autoresearch, 2026

Andrej Karpathy. autoresearch: Ai agents running research on single-gpu nanochat training automatically.https://github.com/karpathy/autoresearch, 2026. GitHub repository

work page 2026

[25] [25]

Learning multiple layers of features from tiny images

Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009

work page 2009

[26] [26]

LAB-Bench: Measuring Capabilities of Language Models for Biology Research

Jon M Laurent, Joseph D Janizek, Michael Ruzo, Michaela M Hinks, Michael J Hammer- ling, Siddharth Narayanan, Manvitha Ponnapati, Andrew D White, and Samuel G Rodriques. Lab-bench: Measuring capabilities of language models for biology research.arXiv preprint arXiv:2407.10362, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[27] [27]

Abandoning objectives: Evolution through the search for novelty alone.Evolutionary computation, 19(2):189–223, 2011

Joel Lehman and Kenneth O Stanley. Abandoning objectives: Evolution through the search for novelty alone.Evolutionary computation, 19(2):189–223, 2011

work page 2011

[28] [28]

Trustworthy ai: From principles to practices.ACM Computing Surveys, 55(9):1–46, 2023

Bo Li, Peng Qi, Bo Liu, Shuai Di, Jingen Liu, Jiquan Pei, Jinfeng Yi, and Bowen Zhou. Trustworthy ai: From principles to practices.ACM Computing Surveys, 55(9):1–46, 2023

work page 2023

[29] [29]

Chris Lu, Cong Lu, Robert Tjarko Lange, Jakob Foerster, Jeff Clune, and David Ha

Ruochen Li, Teerth Patel, Qingyun Wang, and Xinya Du. Mlr-copilot: Autonomous machine learning research based on large language models agents.arXiv preprint arXiv:2408.14033, 2024

work page arXiv 2024

[30] [30]

Lightly: A python library for self-supervised learning on images

Lightly-AI. Lightly: A python library for self-supervised learning on images. https://gith ub.com/lightly-ai/lightly, 2025

work page 2025

[31] [31]

The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery

Chris Lu, Cong Lu, Robert Tjarko Lange, Jakob Foerster, Jeff Clune, and David Ha. The ai scien- tist: Towards fully automated open-ended scientific discovery.arXiv preprint arXiv:2408.06292, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[32] [32]

TOFU: A Task of Fictitious Unlearning for LLMs

Pratyush Maini, Zhili Feng, Avi Schwarzschild, Zachary C Lipton, and J Zico Kolter. Tofu: A task of fictitious unlearning for llms.arXiv preprint arXiv:2401.06121, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[33] [33]

Communication-efficient learning of deep networks from decentralized data

Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas. Communication-efficient learning of deep networks from decentralized data. InArtificial intelligence and statistics, pages 1273–1282. Pmlr, 2017

work page 2017

[34] [34]

A survey on bias and fairness in machine learning.ACM computing surveys (CSUR), 54(6):1–35, 2021

Ninareh Mehrabi, Fred Morstatter, Nripsuta Saxena, Kristina Lerman, and Aram Galstyan. A survey on bias and fairness in machine learning.ACM computing surveys (CSUR), 54(6):1–35, 2021. 11

work page 2021

[35] [35]

Illuminating search spaces by mapping elites

Jean-Baptiste Mouret and Jeff Clune. Illuminating search spaces by mapping elites.arXiv preprint arXiv:1504.04909, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015

[36] [36]

Ml privacy meter: Aiding regulatory compliance by quantifying the privacy risks of machine learning.arXiv preprint arXiv:2007.09339, 2020

Sasi Kumar Murakonda and Reza Shokri. Ml privacy meter: Aiding regulatory compliance by quantifying the privacy risks of machine learning.arXiv preprint arXiv:2007.09339, 2020

work page arXiv 2007

[37] [37]

arXiv preprint arXiv:1807.01069 , year=

Maria-Irina Nicolae, Mathieu Sinn, Minh Ngoc Tran, Beat Buesser, Ambrish Rawat, Mar- tin Wistuba, Valentina Zantedeschi, Nathalie Baracaldo, Bryant Chen, Heiko Ludwig, et al. Adversarial robustness toolbox v1. 0.0.arXiv preprint arXiv:1807.01069, 2018

work page arXiv 2018

[38] [38]

AlphaEvolve: A coding agent for scientific and algorithmic discovery

Alexander Novikov, Ngân V˜u, Marvin Eisenberger, Emilien Dupont, Po-Sen Huang, Adam Zsolt Wagner, Sergey Shirobokov, Borislav Kozlovskii, Francisco JR Ruiz, Abbas Mehrabian, et al. Alphaevolve: A coding agent for scientific and algorithmic discovery.arXiv preprint arXiv:2506.13131, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[39] [39]

Ml-dev-bench: Comparative analysis of ai agents on ml development workflows.arXiv preprint arXiv:2502.00964, 2025

Harshith Padigela, Chintan Shah, and Dinkar Juyal. Ml-dev-bench: Comparative analysis of ai agents on ml development workflows.arXiv preprint arXiv:2502.00964, 2025

work page arXiv 2025

[40] [40]

The seven tools of causal inference, with reflections on machine learning.Commu- nications of the ACM, 62(3):54–60, 2019

Judea Pearl. The seven tools of causal inference, with reflections on machine learning.Commu- nications of the ACM, 62(3):54–60, 2019

work page 2019

[41] [41]

Quality diversity: A new frontier for evolutionary computation.Frontiers in Robotics and AI, 3:40, 2016

Justin K Pugh, Lisa B Soros, and Kenneth O Stanley. Quality diversity: A new frontier for evolutionary computation.Frontiers in Robotics and AI, 3:40, 2016

work page 2016

[42] [42]

icarl: Incremental classifier and representation learning

Sylvestre-Alvise Rebuffi, Alexander Kolesnikov, Georg Sperl, and Christoph H Lampert. icarl: Incremental classifier and representation learning. InProceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 2001–2010, 2017

work page 2001

[43] [43]

Agent laboratory: Using llm agents as research assistants.Findings of the Association for Computational Linguistics: EMNLP 2025, pages 5977–6043, 2025

Samuel Schmidgall, Yusheng Su, Ze Wang, Ximeng Sun, Jialian Wu, Xiaodong Yu, Jiang Liu, Michael Moor, Zicheng Liu, and Emad Barsoum. Agent laboratory: Using llm agents as research assistants.Findings of the Association for Computational Linguistics: EMNLP 2025, pages 5977–6043, 2025

work page 2025

[44] [44]

Openevolve: an open-source evolutionary coding agent

Asankhaya Sharma. Openevolve: an open-source evolutionary coding agent. https://gith ub.com/algorithmicsuperintelligence/openevolve, 2025. GitHub repository

work page 2025

[45] [45]

Adapting neural networks for the estimation of treatment effects.Advances in neural information processing systems, 32, 2019

Claudia Shi, David Blei, and Victor Veitch. Adapting neural networks for the estimation of treatment effects.Advances in neural information processing systems, 32, 2019

work page 2019

[46] [46]

Easy few-shot learning: ready-to-use code and tutorial notebooks for few-shot image classification.https://github.com/sicara/easy-few-shot-learning, 2024

Sicara. Easy few-shot learning: ready-to-use code and tutorial notebooks for few-shot image classification.https://github.com/sicara/easy-few-shot-learning, 2024

work page 2024

[47] [47]

Prototypical networks for few-shot learning

Jake Snell, Kevin Swersky, and Richard Zemel. Prototypical networks for few-shot learning. Advances in neural information processing systems, 30, 2017

work page 2017

[48] [48]

Fixmatch: Simplifying semi- supervised learning with consistency and confidence.Advances in neural information processing systems, 33:596–608, 2020

Kihyuk Sohn, David Berthelot, Nicholas Carlini, Zizhao Zhang, Han Zhang, Colin A Raf- fel, Ekin Dogus Cubuk, Alexey Kurakin, and Chun-Liang Li. Fixmatch: Simplifying semi- supervised learning with consistency and confidence.Advances in neural information processing systems, 33:596–608, 2020

work page 2020

[49] [49]

arXiv preprint arXiv:2505.18705 , year=

Jiabin Tang, Lianghao Xia, Zhonghang Li, and Chao Huang. Ai-researcher: Autonomous scientific innovation.arXiv preprint arXiv:2505.18705, 2025

work page arXiv 2025

[50] [50]

arXiv preprint arXiv:2507.02554 , year=

Edan Toledo, Karen Hambardzumyan, Martin Josifoski, Rishi Hazra, Nicolas Baldwin, Alexis Audran-Reiss, Michael Kuchnik, Despoina Magka, Minqi Jiang, Alisia Maria Lupidi, et al. Ai research agents for machine learning: Search, exploration, and generalization in mle-bench. arXiv preprint arXiv:2507.02554, 2025

work page arXiv 2025

[51] [51]

Three types of incremental learning.Nature Machine Intelligence, 4:1185–1197, 2022

Gido M van de Ven, Tinne Tuytelaars, and Andreas S Tolias. Three types of incremental learning.Nature Machine Intelligence, 4:1185–1197, 2022

work page 2022

[52] [52]

Vapnik.Statistical Learning Theory

Vladimir N. Vapnik.Statistical Learning Theory. Wiley-Interscience, New York, 1998. 12

work page 1998

[53] [53]

Deep hashing network for unsupervised domain adaptation

Hemanth Venkateswara, Jose Eusebio, Shayok Chakraborty, and Sethuraman Panchanathan. Deep hashing network for unsupervised domain adaptation. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5018–5027, 2017

work page 2017

[54] [54]

Matching networks for one shot learning.Advances in neural information processing systems, 29, 2016

Oriol Vinyals, Charles Blundell, Timothy Lillicrap, Daan Wierstra, et al. Matching networks for one shot learning.Advances in neural information processing systems, 29, 2016

work page 2016

[55] [55]

SciBench: Evaluating College-Level Scientific Problem-Solving Abilities of Large Language Models

Xiaoxuan Wang, Ziniu Hu, Pan Lu, Yanqiao Zhu, Jieyu Zhang, Satyen Subramaniam, Arjun R Loomba, Shichang Zhang, Yizhou Sun, and Wei Wang. Scibench: Evaluating college-level scientific problem-solving abilities of large language models.arXiv preprint arXiv:2307.10635, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[56] [56]

Usb: A unified semi-supervised learning benchmark for classification.Advances in Neural Information Processing Systems, 35:3938–3961, 2022

Yidong Wang, Hao Chen, Yue Fan, Wang Sun, Ran Tao, Wenxin Hou, Renjie Wang, Linyi Yang, Zhi Zhou, Lan-Zhe Guo, et al. Usb: A unified semi-supervised learning benchmark for classification.Advances in Neural Information Processing Systems, 35:3938–3961, 2022

work page 2022

[57] [57]

Re-bench: Evaluating frontier ai r&d capabilities of language model agents against human experts.arXiv preprint arXiv:2411.15114, 2024

Hjalmar Wijk, Tao Lin, Joel Becker, Sami Jawhar, Neev Parikh, Thomas Broadley, Lawrence Chan, Michael Chen, Josh Clymer, Jai Dhyani, et al. Re-bench: Evaluating frontier ai r&d capabilities of language model agents against human experts.arXiv preprint arXiv:2411.15114, 2024

work page arXiv 2024

[58] [58]

The AI Scientist-v2: Workshop-Level Automated Scientific Discovery via Agentic Tree Search

Yutaro Yamada, Robert Tjarko Lange, Cong Lu, Shengran Hu, Chris Lu, Jakob Foerster, Jeff Clune, and David Ha. The ai scientist-v2: Workshop-level automated scientific discovery via agentic tree search.arXiv preprint arXiv:2504.08066, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[59] [59]

Opacus: User-friendly differential privacy library in PyTorch.arXiv preprint arXiv:2109.12298, 2021

Ashkan Yousefpour, Igor Shilov, Alexandre Sablayrolles, Davide Testuggine, Karthik Prasad, Mani Malek, John Nguyen, Sayan Ghosh, Akash Bharadwaj, Jessica Zhao, et al. Opacus: User-friendly differential privacy library in pytorch.arXiv preprint arXiv:2109.12298, 2021

work page arXiv 2021

[60] [60]

Wide Residual Networks

Sergey Zagoruyko and Nikos Komodakis. Wide residual networks.arXiv preprint arXiv:1605.07146, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[61] [61]

Barlow twins: Self- supervised learning via redundancy reduction

Jure Zbontar, Li Jing, Ishan Misra, Yann LeCun, and Stéphane Deny. Barlow twins: Self- supervised learning via redundancy reduction. InInternational conference on machine learning, pages 12310–12320. PMLR, 2021

work page 2021

[62] [62]

Continual learning through synaptic intelligence

Friedemann Zenke, Ben Poole, and Surya Ganguli. Continual learning through synaptic intelligence. InInternational conference on machine learning, pages 3987–3995. PMLR, 2017

work page 2017

[63] [63]

Pfllib: Personalized federated learning algorithm library.arXiv preprint arXiv:2312.04992, 2023

Jianqing Zhang, Yang Liu, Yang Hua, Hao Wang, Tao Song, Zhengui Xue, Ruhui Ma, and Jian Cao. Pfllib: Personalized federated learning algorithm library.arXiv preprint arXiv:2312.04992, 2023

work page arXiv 2023

[64] [64]

OpenOOD v1.5: Enhanced Benchmark for Out -of- Distribution Detection,

Jingyang Zhang, Jingkang Yang, Pengyun Wang, Haoqi Wang, Yueqian Lin, Haoran Zhang, Yiyou Sun, Xuefeng Du, Yixuan Li, Ziwei Liu, et al. Openood v1. 5: Enhanced benchmark for out-of-distribution detection.arXiv preprint arXiv:2306.09301, 2023

work page arXiv 2023

[65] [65]

gcastle: A python toolbox for causal discovery.arXiv preprint arXiv:2111.15155, 2021

Keli Zhang, Shengyu Zhu, Marcus Kalander, Ignavier Ng, Junjian Ye, Zhitang Chen, and Lujia Pan. gcastle: A python toolbox for causal discovery.arXiv preprint arXiv:2111.15155, 2021

work page arXiv 2021

[66] [66]

Dags with no tears: Continuous optimization for structure learning.Advances in neural information processing systems, 31, 2018

Xun Zheng, Bryon Aragam, Pradeep K Ravikumar, and Eric P Xing. Dags with no tears: Continuous optimization for structure learning.Advances in neural information processing systems, 31, 2018

work page 2018

[67] [67]

lower is better

Da-Wei Zhou, Fu-Yun Wang, Han-Jia Ye, and De-Chuan Zhan. Pycil: a python toolbox for class-incremental learning, 2023. 13 A Task descriptions This appendix gives a short description of each of the 18 research tasks in FML-bench, one paragraph per task. For every task we identify the dataset, the baseline algorithm, the agent’s optimization target, and the...

work page arXiv 2023