FML-bench: A Controlled Study of AI Research Agent Strategies from the Perspective of Search Dynamics
Pith reviewed 2026-05-20 14:21 UTC · model grok-4.3
The pith
A simple greedy hill-climber nearly matches top tree-search performance in AI research agents, while an adaptive strategy that broadens exploration on stagnation outperforms all six tested approaches.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Evaluating six agents on the 18-task benchmark reveals that strategy complexity alone does not guarantee strong performance: a simple greedy hill-climber nearly matches the best-performing tree-search agent, both well above the remaining agents. Analysis ties this pattern to improvement opportunity structure, with greedy search favored when opportunities are dense and tree or evolutionary search favored when they are sparse. An adaptive agent that switches to broader exploration upon detecting improvement stagnation outperforms the other six agents, and process-level metrics show that early convergence and directionally focused exploration are significantly associated with final performance.
What carries the argument
FML-Bench, a controlled benchmark of 18 ML research tasks with 12 process-level behavioral metrics that separates agent search strategy from execution infrastructure.
If this is right
- Greedy search tends to be more effective on tasks where improvement opportunities are dense.
- Tree-search and evolutionary strategies tend to be more effective on tasks where improvement opportunities are sparse.
- Early convergence and directionally focused exploration predict higher final performance across agents.
- Solution diversity and total compute cost show no reliable association with final performance.
Where Pith is reading between the lines
- Agents in other scientific domains could adopt similar stagnation-triggered switches between local and global search.
- Task designers might classify new problems by opportunity density to select or combine strategies in advance.
- The benchmark could be extended with tasks that deliberately vary opportunity density to test the adaptive rule more directly.
Load-bearing premise
The chosen 18 tasks and 12 metrics adequately represent the dense versus sparse improvement opportunity structures found in real machine learning research problems.
What would settle it
A follow-up study on a fresh collection of ML research tasks finds that the adaptive agent no longer leads the others or that greedy and tree-search performance gaps disappear.
Figures
read the original abstract
AI research agents accelerate ML research by automating hypothesis generation, experimentation, and empirical refinement. Existing agent strategies range from greedy hill-climbing to tree search and evolutionary optimization, yet which strategy choices drive performance remains unclear. Answering this question requires a benchmark that separates agent strategy (e.g., search topology) from execution infrastructure (e.g., code editor), so that performance differences are attributable to strategy rather than infrastructure, and that provides process-level metrics beyond final scores to analyze exploration behaviors. Existing benchmarks offer limited support. We propose FML-Bench, a benchmark of 18 fundamental ML research tasks across 10 domains that separates agent strategy from execution infrastructure and defines 12 process-level behavioral metrics. Evaluating six representative agents, we find that: (1) strategy complexity alone does not guarantee strong performance: a simple greedy hill-climber nearly matches the best-performing tree-search agent, both well above the remaining agents; (2) our analysis suggests this pattern relates to improvement opportunity structure: greedy search tends to be more effective when opportunities are dense, while tree-search and evolutionary strategies tend to be more effective when opportunities are sparse; an adaptive agent built on this insight switches to broader exploration upon detecting improvement stagnation and outperforms the other six agents, lending initial support to this observation; and (3) process-level analysis reveals that early convergence and directionally focused exploration are significantly associated with final performance, while solution diversity and compute cost are not. Our benchmark is available at: https://github.com/qrzou/FML-bench.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces FML-Bench, a controlled benchmark of 18 fundamental ML research tasks across 10 domains. It separates agent search strategy from execution infrastructure and defines 12 process-level behavioral metrics. Experiments on six representative agents show that a simple greedy hill-climber nearly matches the best tree-search agent (both substantially above the others). The authors interpret this pattern as relating to dense versus sparse improvement opportunity structures across tasks; an adaptive agent that switches to broader exploration upon detecting stagnation outperforms the six baselines. Process-level analysis finds early convergence and directionally focused exploration significantly associated with final performance, while solution diversity and compute cost are not.
Significance. If the empirical results hold, the work supplies reproducible, process-aware evidence on strategy effectiveness for AI research agents, showing that added complexity does not automatically improve outcomes and that adaptive switching can be beneficial. The open benchmark and released code enable direct follow-up studies. The process metrics move the field beyond final-score comparisons toward mechanistic understanding of exploration behavior.
major comments (1)
- [Analysis of opportunity structure / adaptive agent construction] Analysis section on opportunity structure and adaptive agent: the classification of tasks into dense versus sparse improvement opportunities is performed post-hoc from the same agent-run improvement frequencies that generate the performance numbers. This creates a risk that the reported strategy-by-task interaction is tautological rather than explanatory. An independent, a-priori measure of opportunity density (defined from task properties before any agent execution) is needed to substantiate the interpretation that underpins the adaptive-agent design and the claim that it generalizes the observed pattern.
minor comments (3)
- [Task description] Table 1 or task-description section: confirm whether the 18 tasks are evenly distributed across the stated 10 domains or whether some domains contain multiple related tasks; this affects claims about breadth.
- [Metrics definition] Process-metrics definitions: provide explicit formulas or pseudocode for the 12 metrics (especially the stagnation-detection rule used by the adaptive agent) so that future work can replicate the exact thresholds.
- [Results] Results figures: add statistical significance markers or confidence intervals to the performance bar charts so readers can judge whether the greedy–tree-search parity and the adaptive-agent gains are reliable.
Simulated Author's Rebuttal
We thank the referee for their constructive and insightful feedback on our manuscript. We address the major comment below and describe the revisions we plan to make.
read point-by-point responses
-
Referee: [Analysis of opportunity structure / adaptive agent construction] Analysis section on opportunity structure and adaptive agent: the classification of tasks into dense versus sparse improvement opportunities is performed post-hoc from the same agent-run improvement frequencies that generate the performance numbers. This creates a risk that the reported strategy-by-task interaction is tautological rather than explanatory. An independent, a-priori measure of opportunity density (defined from task properties before any agent execution) is needed to substantiate the interpretation that underpins the adaptive-agent design and the claim that it generalizes the observed pattern.
Authors: We agree that the current categorization of tasks into dense versus sparse improvement opportunities relies on post-hoc analysis of improvement frequencies observed during agent runs, which introduces a potential circularity with the performance results. To address this limitation, we will revise the manuscript to define and compute an independent a-priori measure of opportunity density based solely on intrinsic task properties prior to any agent execution. This measure will incorporate factors such as the dimensionality of the hyperparameter space, the number of modifiable components in the task formulation, and indicators of landscape structure derivable from the task description itself. Tasks will then be reclassified using this pre-defined metric, and we will re-examine the strategy-by-task performance patterns to verify consistency with the dense/sparse distinction. The adaptive agent design and associated claims will be updated to reference this independent categorization, thereby strengthening the explanatory power and generalizability of the results. revision: yes
Circularity Check
Empirical benchmark evaluation with direct comparisons; no derivation reduces to fitted inputs or self-citation
full rationale
The paper introduces FML-Bench as an experimental platform with 18 tasks and 12 process metrics, then reports head-to-head performance of six agents plus one adaptive variant constructed from observed patterns. All central claims (greedy nearly matching tree search, adaptive outperforming others, associations with early convergence) rest on these controlled runs and released code rather than any equation, parameter fit, or uniqueness theorem. The dense/sparse opportunity analysis is post-hoc but does not redefine performance metrics or reuse the same data splits for validation in a circular manner. No self-citations are load-bearing for the main results, and no ansatz or renaming of known results is presented as a derivation. This is the normal case of an honest empirical study whose findings can be checked against the public benchmark.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption The 18 tasks across 10 domains adequately sample the space of ML research problems for strategy comparison.
- standard math Process-level metrics such as early convergence and directional focus can be measured independently of final task performance.
Reference graph
Works this paper leans on
-
[1]
Machine bias: There’s software used across the country to predict future criminals
Julia Angwin, Jeff Larson, Surya Mattu, and Lauren Kirchner. Machine bias: There’s software used across the country to predict future criminals. and it’s biased against blacks.ProPublica, May 2016. URL https://www.propublica.org/article/machine-bias-risk-asses sments-in-criminal-sentencing
work page 2016
-
[2]
Martin Arjovsky, Léon Bottou, Ishaan Gulrajani, and David Lopez-Paz. Invariant risk mini- mization.arXiv preprint arXiv:1907.02893, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1907
-
[3]
Barry Becker and Ronny Kohavi. Adult. UCI Machine Learning Repository, 1996. DOI: https://doi.org/10.24432/C5XW20
-
[4]
Rachel K. E. Bellamy, Kuntal Dey, Michael Hind, Samuel C. Hoffman, Stephanie Houde, Kalapriya Kannan, Pranay Lohia, Jacquelyn Martino, Sameep Mehta, Aleksandra Mojsilovic, Seema Nagar, Karthikeyan Natesan Ramamurthy, John Richards, Diptikalyan Saha, Prasanna Sattigeri, Moninder Singh, Kush R. Varshney, and Yunfeng Zhang. AI Fairness 360: An extensible too...
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[5]
Yoshua Bengio, Aaron Courville, and Pascal Vincent. Representation learning: A review and new perspectives.IEEE transactions on pattern analysis and machine intelligence, 35(8): 1798–1828, 2013
work page 2013
-
[6]
Fairlearn: A toolkit for assessing and improving fairness in ai
Sarah Bird, Miro Dudík, Richard Edgar, Brandon Horn, Roman Lutz, Vanessa Milan, Mehrnoosh Sameki, Hanna Wallach, and Kathleen Walker. Fairlearn: A toolkit for assessing and improving fairness in ai. 2020
work page 2020
-
[7]
Eitan Borgnia, Jonas Geiping, Valeriia Cherepanova, Liam Fowl, Arjun Gupta, Amin Ghiasi, Furong Huang, Micah Goldblum, and Tom Goldstein. Dp-instahide: Provably defusing poi- soning and backdoor attacks with differentially private data augmentations.arXiv preprint arXiv:2103.02079, 2021
-
[8]
MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering
Jun Shern Chan, Neil Chowdhury, Oliver Jaffe, James Aung, Dane Sherburn, Evan Mays, Giulio Starace, Kevin Liu, Leon Maksin, Tejal Patwardhan, et al. Mle-bench: Evaluating machine learning agents on machine learning engineering.arXiv preprint arXiv:2410.07095, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[9]
Causalml: Python package for causal machine learning, 2020
Huigang Chen, Totte Harinen, Jeong-Yoon Lee, Mike Yung, and Zhenyu Zhao. Causalml: Python package for causal machine learning, 2020
work page 2020
-
[10]
Jiefeng Chen, Bhavana Dalvi Mishra, Jaehyun Nam, Rui Meng, Tomas Pfister, and Jinsung Yoon. Mars: Modular agent with reflective search for automated ai research.arXiv preprint arXiv:2602.02660, 2026
work page internal anchor Pith review arXiv 2026
-
[11]
Morgan & Claypool Publishers, 2018
Zhiyuan Chen and Bing Liu.Lifelong machine learning. Morgan & Claypool Publishers, 2018
work page 2018
-
[12]
Ziru Chen, Shijie Chen, Yuting Ning, Qianheng Zhang, Boshi Wang, Botao Yu, Yifei Li, Zeyi Liao, Chen Wei, Zitong Lu, et al. Scienceagentbench: Toward rigorous assessment of language agents for data-driven scientific discovery.arXiv preprint arXiv:2410.05080, 2024
-
[13]
Victor Guilherme Turrisi Da Costa, Enrico Fini, Moin Nabi, Nicu Sebe, and Elisa Ricci. solo- learn: A library of self-supervised methods for visual representation learning.Journal of Machine Learning Research, 23(56):1–6, 2022
work page 2022
-
[14]
Li Deng. The mnist database of handwritten digit images for machine learning research.IEEE Signal Processing Magazine, 29(6):141–142, 2012
work page 2012
-
[15]
Vineeth Dorna, Anmol Mekala, Wenlong Zhao, Andrew McCallum, Zachary C Lipton, J Zico Kolter, and Pratyush Maini. Openunlearning: Accelerating llm unlearning via unified bench- marking of methods and metrics.arXiv preprint arXiv:2506.12618, 2025
-
[16]
In search of lost domain generalization.arXiv preprint arXiv:2007.01434, 2020
Ishaan Gulrajani and David Lopez-Paz. In search of lost domain generalization.arXiv preprint arXiv:2007.01434, 2020. 10
-
[17]
GraphCodeBERT: Pre-training Code Representations with Data Flow
Daya Guo, Shuo Ren, Shuai Lu, Zhangyin Feng, Duyu Tang, Shujie Liu, Long Zhou, Nan Duan, Alexey Svyatkovskiy, Shengyu Fu, et al. Graphcodebert: Pre-training code representations with data flow.arXiv preprint arXiv:2009.08366, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2009
-
[18]
Momentum contrast for unsupervised visual representation learning
Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual representation learning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9729–9738, 2020
work page 2020
-
[19]
Jennifer L Hill. Bayesian nonparametric modeling for causal inference.Journal of Computa- tional and Graphical Statistics, 20(1):217–240, 2011
work page 2011
-
[20]
Maximilian Idahl and Zahra Ahmadi
Qian Huang, Jian V ora, Percy Liang, and Jure Leskovec. Mlagentbench: Evaluating language agents on machine learning experimentation.arXiv preprint arXiv:2310.03302, 2023
-
[21]
AIDE: AI-Driven Exploration in the Space of Code
Zhengyao Jiang, Dominik Schmidt, Dhruv Srikanth, Dixing Xu, Ian Kaplan, Deniss Jacenko, and Yuxiang Wu. Aide: Ai-driven exploration in the space of code. 2025. URL https: //arxiv.org/abs/2502.13138
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[22]
SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues?arXiv preprint arXiv:2310.06770, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[23]
Liqiang Jing, Zhehui Huang, Xiaoyang Wang, Wenlin Yao, Wenhao Yu, Kaixin Ma, Hongming Zhang, Xinya Du, and Dong Yu. Dsbench: How far are data science agents from becoming data science experts?arXiv preprint arXiv:2409.07703, 2024
-
[24]
Andrej Karpathy. autoresearch: Ai agents running research on single-gpu nanochat training automatically.https://github.com/karpathy/autoresearch, 2026. GitHub repository
work page 2026
-
[25]
Learning multiple layers of features from tiny images
Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009
work page 2009
-
[26]
LAB-Bench: Measuring Capabilities of Language Models for Biology Research
Jon M Laurent, Joseph D Janizek, Michael Ruzo, Michaela M Hinks, Michael J Hammer- ling, Siddharth Narayanan, Manvitha Ponnapati, Andrew D White, and Samuel G Rodriques. Lab-bench: Measuring capabilities of language models for biology research.arXiv preprint arXiv:2407.10362, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[27]
Joel Lehman and Kenneth O Stanley. Abandoning objectives: Evolution through the search for novelty alone.Evolutionary computation, 19(2):189–223, 2011
work page 2011
-
[28]
Trustworthy ai: From principles to practices.ACM Computing Surveys, 55(9):1–46, 2023
Bo Li, Peng Qi, Bo Liu, Shuai Di, Jingen Liu, Jiquan Pei, Jinfeng Yi, and Bowen Zhou. Trustworthy ai: From principles to practices.ACM Computing Surveys, 55(9):1–46, 2023
work page 2023
-
[29]
Chris Lu, Cong Lu, Robert Tjarko Lange, Jakob Foerster, Jeff Clune, and David Ha
Ruochen Li, Teerth Patel, Qingyun Wang, and Xinya Du. Mlr-copilot: Autonomous machine learning research based on large language models agents.arXiv preprint arXiv:2408.14033, 2024
-
[30]
Lightly: A python library for self-supervised learning on images
Lightly-AI. Lightly: A python library for self-supervised learning on images. https://gith ub.com/lightly-ai/lightly, 2025
work page 2025
-
[31]
The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery
Chris Lu, Cong Lu, Robert Tjarko Lange, Jakob Foerster, Jeff Clune, and David Ha. The ai scien- tist: Towards fully automated open-ended scientific discovery.arXiv preprint arXiv:2408.06292, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[32]
TOFU: A Task of Fictitious Unlearning for LLMs
Pratyush Maini, Zhili Feng, Avi Schwarzschild, Zachary C Lipton, and J Zico Kolter. Tofu: A task of fictitious unlearning for llms.arXiv preprint arXiv:2401.06121, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[33]
Communication-efficient learning of deep networks from decentralized data
Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas. Communication-efficient learning of deep networks from decentralized data. InArtificial intelligence and statistics, pages 1273–1282. Pmlr, 2017
work page 2017
-
[34]
A survey on bias and fairness in machine learning.ACM computing surveys (CSUR), 54(6):1–35, 2021
Ninareh Mehrabi, Fred Morstatter, Nripsuta Saxena, Kristina Lerman, and Aram Galstyan. A survey on bias and fairness in machine learning.ACM computing surveys (CSUR), 54(6):1–35, 2021. 11
work page 2021
-
[35]
Illuminating search spaces by mapping elites
Jean-Baptiste Mouret and Jeff Clune. Illuminating search spaces by mapping elites.arXiv preprint arXiv:1504.04909, 2015
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[36]
Sasi Kumar Murakonda and Reza Shokri. Ml privacy meter: Aiding regulatory compliance by quantifying the privacy risks of machine learning.arXiv preprint arXiv:2007.09339, 2020
-
[37]
arXiv preprint arXiv:1807.01069 , year=
Maria-Irina Nicolae, Mathieu Sinn, Minh Ngoc Tran, Beat Buesser, Ambrish Rawat, Mar- tin Wistuba, Valentina Zantedeschi, Nathalie Baracaldo, Bryant Chen, Heiko Ludwig, et al. Adversarial robustness toolbox v1. 0.0.arXiv preprint arXiv:1807.01069, 2018
-
[38]
AlphaEvolve: A coding agent for scientific and algorithmic discovery
Alexander Novikov, Ngân V˜u, Marvin Eisenberger, Emilien Dupont, Po-Sen Huang, Adam Zsolt Wagner, Sergey Shirobokov, Borislav Kozlovskii, Francisco JR Ruiz, Abbas Mehrabian, et al. Alphaevolve: A coding agent for scientific and algorithmic discovery.arXiv preprint arXiv:2506.13131, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[39]
Harshith Padigela, Chintan Shah, and Dinkar Juyal. Ml-dev-bench: Comparative analysis of ai agents on ml development workflows.arXiv preprint arXiv:2502.00964, 2025
-
[40]
Judea Pearl. The seven tools of causal inference, with reflections on machine learning.Commu- nications of the ACM, 62(3):54–60, 2019
work page 2019
-
[41]
Justin K Pugh, Lisa B Soros, and Kenneth O Stanley. Quality diversity: A new frontier for evolutionary computation.Frontiers in Robotics and AI, 3:40, 2016
work page 2016
-
[42]
icarl: Incremental classifier and representation learning
Sylvestre-Alvise Rebuffi, Alexander Kolesnikov, Georg Sperl, and Christoph H Lampert. icarl: Incremental classifier and representation learning. InProceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 2001–2010, 2017
work page 2001
-
[43]
Samuel Schmidgall, Yusheng Su, Ze Wang, Ximeng Sun, Jialian Wu, Xiaodong Yu, Jiang Liu, Michael Moor, Zicheng Liu, and Emad Barsoum. Agent laboratory: Using llm agents as research assistants.Findings of the Association for Computational Linguistics: EMNLP 2025, pages 5977–6043, 2025
work page 2025
-
[44]
Openevolve: an open-source evolutionary coding agent
Asankhaya Sharma. Openevolve: an open-source evolutionary coding agent. https://gith ub.com/algorithmicsuperintelligence/openevolve, 2025. GitHub repository
work page 2025
-
[45]
Claudia Shi, David Blei, and Victor Veitch. Adapting neural networks for the estimation of treatment effects.Advances in neural information processing systems, 32, 2019
work page 2019
-
[46]
Sicara. Easy few-shot learning: ready-to-use code and tutorial notebooks for few-shot image classification.https://github.com/sicara/easy-few-shot-learning, 2024
work page 2024
-
[47]
Prototypical networks for few-shot learning
Jake Snell, Kevin Swersky, and Richard Zemel. Prototypical networks for few-shot learning. Advances in neural information processing systems, 30, 2017
work page 2017
-
[48]
Kihyuk Sohn, David Berthelot, Nicholas Carlini, Zizhao Zhang, Han Zhang, Colin A Raf- fel, Ekin Dogus Cubuk, Alexey Kurakin, and Chun-Liang Li. Fixmatch: Simplifying semi- supervised learning with consistency and confidence.Advances in neural information processing systems, 33:596–608, 2020
work page 2020
-
[49]
arXiv preprint arXiv:2505.18705 , year=
Jiabin Tang, Lianghao Xia, Zhonghang Li, and Chao Huang. Ai-researcher: Autonomous scientific innovation.arXiv preprint arXiv:2505.18705, 2025
-
[50]
arXiv preprint arXiv:2507.02554 , year=
Edan Toledo, Karen Hambardzumyan, Martin Josifoski, Rishi Hazra, Nicolas Baldwin, Alexis Audran-Reiss, Michael Kuchnik, Despoina Magka, Minqi Jiang, Alisia Maria Lupidi, et al. Ai research agents for machine learning: Search, exploration, and generalization in mle-bench. arXiv preprint arXiv:2507.02554, 2025
-
[51]
Three types of incremental learning.Nature Machine Intelligence, 4:1185–1197, 2022
Gido M van de Ven, Tinne Tuytelaars, and Andreas S Tolias. Three types of incremental learning.Nature Machine Intelligence, 4:1185–1197, 2022
work page 2022
-
[52]
Vapnik.Statistical Learning Theory
Vladimir N. Vapnik.Statistical Learning Theory. Wiley-Interscience, New York, 1998. 12
work page 1998
-
[53]
Deep hashing network for unsupervised domain adaptation
Hemanth Venkateswara, Jose Eusebio, Shayok Chakraborty, and Sethuraman Panchanathan. Deep hashing network for unsupervised domain adaptation. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5018–5027, 2017
work page 2017
-
[54]
Matching networks for one shot learning.Advances in neural information processing systems, 29, 2016
Oriol Vinyals, Charles Blundell, Timothy Lillicrap, Daan Wierstra, et al. Matching networks for one shot learning.Advances in neural information processing systems, 29, 2016
work page 2016
-
[55]
SciBench: Evaluating College-Level Scientific Problem-Solving Abilities of Large Language Models
Xiaoxuan Wang, Ziniu Hu, Pan Lu, Yanqiao Zhu, Jieyu Zhang, Satyen Subramaniam, Arjun R Loomba, Shichang Zhang, Yizhou Sun, and Wei Wang. Scibench: Evaluating college-level scientific problem-solving abilities of large language models.arXiv preprint arXiv:2307.10635, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[56]
Yidong Wang, Hao Chen, Yue Fan, Wang Sun, Ran Tao, Wenxin Hou, Renjie Wang, Linyi Yang, Zhi Zhou, Lan-Zhe Guo, et al. Usb: A unified semi-supervised learning benchmark for classification.Advances in Neural Information Processing Systems, 35:3938–3961, 2022
work page 2022
-
[57]
Hjalmar Wijk, Tao Lin, Joel Becker, Sami Jawhar, Neev Parikh, Thomas Broadley, Lawrence Chan, Michael Chen, Josh Clymer, Jai Dhyani, et al. Re-bench: Evaluating frontier ai r&d capabilities of language model agents against human experts.arXiv preprint arXiv:2411.15114, 2024
-
[58]
The AI Scientist-v2: Workshop-Level Automated Scientific Discovery via Agentic Tree Search
Yutaro Yamada, Robert Tjarko Lange, Cong Lu, Shengran Hu, Chris Lu, Jakob Foerster, Jeff Clune, and David Ha. The ai scientist-v2: Workshop-level automated scientific discovery via agentic tree search.arXiv preprint arXiv:2504.08066, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[59]
Opacus: User-friendly differential privacy library in PyTorch.arXiv preprint arXiv:2109.12298, 2021
Ashkan Yousefpour, Igor Shilov, Alexandre Sablayrolles, Davide Testuggine, Karthik Prasad, Mani Malek, John Nguyen, Sayan Ghosh, Akash Bharadwaj, Jessica Zhao, et al. Opacus: User-friendly differential privacy library in pytorch.arXiv preprint arXiv:2109.12298, 2021
-
[60]
Sergey Zagoruyko and Nikos Komodakis. Wide residual networks.arXiv preprint arXiv:1605.07146, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[61]
Barlow twins: Self- supervised learning via redundancy reduction
Jure Zbontar, Li Jing, Ishan Misra, Yann LeCun, and Stéphane Deny. Barlow twins: Self- supervised learning via redundancy reduction. InInternational conference on machine learning, pages 12310–12320. PMLR, 2021
work page 2021
-
[62]
Continual learning through synaptic intelligence
Friedemann Zenke, Ben Poole, and Surya Ganguli. Continual learning through synaptic intelligence. InInternational conference on machine learning, pages 3987–3995. PMLR, 2017
work page 2017
-
[63]
Pfllib: Personalized federated learning algorithm library.arXiv preprint arXiv:2312.04992, 2023
Jianqing Zhang, Yang Liu, Yang Hua, Hao Wang, Tao Song, Zhengui Xue, Ruhui Ma, and Jian Cao. Pfllib: Personalized federated learning algorithm library.arXiv preprint arXiv:2312.04992, 2023
-
[64]
OpenOOD v1.5: Enhanced Benchmark for Out -of- Distribution Detection,
Jingyang Zhang, Jingkang Yang, Pengyun Wang, Haoqi Wang, Yueqian Lin, Haoran Zhang, Yiyou Sun, Xuefeng Du, Yixuan Li, Ziwei Liu, et al. Openood v1. 5: Enhanced benchmark for out-of-distribution detection.arXiv preprint arXiv:2306.09301, 2023
-
[65]
gcastle: A python toolbox for causal discovery.arXiv preprint arXiv:2111.15155, 2021
Keli Zhang, Shengyu Zhu, Marcus Kalander, Ignavier Ng, Junjian Ye, Zhitang Chen, and Lujia Pan. gcastle: A python toolbox for causal discovery.arXiv preprint arXiv:2111.15155, 2021
-
[66]
Xun Zheng, Bryon Aragam, Pradeep K Ravikumar, and Eric P Xing. Dags with no tears: Continuous optimization for structure learning.Advances in neural information processing systems, 31, 2018
work page 2018
-
[67]
Da-Wei Zhou, Fu-Yun Wang, Han-Jia Ye, and De-Chuan Zhan. Pycil: a python toolbox for class-incremental learning, 2023. 13 A Task descriptions This appendix gives a short description of each of the 18 research tasks in FML-bench, one paragraph per task. For every task we identify the dataset, the baseline algorithm, the agent’s optimization target, and the...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.