AIBuildAI: An AI Agent for Automatically Building AI Models

Li Zhang; Peijia Qin; Pengtao Xie; Qi Cao; Ruiyi Zhang

arxiv: 2604.14455 · v1 · submitted 2026-04-15 · 💻 cs.AI

AIBuildAI: An AI Agent for Automatically Building AI Models

Ruiyi Zhang , Peijia Qin , Qi Cao , Li Zhang , Pengtao Xie This is my paper

Pith reviewed 2026-05-10 12:45 UTC · model grok-4.3

classification 💻 cs.AI

keywords AI agentautomated model buildinghierarchical agentsLLM agentsMLE-BenchAutoMLmodel development

0 comments

The pith

A hierarchical AI agent system automatically builds complete models from task descriptions and data, achieving first place on a benchmark of realistic development tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces AIBuildAI as an agent that receives only a task description and training data then produces a working AI model through fully automated steps. It structures the process with one manager agent directing three sub-agents that separately handle modeling choices, code creation and fixes, and training adjustments. This approach extends past conventional AutoML tools that operate only inside fixed model families and hyperparameter spaces. The system is tested on MLE-Bench, a collection of Kaggle-style problems covering image, text, time-series, and tabular data, where it records the top score and matches the results of experienced human engineers. The central demonstration is that coordinated LLM agents can carry out the entire model-development lifecycle without ongoing human guidance.

Core claim

AIBuildAI uses a manager agent to coordinate a designer sub-agent for choosing modeling strategies, a coder sub-agent for writing and debugging code, and a tuner sub-agent for training and performance refinement. Each sub-agent is an LLM-based system that performs multi-step reasoning and tool use. On the MLE-Bench benchmark of diverse real-world tasks, this architecture delivers a 63.1 percent medal rate, the highest among tested methods and comparable to the output of skilled human practitioners.

What carries the argument

Hierarchical agent architecture in which a manager coordinates three specialized LLM agents (designer, coder, tuner) that together execute architecture selection, code implementation, debugging, and optimization.

If this is right

End-to-end automation becomes feasible for the full AI model development process from specification to deployable artifact.
Performance on realistic tasks reaches levels previously associated only with experienced human engineers.
The approach surpasses existing AutoML systems by handling open-ended architecture design and implementation steps.
AI model creation could become accessible with far less specialized expertise than is currently required.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same coordination pattern might transfer to other multi-stage engineering workflows that currently demand teams of specialists.
Further reliability gains in the underlying language models could reduce the frequency of failures on harder or less common data modalities.
Combining the agent with existing code repositories or external APIs might shorten the remaining manual review steps even more.

Load-bearing premise

LLM-based agents can execute long sequences of architecture design, coding, debugging, and performance tuning across different data types without repeated human corrections or breakdowns.

What would settle it

A follow-up evaluation on additional MLE-Bench tasks or similar problems in which AIBuildAI produces no competitive model or requires substantial human fixes to reach working performance.

read the original abstract

AI models underpin modern intelligent systems, driving advances across science, medicine, finance, and technology. Yet developing high-performing AI models remains a labor-intensive process that requires expert practitioners to iteratively design architectures, engineer representations, implement training pipelines and refine approaches through empirical evaluation. Existing AutoML methods partially alleviate this burden but remain limited to narrow aspects such as hyperparameter optimization and model selection within predefined search spaces, leaving the full development lifecycle largely dependent on human expertise. To address this gap, we introduce AIBuildAI, an AI agent that automatically builds AI models from a task description and training data. AIBuildAI adopts a hierarchical agent architecture in which a manager agent coordinates three specialized sub-agents: a designer for modeling strategy, a coder for implementation and debugging, and a tuner for training and performance optimization. Each sub-agent is itself a large language model (LLM) based agent capable of multi-step reasoning and tool use, enabling end-to-end automation of the AI model development process that goes beyond the scope of existing AutoML approaches. We evaluate AIBuildAI on MLE-Bench, a benchmark of realistic Kaggle-style AI development tasks spanning visual, textual, time-series and tabular modalities. AIBuildAI ranks first on MLE-Bench with a medal rate of 63.1%, outperforming all existing baseline methods and matching the capability of highly experienced AI engineers. These results demonstrate that hierarchical agent systems can automate the full AI model development process from task specification to deployable model, suggesting a pathway toward broadly accessible AI development with minimal human intervention.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces AIBuildAI, a hierarchical LLM-based agent system in which a manager agent coordinates three specialized sub-agents (designer for modeling strategy, coder for implementation/debugging, and tuner for optimization) to automate the full AI model development pipeline from task description and data to a deployable model. It evaluates the system on MLE-Bench, a collection of realistic Kaggle-style tasks across visual, textual, time-series, and tabular modalities, and claims a first-place ranking with a 63.1% medal rate that outperforms existing AutoML and agent baselines while matching experienced human engineers.

Significance. If the performance claims can be substantiated with complete experimental details, this would constitute a notable advance in automated machine learning. The hierarchical multi-agent design extends beyond conventional AutoML (limited to hyperparameter search within fixed spaces) by attempting end-to-end automation of architecture design, coding, debugging, and tuning. Successful validation would support the broader hypothesis that LLM agents can reliably handle complex, multi-step engineering workflows across modalities with minimal human oversight.

major comments (3)

[Abstract and Experiments section] Abstract and Experiments section: The headline result of a 63.1% medal rate and first-place ranking on MLE-Bench is presented without any description of the base LLMs powering the manager/designer/coder/tuner agents, the number of tasks attempted versus completed, the number of independent trials per task, retry budgets, failure-handling protocols, or the precise definition of a 'medal' used by the benchmark. These omissions make it impossible to determine whether the reported superiority is attributable to the hierarchical architecture or to unreported implementation choices, directly undermining the central empirical claim.
[Method section] Method section: The paper asserts that the sub-agents enable 'end-to-end automation ... without human intervention,' yet provides no concrete specification of inter-agent communication protocols, tool-use interfaces, state sharing, or error-recovery mechanisms. Without these details, the weakest assumption—that LLM agents can reliably execute the full multi-step pipeline across diverse modalities—cannot be evaluated, leaving the architectural contribution untestable.
[Experiments section] Experiments section: The claim that AIBuildAI 'outperforms all existing baseline methods' is unsupported by any description of baseline re-implementations, statistical significance tests, variance across runs, or ablation studies isolating the contribution of the manager or individual sub-agents. This absence prevents assessment of whether the medal-rate advantage is robust or confounded by differences in underlying model capabilities.

minor comments (2)

[Abstract] The abstract states that results 'match the capability of highly experienced AI engineers' without any quantitative human baseline or side-by-side comparison; this phrasing should be qualified or removed unless supported by data in the full evaluation.
[Method section] A system diagram or pseudocode illustrating the exact workflow and handoff between manager and sub-agents would improve clarity of the hierarchical architecture.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments, which identify important gaps in experimental transparency and methodological specification. We agree that addressing these points will strengthen the manuscript's reproducibility and allow for a more rigorous evaluation of the hierarchical agent architecture. We respond to each major comment below and commit to the indicated revisions.

read point-by-point responses

Referee: [Abstract and Experiments section] Abstract and Experiments section: The headline result of a 63.1% medal rate and first-place ranking on MLE-Bench is presented without any description of the base LLMs powering the manager/designer/coder/tuner agents, the number of tasks attempted versus completed, the number of independent trials per task, retry budgets, failure-handling protocols, or the precise definition of a 'medal' used by the benchmark. These omissions make it impossible to determine whether the reported superiority is attributable to the hierarchical architecture or to unreported implementation choices, directly undermining the central empirical claim.

Authors: We agree that these details are necessary to substantiate the central empirical claims and to distinguish the contribution of the architecture from implementation specifics. In the revised manuscript we will add a dedicated 'Experimental Setup' subsection that explicitly states the base LLMs used for the manager and each sub-agent, the total number of MLE-Bench tasks attempted and completed, the number of independent trials per task, the retry budgets and failure-handling protocols, and the precise definition of a 'medal' as specified by the benchmark. These additions will be placed in both the Experiments section and referenced from the abstract where appropriate. revision: yes
Referee: [Method section] Method section: The paper asserts that the sub-agents enable 'end-to-end automation ... without human intervention,' yet provides no concrete specification of inter-agent communication protocols, tool-use interfaces, state sharing, or error-recovery mechanisms. Without these details, the weakest assumption—that LLM agents can reliably execute the full multi-step pipeline across diverse modalities—cannot be evaluated, leaving the architectural contribution untestable.

Authors: We acknowledge that the current Method section lacks the level of implementation detail required to make the system testable and reproducible. In the revision we will expand the hierarchical architecture description with a new subsection that specifies the inter-agent communication protocols (including message formats and delegation procedures), tool-use interfaces (code execution environment, data loaders, and evaluation tools), state sharing mechanisms (shared workspace and conversation history), and error-recovery mechanisms (retry logic, fallback strategies, and escalation to the manager). These concrete specifications will allow readers to assess the reliability of the end-to-end automation claim. revision: yes
Referee: [Experiments section] Experiments section: The claim that AIBuildAI 'outperforms all existing baseline methods' is unsupported by any description of baseline re-implementations, statistical significance tests, variance across runs, or ablation studies isolating the contribution of the manager or individual sub-agents. This absence prevents assessment of whether the medal-rate advantage is robust or confounded by differences in underlying model capabilities.

Authors: We will revise the Experiments section to include detailed descriptions of all baseline methods, specifying whether they were re-implemented from original code or taken from published results and noting any adaptations required for fair comparison. We will also report statistical significance tests, discuss observed variance across runs (accounting for LLM stochasticity), and present ablation studies that isolate the manager agent and each sub-agent. Where full multi-run variance or exhaustive ablations were not performed in the original experiments, we will explicitly note this as a limitation and provide the available partial results. revision: partial

Circularity Check

0 steps flagged

No circularity; empirical benchmark result stands independent of internal definitions

full rationale

The paper reports an empirical outcome (63.1% medal rate on external MLE-Bench) obtained by running the described hierarchical LLM agent on a fixed public benchmark. No equations, fitted parameters, or first-principles derivations are present; the central claim is a measured performance number on tasks whose success criteria and data are defined outside the paper. No self-citations are invoked to justify uniqueness or to close any logical loop, and the architecture description does not redefine or presuppose the reported metric. The result is therefore self-contained against the external benchmark.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claim depends on the unproven assumption that current LLMs can serve as reliable autonomous agents for software engineering tasks and that MLE-Bench tasks are representative of real deployment scenarios. The AIBuildAI system itself is the primary new entity introduced.

axioms (2)

domain assumption Large language models can perform reliable multi-step reasoning, tool use, code generation, and iterative debugging for AI model development tasks.
Invoked when describing the capabilities of the designer, coder, and tuner sub-agents.
domain assumption The MLE-Bench benchmark tasks and evaluation protocol accurately reflect the capabilities of experienced human AI engineers.
Used to interpret the 63.1% medal rate as matching human expert performance.

invented entities (1)

AIBuildAI hierarchical agent system no independent evidence
purpose: To coordinate design, coding, and tuning sub-agents for end-to-end AI model construction
The system is the primary contribution; no independent external evidence for its reliability is provided in the abstract.

pith-pipeline@v0.9.0 · 5592 in / 1481 out tokens · 54498 ms · 2026-05-10T12:45:07.400103+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

88 extracted references · 88 canonical work pages · 6 internal anchors

[1]

M.Computing machinery and intelligence, 23–65 (Springer, 2007)

Turing, A. M.Computing machinery and intelligence, 23–65 (Springer, 2007)

work page 2007
[2]

Jordan, M. I. & Mitchell, T. M. Machine learning: Trends, perspectives, and prospects.Science 349, 255–260 (2015)

work page 2015
[3]

Probabilistic machine learning and artificial intelligence.Nature521, 452–459 (2015)

Ghahramani, Z. Probabilistic machine learning and artificial intelligence.Nature521, 452–459 (2015)

work page 2015
[4]

Biamonte, J.et al.Quantum machine learning.Nature549, 195–202 (2017)

work page 2017
[5]

& Sun, J

He, K., Zhang, X., Ren, S. & Sun, J. Agapito, L., Berg, T., Kosecka, J. & Zelnik-Manor, L. (eds)Deep residual learning for image recognition. (eds Agapito, L., Berg, T., Kosecka, J. & Zelnik-Manor, L.) Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 770–778 (2016)

work page 2016
[6]

W., Medina, J

Otter, D. W., Medina, J. R. & Kalita, J. K. A survey of the usages of deep learning for natural language processing.IEEE transactions on neural networks and learning systems32, 604–624 (2020). 17

work page 2020
[7]

Price, I.et al.Probabilistic weather forecasting with machine learning.Nature637, 84–90 (2025)

work page 2025
[8]

Hollmann, N.et al.Accurate predictions on small data with a tabular foundation model.Nature 637, 319–326 (2025)

work page 2025
[9]

& Bengio, Y

Bergstra, J. & Bengio, Y. Random search for hyper-parameter optimization.Journal of machine learning research13(2012)

work page 2012
[10]

D., Lee, D

Sculley, D.et al.Cortes, C., Lawrence, N. D., Lee, D. D., Sugiyama, M. & Garnett, R. (eds)Hidden technical debt in machine learning systems. (eds Cortes, C., Lawrence, N. D., Lee, D. D., Sugiyama, M. & Garnett, R.)Proceedings of the International Conference on Neural Information Processing Systems, Vol. 2, 2503–2511 (2015)

work page 2015
[11]

& Vanschoren, J.Automated machine learning: methods, systems, challenges (Springer, 2019)

Hutter, F., Kotthoff, L. & Vanschoren, J.Automated machine learning: methods, systems, challenges (Springer, 2019)

work page 2019
[12]

Aldoseri, A., Al-Khalifa, K. N. & Hamouda, A. M. Re-thinking data strategy and integration for artificial intelligence: concepts, opportunities, and challenges.Applied Sciences13, 7082 (2023)

work page 2023
[13]

W., Katabi, D

Yang, Y., Zhang, H., Gichoya, J. W., Katabi, D. & Ghassemi, M. The limits of fair medical imaging ai in real-world generalization.Nature medicine30, 2838–2848 (2024)

work page 2024
[14]

A few useful things to know about machine learning.Communications of the ACM 55, 78–87 (2012)

Domingos, P. A few useful things to know about machine learning.Communications of the ACM 55, 78–87 (2012)

work page 2012
[15]

D., Lee, D

Feurer, M.et al.Cortes, C., Lawrence, N. D., Lee, D. D., Sugiyama, M. & Garnett, R. (eds)Efficient and robust automated machine learning. (eds Cortes, C., Lawrence, N. D., Lee, D. D., Sugiyama, M. & Garnett, R.)Proceedings of the International Conference on Neural Information Processing Systems, 2755–2763 (2015)

work page 2015
[16]

& Weinberger, K

Henderson, P.et al.McIlraith, S. & Weinberger, K. (eds)Deep reinforcement learning that matters. (eds McIlraith, S. & Weinberger, K.)Proceedings of the AAAI conference on artificial intelligence, Vol. 32 (2018)

work page 2018
[17]

Thornton, C., Hutter, F., Hoos, H. H. & Leyton-Brown, K. Dhillon, I. S., Koren, Y., Ghani, R., Senator, T. E. & Schmerl, B. (eds)Auto-weka: Combined selection and hyperparameter optimization of classification algorithms. (eds Dhillon, I. S., Koren, Y., Ghani, R., Senator, T. E. & Schmerl, B.)Proceedings of the 19th ACM SIGKDD International Conference on K...

work page 2013
[18]

& Chu, X

He, X., Zhao, K. & Chu, X. Automl: A survey of the state-of-the-art.Knowledge-based systems 212, 106622 (2021)

work page 2021
[19]

URL https://openreview.net/forum?id=RwfrdKSgCE

Toledo, E.et al.AI research agents for machine learning: Search, exploration, and generalization in MLE-bench (2025). URL https://openreview.net/forum?id=RwfrdKSgCE

work page 2025
[20]

Automlgen: Navigating fine- grained optimization for coding agents.arXiv preprint arXiv:2510.08511, 2025a

Du, S.et al.Automlgen: Navigating fine-grained optimization for coding agents.ArXiv abs/2510.08511(2025). URL https://api.semanticscholar.org/CorpusID:281951479

work page arXiv 2025
[21]

Hong, S.et al.MetaGPT: Meta programming for a multi-agent collaborative framework.The Twelfth International Conference on Learning Representations (ICLR)(2024)

work page 2024
[22]

Conference on Language Modeling (COLM)(2024)

Wu, Q.et al.AutoGen: Enabling next-gen LLM applications via multi-agent conversation. Conference on Language Modeling (COLM)(2024)

work page 2024
[23]

S.et al.Mle-bench: Evaluating machine learning agents on machine learning engineering (2025)

Chan, J. S.et al.Mle-bench: Evaluating machine learning agents on machine learning engineering (2025). International Conference on Learning Representations (ICLR)

work page 2025
[24]

Mle-bench leaderboard (commit c5631ba)

OpenAI. Mle-bench leaderboard (commit c5631ba). https://github.com/openai/mle-bench/tree/ c5631ba61ceeb0573235a6ce209db435327a1e84 (2026). Accessed: 2026-03-18. 18

work page 2026
[25]

Chen, J.et al.MARS: Modular agent with reflective search for automated AI research.arXiv preprint arXiv:2602.02660(2026)

work page internal anchor Pith review arXiv 2026
[26]

Li, A.et al.The FM agent.arXiv preprint arXiv:2510.26144(2025)

work page arXiv 2025
[27]

Liu, Z.et al.ML-Master: Towards AI-for-AI via integration of exploration and reasoning.arXiv preprint arXiv:2506.16499(2025)

work page arXiv 2025
[28]

Kapso: A knowledge- grounded framework for autonomous program synthesis and optimization.arXiv preprint arXiv:2601.21526, 2026

Nadafian, A., Mohammadshahi, A. & Yazdani, M. KAPSO: A knowledge-grounded framework for autonomous program synthesis and optimization.arXiv preprint arXiv:2601.21526(2026)

work page arXiv 2026
[29]

Team, I.et al.InternAgent: When agent becomes the scientist—building closed-loop system from hypothesis to verification.arXiv preprint arXiv:2505.16938(2025)

work page arXiv 2025
[30]

Yang, X.et al.R&D-Agent: An LLM-Agent framework towards autonomous data science.arXiv preprint arXiv:2505.14738(2025)

work page arXiv 2025
[31]

Jiang, Z.et al.AIDE: AI-Driven exploration in the space of code.arXiv preprint arXiv:2502.13138 (2025)

work page internal anchor Pith review arXiv 2025
[32]

ImageNet classiﬁc ation with deep convolutional neural networks

Krizhevsky, A., Sutskever, I. & Hinton, G. E. Imagenet classification with deep convolutional neural networks.Commun. ACM60, 84–90 (2017). URL https://doi.org/10.1145/3065386

work page doi:10.1145/3065386 2017
[33]

Tan, M. & Le, Q. Efficientnet: Rethinking model scaling for convolutional neural networks (2019)

work page 2019
[34]

URL https://openreview.net/forum?id=YicbFdNTTy

Dosovitskiy, A.et al.An image is worth 16x16 words: Transformers for image recognition at scale (2021). URL https://openreview.net/forum?id=YicbFdNTTy

work page 2021
[35]

Liu, Z.et al.Swin transformer: Hierarchical vision transformer using shifted windows.Proceedings of the IEEE/CVF International Conference on Computer Vision10012–10022 (2021)

work page 2021
[36]

Neurocomputing , author =

Wang, M. & Deng, W. Deep visual domain adaptation: A survey.Neurocomput.312, 135–153 (2018). URL https://doi.org/10.1016/j.neucom.2018.05.083

work page doi:10.1016/j.neucom.2018.05.083 2018
[37]

D., Zoph, B., Man´ e, D., Vasudevan, V

Cubuk, E. D., Zoph, B., Man´ e, D., Vasudevan, V. & Le, Q. V. Autoaugment: Learning augmen- tation strategies from data. (2019). URL http://dblp.uni-trier.de/db/conf/cvpr/cvpr2019.html# CubukZMVL19

work page 2019
[38]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2022)

Liu, Z.et al.A convnet for the 2020s. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2022)

work page 2022
[39]

& Komodakis, N

Zagoruyko, S. & Komodakis, N. Wide residual networks.Proceedings of the British Machine Vision Conference (BMVC)(2016)

work page 2016
[40]

Deng, J.et al.ImageNet: A large-scale hierarchical image database.Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition248–255 (2009)

work page 2009
[41]

Tan, M. & Le, Q. V. EfficientNetV2: Smaller models and faster training.Proceedings of the International Conference on Machine Learning (ICML)10096–10106 (2021)

work page 2021
[42]

B., He, K

Lin, T.-Y., Goyal, P., Girshick, R. B., He, K. & Doll´ ar, P. Focal loss for dense object detection. IEEE Transactions on Pattern Analysis and Machine Intelligence42, 318–327 (2017). URL https: //api.semanticscholar.org/CorpusID:206771220

work page 2017
[43]

Liu, L.et al.Deep learning for generic object detection: A survey.Int. J. Comput. Vision128, 261–318 (2020). URL https://doi.org/10.1007/s11263-019-01247-4

work page doi:10.1007/s11263-019-01247-4 2020
[44]

Sharma, R., Saqib, M., Lin, C. T. & Blumenstein, M. A survey on object instance segmentation. SN Comput. Sci.3(2022). URL https://doi.org/10.1007/s42979-022-01407-3

work page doi:10.1007/s42979-022-01407-3 2022
[45]

& Sun, J

Ren, S., He, K., Girshick, R. & Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks.Advances in Neural Information Processing Systems28(2015). 19

work page 2015
[46]

& Farhadi, A

Redmon, J., Divvala, S., Girshick, R. & Farhadi, A. You only look once: Unified, real-time object detection.Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition779– 788 (2016)

work page 2016
[47]

& Brox, T

Ronneberger, O., Fischer, P. & Brox, T. U-net: Convolutional networks for biomedical image segmen- tation. International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI) (2015)

work page 2015
[48]

Rethinking Atrous Convolution for Semantic Image Segmentation

Chen, L.-C., Papandreou, G., Schroff, F. & Adam, H. Rethinking atrous convolution for semantic image segmentation (2017). arXiv:1706.05587

work page internal anchor Pith review arXiv 2017
[49]

& Zisserman, A

Simonyan, K. & Zisserman, A. Two-stream convolutional networks for action recognition in videos. Advances in Neural Information Processing Systems (NeurIPS) (2014)

work page 2014
[50]

Arnab, A.et al.ViViT: A video vision transformer.Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)6836–6846 (2021)

work page 2021
[51]

Berman, M., Triki, A. R. & Blaschko, M. B. The Lov´ asz-Softmax loss: A tractable surrogate for the optimization of the intersection-over-union measure in neural networks 4413–4421 (2018)

work page 2018
[52]

& Ahmadi, S.-A

Milletari, F., Navab, N. & Ahmadi, S.-A. V-Net: Fully convolutional neural networks for volumetric medical image segmentation.Proceedings of the International Conference on 3D Vision (3DV) 565–571 (2016)

work page 2016
[53]

Advances in Neural Information Processing Systems34, 12077–12090 (2021)

Xie, E.et al.SegFormer: Simple and efficient design for semantic segmentation with transformers. Advances in Neural Information Processing Systems34, 12077–12090 (2021)

work page 2021
[54]

& Janvin, C

Bengio, Y., Ducharme, R., Vincent, P. & Janvin, C. A neural probabilistic language model.J. Mach. Learn. Res.3, 1137–1155 (2003)

work page 2003
[55]

& Toutanova, K

Devlin, J., Chang, M.-W., Lee, K. & Toutanova, K. Bert: Pre-training of deep bidirectional trans- formers for language understanding. Proceedings of the North American Chapter of the Association for Computational Linguistics (NAACL) (2019)

work page 2019
[56]

arXiv (2019)

Liu, Y.et al.Roberta: A robustly optimized bert pretraining approach. arXiv (2019)

work page 2019
[57]

Raffel, C.et al.Exploring the limits of transfer learning with a unified text-to-text transformer.J. Mach. Learn. Res.21(2020)

work page 2020
[58]

OpenAI Technical Report (2019)

Radford, A.et al.Language models are unsupervised multitask learners. OpenAI Technical Report (2019)

work page 2019
[59]

& Wolf, T

Sanh, V., Debut, L., Chaumond, J. & Wolf, T. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv (2019)

work page 2019
[60]

van den Oord, A.et al.WaveNet: A generative model for raw audio.arXiv preprint arXiv:1609.03499 (2016)

work page internal anchor Pith review arXiv 2016
[61]

URL https: //api.semanticscholar.org/CorpusID:8810481

Hershey, S.et al.Cnn architectures for large-scale audio classification.2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)131–135 (2016). URL https: //api.semanticscholar.org/CorpusID:8810481

work page 2017
[62]

Long short -term memory,

Hochreiter, S. & Schmidhuber, J. Long short-term memory.Neural Comput.9, 1735–1780 (1997). URL https://doi.org/10.1162/neco.1997.9.8.1735

work page doi:10.1162/neco.1997.9.8.1735 1997
[63]

Bai, S., Kolter, J. Z. & Koltun, V. An empirical evaluation of generic convolutional and recurrent networks for sequence modeling.ArXivabs/1803.01271(2018). URL https://api.semanticscholar. org/CorpusID:4747877

work page internal anchor Pith review arXiv 2018
[64]

& Varoquaux, G

Grinsztajn, L., Oyallon, E. & Varoquaux, G. Why do tree-based models still outperform deep learning on typical tabular data? (2022). URL https://openreview.net/forum?id=Fp7 phQszn. 20

work page 2022
[65]

Chen and C

Chen, T. & Guestrin, C. Xgboost: A scalable tree boosting system (2016). URL https://doi.org/ 10.1145/2939672.2939785

work page doi:10.1145/2939672.2939785 2016
[66]

Ke, G.et al.Lightgbm: a highly efficient gradient boosting decision tree (2017)

work page 2017
[67]

Prokhorenkova, L., Gusev, G., Vorobev, A., Dorogush, A. V. & Gulin, A. CatBoost: Unbiased boosting with categorical features.Advances in Neural Information Processing Systems31(2018)

work page 2018
[68]

Arik, S. ¨O. & Pfister, T. TabNet: Attentive interpretable tabular learning.Proceedings of the AAAI Conference on Artificial Intelligence35, 6679–6687 (2021)

work page 2021
[69]

Jasper, H. H. The ten-twenty electrode system of the International Federation.Electroencephalog- raphy and Clinical Neurophysiology10, 371–375 (1958)

work page 1958
[70]

Wu, K.et al.TinyViT: Fast pretraining distillation for small vision transformers.Proceedings of the European Conference on Computer Vision (ECCV)(2022)

work page 2022
[71]

N., Hani, A

Acharya, J. N., Hani, A. J., Thirumala, P. D. & Tsuchida, T. N. American clinical neurophysiology society guideline 3: A proposal for standard montages to be used in clinical EEG.Journal of Clinical Neurophysiology33, 312–316 (2016)

work page 2016
[72]

On the theory of filter amplifiers.Experimental Wireless and the Wireless Engineer 7, 536–541 (1930)

Butterworth, S. On the theory of filter amplifiers.Experimental Wireless and the Wireless Engineer 7, 536–541 (1930)

work page 1930
[73]

Ding, D.et al.Hybrid LLM: Cost-efficient and quality-aware query routing.Proceedings of the Twelfth International Conference on Learning Representations(2024)

work page 2024
[74]

Wang, X.et al.MixLLM: Dynamic routing in mixed large language models.Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics (2025)

work page 2025
[75]

Hoffmann, J.et al.Training compute-optimal large language models.Advances in Neural Information Processing Systems (NeurIPS)(2022)

work page 2022
[76]

Proceedings of the 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI)(2022)

Zheng, L.et al.Alpa: Automating inter- and intra-operator parallelism for distributed deep learning. Proceedings of the 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI)(2022)

work page 2022
[77]

Zhu, Z.et al.Mist: Efficient distributed training of large language models via memory-parallelism co-optimization.Proceedings of the 20th European Conference on Computer Systems (EuroSys) (2025)

work page 2025
[78]

& Shi, W

Wang, Z., Li, Z., Jiang, Z., Tu, D. & Shi, W. Crafting personalized agents through retrieval- augmented generation on editable memory graphs.Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP)(2024)

work page 2024
[79]

Xu, W.et al.A-MEM: Agentic memory for LLM agents.Advances in Neural Information Processing Systems (NeurIPS)(2025)

work page 2025
[80]

Peidli, S.et al.scPerturb: Harmonized single-cell perturbation data.Nature Methods21, 531–540 (2024)

work page 2024

Showing first 80 references.

[1] [1]

M.Computing machinery and intelligence, 23–65 (Springer, 2007)

Turing, A. M.Computing machinery and intelligence, 23–65 (Springer, 2007)

work page 2007

[2] [2]

Jordan, M. I. & Mitchell, T. M. Machine learning: Trends, perspectives, and prospects.Science 349, 255–260 (2015)

work page 2015

[3] [3]

Probabilistic machine learning and artificial intelligence.Nature521, 452–459 (2015)

Ghahramani, Z. Probabilistic machine learning and artificial intelligence.Nature521, 452–459 (2015)

work page 2015

[4] [4]

Biamonte, J.et al.Quantum machine learning.Nature549, 195–202 (2017)

work page 2017

[5] [5]

& Sun, J

He, K., Zhang, X., Ren, S. & Sun, J. Agapito, L., Berg, T., Kosecka, J. & Zelnik-Manor, L. (eds)Deep residual learning for image recognition. (eds Agapito, L., Berg, T., Kosecka, J. & Zelnik-Manor, L.) Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 770–778 (2016)

work page 2016

[6] [6]

W., Medina, J

Otter, D. W., Medina, J. R. & Kalita, J. K. A survey of the usages of deep learning for natural language processing.IEEE transactions on neural networks and learning systems32, 604–624 (2020). 17

work page 2020

[7] [7]

Price, I.et al.Probabilistic weather forecasting with machine learning.Nature637, 84–90 (2025)

work page 2025

[8] [8]

Hollmann, N.et al.Accurate predictions on small data with a tabular foundation model.Nature 637, 319–326 (2025)

work page 2025

[9] [9]

& Bengio, Y

Bergstra, J. & Bengio, Y. Random search for hyper-parameter optimization.Journal of machine learning research13(2012)

work page 2012

[10] [10]

D., Lee, D

Sculley, D.et al.Cortes, C., Lawrence, N. D., Lee, D. D., Sugiyama, M. & Garnett, R. (eds)Hidden technical debt in machine learning systems. (eds Cortes, C., Lawrence, N. D., Lee, D. D., Sugiyama, M. & Garnett, R.)Proceedings of the International Conference on Neural Information Processing Systems, Vol. 2, 2503–2511 (2015)

work page 2015

[11] [11]

& Vanschoren, J.Automated machine learning: methods, systems, challenges (Springer, 2019)

Hutter, F., Kotthoff, L. & Vanschoren, J.Automated machine learning: methods, systems, challenges (Springer, 2019)

work page 2019

[12] [12]

Aldoseri, A., Al-Khalifa, K. N. & Hamouda, A. M. Re-thinking data strategy and integration for artificial intelligence: concepts, opportunities, and challenges.Applied Sciences13, 7082 (2023)

work page 2023

[13] [13]

W., Katabi, D

Yang, Y., Zhang, H., Gichoya, J. W., Katabi, D. & Ghassemi, M. The limits of fair medical imaging ai in real-world generalization.Nature medicine30, 2838–2848 (2024)

work page 2024

[14] [14]

A few useful things to know about machine learning.Communications of the ACM 55, 78–87 (2012)

Domingos, P. A few useful things to know about machine learning.Communications of the ACM 55, 78–87 (2012)

work page 2012

[15] [15]

D., Lee, D

Feurer, M.et al.Cortes, C., Lawrence, N. D., Lee, D. D., Sugiyama, M. & Garnett, R. (eds)Efficient and robust automated machine learning. (eds Cortes, C., Lawrence, N. D., Lee, D. D., Sugiyama, M. & Garnett, R.)Proceedings of the International Conference on Neural Information Processing Systems, 2755–2763 (2015)

work page 2015

[16] [16]

& Weinberger, K

Henderson, P.et al.McIlraith, S. & Weinberger, K. (eds)Deep reinforcement learning that matters. (eds McIlraith, S. & Weinberger, K.)Proceedings of the AAAI conference on artificial intelligence, Vol. 32 (2018)

work page 2018

[17] [17]

Thornton, C., Hutter, F., Hoos, H. H. & Leyton-Brown, K. Dhillon, I. S., Koren, Y., Ghani, R., Senator, T. E. & Schmerl, B. (eds)Auto-weka: Combined selection and hyperparameter optimization of classification algorithms. (eds Dhillon, I. S., Koren, Y., Ghani, R., Senator, T. E. & Schmerl, B.)Proceedings of the 19th ACM SIGKDD International Conference on K...

work page 2013

[18] [18]

& Chu, X

He, X., Zhao, K. & Chu, X. Automl: A survey of the state-of-the-art.Knowledge-based systems 212, 106622 (2021)

work page 2021

[19] [19]

URL https://openreview.net/forum?id=RwfrdKSgCE

Toledo, E.et al.AI research agents for machine learning: Search, exploration, and generalization in MLE-bench (2025). URL https://openreview.net/forum?id=RwfrdKSgCE

work page 2025

[20] [20]

Automlgen: Navigating fine- grained optimization for coding agents.arXiv preprint arXiv:2510.08511, 2025a

Du, S.et al.Automlgen: Navigating fine-grained optimization for coding agents.ArXiv abs/2510.08511(2025). URL https://api.semanticscholar.org/CorpusID:281951479

work page arXiv 2025

[21] [21]

Hong, S.et al.MetaGPT: Meta programming for a multi-agent collaborative framework.The Twelfth International Conference on Learning Representations (ICLR)(2024)

work page 2024

[22] [22]

Conference on Language Modeling (COLM)(2024)

Wu, Q.et al.AutoGen: Enabling next-gen LLM applications via multi-agent conversation. Conference on Language Modeling (COLM)(2024)

work page 2024

[23] [23]

S.et al.Mle-bench: Evaluating machine learning agents on machine learning engineering (2025)

Chan, J. S.et al.Mle-bench: Evaluating machine learning agents on machine learning engineering (2025). International Conference on Learning Representations (ICLR)

work page 2025

[24] [24]

Mle-bench leaderboard (commit c5631ba)

OpenAI. Mle-bench leaderboard (commit c5631ba). https://github.com/openai/mle-bench/tree/ c5631ba61ceeb0573235a6ce209db435327a1e84 (2026). Accessed: 2026-03-18. 18

work page 2026

[25] [25]

Chen, J.et al.MARS: Modular agent with reflective search for automated AI research.arXiv preprint arXiv:2602.02660(2026)

work page internal anchor Pith review arXiv 2026

[26] [26]

Li, A.et al.The FM agent.arXiv preprint arXiv:2510.26144(2025)

work page arXiv 2025

[27] [27]

Liu, Z.et al.ML-Master: Towards AI-for-AI via integration of exploration and reasoning.arXiv preprint arXiv:2506.16499(2025)

work page arXiv 2025

[28] [28]

Kapso: A knowledge- grounded framework for autonomous program synthesis and optimization.arXiv preprint arXiv:2601.21526, 2026

Nadafian, A., Mohammadshahi, A. & Yazdani, M. KAPSO: A knowledge-grounded framework for autonomous program synthesis and optimization.arXiv preprint arXiv:2601.21526(2026)

work page arXiv 2026

[29] [29]

Team, I.et al.InternAgent: When agent becomes the scientist—building closed-loop system from hypothesis to verification.arXiv preprint arXiv:2505.16938(2025)

work page arXiv 2025

[30] [30]

Yang, X.et al.R&D-Agent: An LLM-Agent framework towards autonomous data science.arXiv preprint arXiv:2505.14738(2025)

work page arXiv 2025

[31] [31]

Jiang, Z.et al.AIDE: AI-Driven exploration in the space of code.arXiv preprint arXiv:2502.13138 (2025)

work page internal anchor Pith review arXiv 2025

[32] [32]

ImageNet classiﬁc ation with deep convolutional neural networks

Krizhevsky, A., Sutskever, I. & Hinton, G. E. Imagenet classification with deep convolutional neural networks.Commun. ACM60, 84–90 (2017). URL https://doi.org/10.1145/3065386

work page doi:10.1145/3065386 2017

[33] [33]

Tan, M. & Le, Q. Efficientnet: Rethinking model scaling for convolutional neural networks (2019)

work page 2019

[34] [34]

URL https://openreview.net/forum?id=YicbFdNTTy

Dosovitskiy, A.et al.An image is worth 16x16 words: Transformers for image recognition at scale (2021). URL https://openreview.net/forum?id=YicbFdNTTy

work page 2021

[35] [35]

Liu, Z.et al.Swin transformer: Hierarchical vision transformer using shifted windows.Proceedings of the IEEE/CVF International Conference on Computer Vision10012–10022 (2021)

work page 2021

[36] [36]

Neurocomputing , author =

Wang, M. & Deng, W. Deep visual domain adaptation: A survey.Neurocomput.312, 135–153 (2018). URL https://doi.org/10.1016/j.neucom.2018.05.083

work page doi:10.1016/j.neucom.2018.05.083 2018

[37] [37]

D., Zoph, B., Man´ e, D., Vasudevan, V

Cubuk, E. D., Zoph, B., Man´ e, D., Vasudevan, V. & Le, Q. V. Autoaugment: Learning augmen- tation strategies from data. (2019). URL http://dblp.uni-trier.de/db/conf/cvpr/cvpr2019.html# CubukZMVL19

work page 2019

[38] [38]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2022)

Liu, Z.et al.A convnet for the 2020s. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2022)

work page 2022

[39] [39]

& Komodakis, N

Zagoruyko, S. & Komodakis, N. Wide residual networks.Proceedings of the British Machine Vision Conference (BMVC)(2016)

work page 2016

[40] [40]

Deng, J.et al.ImageNet: A large-scale hierarchical image database.Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition248–255 (2009)

work page 2009

[41] [41]

Tan, M. & Le, Q. V. EfficientNetV2: Smaller models and faster training.Proceedings of the International Conference on Machine Learning (ICML)10096–10106 (2021)

work page 2021

[42] [42]

B., He, K

Lin, T.-Y., Goyal, P., Girshick, R. B., He, K. & Doll´ ar, P. Focal loss for dense object detection. IEEE Transactions on Pattern Analysis and Machine Intelligence42, 318–327 (2017). URL https: //api.semanticscholar.org/CorpusID:206771220

work page 2017

[43] [43]

Liu, L.et al.Deep learning for generic object detection: A survey.Int. J. Comput. Vision128, 261–318 (2020). URL https://doi.org/10.1007/s11263-019-01247-4

work page doi:10.1007/s11263-019-01247-4 2020

[44] [44]

Sharma, R., Saqib, M., Lin, C. T. & Blumenstein, M. A survey on object instance segmentation. SN Comput. Sci.3(2022). URL https://doi.org/10.1007/s42979-022-01407-3

work page doi:10.1007/s42979-022-01407-3 2022

[45] [45]

& Sun, J

Ren, S., He, K., Girshick, R. & Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks.Advances in Neural Information Processing Systems28(2015). 19

work page 2015

[46] [46]

& Farhadi, A

Redmon, J., Divvala, S., Girshick, R. & Farhadi, A. You only look once: Unified, real-time object detection.Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition779– 788 (2016)

work page 2016

[47] [47]

& Brox, T

Ronneberger, O., Fischer, P. & Brox, T. U-net: Convolutional networks for biomedical image segmen- tation. International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI) (2015)

work page 2015

[48] [48]

Rethinking Atrous Convolution for Semantic Image Segmentation

Chen, L.-C., Papandreou, G., Schroff, F. & Adam, H. Rethinking atrous convolution for semantic image segmentation (2017). arXiv:1706.05587

work page internal anchor Pith review arXiv 2017

[49] [49]

& Zisserman, A

Simonyan, K. & Zisserman, A. Two-stream convolutional networks for action recognition in videos. Advances in Neural Information Processing Systems (NeurIPS) (2014)

work page 2014

[50] [50]

Arnab, A.et al.ViViT: A video vision transformer.Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)6836–6846 (2021)

work page 2021

[51] [51]

Berman, M., Triki, A. R. & Blaschko, M. B. The Lov´ asz-Softmax loss: A tractable surrogate for the optimization of the intersection-over-union measure in neural networks 4413–4421 (2018)

work page 2018

[52] [52]

& Ahmadi, S.-A

Milletari, F., Navab, N. & Ahmadi, S.-A. V-Net: Fully convolutional neural networks for volumetric medical image segmentation.Proceedings of the International Conference on 3D Vision (3DV) 565–571 (2016)

work page 2016

[53] [53]

Advances in Neural Information Processing Systems34, 12077–12090 (2021)

Xie, E.et al.SegFormer: Simple and efficient design for semantic segmentation with transformers. Advances in Neural Information Processing Systems34, 12077–12090 (2021)

work page 2021

[54] [54]

& Janvin, C

Bengio, Y., Ducharme, R., Vincent, P. & Janvin, C. A neural probabilistic language model.J. Mach. Learn. Res.3, 1137–1155 (2003)

work page 2003

[55] [55]

& Toutanova, K

Devlin, J., Chang, M.-W., Lee, K. & Toutanova, K. Bert: Pre-training of deep bidirectional trans- formers for language understanding. Proceedings of the North American Chapter of the Association for Computational Linguistics (NAACL) (2019)

work page 2019

[56] [56]

arXiv (2019)

Liu, Y.et al.Roberta: A robustly optimized bert pretraining approach. arXiv (2019)

work page 2019

[57] [57]

Raffel, C.et al.Exploring the limits of transfer learning with a unified text-to-text transformer.J. Mach. Learn. Res.21(2020)

work page 2020

[58] [58]

OpenAI Technical Report (2019)

Radford, A.et al.Language models are unsupervised multitask learners. OpenAI Technical Report (2019)

work page 2019

[59] [59]

& Wolf, T

Sanh, V., Debut, L., Chaumond, J. & Wolf, T. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv (2019)

work page 2019

[60] [60]

van den Oord, A.et al.WaveNet: A generative model for raw audio.arXiv preprint arXiv:1609.03499 (2016)

work page internal anchor Pith review arXiv 2016

[61] [61]

URL https: //api.semanticscholar.org/CorpusID:8810481

Hershey, S.et al.Cnn architectures for large-scale audio classification.2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)131–135 (2016). URL https: //api.semanticscholar.org/CorpusID:8810481

work page 2017

[62] [62]

Long short -term memory,

Hochreiter, S. & Schmidhuber, J. Long short-term memory.Neural Comput.9, 1735–1780 (1997). URL https://doi.org/10.1162/neco.1997.9.8.1735

work page doi:10.1162/neco.1997.9.8.1735 1997

[63] [63]

Bai, S., Kolter, J. Z. & Koltun, V. An empirical evaluation of generic convolutional and recurrent networks for sequence modeling.ArXivabs/1803.01271(2018). URL https://api.semanticscholar. org/CorpusID:4747877

work page internal anchor Pith review arXiv 2018

[64] [64]

& Varoquaux, G

Grinsztajn, L., Oyallon, E. & Varoquaux, G. Why do tree-based models still outperform deep learning on typical tabular data? (2022). URL https://openreview.net/forum?id=Fp7 phQszn. 20

work page 2022

[65] [65]

Chen and C

Chen, T. & Guestrin, C. Xgboost: A scalable tree boosting system (2016). URL https://doi.org/ 10.1145/2939672.2939785

work page doi:10.1145/2939672.2939785 2016

[66] [66]

Ke, G.et al.Lightgbm: a highly efficient gradient boosting decision tree (2017)

work page 2017

[67] [67]

Prokhorenkova, L., Gusev, G., Vorobev, A., Dorogush, A. V. & Gulin, A. CatBoost: Unbiased boosting with categorical features.Advances in Neural Information Processing Systems31(2018)

work page 2018

[68] [68]

Arik, S. ¨O. & Pfister, T. TabNet: Attentive interpretable tabular learning.Proceedings of the AAAI Conference on Artificial Intelligence35, 6679–6687 (2021)

work page 2021

[69] [69]

Jasper, H. H. The ten-twenty electrode system of the International Federation.Electroencephalog- raphy and Clinical Neurophysiology10, 371–375 (1958)

work page 1958

[70] [70]

Wu, K.et al.TinyViT: Fast pretraining distillation for small vision transformers.Proceedings of the European Conference on Computer Vision (ECCV)(2022)

work page 2022

[71] [71]

N., Hani, A

Acharya, J. N., Hani, A. J., Thirumala, P. D. & Tsuchida, T. N. American clinical neurophysiology society guideline 3: A proposal for standard montages to be used in clinical EEG.Journal of Clinical Neurophysiology33, 312–316 (2016)

work page 2016

[72] [72]

On the theory of filter amplifiers.Experimental Wireless and the Wireless Engineer 7, 536–541 (1930)

Butterworth, S. On the theory of filter amplifiers.Experimental Wireless and the Wireless Engineer 7, 536–541 (1930)

work page 1930

[73] [73]

Ding, D.et al.Hybrid LLM: Cost-efficient and quality-aware query routing.Proceedings of the Twelfth International Conference on Learning Representations(2024)

work page 2024

[74] [74]

Wang, X.et al.MixLLM: Dynamic routing in mixed large language models.Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics (2025)

work page 2025

[75] [75]

Hoffmann, J.et al.Training compute-optimal large language models.Advances in Neural Information Processing Systems (NeurIPS)(2022)

work page 2022

[76] [76]

Proceedings of the 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI)(2022)

Zheng, L.et al.Alpa: Automating inter- and intra-operator parallelism for distributed deep learning. Proceedings of the 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI)(2022)

work page 2022

[77] [77]

Zhu, Z.et al.Mist: Efficient distributed training of large language models via memory-parallelism co-optimization.Proceedings of the 20th European Conference on Computer Systems (EuroSys) (2025)

work page 2025

[78] [78]

& Shi, W

Wang, Z., Li, Z., Jiang, Z., Tu, D. & Shi, W. Crafting personalized agents through retrieval- augmented generation on editable memory graphs.Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP)(2024)

work page 2024

[79] [79]

Xu, W.et al.A-MEM: Agentic memory for LLM agents.Advances in Neural Information Processing Systems (NeurIPS)(2025)

work page 2025

[80] [80]

Peidli, S.et al.scPerturb: Harmonized single-cell perturbation data.Nature Methods21, 531–540 (2024)

work page 2024