arxiv: 2502.13138 · v1 · pith:H5E4RBVVnew · submitted 2025-02-18 · 💻 cs.AI · cs.LG

AIDE: AI-Driven Exploration in the Space of Code

Zhengyao Jiang , Dominik Schmidt , Dhruv Srikanth , Dixing Xu , Ian Kaplan , Deniss Jacenko , Yuxiang Wu This is my paper

Pith reviewed 2026-05-17 18:16 UTC · model grok-4.3

classification 💻 cs.AI cs.LG

keywords learningmachineaideengineeringsolutionsai-drivencodeexploration

0 comments

The pith

AIDE uses large language models to perform tree search in code space and reaches state-of-the-art results on Kaggle, OpenAI MLE-Bench, and METR RE-Bench.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Developing machine learning models requires repeated rounds of writing, testing, and tweaking code. AIDE turns this process into a search problem: it starts with an initial code solution and generates variations, keeping the promising ones and discarding the rest, much like exploring branches on a tree. The large language model acts as the guide that proposes new code changes and evaluates how well they work. By reusing good partial solutions and refining them, the system trades extra compute time for better final performance. The authors test this on standard machine learning engineering benchmarks and report that it outperforms prior approaches.

Core claim

By strategically reusing and refining promising solutions, AIDE effectively trades computational resources for enhanced performance, achieving state-of-the-art results on multiple machine learning engineering benchmarks, including our Kaggle evaluations, OpenAI MLE-Bench and METRs RE-Bench.

Load-bearing premise

That the tree search guided by LLMs can reliably identify and improve upon promising code variants without the search space becoming intractable or the evaluations becoming unreliable.

read the original abstract

Machine learning, the foundation of modern artificial intelligence, has driven innovations that have fundamentally transformed the world. Yet, behind advancements lies a complex and often tedious process requiring labor and compute intensive iteration and experimentation. Engineers and scientists developing machine learning models spend much of their time on trial-and-error tasks instead of conceptualizing innovative solutions or research hypotheses. To address this challenge, we introduce AI-Driven Exploration (AIDE), a machine learning engineering agent powered by large language models (LLMs). AIDE frames machine learning engineering as a code optimization problem, and formulates trial-and-error as a tree search in the space of potential solutions. By strategically reusing and refining promising solutions, AIDE effectively trades computational resources for enhanced performance, achieving state-of-the-art results on multiple machine learning engineering benchmarks, including our Kaggle evaluations, OpenAI MLE-Bench and METRs RE-Bench.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces AIDE, an LLM-based agent that frames machine learning engineering as a code optimization problem solved via tree search over candidate solutions. It claims that strategic reuse and refinement of promising code variants allows trading additional compute for improved performance, yielding state-of-the-art results on Kaggle evaluations, OpenAI MLE-Bench, and METR's RE-Bench.

Significance. If the central performance claims are shown to be robust to controls for total compute, the work would be significant for automated ML and LLM agents: it supplies a concrete mechanism (LLM-guided tree search with reuse) for converting extra evaluations into better outcomes rather than relying on naive sampling. The multi-benchmark evaluation protocol is a positive feature that supports external validity.

major comments (2)

[Experiments] Experiments section: no ablation is reported that holds total LLM generations and evaluations fixed while removing the tree-search reuse structure (i.e., a flat baseline of independent samples). This directly tests the load-bearing claim that the tree-search framing, rather than simply more compute, is responsible for the reported gains.
[Method] Method section: the tree-search procedure is parameterized by several free hyperparameters whose values are not subjected to sensitivity analysis or ablation; without this, it remains unclear whether the reported SOTA results generalize or depend on benchmark-specific tuning of the search policy.

minor comments (1)

[Abstract] Abstract: quantitative margins, exact evaluation protocols, and statistical significance are omitted, making it difficult for readers to gauge the practical magnitude of the claimed improvements.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below, indicating where we agree that revisions are warranted and outlining the changes we will make to strengthen the manuscript.

read point-by-point responses

Referee: [Experiments] Experiments section: no ablation is reported that holds total LLM generations and evaluations fixed while removing the tree-search reuse structure (i.e., a flat baseline of independent samples). This directly tests the load-bearing claim that the tree-search framing, rather than simply more compute, is responsible for the reported gains.

Authors: We agree that this controlled ablation would directly test whether the tree-search structure with reuse and refinement provides benefits beyond simply allocating additional independent LLM generations and evaluations. Our existing evaluations compare AIDE against other agent baselines on the benchmarks, but we did not include a flat-sampling control that exactly matches total compute. In the revised manuscript we will add this ablation on at least one benchmark (e.g., a Kaggle task or a subset of MLE-Bench), holding the total number of LLM calls and code evaluations fixed while comparing the full tree-search procedure against independent sampling without reuse. revision: yes
Referee: [Method] Method section: the tree-search procedure is parameterized by several free hyperparameters whose values are not subjected to sensitivity analysis or ablation; without this, it remains unclear whether the reported SOTA results generalize or depend on benchmark-specific tuning of the search policy.

Authors: The tree-search procedure uses several hyperparameters (branching factor, selection threshold for promising nodes, and maximum depth). These were selected during initial development on a small development set and then held fixed across all three benchmark suites to demonstrate that the same policy works without per-benchmark retuning. We acknowledge that a formal sensitivity analysis would further support robustness claims. In the revision we will add a sensitivity study for the primary hyperparameters, reporting performance variation on a representative task from one of the benchmarks. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical results on external benchmarks

full rationale

The paper introduces AIDE as an LLM-guided tree search system for ML code optimization and reports SOTA performance on independent external benchmarks (Kaggle evaluations, OpenAI MLE-Bench, METR RE-Bench). No equations, derivations, fitted parameters, or self-referential definitions appear in the provided text. Claims rest on measured outcomes from separate test suites rather than any quantity being defined in terms of itself or forced by internal construction. Self-citations, if present, are not load-bearing for the central empirical result.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that LLMs can serve as effective proposal and evaluation oracles for code modifications and that tree search is an appropriate structure for the ML engineering search space. No explicit free parameters or invented entities are named in the abstract.

free parameters (1)

tree search hyperparameters
Depth, branching factor, and selection criteria for the search tree are not specified in the abstract but must be chosen or tuned to achieve the reported results.

axioms (1)

domain assumption Large language models can generate and evaluate useful code modifications for machine learning tasks
The method depends on this capability of current LLMs; if it fails, the tree search cannot progress.

pith-pipeline@v0.9.0 · 5465 in / 1247 out tokens · 66859 ms · 2026-05-17T18:16:23.603996+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Cost.FunctionalEquation / DAlembert.Inevitability RCL_is_unique_functional_form_of_logic unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

achieving state-of-the-art results on multiple machine learning engineering benchmarks

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 18 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

AutoLLMResearch: Training Research Agents for Automating LLM Experiment Configuration -- Learning from Cheap, Optimizing Expensive
cs.AI 2026-05 unverdicted novelty 7.0

AutoLLMResearch trains agents via a multi-fidelity environment and MDP pipeline to extrapolate configuration principles from inexpensive to costly LLM experiments.
SciPredict: Can LLMs Predict the Outcomes of Scientific Experiments in Natural Sciences?
cs.AI 2026-04 unverdicted novelty 7.0

LLMs predict outcomes of real scientific experiments at 14-26% accuracy, comparable to human experts, but lack calibration on prediction reliability while humans demonstrate strong calibration.
DataMaster: Data-Centric Autonomous AI Research
cs.LG 2026-05 unverdicted novelty 6.0

DataMaster deploys an AI agent to autonomously engineer data via tree search over external sources, shared candidate pools, and memory of past outcomes, yielding 32% higher medal rates on MLE-Bench Lite and a small GP...
DataMaster: Data-Centric Autonomous AI Research
cs.LG 2026-05 unverdicted novelty 6.0

DataMaster autonomously optimizes data via tree search and shared memory, raising medal rate 32.27% on MLE-Bench Lite and beating the base instruct model on GPQA.
CellScientist: Dual-Space Hierarchical Orchestration for Closed-Loop Refinement of Virtual Cell Models
cs.LG 2026-05 unverdicted novelty 6.0

CellScientist introduces a dual-space hierarchical orchestration system that enables closed-loop refinement of virtual cell models by routing execution discrepancies back to hypothesis or implementation updates, yield...
SHARP: A Self-Evolving Human-Auditable Rubric Policy for Financial Trading Agents
cs.LG 2026-05 unverdicted novelty 6.0

SHARP is a neuro-symbolic method that evolves bounded, auditable rule rubrics for LLM trading agents via cross-sample attribution and walk-forward validation, raising compact-model performance by 10-20 percentage poin...
TrafficClaw: Generalizable Urban Traffic Control via Unified Physical Environment Modeling
cs.AI 2026-04 unverdicted novelty 6.0

TrafficClaw creates a single runtime environment for heterogeneous urban traffic subsystems and deploys an LLM agent with spatiotemporal reasoning to deliver robust control that generalizes across unseen scenarios.
AgentGA: Evolving Code Solutions in Agent-Seed Space
cs.AI 2026-04 unverdicted novelty 6.0

AgentGA uses a genetic algorithm to evolve agent seeds and achieves 74.52% human-exceeding performance on tabular AutoML tasks versus 54.15% for the AIDE baseline.
AgentGA: Evolving Code Solutions in Agent-Seed Space
cs.AI 2026-04 unverdicted novelty 6.0

AgentGA optimizes agent seeds with genetic algorithms and parent-archive inheritance to improve autonomous code generation, beating a baseline on 15 of 16 Kaggle competitions.
AIBuildAI: An AI Agent for Automatically Building AI Models
cs.AI 2026-04 unverdicted novelty 6.0

AIBuildAI uses a manager agent and three LLM sub-agents to fully automate AI model development and achieves a 63.1% medal rate on MLE-Bench, matching experienced human engineers.
Toward Autonomous Long-Horizon Engineering for ML Research
cs.CL 2026-04 unverdicted novelty 6.0

AiScientist improves ML research benchmarks by 10.54 points on PaperBench and reaches 81.82% Any Medal on MLE-Bench Lite through hierarchical control plus durable file-based state instead of conversational handoffs.
Pioneer Agent: Continual Improvement of Small Language Models in Production
cs.AI 2026-04 unverdicted novelty 6.0

Pioneer Agent automates the full lifecycle of adapting and continually improving small language models via diagnosis-driven data synthesis and regression-constrained retraining, delivering gains of 1.6-83.8 points on ...
AIRA_2: Overcoming Bottlenecks in AI Research Agents
cs.AI 2026-03 conditional novelty 6.0

AIRA₂ improves AI research agents via asynchronous multi-GPU workers, hidden consistent evaluation, and interactive ReAct agents, reaching 81.5-83.1% percentile rank on MLE-bench-30 and exceeding human SOTA on 6 of 20...
A Self-Evolving Defect Detection Framework for Industrial Photovoltaic Systems
cs.AI 2026-03 unverdicted novelty 6.0

SEPDD is a self-evolving defect detection framework for PV modules that achieves 91.4% mAP50 on public data and 49.5% on private data, outperforming autonomous baselines and human experts.
Reasoning as Gradient: Scaling MLE Agents Beyond Tree Search
cs.LG 2026-03 unverdicted novelty 6.0

Gome reaches 35.1% any-medal rate on MLE-Bench by mapping reasoning to gradient-based updates, outperforming tree search once models are sufficiently capable.
ShinkaEvolve: Towards Open-Ended And Sample-Efficient Program Evolution
cs.CL 2025-09 unverdicted novelty 6.0

ShinkaEvolve improves sample efficiency in LLM-driven program evolution via parent sampling, code novelty rejection-sampling, and bandit LLM ensemble selection, achieving new SOTA circle packing with 150 samples and g...
GEAR: Genetic AutoResearch for Agentic Code Evolution
cs.NE 2026-05 unverdicted novelty 5.0

GEAR applies genetic algorithms to maintain and evolve multiple research states in autonomous code agents, outperforming single-path baselines by continuing to discover improvements over extended runs.
AceGRPO: Adaptive Curriculum Enhanced Group Relative Policy Optimization for Autonomous Machine Learning Engineering
cs.LG 2026-02 unverdicted novelty 5.0

AceGRPO trains 30B-parameter LLM agents to achieve 100% valid submissions and competitive performance on MLE-Bench-Lite through evolving data buffers and adaptive task sampling.

Reference graph

Works this paper leans on

14 extracted references · 14 canonical work pages · cited by 16 Pith papers · 1 internal anchor

[1]

doi: 10.1126/science.abq1158. H. Liu, K. Simonyan, and Y . Yang. DARTS: Differentiable Architecture Search. In Proc. of ICLR, 2019. METR. Evaluating frontier AI R&D capabilities of language model agents against human experts. https://metr.org/blog/2024-11-22-evaluating-r-d-capabilities-of-llms/ ,

work page doi:10.1126/science.abq1158 2019
[2]

Blog post (November 2024). J. Mueller and et al. AutoGluon: AutoML for Text, Image, and Tabular Data. Scientific Reports, 14 (1):72889, 2024. R. S. Olson and J. H. Moore. TPOT: A Tree-based Pipeline Optimization Tool for Automating Machine Learning. In ICML AutoML Workshop, 2016. OpenAI. Gpt-4 technical report. Technical report, OpenAI, 2023. OpenAI. Open...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[3]

Distributed Random Forest (DRF) and Extremely Randomized Trees (XRT)

work page
[4]

Generalized Linear Model (GLM) with regularization

work page
[5]

H2O Gradient Boosting Machines

work page
[6]

Fully connected multi-layer artificial neural network (DeepLearning)

work page
[7]

After training individual models, H2O AutoML creates stacked ensembles by combining the predictions of the best-performing models from each algorithm

Stacked Ensembles (including an ensemble of all base models and ensembles using subsets of the base models) It then performs a random search over a predefined grid of hyperparameter combinations, avoiding the computational expense of an exhaustive grid search. After training individual models, H2O AutoML creates stacked ensembles by combining the predicti...

work page 2024
[8]

Select a model and fill in the provided python snippet

work page
[9]

csv and Prepare s u b m i s s i o n

Train the model and Make p r e d i c t i o n s on data from test . csv and Prepare s u b m i s s i o n . csv by ex ec ut ing the script wiith python repl tool

work page
[10]

py Here are some rules to follow :

Save the script into local disk such as model_ { m o d e l _ n a m e }. py Here are some rules to follow :

work page
[11]

csv and test

Never try to change the train . csv and test . csv

work page
[12]

Never output graphs or figures

work page
[13]

Do Not change the c a p i t a l i z a t i o n of the column name

work page
[14]

feels like

Do Not read train . csv and test . csv directly . A.3 ChatGPT with Human Assistance A human operator is tasked with solving a Kaggle competition using only the information provided in the overview and data tabs, which include the available dataset. The operator is permitted to utilize the ChatGPT web interface. The LLM is set to gpt-4-0125-preview in comp...

work page 2024