AIDE: AI-Driven Exploration in the Space of Code
Pith reviewed 2026-05-17 18:16 UTC · model grok-4.3
The pith
AIDE uses large language models to perform tree search in code space and reaches state-of-the-art results on Kaggle, OpenAI MLE-Bench, and METR RE-Bench.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By strategically reusing and refining promising solutions, AIDE effectively trades computational resources for enhanced performance, achieving state-of-the-art results on multiple machine learning engineering benchmarks, including our Kaggle evaluations, OpenAI MLE-Bench and METRs RE-Bench.
Load-bearing premise
That the tree search guided by LLMs can reliably identify and improve upon promising code variants without the search space becoming intractable or the evaluations becoming unreliable.
read the original abstract
Machine learning, the foundation of modern artificial intelligence, has driven innovations that have fundamentally transformed the world. Yet, behind advancements lies a complex and often tedious process requiring labor and compute intensive iteration and experimentation. Engineers and scientists developing machine learning models spend much of their time on trial-and-error tasks instead of conceptualizing innovative solutions or research hypotheses. To address this challenge, we introduce AI-Driven Exploration (AIDE), a machine learning engineering agent powered by large language models (LLMs). AIDE frames machine learning engineering as a code optimization problem, and formulates trial-and-error as a tree search in the space of potential solutions. By strategically reusing and refining promising solutions, AIDE effectively trades computational resources for enhanced performance, achieving state-of-the-art results on multiple machine learning engineering benchmarks, including our Kaggle evaluations, OpenAI MLE-Bench and METRs RE-Bench.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces AIDE, an LLM-based agent that frames machine learning engineering as a code optimization problem solved via tree search over candidate solutions. It claims that strategic reuse and refinement of promising code variants allows trading additional compute for improved performance, yielding state-of-the-art results on Kaggle evaluations, OpenAI MLE-Bench, and METR's RE-Bench.
Significance. If the central performance claims are shown to be robust to controls for total compute, the work would be significant for automated ML and LLM agents: it supplies a concrete mechanism (LLM-guided tree search with reuse) for converting extra evaluations into better outcomes rather than relying on naive sampling. The multi-benchmark evaluation protocol is a positive feature that supports external validity.
major comments (2)
- [Experiments] Experiments section: no ablation is reported that holds total LLM generations and evaluations fixed while removing the tree-search reuse structure (i.e., a flat baseline of independent samples). This directly tests the load-bearing claim that the tree-search framing, rather than simply more compute, is responsible for the reported gains.
- [Method] Method section: the tree-search procedure is parameterized by several free hyperparameters whose values are not subjected to sensitivity analysis or ablation; without this, it remains unclear whether the reported SOTA results generalize or depend on benchmark-specific tuning of the search policy.
minor comments (1)
- [Abstract] Abstract: quantitative margins, exact evaluation protocols, and statistical significance are omitted, making it difficult for readers to gauge the practical magnitude of the claimed improvements.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment below, indicating where we agree that revisions are warranted and outlining the changes we will make to strengthen the manuscript.
read point-by-point responses
-
Referee: [Experiments] Experiments section: no ablation is reported that holds total LLM generations and evaluations fixed while removing the tree-search reuse structure (i.e., a flat baseline of independent samples). This directly tests the load-bearing claim that the tree-search framing, rather than simply more compute, is responsible for the reported gains.
Authors: We agree that this controlled ablation would directly test whether the tree-search structure with reuse and refinement provides benefits beyond simply allocating additional independent LLM generations and evaluations. Our existing evaluations compare AIDE against other agent baselines on the benchmarks, but we did not include a flat-sampling control that exactly matches total compute. In the revised manuscript we will add this ablation on at least one benchmark (e.g., a Kaggle task or a subset of MLE-Bench), holding the total number of LLM calls and code evaluations fixed while comparing the full tree-search procedure against independent sampling without reuse. revision: yes
-
Referee: [Method] Method section: the tree-search procedure is parameterized by several free hyperparameters whose values are not subjected to sensitivity analysis or ablation; without this, it remains unclear whether the reported SOTA results generalize or depend on benchmark-specific tuning of the search policy.
Authors: The tree-search procedure uses several hyperparameters (branching factor, selection threshold for promising nodes, and maximum depth). These were selected during initial development on a small development set and then held fixed across all three benchmark suites to demonstrate that the same policy works without per-benchmark retuning. We acknowledge that a formal sensitivity analysis would further support robustness claims. In the revision we will add a sensitivity study for the primary hyperparameters, reporting performance variation on a representative task from one of the benchmarks. revision: yes
Circularity Check
No circularity: empirical results on external benchmarks
full rationale
The paper introduces AIDE as an LLM-guided tree search system for ML code optimization and reports SOTA performance on independent external benchmarks (Kaggle evaluations, OpenAI MLE-Bench, METR RE-Bench). No equations, derivations, fitted parameters, or self-referential definitions appear in the provided text. Claims rest on measured outcomes from separate test suites rather than any quantity being defined in terms of itself or forced by internal construction. Self-citations, if present, are not load-bearing for the central empirical result.
Axiom & Free-Parameter Ledger
free parameters (1)
- tree search hyperparameters
axioms (1)
- domain assumption Large language models can generate and evaluate useful code modifications for machine learning tasks
Lean theorems connected to this paper
-
Cost.FunctionalEquation / DAlembert.InevitabilityRCL_is_unique_functional_form_of_logic unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
achieving state-of-the-art results on multiple machine learning engineering benchmarks
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 18 Pith papers
-
AutoLLMResearch: Training Research Agents for Automating LLM Experiment Configuration -- Learning from Cheap, Optimizing Expensive
AutoLLMResearch trains agents via a multi-fidelity environment and MDP pipeline to extrapolate configuration principles from inexpensive to costly LLM experiments.
-
SciPredict: Can LLMs Predict the Outcomes of Scientific Experiments in Natural Sciences?
LLMs predict outcomes of real scientific experiments at 14-26% accuracy, comparable to human experts, but lack calibration on prediction reliability while humans demonstrate strong calibration.
-
DataMaster: Data-Centric Autonomous AI Research
DataMaster deploys an AI agent to autonomously engineer data via tree search over external sources, shared candidate pools, and memory of past outcomes, yielding 32% higher medal rates on MLE-Bench Lite and a small GP...
-
DataMaster: Data-Centric Autonomous AI Research
DataMaster autonomously optimizes data via tree search and shared memory, raising medal rate 32.27% on MLE-Bench Lite and beating the base instruct model on GPQA.
-
CellScientist: Dual-Space Hierarchical Orchestration for Closed-Loop Refinement of Virtual Cell Models
CellScientist introduces a dual-space hierarchical orchestration system that enables closed-loop refinement of virtual cell models by routing execution discrepancies back to hypothesis or implementation updates, yield...
-
SHARP: A Self-Evolving Human-Auditable Rubric Policy for Financial Trading Agents
SHARP is a neuro-symbolic method that evolves bounded, auditable rule rubrics for LLM trading agents via cross-sample attribution and walk-forward validation, raising compact-model performance by 10-20 percentage poin...
-
TrafficClaw: Generalizable Urban Traffic Control via Unified Physical Environment Modeling
TrafficClaw creates a single runtime environment for heterogeneous urban traffic subsystems and deploys an LLM agent with spatiotemporal reasoning to deliver robust control that generalizes across unseen scenarios.
-
AgentGA: Evolving Code Solutions in Agent-Seed Space
AgentGA uses a genetic algorithm to evolve agent seeds and achieves 74.52% human-exceeding performance on tabular AutoML tasks versus 54.15% for the AIDE baseline.
-
AgentGA: Evolving Code Solutions in Agent-Seed Space
AgentGA optimizes agent seeds with genetic algorithms and parent-archive inheritance to improve autonomous code generation, beating a baseline on 15 of 16 Kaggle competitions.
-
AIBuildAI: An AI Agent for Automatically Building AI Models
AIBuildAI uses a manager agent and three LLM sub-agents to fully automate AI model development and achieves a 63.1% medal rate on MLE-Bench, matching experienced human engineers.
-
Toward Autonomous Long-Horizon Engineering for ML Research
AiScientist improves ML research benchmarks by 10.54 points on PaperBench and reaches 81.82% Any Medal on MLE-Bench Lite through hierarchical control plus durable file-based state instead of conversational handoffs.
-
Pioneer Agent: Continual Improvement of Small Language Models in Production
Pioneer Agent automates the full lifecycle of adapting and continually improving small language models via diagnosis-driven data synthesis and regression-constrained retraining, delivering gains of 1.6-83.8 points on ...
-
AIRA_2: Overcoming Bottlenecks in AI Research Agents
AIRA₂ improves AI research agents via asynchronous multi-GPU workers, hidden consistent evaluation, and interactive ReAct agents, reaching 81.5-83.1% percentile rank on MLE-bench-30 and exceeding human SOTA on 6 of 20...
-
A Self-Evolving Defect Detection Framework for Industrial Photovoltaic Systems
SEPDD is a self-evolving defect detection framework for PV modules that achieves 91.4% mAP50 on public data and 49.5% on private data, outperforming autonomous baselines and human experts.
-
Reasoning as Gradient: Scaling MLE Agents Beyond Tree Search
Gome reaches 35.1% any-medal rate on MLE-Bench by mapping reasoning to gradient-based updates, outperforming tree search once models are sufficiently capable.
-
ShinkaEvolve: Towards Open-Ended And Sample-Efficient Program Evolution
ShinkaEvolve improves sample efficiency in LLM-driven program evolution via parent sampling, code novelty rejection-sampling, and bandit LLM ensemble selection, achieving new SOTA circle packing with 150 samples and g...
-
GEAR: Genetic AutoResearch for Agentic Code Evolution
GEAR applies genetic algorithms to maintain and evolve multiple research states in autonomous code agents, outperforming single-path baselines by continuing to discover improvements over extended runs.
-
AceGRPO: Adaptive Curriculum Enhanced Group Relative Policy Optimization for Autonomous Machine Learning Engineering
AceGRPO trains 30B-parameter LLM agents to achieve 100% valid submissions and competitive performance on MLE-Bench-Lite through evolving data buffers and adaptive task sampling.
Reference graph
Works this paper leans on
-
[1]
doi: 10.1126/science.abq1158. H. Liu, K. Simonyan, and Y . Yang. DARTS: Differentiable Architecture Search. In Proc. of ICLR, 2019. METR. Evaluating frontier AI R&D capabilities of language model agents against human experts. https://metr.org/blog/2024-11-22-evaluating-r-d-capabilities-of-llms/ ,
-
[2]
Blog post (November 2024). J. Mueller and et al. AutoGluon: AutoML for Text, Image, and Tabular Data. Scientific Reports, 14 (1):72889, 2024. R. S. Olson and J. H. Moore. TPOT: A Tree-based Pipeline Optimization Tool for Automating Machine Learning. In ICML AutoML Workshop, 2016. OpenAI. Gpt-4 technical report. Technical report, OpenAI, 2023. OpenAI. Open...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[3]
Distributed Random Forest (DRF) and Extremely Randomized Trees (XRT)
-
[4]
Generalized Linear Model (GLM) with regularization
-
[5]
H2O Gradient Boosting Machines
-
[6]
Fully connected multi-layer artificial neural network (DeepLearning)
-
[7]
Stacked Ensembles (including an ensemble of all base models and ensembles using subsets of the base models) It then performs a random search over a predefined grid of hyperparameter combinations, avoiding the computational expense of an exhaustive grid search. After training individual models, H2O AutoML creates stacked ensembles by combining the predicti...
work page 2024
-
[8]
Select a model and fill in the provided python snippet
-
[9]
csv and Prepare s u b m i s s i o n
Train the model and Make p r e d i c t i o n s on data from test . csv and Prepare s u b m i s s i o n . csv by ex ec ut ing the script wiith python repl tool
-
[10]
py Here are some rules to follow :
Save the script into local disk such as model_ { m o d e l _ n a m e }. py Here are some rules to follow :
- [11]
-
[12]
Never output graphs or figures
-
[13]
Do Not change the c a p i t a l i z a t i o n of the column name
-
[14]
Do Not read train . csv and test . csv directly . A.3 ChatGPT with Human Assistance A human operator is tasked with solving a Kaggle competition using only the information provided in the overview and data tabs, which include the available dataset. The operator is permitted to utilize the ChatGPT web interface. The LLM is set to gpt-4-0125-preview in comp...
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.