hub

AIDE: AI-Driven Exploration in the Space of Code

Zhengyao Jiang, Dominik Schmidt, Dhruv Srikanth, Dixing Xu, Ian Kaplan, Deniss Jacenko · 2025 · cs.AI · arXiv 2502.13138

29 Pith papers cite this work. Polarity classification is still indexing.

29 Pith papers citing it

open full Pith review browse 29 citing papers arXiv PDF

abstract

Machine learning, the foundation of modern artificial intelligence, has driven innovations that have fundamentally transformed the world. Yet, behind advancements lies a complex and often tedious process requiring labor and compute intensive iteration and experimentation. Engineers and scientists developing machine learning models spend much of their time on trial-and-error tasks instead of conceptualizing innovative solutions or research hypotheses. To address this challenge, we introduce AI-Driven Exploration (AIDE), a machine learning engineering agent powered by large language models (LLMs). AIDE frames machine learning engineering as a code optimization problem, and formulates trial-and-error as a tree search in the space of potential solutions. By strategically reusing and refining promising solutions, AIDE effectively trades computational resources for enhanced performance, achieving state-of-the-art results on multiple machine learning engineering benchmarks, including our Kaggle evaluations, OpenAI MLE-Bench and METRs RE-Bench.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 3 baseline 1

citation-polarity summary

background 3 baseline 1

representative citing papers

SpecBench: Measuring Reward Hacking in Long-Horizon Coding Agents

cs.SE · 2026-05-20 · unverdicted · novelty 7.0

SpecBench shows frontier coding agents saturate visible test suites but exhibit persistent reward hacking on held-out tests, with the gap growing 28 percentage points per tenfold increase in code size.

What Do Evolutionary Coding Agents Evolve?

cs.NE · 2026-05-19 · unverdicted · novelty 7.0

Evolutionary coding agents achieve most benchmark gains through a small subset of edit types and by cycling previously deleted code lines rather than developing new algorithmic structures.

Memory-Guided Tree Search with Cross-Branch Knowledge Transfer for LLM Solver Synthesis

cs.AI · 2026-05-17 · unverdicted · novelty 7.0

MEMOIR adds branch-local and global memory with a reflection step to tree search for LLM solver synthesis, reaching 96.7% solution validity and 7.3-point score gains over baselines on seven CO problems with lower run-to-run variance.

SciPredict: Can LLMs Predict the Outcomes of Scientific Experiments in Natural Sciences?

cs.AI · 2026-04-12 · unverdicted · novelty 7.0

LLMs predict outcomes of real scientific experiments at 14-26% accuracy, comparable to human experts, but lack calibration on prediction reliability while humans demonstrate strong calibration.

Talking Trees: Reasoning-Assisted Induction of Decision Trees for Tabular Data

cs.LG · 2025-09-25 · unverdicted · novelty 7.0

Reasoning LLMs with minimal tools for tree construction and analysis induce decision trees that outperform CART, compete with ensembles on low-resource tabular data, and provide human-readable reasoning traces.

MLReplicate: Benchmarking Autonomous Research Systems for Machine Learning Reproducibility

cs.LG · 2026-05-15 · conditional · novelty 6.0

MLReplicate benchmark evaluates six autonomous systems on 45 manuscripts from ICML 2025 papers, finding that automated reviews accept flawed outputs with fabricated claims while human review exposes methodological failures, and that the cheapest system outperforms the most expensive by a wide margin

DrugSAGE:Self-evolving Agent Experience for Efficient State-of-the-Art Drug Discovery

cs.LG · 2026-05-14 · unverdicted · novelty 6.0

DrugSAGE accumulates cross-task memory of skills, statistical evidence, and recurring errors to let LLM agents achieve top-ranked performance on molecular property prediction tasks with reduced or zero test-time search.

AutoLLMResearch: Training Research Agents for Automating LLM Experiment Configuration - Learning from Cheap, Optimizing Expensive

cs.AI · 2026-05-12 · unverdicted · novelty 6.0 · 2 refs

AutoLLMResearch trains agents in a multi-fidelity LLMConfig-Gym environment formulated as a long-horizon MDP to enable cross-fidelity extrapolation for automating high-cost LLM experiment configurations.

DataMaster: Data-Centric Autonomous AI Research

cs.LG · 2026-05-11 · unverdicted · novelty 6.0 · 2 refs

DataMaster deploys an AI agent to autonomously engineer data via tree search over external sources, shared candidate pools, and memory of past outcomes, yielding 32% higher medal rates on MLE-Bench Lite and a small GPQA gain over the base instruct model.

CellScientist: Dual-Space Hierarchical Orchestration for Closed-Loop Refinement of Virtual Cell Models

cs.LG · 2026-05-08 · unverdicted · novelty 6.0

CellScientist introduces a dual-space hierarchical orchestration system that enables closed-loop refinement of virtual cell models by routing execution discrepancies back to hypothesis or implementation updates, yielding improved benchmark performance with auditable traces.

SHARP: A Self-Evolving Human-Auditable Rubric Policy for Financial Trading Agents

cs.LG · 2026-05-07 · unverdicted · novelty 6.0

SHARP is a neuro-symbolic method that evolves bounded, auditable rule rubrics for LLM trading agents via cross-sample attribution and walk-forward validation, raising compact-model performance by 10-20 percentage points across equity sectors.

AgentGA: Evolving Code Solutions in Agent-Seed Space

cs.AI · 2026-04-16 · unverdicted · novelty 6.0 · 2 refs

AgentGA optimizes agent seeds with genetic algorithms and parent-archive inheritance to improve autonomous code generation, beating a baseline on 15 of 16 Kaggle competitions.

AIBuildAI: An AI Agent for Automatically Building AI Models

cs.AI · 2026-04-15 · unverdicted · novelty 6.0

AIBuildAI uses a manager agent and three LLM sub-agents to fully automate AI model development and achieves a 63.1% medal rate on MLE-Bench, matching experienced human engineers.

Pioneer Agent: Continual Improvement of Small Language Models in Production

cs.AI · 2026-04-10 · unverdicted · novelty 6.0

Pioneer Agent automates the full lifecycle of adapting and continually improving small language models via diagnosis-driven data synthesis and regression-constrained retraining, delivering gains of 1.6-83.8 points on benchmarks and large lifts in production-style tasks.

AIRA_2: Overcoming Bottlenecks in AI Research Agents

cs.AI · 2026-03-27 · conditional · novelty 6.0

AIRA₂ improves AI research agents via asynchronous multi-GPU workers, hidden consistent evaluation, and interactive ReAct agents, reaching 81.5-83.1% percentile rank on MLE-bench-30 and exceeding human SOTA on 6 of 20 AIRS-Bench tasks.

A Self-Evolving Defect Detection Framework for Industrial Photovoltaic Systems

cs.AI · 2026-03-16 · unverdicted · novelty 6.0

SEPDD is a self-evolving defect detection framework for PV modules that achieves 91.4% mAP50 on public data and 49.5% on private data, outperforming autonomous baselines and human experts.

Reasoning as Gradient: Scaling MLE Agents Beyond Tree Search

cs.LG · 2026-03-02 · unverdicted · novelty 6.0

Gome reaches 35.1% any-medal rate on MLE-Bench by mapping reasoning to gradient-based updates, outperforming tree search once models are sufficiently capable.

ShinkaEvolve: Towards Open-Ended And Sample-Efficient Program Evolution

cs.CL · 2025-09-17 · unverdicted · novelty 6.0

ShinkaEvolve improves sample efficiency in LLM-driven program evolution via parent sampling, code novelty rejection-sampling, and bandit LLM ensemble selection, achieving new SOTA circle packing with 150 samples and gains on math reasoning and competitive programming tasks.

An AI system to help scientists write expert-level empirical software

cs.AI · 2025-09-08 · unverdicted · novelty 6.0 · 2 refs

ERA combines LLMs and tree search to produce expert-level empirical software that outperforms top human methods on single-cell analysis leaderboards and CDC COVID-19 forecasts.

GEAR: Genetic AutoResearch for Agentic Code Evolution

cs.NE · 2026-05-08 · unverdicted · novelty 5.0

GEAR applies genetic algorithms to maintain and evolve multiple research states in autonomous code agents, outperforming single-path baselines by continuing to discover improvements over extended runs.

AceGRPO: Adaptive Curriculum Enhanced Group Relative Policy Optimization for Autonomous Machine Learning Engineering

cs.LG · 2026-02-08 · unverdicted · novelty 5.0

AceGRPO trains 30B-parameter LLM agents to achieve 100% valid submissions and competitive performance on MLE-Bench-Lite through evolving data buffers and adaptive task sampling.

TusoAI: Agentic Optimization for Scientific Methods

cs.AI · 2025-09-28 · unverdicted · novelty 5.0

TusoAI is an LLM-based agent that builds and iteratively optimizes domain-specific computational methods for scientific data analysis, outperforming expert baselines on RNA-seq denoising and earth monitoring while reporting new genetic associations.

SciAtlas: A Large-Scale Knowledge Graph for Automated Scientific Research

cs.AI · 2026-05-20 · unverdicted · novelty 4.0

SciAtlas builds a large-scale multi-disciplinary academic knowledge graph and a neuro-symbolic retrieval system to support automated scientific research tasks such as literature review and idea positioning.

AI for Auto-Research: Roadmap & User Guide

cs.AI · 2026-05-18 · unverdicted · novelty 4.0

The paper delivers a stage-by-stage roadmap for AI in research, showing reliable assistance in retrieval and tool tasks but fragility in novelty and judgment, advocating human-governed collaboration.

citing papers explorer

Showing 29 of 29 citing papers.

SpecBench: Measuring Reward Hacking in Long-Horizon Coding Agents cs.SE · 2026-05-20 · unverdicted · none · ref 26 · internal anchor
SpecBench shows frontier coding agents saturate visible test suites but exhibit persistent reward hacking on held-out tests, with the gap growing 28 percentage points per tenfold increase in code size.
What Do Evolutionary Coding Agents Evolve? cs.NE · 2026-05-19 · unverdicted · none · ref 16 · internal anchor
Evolutionary coding agents achieve most benchmark gains through a small subset of edit types and by cycling previously deleted code lines rather than developing new algorithmic structures.
Memory-Guided Tree Search with Cross-Branch Knowledge Transfer for LLM Solver Synthesis cs.AI · 2026-05-17 · unverdicted · none · ref 8 · internal anchor
MEMOIR adds branch-local and global memory with a reflection step to tree search for LLM solver synthesis, reaching 96.7% solution validity and 7.3-point score gains over baselines on seven CO problems with lower run-to-run variance.
SciPredict: Can LLMs Predict the Outcomes of Scientific Experiments in Natural Sciences? cs.AI · 2026-04-12 · unverdicted · none · ref 21 · internal anchor
LLMs predict outcomes of real scientific experiments at 14-26% accuracy, comparable to human experts, but lack calibration on prediction reliability while humans demonstrate strong calibration.
Talking Trees: Reasoning-Assisted Induction of Decision Trees for Tabular Data cs.LG · 2025-09-25 · unverdicted · none · ref 26 · internal anchor
Reasoning LLMs with minimal tools for tree construction and analysis induce decision trees that outperform CART, compete with ensembles on low-resource tabular data, and provide human-readable reasoning traces.
MLReplicate: Benchmarking Autonomous Research Systems for Machine Learning Reproducibility cs.LG · 2026-05-15 · conditional · none · ref 18 · internal anchor
MLReplicate benchmark evaluates six autonomous systems on 45 manuscripts from ICML 2025 papers, finding that automated reviews accept flawed outputs with fabricated claims while human review exposes methodological failures, and that the cheapest system outperforms the most expensive by a wide margin
DrugSAGE:Self-evolving Agent Experience for Efficient State-of-the-Art Drug Discovery cs.LG · 2026-05-14 · unverdicted · none · ref 11 · internal anchor
DrugSAGE accumulates cross-task memory of skills, statistical evidence, and recurring errors to let LLM agents achieve top-ranked performance on molecular property prediction tasks with reduced or zero test-time search.
AutoLLMResearch: Training Research Agents for Automating LLM Experiment Configuration - Learning from Cheap, Optimizing Expensive cs.AI · 2026-05-12 · unverdicted · none · ref 19 · 2 links · internal anchor
AutoLLMResearch trains agents in a multi-fidelity LLMConfig-Gym environment formulated as a long-horizon MDP to enable cross-fidelity extrapolation for automating high-cost LLM experiment configurations.
DataMaster: Data-Centric Autonomous AI Research cs.LG · 2026-05-11 · unverdicted · none · ref 16 · 2 links · internal anchor
DataMaster deploys an AI agent to autonomously engineer data via tree search over external sources, shared candidate pools, and memory of past outcomes, yielding 32% higher medal rates on MLE-Bench Lite and a small GPQA gain over the base instruct model.
CellScientist: Dual-Space Hierarchical Orchestration for Closed-Loop Refinement of Virtual Cell Models cs.LG · 2026-05-08 · unverdicted · none · ref 22 · internal anchor
CellScientist introduces a dual-space hierarchical orchestration system that enables closed-loop refinement of virtual cell models by routing execution discrepancies back to hypothesis or implementation updates, yielding improved benchmark performance with auditable traces.
SHARP: A Self-Evolving Human-Auditable Rubric Policy for Financial Trading Agents cs.LG · 2026-05-07 · unverdicted · none · ref 20 · internal anchor
SHARP is a neuro-symbolic method that evolves bounded, auditable rule rubrics for LLM trading agents via cross-sample attribution and walk-forward validation, raising compact-model performance by 10-20 percentage points across equity sectors.
AgentGA: Evolving Code Solutions in Agent-Seed Space cs.AI · 2026-04-16 · unverdicted · none · ref 13 · 2 links · internal anchor
AgentGA optimizes agent seeds with genetic algorithms and parent-archive inheritance to improve autonomous code generation, beating a baseline on 15 of 16 Kaggle competitions.
AIBuildAI: An AI Agent for Automatically Building AI Models cs.AI · 2026-04-15 · unverdicted · none · ref 31 · internal anchor
AIBuildAI uses a manager agent and three LLM sub-agents to fully automate AI model development and achieves a 63.1% medal rate on MLE-Bench, matching experienced human engineers.
Pioneer Agent: Continual Improvement of Small Language Models in Production cs.AI · 2026-04-10 · unverdicted · none · ref 44 · internal anchor
Pioneer Agent automates the full lifecycle of adapting and continually improving small language models via diagnosis-driven data synthesis and regression-constrained retraining, delivering gains of 1.6-83.8 points on benchmarks and large lifts in production-style tasks.
AIRA_2: Overcoming Bottlenecks in AI Research Agents cs.AI · 2026-03-27 · conditional · none · ref 10 · internal anchor
AIRA₂ improves AI research agents via asynchronous multi-GPU workers, hidden consistent evaluation, and interactive ReAct agents, reaching 81.5-83.1% percentile rank on MLE-bench-30 and exceeding human SOTA on 6 of 20 AIRS-Bench tasks.
A Self-Evolving Defect Detection Framework for Industrial Photovoltaic Systems cs.AI · 2026-03-16 · unverdicted · none · ref 21 · internal anchor
SEPDD is a self-evolving defect detection framework for PV modules that achieves 91.4% mAP50 on public data and 49.5% on private data, outperforming autonomous baselines and human experts.
Reasoning as Gradient: Scaling MLE Agents Beyond Tree Search cs.LG · 2026-03-02 · unverdicted · none · ref 15 · internal anchor
Gome reaches 35.1% any-medal rate on MLE-Bench by mapping reasoning to gradient-based updates, outperforming tree search once models are sufficiently capable.
ShinkaEvolve: Towards Open-Ended And Sample-Efficient Program Evolution cs.CL · 2025-09-17 · unverdicted · none · ref 223 · internal anchor
ShinkaEvolve improves sample efficiency in LLM-driven program evolution via parent sampling, code novelty rejection-sampling, and bandit LLM ensemble selection, achieving new SOTA circle packing with 150 samples and gains on math reasoning and competitive programming tasks.
An AI system to help scientists write expert-level empirical software cs.AI · 2025-09-08 · unverdicted · none · ref 14 · 2 links · internal anchor
ERA combines LLMs and tree search to produce expert-level empirical software that outperforms top human methods on single-cell analysis leaderboards and CDC COVID-19 forecasts.
GEAR: Genetic AutoResearch for Agentic Code Evolution cs.NE · 2026-05-08 · unverdicted · none · ref 9 · internal anchor
GEAR applies genetic algorithms to maintain and evolve multiple research states in autonomous code agents, outperforming single-path baselines by continuing to discover improvements over extended runs.
AceGRPO: Adaptive Curriculum Enhanced Group Relative Policy Optimization for Autonomous Machine Learning Engineering cs.LG · 2026-02-08 · unverdicted · none · ref 6 · internal anchor
AceGRPO trains 30B-parameter LLM agents to achieve 100% valid submissions and competitive performance on MLE-Bench-Lite through evolving data buffers and adaptive task sampling.
TusoAI: Agentic Optimization for Scientific Methods cs.AI · 2025-09-28 · unverdicted · none · ref 17 · internal anchor
TusoAI is an LLM-based agent that builds and iteratively optimizes domain-specific computational methods for scientific data analysis, outperforming expert baselines on RNA-seq denoising and earth monitoring while reporting new genetic associations.
SciAtlas: A Large-Scale Knowledge Graph for Automated Scientific Research cs.AI · 2026-05-20 · unverdicted · none · ref 24 · internal anchor
SciAtlas builds a large-scale multi-disciplinary academic knowledge graph and a neuro-symbolic retrieval system to support automated scientific research tasks such as literature review and idea positioning.
AI for Auto-Research: Roadmap & User Guide cs.AI · 2026-05-18 · unverdicted · none · ref 81 · internal anchor
The paper delivers a stage-by-stage roadmap for AI in research, showing reliable assistance in retrieval and tool tasks but fragility in novelty and judgment, advocating human-governed collaboration.
NeuroWeaver: An Autonomous Evolutionary Agent for Exploring the Programmatic Space of EEG Analysis Pipelines cs.AI · 2026-02-13 · unverdicted · none · ref 7 · internal anchor
NeuroWeaver reformulates EEG pipeline design as constrained evolutionary optimization with domain-informed initialization, yielding lightweight pipelines that outperform task-specific methods and match foundation models on five benchmarks.
AutoResearchClaw: Self-Reinforcing Autonomous Research with Human-AI Collaboration cs.AI · 2026-05-19 · unreviewed · ref 9 · internal anchor
FML-bench: A Controlled Study of AI Research Agent Strategies from the Perspective of Search Dynamics cs.LG · 2026-05-17 · unreviewed · ref 21 · internal anchor
TrafficClaw: A Generalizable LLM Agent in the Unified Physical Environment for Urban Traffic Control cs.AI · 2026-04-19 · unreviewed · ref 14 · internal anchor
Toward Autonomous Long-Horizon Engineering for ML Research cs.CL · 2026-04-14 · unreviewed · ref 6 · internal anchor

AIDE: AI-Driven Exploration in the Space of Code

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer