super hub Canonical reference

The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery

Chris Lu, Cong Lu, David Ha, Jakob Foerster, Jeff Clune, Robert Tjarko Lange · 2024 · cs.AI · arXiv 2408.06292

Canonical reference. 75% of citing Pith papers cite this work as background.

220 Pith papers citing it

Background 75% of classified citations

open full Pith review browse 220 citing papers more from Chris Lu arXiv PDF

abstract

One of the grand challenges of artificial general intelligence is developing agents capable of conducting scientific research and discovering new knowledge. While frontier models have already been used as aides to human scientists, e.g. for brainstorming ideas, writing code, or prediction tasks, they still conduct only a small part of the scientific process. This paper presents the first comprehensive framework for fully automatic scientific discovery, enabling frontier large language models to perform research independently and communicate their findings. We introduce The AI Scientist, which generates novel research ideas, writes code, executes experiments, visualizes results, describes its findings by writing a full scientific paper, and then runs a simulated review process for evaluation. In principle, this process can be repeated to iteratively develop ideas in an open-ended fashion, acting like the human scientific community. We demonstrate its versatility by applying it to three distinct subfields of machine learning: diffusion modeling, transformer-based language modeling, and learning dynamics. Each idea is implemented and developed into a full paper at a cost of less than $15 per paper. To evaluate the generated papers, we design and validate an automated reviewer, which we show achieves near-human performance in evaluating paper scores. The AI Scientist can produce papers that exceed the acceptance threshold at a top machine learning conference as judged by our automated reviewer. This approach signifies the beginning of a new era in scientific discovery in machine learning: bringing the transformative benefits of AI agents to the entire research process of AI itself, and taking us closer to a world where endless affordable creativity and innovation can be unleashed on the world's most challenging problems. Our code is open-sourced at https://github.com/SakanaAI/AI-Scientist

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 31 baseline 2 dataset 1 method 1 other 1

citation-polarity summary

background 27 unclear 3 baseline 2 support 2 use dataset 1 use method 1

claims ledger

abstract One of the grand challenges of artificial general intelligence is developing agents capable of conducting scientific research and discovering new knowledge. While frontier models have already been used as aides to human scientists, e.g. for brainstorming ideas, writing code, or prediction tasks, they still conduct only a small part of the scientific process. This paper presents the first comprehensive framework for fully automatic scientific discovery, enabling frontier large language models to perform research independently and communicate their findings. We introduce The AI Scientist, which

authors

Chris Lu Cong Lu David Ha Jakob Foerster Jeff Clune Robert Tjarko Lange

co-cited works

representative citing papers

AI-Assisted Peer Review at Scale: The AAAI-26 AI Review Pilot

cs.AI · 2026-04-15 · conditional · novelty 9.0

AI reviews for all 22,977 AAAI-26 papers were preferred by authors and PC members over human reviews on accuracy and suggestions and outperformed baselines at spotting weaknesses.

Benchmarking LLM Agents on Meta-Analysis Articles from Nature Portfolio

cs.CL · 2026-06-15 · unverdicted · novelty 8.0 · 3 refs

MetaSyn benchmark shows LLM pipelines recover at most 52.7% of ground-truth included studies due to screening failures on PI/ECO eligibility, despite 90.9% retrieval recall at K=200.

AutoLab: Can Frontier Models Solve Long-Horizon Auto Research and Engineering Tasks?

cs.AI · 2026-06-03 · unverdicted · novelty 8.0

AutoLab benchmark shows frontier models mostly fail at sustained iterative optimization due to premature termination, with persistence as the key success factor.

Agent-BRACE: Decoupling Beliefs from Actions in Long-Horizon Tasks via Verbalized State Uncertainty

cs.CL · 2026-05-12 · unverdicted · novelty 8.0

Agent-BRACE improves LLM agent performance on long-horizon partially observable tasks by 5.3-14.5% through a decoupled belief state of verbalized atomic claims with certainty labels that keeps context length constant.

WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation

cs.CL · 2026-05-11 · unverdicted · novelty 8.0

A new native-runtime benchmark reveals that current frontier AI agents succeed on at most 62 percent of realistic long-horizon CLI tasks.

AutoResearchBench: Benchmarking AI Agents on Complex Scientific Literature Discovery

cs.AI · 2026-04-28 · accept · novelty 8.0

AutoResearchBench is a new benchmark showing top AI agents achieve under 10% success on complex scientific literature discovery tasks that demand deep comprehension and open-ended search.

The Last Human-Written Paper: Agent-Native Research Artifacts

cs.LG · 2026-04-27 · unverdicted · novelty 8.0

Introduces ARA as a four-layer machine-executable research package and reports benchmark gains in agent QA accuracy and reproduction success.

FermiLink: A Unified Agent Framework for Multidomain Autonomous Scientific Simulations

physics.chem-ph · 2026-04-03 · conditional · novelty 8.0

FermiLink is a unified AI agent framework that automates multidomain scientific simulations via separated package knowledge bases and a four-layer progressive disclosure mechanism, reproducing 56% of target figures in benchmarks and generating research-grade results on unpublished problems.

Co-Designing Quantum Codes with Transversal Diagonal Gates via Multi-Agent Systems

quant-ph · 2025-10-23 · accept · novelty 8.0

A Lean-verified multi-agent system produces a catalogue of 14,116 quantum codes with transversal diagonal gates for small parameters, extracts infinite families, and resolves specific distance-3 cases with constructions and no-go proofs.

Measuring the Gap Between Human and LLM Research Ideas

cs.CL · 2026-07-01 · unverdicted · novelty 7.0

LLM-generated research ideas cluster more around bridge-like opportunities and synthesis methods than the broader distribution seen in human papers.

FARS: A Fully Automated Research System Deployed at Scale

cs.AI · 2026-06-30 · unverdicted · novelty 7.0

FARS deployed at scale produced 166 AI/ML papers across 67 topics that received 282 structured human reviews indicating some review-worthy outputs alongside recurring failure modes.

AI Trading's Alpha Singularity: Emergent Market Reasoning through Agent-to-Agent Self-Evolution

cs.AI · 2026-06-28 · reject · novelty 7.0

Multi-agent LLM system Agora under Sealed Joint Search conditions produces +1.87 holdout Sharpe on CSI 1000 over a 91-day sealed period, exceeding the best baseline at +1.334 under favorable seed.

AICID: Unique Identifiers for AI Scientists

cs.DL · 2026-06-27 · unverdicted · novelty 7.0

The paper proposes AICID as a new identifier system to make the provenance of AI-generated scholarly work transparent and machine-readable.

Glite ARF: Verifier-Driven Research with Parallel LLM Coding Agents

cs.MA · 2026-06-25 · accept · novelty 7.0

Glite ARF introduces a verifier-driven three-role framework for parallel LLM coding agents, demonstrated by first- and second-place finishes in the BEA 2026 vocabulary-difficulty shared task across three languages with 29.9-35.9% RMSE reduction at ~$450 API cost.

A hardware-safety-gated system for LLM-written native ARTIQ control code on a trapped-ion platform

quant-ph · 2026-06-25 · unverdicted · novelty 7.0

A token-based authorization system with simulation and human gates enables safe LLM-written ARTIQ control code execution on trapped-ion platforms while blocking unauthorized hardware access.

NatureBench: Can Coding Agents Match the Published SOTA of Nature-Family Papers?

cs.CL · 2026-06-23 · unverdicted · novelty 7.0

NatureBench evaluates ten frontier AI coding agents on 90 tasks from Nature papers under web-search-disabled conditions and finds the strongest agent surpasses published SOTA on only 17.8% of tasks, succeeding mainly by translating problems into familiar supervised learning setups.

PixJail: Self-Evolving Paper-to-Pipeline Reproduction for Text-to-Image Jailbreak Evaluation

cs.CR · 2026-06-23 · unverdicted · novelty 7.0

PixJail automates construction of paper-specific attack modules and unified evaluation pipelines for text-to-image jailbreaks, reproducing eleven methods with 2.1% average and 0% median error.

Agentic AutoResearch forSpace Autonomy: An Auditable, LLM-Driven Research Agent for Aerospace Control Problems

cs.RO · 2026-06-18 · unverdicted · novelty 7.0

An LLM-driven agent with built-in seed-noise audits develops control policies for two aerospace problems that outperform undirected search and pass verification checks.

ENPIRE: Agentic Robot Policy Self-Improvement in the Real World

cs.AI · 2026-06-18 · unverdicted · novelty 7.0

ENPIRE supplies four modules (Environment, Policy Improvement, Rollout, Evolution) that turn real-world robot training into an autonomous optimization loop driven by coding agents.

Can In-Context Learning Support Intrinsic Curiosity?

cs.LG · 2026-06-17 · unverdicted · novelty 7.0

ICL-derived intrinsic rewards are biased in general MDPs but asymptotically match true learning progress in non-temporal settings, with supporting experiments.

Deep Research in Physical Sciences: A Multi-Agent Framework and Comprehensive Benchmark

physics.comp-ph · 2026-06-17 · unverdicted · novelty 7.0

PhySciBench benchmark shows current AI models achieve at most 33.5% accuracy on physical science tasks; DelveAgent framework improves accuracy by up to 7.5 points and cuts costs to one-third.

SafeClawBench: Separating Semantic, Audit-Evidence, and Sandbox Harm in Tool-Using LLM Agents

cs.CR · 2026-06-16 · accept · novelty 7.0

SafeClawBench supplies 600 staged adversarial tasks and three separate endpoints that show semantic acceptance, audit evidence, and sandbox-observed harm are distinct failure modes in tool-using LLM agents.

Representing Time Series as Structured Programs for LLM Reasoning

cs.LG · 2026-06-10 · unverdicted · novelty 7.0

T2SP converts time series into structured programs for trends, periods, and events, enabling off-the-shelf LLMs to perform better on editing, captioning, and QA tasks than raw string inputs.

AI Coding Agents in Social Science: Methodologically Diverse, Empirically Consistent, Interpretively Vulnerable

cs.CL · 2026-06-09 · unverdicted · novelty 7.0

LLM agents match or exceed human methodological diversity and produce aligned effect estimates, yet flip final verdicts from 10% to 90% support under a confirmatory prompt while leaving coefficients unchanged.

citing papers explorer

Showing 0 of 0 citing papers after filters.

No citing papers match the current filters.

The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer