hub Canonical reference

Darwin Godel Machine: Open-Ended Evolution of Self-Improving Agents

Jenny Zhang, Shengran Hu, Cong Lu, Robert Lange, Jeff Clune · 2025 · cs.AI · arXiv 2505.22954

Canonical reference. 85% of citing Pith papers cite this work as background.

32 Pith papers citing it

Background 85% of classified citations

open full Pith review browse 32 citing papers arXiv PDF

abstract

Today's AI systems have human-designed, fixed architectures and cannot autonomously and continuously improve themselves. The advance of AI could itself be automated. If done safely, that would accelerate AI development and allow us to reap its benefits much sooner. Meta-learning can automate the discovery of novel algorithms, but is limited by first-order improvements and the human design of a suitable search space. The G\"odel machine proposed a theoretical alternative: a self-improving AI that repeatedly modifies itself in a provably beneficial manner. Unfortunately, proving that most changes are net beneficial is impossible in practice. We introduce the Darwin G\"odel Machine (DGM), a self-improving system that iteratively modifies its own code (thereby also improving its ability to modify its own codebase) and empirically validates each change using coding benchmarks. Inspired by Darwinian evolution and open-endedness research, the DGM maintains an archive of generated coding agents. It grows the archive by sampling an agent from it and using a foundation model to create a new, interesting, version of the sampled agent. This open-ended exploration forms a growing tree of diverse, high-quality agents and allows the parallel exploration of many different paths through the search space. Empirically, the DGM automatically improves its coding capabilities (e.g., better code editing tools, long-context window management, peer-review mechanisms), increasing performance on SWE-bench from 20.0% to 50.0%, and on Polyglot from 14.2% to 30.7%. Furthermore, the DGM significantly outperforms baselines without self-improvement or open-ended exploration. All experiments were done with safety precautions (e.g., sandboxing, human oversight). The DGM is a significant step toward self-improving AI, capable of gathering its own stepping stones along paths that unfold into endless innovation.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 11 dataset 1 other 1

citation-polarity summary

background 11 unclear 1 use dataset 1

representative citing papers

SimWorld Studio: Automatic Environment Generation with Evolving Coding Agent for Embodied Agent Learning

cs.AI · 2026-05-10 · accept · novelty 8.0 · 2 refs

SimWorld Studio deploys an evolving coding agent to create adaptive 3D environments that co-evolve with embodied learners, delivering 18-point success-rate gains over fixed environments in navigation benchmarks.

CyberEvolver: Structured Self-Evolution for Cybersecurity Agents On the Fly

cs.CR · 2026-05-25 · unverdicted · novelty 7.0

CyberEvolver introduces a four-layer self-evolving agent architecture with trace-to-diagnosis and population beam search that raises seed agent success rates by 13.6% on CTF, exploitation, and penetration tasks across four LLMs.

Harnessing Agentic Evolution

cs.AI · 2026-05-13 · unverdicted · novelty 7.0

AEvo introduces a meta-agent that edits the evolution procedure or agent context based on accumulated state, outperforming baselines by 26% relative improvement on agentic benchmarks and achieving SOTA on open-ended tasks.

Agentic-imodels: Evolving agentic interpretability tools via autoresearch

cs.AI · 2026-05-05 · unverdicted · novelty 7.0

Agentic-imodels evolves scikit-learn regressors via an autoresearch loop to jointly boost predictive performance and LLM-simulatability, improving downstream agentic data science tasks by up to 73% on the BLADE benchmark.

BIM Information Extraction Through LLM-based Adaptive Exploration

cs.CL · 2026-05-03 · unverdicted · novelty 7.0

LLM adaptive exploration via runtime code execution outperforms static query generation for information extraction from heterogeneous BIM models on the new ifc-bench v2 benchmark.

Theory Under Construction: Orchestrating Language Models for Research Software Where the Specification Evolves

cs.SE · 2026-04-29 · unverdicted · novelty 7.0

Comet-H orchestrates LLMs via deficit-scoring prompt selection and half-life task tracking to co-evolve research software components, demonstrated by a static analysis tool reaching F1=0.768 versus a 0.364 baseline.

Optimizing ground state preparation protocols with autoresearch

quant-ph · 2026-04-28 · unverdicted · novelty 7.0 · 2 refs

AI coding agents evolve simple ground-state protocols into improved versions for VQE, DMRG, and AFQMC on spin models and molecules by using executable energy scores under fixed compute budgets.

SWE-EVO: Benchmarking Coding Agents in Long-Horizon Software Evolution Scenarios

cs.SE · 2025-12-20 · unverdicted · novelty 7.0

SWE-EVO shows GPT-5.4 with OpenHands reaching only 25% success on complex multi-file evolution tasks versus 72.8% on SWE-Bench Verified, and introduces Fix Rate as a partial-progress metric.

Analytic Concept-Centric Memory for Agentic Embodied Manipulation

cs.RO · 2026-06-29 · unverdicted · novelty 6.0

Proposes a structured concept-centric memory system for embodied agents that connects object, scene, transition, and skill memories to support coarse-to-fine retrieval and improve task performance over baselines.

Managing Procedural Memory in LLM Agents: Control, Adaptation, and Evaluation

cs.AI · 2026-06-22 · unverdicted · novelty 6.0

AFTER benchmark shows single refinement improves LLM agent performance by 3.7-6.7 points and multi-model procedural skills reach 73.1% cross-model accuracy on 382 tasks.

Toward Generalist Autonomous Research via Hypothesis-Tree Refinement

cs.CL · 2026-06-10 · unverdicted · novelty 6.0

Arbor combines a coordinator, executors, and a hypothesis tree to enable cumulative autonomous research, outperforming Codex and Claude Code by over 2.5x on six real tasks and reaching 86.36% Any Medal on MLE-Bench Lite.

MemPro: Agentic Memory Systems as Evolvable Programs

cs.CL · 2026-05-30 · unverdicted · novelty 6.0

MemPro evolves the entire MCR pipeline as runnable programs via failure-guided refinement on a version tree and outperforms static baselines on LongMemEval, LoCoMo, HotpotQA, and NarrativeQA.

RoboEvolve: Co-Evolving Planner-Simulator for Robotic Manipulation with Limited Data

cs.RO · 2026-05-13 · unverdicted · novelty 6.0

A co-evolutionary VLM-VGM loop on 500 unlabeled images raises planner success by 30 points and simulator success by 48 percent while beating fully supervised baselines.

Open-Ended Task Discovery via Bayesian Optimization

cs.AI · 2026-05-08 · unverdicted · novelty 6.0

Generate-Select-Refine is an open-ended Bayesian optimization method that generates tasks and concentrates evaluations on the best one with only logarithmic regret overhead relative to standard single-task optimization.

Training LLM Agents for Spontaneous, Reward-Free Self-Evolution via World Knowledge Exploration

cs.AI · 2026-04-20 · unverdicted · novelty 6.0

LLM agents trained with a task-success reward on self-generated knowledge can spontaneously explore and adapt to new environments without any rewards or instructions at inference, yielding 20% gains on web tasks and allowing a 14B model to beat Gemini-2.5-Flash.

AgentGA: Evolving Code Solutions in Agent-Seed Space

cs.AI · 2026-04-16 · unverdicted · novelty 6.0 · 2 refs

AgentGA optimizes agent seeds with genetic algorithms and parent-archive inheritance to improve autonomous code generation, beating a baseline on 15 of 16 Kaggle competitions.

AI-Driven Research for Databases

cs.DB · 2026-04-08 · unverdicted · novelty 6.0

Co-evolving LLM-generated solutions with their evaluators enables discovery of novel database algorithms that outperform state-of-the-art baselines, including a query rewrite policy with up to 6.8x lower latency.

Self-Optimizing Multi-Agent Systems for Deep Research

cs.IR · 2026-04-03 · unverdicted · novelty 6.0

Multi-agent deep research systems self-optimize prompts through self-play to match or outperform expert-crafted versions.

Memory in the Age of AI Agents

cs.CL · 2025-12-15 · unverdicted · novelty 6.0

The paper maps agent memory research via three forms (token-level, parametric, latent), three functions (factual, experiential, working), and dynamics of formation/evolution/retrieval, plus benchmarks and future directions.

Differentiable Evolutionary Reinforcement Learning

cs.AI · 2025-12-15 · unverdicted · novelty 6.0

DERL is a differentiable bi-level method that evolves optimal reward structures for RL policies by composing atomic primitives and using meta-gradients from validation performance.

ShinkaEvolve: Towards Open-Ended And Sample-Efficient Program Evolution

cs.CL · 2025-09-17 · unverdicted · novelty 6.0

ShinkaEvolve improves sample efficiency in LLM-driven program evolution via parent sampling, code novelty rejection-sampling, and bandit LLM ensemble selection, achieving new SOTA circle packing with 150 samples and gains on math reasoning and competitive programming tasks.

Evolutionary Ensemble of Agents

cs.NE · 2026-05-09 · unverdicted · novelty 5.0 · 2 refs

EvE co-evolves code solvers and guidance states via synchronous races and Elo updates, discovering a rescale-then-interpolate mechanism that enables example-count generalization in ICON.

Disposition Distillation at Small Scale: A Three-Arc Negative Result

cs.LG · 2026-04-13 · accept · novelty 5.0

Multiple standard techniques for instilling dispositions in small LMs consistently failed across five models, with initial apparent gains revealed as artifacts and cross-validation collapsing to chance.

Rethinking the Value of Agent-Generated Tests for LLM-Based Software Engineering Agents

cs.SE · 2026-02-08 · unverdicted · novelty 5.0

Agent-generated tests mainly act as observational feedback channels and do not meaningfully improve issue resolution success in current LLM software engineering agents.

citing papers explorer

Showing 5 of 5 citing papers after filters.

Theory Under Construction: Orchestrating Language Models for Research Software Where the Specification Evolves cs.SE · 2026-04-29 · unverdicted · none · ref 37 · internal anchor
Comet-H orchestrates LLMs via deficit-scoring prompt selection and half-life task tracking to co-evolve research software components, demonstrated by a static analysis tool reaching F1=0.768 versus a 0.364 baseline.
SWE-EVO: Benchmarking Coding Agents in Long-Horizon Software Evolution Scenarios cs.SE · 2025-12-20 · unverdicted · none · ref 64 · internal anchor
SWE-EVO shows GPT-5.4 with OpenHands reaching only 25% success on complex multi-file evolution tasks versus 72.8% on SWE-Bench Verified, and introduces Fix Rate as a partial-progress metric.
Rethinking the Value of Agent-Generated Tests for LLM-Based Software Engineering Agents cs.SE · 2026-02-08 · unverdicted · none · ref 61 · internal anchor
Agent-generated tests mainly act as observational feedback channels and do not meaningfully improve issue resolution success in current LLM software engineering agents.
Effective Harness Engineering for Algorithm Discovery with Coding Agents cs.SE · 2026-05-13 · unverdicted · none · ref 13 · internal anchor
Under fixed token budget on Circle Packing, deeper per-candidate reasoning beats generating more shallow candidates, and capable models produce evaluation hacks at higher rates.
Toward Training Superintelligent Software Agents through Self-Play SWE-RL cs.SE · 2025-12-21 · unreviewed · ref 56 · internal anchor

Darwin Godel Machine: Open-Ended Evolution of Self-Improving Agents

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer