pith. sign in

arxiv: 2606.26294 · v1 · pith:TT5KXQQUnew · submitted 2026-06-24 · 💻 cs.LG · cs.AI· cs.MA· cs.NE

The Red Queen G\"odel Machine: Co-Evolving Agents and Their Evaluators

Pith reviewed 2026-06-26 01:42 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.MAcs.NE
keywords self-improving agentsco-evolutionRed Queen Gödel Machinenon-stationary utilitiesagent-as-judgerecursive self-improvementevolutionary searchadversarial evaluation
0
0 comments X

The pith

The Red Queen Gödel Machine organizes self-improvement into epochs so agents and evaluators can co-evolve as utilities change.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Self-improving agents have relied on fixed verifiers or benchmarks, but real evolution involves changing environments. The paper proposes the Red Queen Gödel Machine to handle non-stationary utilities by running search in epochs under a fixed within-epoch criterion while allowing the utility to update at boundaries. This structure preserves per-epoch self-improvement guarantees while opening the loop to co-evolving writers, graders, and adversarial signals. On coding tasks the method adds an agent-as-judge code-review signal that raises test pass rate and cuts token use; on paper writing and reviewing, co-evolved writers reach 1.78x–1.86x higher acceptance under diverse panels and co-evolved graders reach 9 percent higher ground-truth accuracy; an adversarial objective also reduces over-acceptance of AI-generated papers.

Core claim

The RQGM makes recursive self-improvement possible under non-stationary utilities by dividing search into epochs that keep a fixed evaluation criterion inside each epoch while permitting the utility function to evolve at epoch boundaries, so that self-improvement guarantees continue to hold locally as the global objective changes.

What carries the argument

Epoch structure with controlled utility evolution: within each epoch the evaluation criterion is held fixed so local search guarantees apply, while the utility can be updated at boundaries to reflect co-evolution of agents and evaluators.

If this is right

  • Adding an agent-as-a-judge code-review signal raises test pass rate on verifiable coding tasks while using 1.35x–1.72x fewer tokens than prior methods.
  • Co-evolved writers achieve 1.78x–1.86x higher acceptance rates when evaluated by a diverse panel of agent judges.
  • Co-evolved graders reach 9 percent higher accuracy against ground-truth labels on Olympiad-level proofs.
  • An adversarial objective can reduce over-acceptance of AI-generated papers to match the rate applied to human work.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The epoch boundary mechanism may generalize to any setting in which short-term fixed criteria must remain stable while longer-term objectives drift.
  • Treating evaluation as a co-evolving partner rather than a static oracle could apply to open-ended domains beyond coding and paper review.
  • The adversarial signal that equalizes stringency between AI and human output suggests a route to debiasing automated reviewers without external human labels.

Load-bearing premise

Self-improvement guarantees remain valid inside an epoch even after the utility has been allowed to change at the previous boundary, and the fixed within-epoch criterion continues to serve as a useful proxy for the evolving objective.

What would settle it

An experiment in which an agent trained inside one epoch shows no further improvement or violates the expected performance scaling once the utility is updated at the next boundary, or in which the within-epoch metric no longer correlates with outcomes measured under the updated utility.

Figures

Figures reproduced from arXiv: 2606.26294 by Alex Iacob, Ambroise Odonnat, Andrej Jovanovi\'c, Bill Marino, Daniel Burkhardt, Lorenzo Sani, Meghdad Kurmanji, Niccol\`o Alberto Elia Venanzi, Nicholas D. Lane, Nurbek Tastan, William F. Shen, Xinchi Qiu, Zeyu Cao.

Figure 1
Figure 1. Figure 1: RQGM exceeds the prior SOTA HGM-H on Polyglot with 1.35×–1.72× fewer search tokens by adding a cheap evolved code reviewer. Left: held-out pass rate vs. search cost; arrows give the tokens RQGM saves to exceed the baseline’s rate, and the heart marks the run’s best agent. Right: best-belief utility during search. At each evaluator replacement (crowned dashed rule), the utility drops as selective erasure di… view at source ↗
Figure 2
Figure 2. Figure 2: RQGM searches over a multi-agent workspace tree containing both learnable task agents and evaluators. At each step, a node is selected by Thompson sampling over clade metapro￾ductivity and either expanded by a meta-agent or evaluated. Evaluators are scored against a ground￾truth anchor and task agents by their epoch-local frozen evaluator or a fixed benchmark for evaluator￾independent roles. At checkpoints… view at source ↗
Figure 3
Figure 3. Figure 3: Evaluator replacements permanently re-rank the archive. Each curve tracks one re￾placement, plotting the Spearman 𝜌 between post- and pre-replacement rankings as evaluations ac￾cumulate: 𝜌 = 1 (dotted) is the unchanged order, 𝜌 = 0 an uncorrelated reordering. Across all three tasks 𝜌 settles well below 1 and never recovers, so the new ordering holds. The no-erasure control (rightmost) stays high (𝜌 ≥ 0.90)… view at source ↗
Figure 4
Figure 4. Figure 4: Evaluator replacement preserves the best lineage while re-ranking the remainder. The paper-run archive after an evaluator replacement (radial layout; node color is best-belief utility, size tracks evidence count). The crimson winning lineage survives intact, ending in the crowned-heart winner. Among the top-8 nodes, every prior member is re-ranked (rings mark promotions and churn). Macro view: each replace… view at source ↗
Figure 5
Figure 5. Figure 5: The co-evolved RQGM grader reaches the best IMO-GradingBench accuracy at 3× lower search cost than HGM-H. A ground-truth-anchored slot has one global best-belief winner (the crowned heart), while an evaluator-dependent slot admits only epoch-local winners, scored post￾hoc in the tables. Left: IMO-GradingBench accuracy of selected graders against grading mean absolute error (MAE); the RQGM grader’s global w… view at source ↗
Figure 6
Figure 6. Figure 6: The adversarial RQGM reviewer accepts AI and human papers at similar rates, a cali￾brated accept/reject boundary that drives the strongest writer (Tab. 1); HGM-H reaches higher raw APReS accuracy only by over-accepting AI-generated papers, which leaves its writer weak. Left: APReS accuracy of selected reviewers against acceptance rate; the dashed line marks the dataset’s true accept rate. The RQGM reviewer… view at source ↗
Figure 7
Figure 7. Figure 7: Each domain separates durable anchor evidence from epoch-local evaluator records. Left: fixed ground-truth anchors (solid lines) survive evaluator replacement, while epoch-frozen evaluator-slot records (dashed lines) are local to epoch 𝑗𝑚. Middle: only evaluator-independent an￾chor evidence enters the 𝜖-best-belief evaluator replacement gate. Right: after replacement, anchors remain valid while records fro… view at source ↗
Figure 8
Figure 8. Figure 8: Blended-token cost decomposition of the three headline GPT-5.5 (low) runs into workspace expansion, train-time evaluation, and validation evaluation. D.3.2 Patch surfaces We classify every lineage edge’s patch by the code surface it modifies. Shared surfaces, the task￾agent code and infrastructure used by more than one role, carry 59–90% of accepted patches in every GPT-5.5 run ( [PITH_FULL_IMAGE:figures/… view at source ↗
Figure 9
Figure 9. Figure 9: Patch surfaces by run, classifying each accepted lineage edge by the code surface it mod￾ifies (task-agent shared, infrastructure, role-specific, meta-agent module, or notes). E What the Co-Evolution Discovered Every evaluator role is a prompt entry in one shared workspace file, so each evaluator’s behavioral history reads directly from its lineage. The initial role prompts are minimal (App. C.5); across e… view at source ↗
Figure 10
Figure 10. Figure 10: What each evaluator replacement keeps versus what it re-ranks, in all three headline domains. Rows are domains (top to bottom: Polyglot, paper, proof); columns are that run’s shown evaluator replacements, each drawing the archive as it stood at that replacement (radial layout, node color is best-belief utility and size is evidence count). Among the top-8 leaderboard nodes at each replacement, navy rings s… view at source ↗
read the original abstract

Self-improving agents are state-of-the-art (SOTA) on agentic coding benchmarks and have recently been extended to general domains. However, their search methods generally assume a stationary evaluation criterion: a fixed verifier, benchmark, or labeled dataset that remains valid as the agent improves. This ignores a central feature of evolution: species adapt as their environments change with them. We aim to bring the same principle to recursive self-improvement, making evaluation part of the improvement loop and opening search to evolving evaluators, adversarial objectives, and dynamic utilities that may surpass static benchmarks. We introduce the Red Queen Godel Machine (RQGM), an evolutionary framework for recursive self-improvement under non-stationary utilities. The RQGM makes this possible through controlled utility evolution: search is organized into epochs with a fixed within-epoch evaluation criterion, while the utility can be updated at epoch boundaries, so self-improvement guarantees hold per epoch as the objective evolves across them. We begin by showing that even on verifiable coding tasks, the RQGM improves test pass rate over the prior SOTA by adding a complementary agent-as-a-judge code-review signal. This signal is cheaper and the RQGM uses 1.35x-1.72x fewer tokens. We then turn to scientific paper writing and reviewing, and Olympiad-level proof writing and grading, where the RQGM improves performance over prior self-improving agents: co-evolved writers reach 1.78x-1.86x higher acceptance rates under a diverse agent-as-a-judge panel, while co-evolved graders reach 9% higher ground-truth accuracy. In paper reviewing, the strongest baseline reviewer over-accepts AI-generated papers at up to 1.91x the human rate. The RQGM corrects this by introducing an adversarial objective that discovers reviewers equally stringent on AI and human work.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces the Red Queen Gödel Machine (RQGM), an evolutionary framework for recursive self-improvement of agents under non-stationary utilities. Search is partitioned into epochs with a fixed within-epoch evaluation criterion so that self-improvement guarantees hold per epoch, while the utility function can be updated at epoch boundaries. Empirical results are reported on verifiable coding tasks (improved test pass rate and 1.35x-1.72x fewer tokens via an agent-as-a-judge code-review signal), scientific paper writing/reviewing (co-evolved writers achieve 1.78x-1.86x higher acceptance rates under diverse agent judges; co-evolved graders reach 9% higher ground-truth accuracy; an adversarial objective corrects over-acceptance of AI-generated papers), and Olympiad-level proof writing/grading.

Significance. If the per-epoch construction is shown to preserve useful progress under evolving utilities, the framework would address a fundamental limitation of current self-improving agents that assume stationary verifiers or benchmarks. The multi-domain empirical gains, token efficiency, and explicit handling of adversarial objectives (e.g., fair reviewing) would be notable contributions to agentic and evolutionary ML methods.

major comments (2)
  1. [Abstract and §3 (controlled utility evolution / epoch partitioning)] Abstract and §3 (controlled utility evolution / epoch partitioning): the central claim that 'self-improvement guarantees hold per epoch as the objective evolves across them' is load-bearing for the non-stationary utility handling, yet the manuscript supplies no formal argument, theorem, or transfer proof showing that improvements obtained under fixed criterion C_t remain informative or non-harmful once the utility is replaced by C_{t+1} at the epoch boundary. This directly matches the stress-test concern and leaves open the possibility that the co-evolution loop amplifies misalignment.
  2. [Empirical evaluation sections (coding, paper, and proof tasks)] Empirical evaluation sections (coding, paper, and proof tasks): all quantitative claims (1.78x-1.86x acceptance rates, 9% accuracy gain, 1.35x-1.72x token reduction, 1.91x over-acceptance correction) are presented without error bars, number of independent runs, or statistical tests, so the reliability of the reported improvements over baselines cannot be assessed.
minor comments (2)
  1. [Methods] Methods: the description of how the utility update rule itself is learned or constrained at epoch boundaries is underspecified; a concrete algorithm or pseudocode would clarify the mechanism.
  2. [Notation] Notation: the distinction between the within-epoch fixed criterion and the evolving utility should be given explicit symbols (e.g., C_t vs. U_t) to avoid ambiguity when discussing transfer.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive report. We respond to each major comment below, indicating planned revisions where appropriate.

read point-by-point responses
  1. Referee: [Abstract and §3 (controlled utility evolution / epoch partitioning)] Abstract and §3 (controlled utility evolution / epoch partitioning): the central claim that 'self-improvement guarantees hold per epoch as the objective evolves across them' is load-bearing for the non-stationary utility handling, yet the manuscript supplies no formal argument, theorem, or transfer proof showing that improvements obtained under fixed criterion C_t remain informative or non-harmful once the utility is replaced by C_{t+1} at the epoch boundary. This directly matches the stress-test concern and leaves open the possibility that the co-evolution loop amplifies misalignment.

    Authors: The per-epoch construction fixes the evaluation criterion for the duration of search within an epoch, allowing any self-improvement procedure that assumes a stationary objective to be invoked without modification. At epoch boundaries the utility is updated in a controlled manner and the subsequent epoch begins from the agents obtained under the prior criterion. We agree that the manuscript would benefit from an explicit statement of this invariance and will add a short clarifying proposition in §3 that formalizes the per-epoch stationarity and the conditions under which intra-epoch improvements are carried forward. revision: yes

  2. Referee: [Empirical evaluation sections (coding, paper, and proof tasks)] Empirical evaluation sections (coding, paper, and proof tasks): all quantitative claims (1.78x-1.86x acceptance rates, 9% accuracy gain, 1.35x-1.72x token reduction, 1.91x over-acceptance correction) are presented without error bars, number of independent runs, or statistical tests, so the reliability of the reported improvements over baselines cannot be assessed.

    Authors: The current manuscript indeed omits these statistical details. In the revision we will report the number of independent runs performed for each experiment, include error bars or standard deviations on all quantitative metrics, and add appropriate statistical tests comparing RQGM variants against baselines. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical outcomes reported without definitional reduction

full rationale

The paper describes an evolutionary framework (RQGM) that partitions search into epochs with fixed within-epoch criteria while allowing utility updates at boundaries. All reported results—test pass rates, acceptance rates, token usage, and accuracy improvements—are presented as measured empirical outcomes on coding, paper-writing, and proof tasks rather than as quantities defined in terms of the framework itself or obtained by fitting parameters that are then renamed as predictions. No equations, self-definitional constructions, fitted-input predictions, or load-bearing self-citation chains appear in the abstract or description; the central claims remain independent of the method's own outputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the unstated premise that per-epoch fixed-evaluation guarantees remain valid when the utility is permitted to change at boundaries; no free parameters, axioms, or invented entities are enumerated in the abstract.

axioms (1)
  • domain assumption Self-improvement guarantees hold inside each epoch under a fixed evaluation criterion even as the global utility evolves across epochs.
    This premise is required for the method to inherit the safety properties of prior stationary self-improvement while allowing non-stationary utilities.

pith-pipeline@v0.9.1-grok · 5939 in / 1329 out tokens · 13867 ms · 2026-06-26T01:42:26.014748+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

61 extracted references · 3 canonical work pages

  1. [1]

    A new evolutionary law

    Leigh Van Valen. A new evolutionary law. Evolutionary Theory, 1:1–30, 1973

  2. [2]

    Darwin Gödel Machine: Open-ended evolution of self-improving agents

    Jenny Zhang, Shengran Hu, Cong Lu, Robert Tjarko Lange, and Jeff Clune. Darwin Gödel Machine: Open-ended evolution of self-improving agents. In ICLR, 2026. URL https:// openreview.net/forum?id=pUpzQZTvGY

  3. [3]

    Huxley-Gödel Machine: Human-Level Coding Agent Development by an Approximation of the Optimal Self-Improving Machine

    Wenyi Wang, Piotr Piękos, Li Nanbo, Firas Laakom, Yimeng Chen, Mateusz Ostaszewski, Mingchen Zhuge, and Jürgen Schmidhuber. Huxley-Gödel Machine: Human-Level Coding Agent Development by an Approximation of the Optimal Self-Improving Machine. In ICLR,

  4. [4]

    URL https://openreview.net/forum?id=T0EiEuhOOL

  5. [5]

    Foerster, Jeff Clune, Minqi Jiang, Sam Devlin, and Tatiana Shavrina

    Jenny Zhang, Bingchen Zhao, Wannan Y ang, Jakob N. Foerster, Jeff Clune, Minqi Jiang, Sam Devlin, and Tatiana Shavrina. HyperAgents. CoRR, abs/2603.19461, 2026

  6. [6]

    Foerster, Jeff Clune, and David Ha

    Chris Lu, Cong Lu, Robert Tjarko Lange, Jakob N. Foerster, Jeff Clune, and David Ha. The AI scientist: Towards fully automated open-ended scientific discovery. CoRR, abs/2408.06292, 2024

  7. [7]

    Foerster, Jeff Clune, and David Ha

    Yutaro Y amada, Robert Tjarko Lange, Cong Lu, Shengran Hu, Chris Lu, Jakob N. Foerster, Jeff Clune, and David Ha. The AI scientist-v2: Workshop-level automated scientific discovery via agentic tree search. CoRR, abs/2504.08066, 2025

  8. [8]

    Thang Luong, Dawsen Hwang, Hoang H. Nguyen, Golnaz Ghiasi, Yuri Chervonyi, Insuk Seo, Junsu Kim, Garrett Bingham, Jonathan Lee, Swaroop Mishra, Alex Zhai, Clara Huiyi Hu, Hen- ryk Michalewski, Jimin Kim, Jeonghyun Ahn, Junhwi Bae, Xingyou Song, Trieu H. Trinh, Quoc V . Le, and Junehyuk Jung. Towards robust mathematical reasoning. In EMNLP, pages 35418–354...

  9. [9]

    Learning hand- eye coordination for robotic grasping with deep learning and large-scale data collection

    Sergey Levine, Peter Pastor, Alex Krizhevsky, Julian Ibarz, and Deirdre Quillen. Learning hand- eye coordination for robotic grasping with deep learning and large-scale data collection. Int. J. Robotics Res., 37(4-5):421–436, 2018

  10. [10]

    Nature , author =

    Benjamin Burger, Phillip M. Maffettone, Vladimir V . Gusev, Catherine M. Aitchison, Y ang Bai, Xiaoyan Wang, Xiaobo Li, Ben M. Alston, Buyi Li, Rob Clowes, Nicola Rankin, Brandon Harris, Reiner Sebastian Sprick, and Andrew I. Cooper. A mobile robotic chemist. Nature, 583 (7815):237–241, 2020. doi: 10.1038/s41586-020-2442-2

  11. [11]

    Hind- sight experience replay

    Marcin Andrychowicz, Filip Wolski, Alex Ray, Jonas Schneider, Rachel Fong, Peter Welin- der, Bob McGrew, Josh Tobin, OpenAI Pieter Abbeel, and Wojciech Zaremba. Hind- sight experience replay. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fer- gus, S. Vishwanathan, and R. Garnett, editors, NeurIPS, volume 30. Curran Associates, Inc.,

  12. [12]

    URL https://proceedings.neurips.cc/paper_files/paper/ 2017/file/453fadbd8a1a3af50a9df4df899537b5-Paper.pdf

  13. [13]

    Self-improvement of large language models: A technical overview and future outlook

    Haoyan Y ang, Mario Xerri, Solha Park, Huajian Zhang, Yiyang Feng, Sai Akhil Kogilathota, and Jiawei Zhou. Self-improvement of large language models: A technical overview and future outlook. CoRR, abs/2603.25681, 2026

  14. [14]

    Map- ping global dynamics of benchmark creation and saturation in artificial intelligence

    Adriano Barbosa-Silva, Simon Ott, Kathrin Blagec, Jan Brauner, and Matthias Samwald. Map- ping global dynamics of benchmark creation and saturation in artificial intelligence. CoRR, abs/2203.04592, 2022

  15. [15]

    o1 tops aider’s new polyglot leaderboard.https://aider.chat/2024/ 12/21/polyglot.html, December 2024

    Paul Gauthier. o1 tops aider’s new polyglot leaderboard.https://aider.chat/2024/ 12/21/polyglot.html, December 2024. Accessed: 2026-01-28. 13

  16. [16]

    Towards end-to-end automation of AI research

    Chris Lu, Cong Lu, Robert Tjarko Lange, Yutaro Y amada, Shengran Hu, Jakob Foerster, David Ha, and Jeff Clune. Towards end-to-end automation of AI research. Nature, 651(8107):914– 919, 2026

  17. [17]

    APReS: an agentic paper revision and evaluation system

    Bingchen Zhao, Jenny Zhang, Chenxi Whitehouse, Minqi Jiang, Michael Shvartsman, Ab- hishek Charnalia, Despoina Magka, Tatiana Shavrina, Derek Dunfield, Oisin Mac Aodha, and Y oram Bachrach. APReS: an agentic paper revision and evaluation system. CoRR, abs/2603.03142, 2026

  18. [18]

    Bowman, and Shi Feng

    Arjun Panickssery, Samuel R. Bowman, and Shi Feng. LLM evaluators recognize and favor their own generations. In NeurIPS, 2024

  19. [19]

    Speculations concerning the first ultraintelligent machine

    Irving John Good. Speculations concerning the first ultraintelligent machine. Adv. Comput., 6: 31–88, 1965

  20. [20]

    Gödel machines: Self-referential universal problem solvers making prov- ably optimal self-improvements

    Jürgen Schmidhuber. Gödel machines: Self-referential universal problem solvers making prov- ably optimal self-improvements. CoRR, cs.LO/0309048, 2003

  21. [21]

    Eliminating meta optimization through self-referential meta learning

    Louis Kirsch and Jürgen Schmidhuber. Eliminating meta optimization through self-referential meta learning. CoRR, abs/2212.14392, 2022

  22. [22]

    Foerster

    Chris Lu, Sebastian Towers, and Jakob N. Foerster. Arbitrary order meta-learning with simple population-based evolution. CoRR, abs/2303.09478, 2023

  23. [23]

    Live-SWE- agent: Can software engineering agents self-evolve on the fly? CoRR, abs/2511.13646, 2025

    Chunqiu Steven Xia, Zhe Wang, Y an Y ang, Yuxiang Wei, and Lingming Zhang. Live-SWE- agent: Can software engineering agents self-evolve on the fly? CoRR, abs/2511.13646, 2025

  24. [24]

    Self-taught optimizer (STOP): recursively self-improving code generation

    Eric Zelikman, Eliana Lorch, Lester Mackey, and Adam Tauman Kalai. Self-taught optimizer (STOP): recursively self-improving code generation. CoRR, abs/2310.02304, 2023

  25. [25]

    Gödel agent: A self-referential agent framework for recursively self-improvement

    Xunjian Yin, Xinyi Wang, Liangming Pan, Li Lin, Xiaojun Wan, and William Y ang Wang. Gödel agent: A self-referential agent framework for recursively self-improvement. In ACL (1), pages 27890–27913. Association for Computational Linguistics, 2025

  26. [26]

    Promptbreeder: Self-referential self-improvement via prompt evolution

    Chrisantha Fernando, Dylan Banarse, Henryk Michalewski, Simon Osindero, and Tim Rock- täschel. Promptbreeder: Self-referential self-improvement via prompt evolution. In ICML, vol- ume 235 of Proceedings of Machine Learning Research , pages 13481–13544. PMLR / Open- Review.net, 2024

  27. [27]

    Eric Zelikman, Yuhuai Wu, Jesse Mu, and Noah D. Goodman. Star: Bootstrapping reasoning with reasoning. In NeurIPS, 2022

  28. [28]

    Shen, Alex Iacob, Zichen Zhang, Daoheng Wang, Wenyi Wang, Rui Liang, Yulong Zhang, Xinchi Qiu, and Nicholas D

    William F. Shen, Alex Iacob, Zichen Zhang, Daoheng Wang, Wenyi Wang, Rui Liang, Yulong Zhang, Xinchi Qiu, and Nicholas D. Lane. Star analyst: Self-tuning alpha research. SSRN preprint, 2026. Available at SSRN: https://doi.org/10.2139/ssrn.6823940

  29. [29]

    A self-improving coding agent

    Maxime Robeyns, Martin Szummer, and Laurence Aitchison. A self-improving coding agent. CoRR, abs/2504.15228, 2025

  30. [30]

    Joar Skalse, Nikolaus H. R. Howe, Dmitrii Krasheninnikov, and David Krueger. Defining and characterizing reward gaming. In NeurIPS, 2022

  31. [31]

    Xing, Hao Zhang, Joseph E

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Y onghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P . Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging LLM-as-a-judge with MT-bench and chatbot arena. In NeurIPS, 2023

  32. [32]

    Length-controlled AlpacaEval: A simple debiasing of automatic evaluators

    Y ann Dubois, Percy Liang, and Tatsunori Hashimoto. Length-controlled AlpacaEval: A simple debiasing of automatic evaluators. In First Conference on Language Modeling , 2024. URL https://openreview.net/forum?id=CybBmzWBX0

  33. [33]

    Ashley, Wenyi Wang, Dmitrii Khizbullin, Yunyang Xiong, Zechun Liu, Ernie Chang, Raghuraman Krishnamoorthi, Yuandong Tian, Y angyang Shi, Vikas Chandra, and Jürgen Schmidhuber

    Mingchen Zhuge, Changsheng Zhao, Dylan R. Ashley, Wenyi Wang, Dmitrii Khizbullin, Yunyang Xiong, Zechun Liu, Ernie Chang, Raghuraman Krishnamoorthi, Yuandong Tian, Y angyang Shi, Vikas Chandra, and Jürgen Schmidhuber. Agent-as-a-judge: Evaluate agents with agents. In ICML, volume 267 of Proceedings of Machine Learning Research . PMLR / OpenReview.net, 2025. 14

  34. [34]

    Accelerating scientific discovery with Co-Scientist

    Juraj Gottweis, Wei-Hung Weng, Alexander Daryin, Tao Tu, Petar Sirkovic, Artiom Myaskovsky, Grzegorz Glowaty, Felix Weissenberger, Alessio Orlandi, Dan Popovici, et al. Accelerating scientific discovery with Co-Scientist. Nature, pages 1–3, 2026

  35. [35]

    A multi-agent system for automating scientific discovery

    Ali Essam Ghareeb, Benjamin Chang, Ludovico Mitchener, Angela Yiu, Caralyn J Szostkiewicz, Dmytro Shved, Gavin J Gyimesi, Jon M Laurent, Samantha M Wright, Muhammed T Razzak, et al. A multi-agent system for automating scientific discovery. Na- ture, pages 1–3, 2026

  36. [36]

    Alexander Novikov, Ngân Vu, Marvin Eisenberger, Emilien Dupont, Po-Sen Huang, Adam Zsolt Wagner, Sergey Shirobokov, Borislav Kozlovskii, Francisco J. R. Ruiz, Abbas Mehrabian, M. Pawan Kumar, Abigail See, Swarat Chaudhuri, George Holland, Alex Davies, Sebastian Nowozin, Pushmeet Kohli, and Matej Balog. AlphaEvolve: A coding agent for sci- entific and algo...

  37. [37]

    Pawan Kumar, Emilien Dupont, Francisco J

    Bernardino Romera-Paredes, Mohammadamin Barekatain, Alexander Novikov, Matej Balog, M. Pawan Kumar, Emilien Dupont, Francisco J. R. Ruiz, Jordan S. Ellenberg, Pengming Wang, Omar Fawzi, Pushmeet Kohli, and Alhussein Fawzi. Mathematical discoveries from program search with large language models. Nature, 625(7995):468–475, 2024

  38. [38]

    Rosin and Richard K

    Christopher D. Rosin and Richard K. Belew. New methods for competitive coevolution. Evol. Comput., 5(1):1–29, 1997

  39. [39]

    Multi-agent learning with the success-story algorithm

    Jürgen Schmidhuber and Jieyu Zhao. Multi-agent learning with the success-story algorithm. In ECAI Workshop LDAIS / ICMAS Workshop LIOME, volume 1221 of Lecture Notes in Computer Science, pages 82–93. Springer, 1996

  40. [40]

    Mastering the game of Go with deep neural networks and tree search

    David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of Go with deep neural networks and tree search. Nature, 529(7587): 484–489, 2016

  41. [41]

    Dennis J

    David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, Timothy Lillicrap, Karen Simonyan, and Demis Hassabis. A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play. Science, 362(6419):1140–1144, 2018. doi: 10.112...

  42. [42]

    Multi-agent evolve: LLM self-improve through co-evolution

    Yixing Chen, Yiding Wang, Siqi Zhu, Haofei Yu, Tao Feng, Muhan Zhang, Mostofa Pat- wary, and Jiaxuan Y ou. Multi-agent evolve: LLM self-improve through co-evolution. CoRR, abs/2510.23595, 2025

  43. [43]

    Group-evolving agents: Open-ended self-improvement via experience sharing

    Zhaotian Weng, Antonis Antoniades, Deepak Nathani, Zhen Zhang, Xiao Pu, and Xin Eric Wang. Group-evolving agents: Open-ended self-improvement via experience sharing. CoRR, abs/2602.04837, 2026

  44. [44]

    Foerster

    Jason Weston and Jakob N. Foerster. AI & human co-improvement for safer co- superintelligence. CoRR, abs/2512.05356, 2025

  45. [45]

    Stanley, Joel Lehman, and Lisa Soros

    Kenneth O. Stanley, Joel Lehman, and Lisa Soros. Open-endedness: The last grand challenge you’ve never heard of. O’Reilly Radar (online arti- cle), December 2017. URL https://www.oreilly.com/radar/ open-endedness-the-last-grand-challenge-youve-never-heard-of/ . Accessed 2026-06-16

  46. [46]

    Dennis, Jack Parker-Holder, Feryal M

    Edward Hughes, Michael D. Dennis, Jack Parker-Holder, Feryal M. P . Behbahani, Aditi Mavalankar, Yuge Shi, Tom Schaul, and Tim Rocktäschel. Position: Open-endedness is es- sential for artificial superhuman intelligence. In ICML, volume 235 of Proceedings of Machine Learning Research, pages 20597–20616. PMLR / OpenReview.net, 2024

  47. [47]

    Illuminating search spaces by mapping elites

    Jean-Baptiste Mouret and Jeff Clune. Illuminating search spaces by mapping elites. CoRR, abs/1504.04909, 2015. 15

  48. [48]

    Stanley, and Jeff Clune

    Adrien Ecoffet, Joost Huizinga, Joel Lehman, Kenneth O. Stanley, and Jeff Clune. Go-explore: a new approach for hard-exploration problems. CoRR, abs/1901.10995, 2019

  49. [49]

    Automated design of agentic systems

    Shengran Hu, Cong Lu, and Jeff Clune. Automated design of agentic systems. In ICLR. Open- Review.net, 2025

  50. [50]

    Ai-gas: Ai-generating algorithms, an alternate paradigm for producing general arti- ficial intelligence

    Jeff Clune. Ai-gas: Ai-generating algorithms, an alternate paradigm for producing general arti- ficial intelligence. CoRR, abs/1905.10985, 2019

  51. [51]

    Algorithms for infinitely many-armed bandits

    Yizao Wang, Jean-yves Audibert, and Rémi Munos. Algorithms for infinitely many-armed bandits. In D. Koller, D. Schuurmans, Y . Bengio, and L. Bottou, editors, NeurIPS, volume 21, pages 1729–1736. Curran Associates, Inc., 2008

  52. [52]

    CRA VE: Code Review Agent Verdict Evaluation

    Li Zhang. CRA VE: Code Review Agent Verdict Evaluation. Hugging Face dataset,

  53. [53]

    Code review classification dataset from curated human code reviews

    URL https://huggingface.co/datasets/TuringEnterprises/ CRAVE. Code review classification dataset from curated human code reviews

  54. [54]

    Jimenez, John Y ang, Alexander Wettig, Shunyu Y ao, Kexin Pei, Ofir Press, and Karthik R

    Carlos E. Jimenez, John Y ang, Alexander Wettig, Shunyu Y ao, Kexin Pei, Ofir Press, and Karthik R. Narasimhan. SWE-bench: Can language models resolve real-world GitHub issues? In ICLR. OpenReview.net, 2024

  55. [55]

    Yichen Huang and Lin F. Y ang. Gemini 2.5 pro capable of winning gold at IMO 2025. CoRR, abs/2507.15855, 2025

  56. [56]

    Api pricing

    OpenAI. Api pricing. https://openai.com/api/pricing/, 2026

  57. [57]

    Anthropic. Pricing. https://platform.claude.com/docs/en/ about-claude/pricing, 2026

  58. [58]

    Our First Proof Submissions

    OpenAI. Our First Proof Submissions. https://openai.com/index/ first-proof-submissions/ , 2025. Gold-medal-level performance at the In- ternational Mathematical Olympiad 2025

  59. [59]

    Spears, Derya Unutmaz, Kevin Weil, Steven Yin, and Nikita Zhivotovskiy

    Sébastien Bubeck, Christian Coester, Ronen Eldan, Timothy Gowers, Yin Tat Lee, Alexandru Lupsasca, Mehtaab Sawhney, Robert Scherrer, Mark Sellke, Brian K. Spears, Derya Unutmaz, Kevin Weil, Steven Yin, and Nikita Zhivotovskiy. Early science acceleration experiments with GPT-5. CoRR, abs/2511.16072, 2025

  60. [60]

    Instruction-following evaluation for large language models

    Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, and Le Hou. Instruction-following evaluation for large language models. CoRR, abs/2311.07911, 2023

  61. [61]

    node_modules

    Fengqing Jiang, Yichen Feng, Yuetai Li, Luyao Niu, Basel Alomair, and Radha Poovendran. BadScientist: Can a research agent write convincing but unsound papers that fool LLM review- ers? CoRR, abs/2510.18003, 2025. 16 Table of Contents A Appendix Overview 17 B The Algorithm in Full 17 C Experimental Setup 18 C.1 Domains and Anchor pairs . . . . . . . . . ....