pith. sign in

arxiv: 2605.20086 · v1 · pith:47Q5SZXCnew · submitted 2026-05-19 · 💻 cs.NE · cs.AI· cs.LG

What Do Evolutionary Coding Agents Evolve?

Pith reviewed 2026-05-20 03:40 UTC · model grok-4.3

classification 💻 cs.NE cs.AIcs.LG
keywords evolutionary searchLLM code generationedit classificationsearch tracesalgorithm designbenchmark evaluationcode evolution
0
0 comments X p. Extension
pith:47Q5SZXC Add to your LaTeX paper What is a Pith Number?
\usepackage{pith}
\pithnumber{47Q5SZXC}

Prints a linked pith:47Q5SZXC badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more

The pith

Evolutionary coding agents often improve scores by cycling deleted lines back into code rather than inventing new algorithms.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Systems that combine LLMs with evolutionary search generate and refine code for math and algorithm tasks. Final benchmark scores can result from new structure, re-tuning, recombination of existing knowledge, or overfitting to the judge. The authors create the EvoTrace dataset of full search traces from four frameworks and sixteen tasks, then apply EvoReplay to reconstruct states and test interventions on successful solutions. Every edit receives one of nine labels from an LLM judge validated by blind human review. Most gains trace to a small subset of edit types, and about thirty percent of added lines are exact re-introductions of lines deleted earlier in the same run.

Core claim

Benchmark gains in evolutionary coding agents arise from qualitatively different mechanisms, only some of which introduce new algorithmic structure; a deterministic cycling pattern appears in which roughly thirty percent of lines added during search are byte-identical re-introductions of previously deleted lines.

What carries the argument

Annotation of every code edit into one of nine recurring types using a validated LLM-as-judge pipeline applied to full evolutionary traces.

If this is right

  • Reported progress on coding benchmarks can reflect simple re-tunes or cycling instead of structural novelty.
  • Diagnostic evaluation must inspect edit distributions and search dynamics rather than final scores alone.
  • Controlled interventions that block line re-introductions can test whether performance depends on cycling.
  • The EvoTrace dataset supports more precise comparison of evolutionary coding methods.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The cycling behavior may indicate that search stays within a narrow region of the model's prior knowledge.
  • Similar re-introduction loops could limit progress in other iterative LLM editing workflows.

Load-bearing premise

The nine edit types assigned by the LLM judge accurately reflect the mechanisms that produce score changes.

What would settle it

Human re-annotation of the same edits that assigns score gains to different edit types and finds no thirty-percent cycling rate.

Figures

Figures reproduced from arXiv: 2605.20086 by Bo Han, Max Zimmer, Nico Pelleriti, Sebastian Pokutta, Sree Harsha Nelaturu, Zhanke Zhou, Zongze Li.

Figure 1
Figure 1. Figure 1: A taxonomy of edits performed by evolutionary coding agents. Each panel shows a representative parent–child diff (added lines in green, deleted lines in red) drawn from EvoTrace runs and labeled with one of nine recurring categories: Bug fix, External dependency, Architectural change, Composition, Local refinement, Pruning, Refactor, Efficiency, and Hyperparameter tuning. The categories range from minimal … view at source ↗
Figure 2
Figure 2. Figure 2: EvoTrace and EvoReplay. EvoTrace records each evo￾lutionary run as a structured object: programs, parent–child graph, prompts and context, scores, and evaluator metadata. EvoRe￾play reconstructs local search states from these traces and reruns controlled interventions, including same-prompt replay, Bayesian￾optimization retuning, static analysis, cycling detection, ablation, repair, context substitution, a… view at source ↗
Figure 3
Figure 3. Figure 3: Program size and numeric-literal hyperparameter count over a run. Best-so-far program length (LOC, left) and numeric-literal count (right), each normalized by the run’s seed value, plotted against normalized iteration. Solid line = cross-run median; shaded band = inter-quartile range; dashed gray line marks the seed value. Math runs (n=59) accumulate modest LOC and hp growth (median final ratios 1.33× and … view at source ↗
Figure 4
Figure 4. Figure 4: Edit-taxonomy: frequency vs. per-edit utility across all programs in EvoTrace. (a) Fre￾quency of each label: Hyperparameter tuning dominates the search distribution. (b) Per-edit odds ratio for positive normalized score change: External dependency, Efficiency, and Architectural change are the most helpful categories on a per-edit basis. The categories that most often improve a single edit are not the categ… view at source ↗
Figure 5
Figure 5. Figure 5: Best-so-far enrichment of edit labels (aggregate). Enrichment of each taxonomy label among best-so-far updates relative to the all-edits base rate. The categories most overrepresented on successful intermediate steps (Efficiency, External dependency, Hyperparameter tuning, Composition) are not identical to the most frequent labels in [PITH_FULL_IMAGE:figures/full_fig_p018_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Final-best-lineage enrichment (aggregate, robustness check). Enrichment of each label along the lineage from each run’s final best program back to the seed. Efficiency, Hyperparameter tuning, and Composition remain overrepresented relative to the all-edits base rate, supporting the best-so-far view in [PITH_FULL_IMAGE:figures/full_fig_p019_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Edit-label prevalence by domain. Frequency of each taxonomy label among labeled edits, split by domain. Hyperparameter tuning dominates in both domains, but Composition is more prominent on math while structural categories shift their relative weights between domains. B.4 LLM-as-judge validation Taxonomy origin. The 9-category edit taxonomy used in §5.1 was derived inductively from EvoTrace runs rather tha… view at source ↗
Figure 8
Figure 8. Figure 8: Per-edit helpfulness (odds ratio for positive normalized score change) by domain. On ALE, External dependency, Efficiency, and Architectural change are the strongest positive categories. On math, External dependency is even stronger and Composition plays a larger role than on ALE. 10 0 6 × 10 −1 enrichment ratio vs all edits Efficiency Hyperparameter tuning Local refinement External dependency Bug fix Comp… view at source ↗
Figure 9
Figure 9. Figure 9: Best-so-far enrichment of edit labels by domain. Enrichment of each label among best￾so-far updates relative to the all-edits base rate, split by domain. The qualitative signal, a small set of categories (notably Efficiency, External dependency, and Hyperparameter tuning) overrepresented on successful intermediate steps, is consistent with the aggregate view in [PITH_FULL_IMAGE:figures/full_fig_p020_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Final-best-lineage enrichment by domain. Robustness check for [PITH_FULL_IMAGE:figures/full_fig_p021_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Distribution of labels per edit, by domain. Most edits in both domains are multi￾label: 52.4% of edits aggregate-wide carry exactly two labels and only 32.4% are single-label. The categories of [PITH_FULL_IMAGE:figures/full_fig_p021_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Per-edit helpfulness by backend. Odds ratio for positive normalized score change broken down by the four evolutionary backends. Some categories (notably External dependency) are consistently positive across backends, while others vary in magnitude. openevolve_native contributes only 2 runs to this corpus, so its column should be interpreted with a wider implicit confidence band; we include it for complete… view at source ↗
read the original abstract

Recent work pairs LLMs with evolutionary search to iteratively generate, modify, and select code using task-specific feedback. These systems have produced strong results in mathematical discovery and algorithm design, yet a fundamental question remains: what do they actually evolve? Progress is typically summarized by the best score a run reaches under a task-specific evaluator, but that score can reflect several different mechanisms: new algorithmic structure, re-tuning an existing strategy, recombining ideas already in the model's internal knowledge, or overfitting to the evaluator. Distinguishing these mechanisms requires inspecting the search process itself, not only its final outcome. We introduce EvoTrace, a dataset of evolutionary coding traces spanning four evolutionary frameworks, reasoning and non-reasoning models, and 16 tasks across mathematics and algorithm design. To analyze these traces, we develop EvoReplay, a replay-based methodology that reconstructs the local search states behind high-scoring solutions and tests controlled interventions, including adjusting constants, removing program components and substituting models or prompting contexts. We annotate every code edit in EvoTrace with one of nine recurring edit types using an LLM-as-judge pipeline validated against blind human re-annotation. Across EvoTrace, most score gains come from a small subset of these edit types. We further find a deterministic cycling pattern: about 30% of code lines added during search are byte-identical re-introductions of previously-deleted lines, present throughout nearly every run. These results show that benchmark gains in evolutionary coding agents can arise from qualitatively different mechanisms, only some of which correspond to new algorithmic structure. EvoTrace enables more diagnostic evaluation of evolutionary coding agents beyond final benchmark scores.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces EvoTrace, a dataset of evolutionary coding traces spanning four frameworks, reasoning and non-reasoning models, and 16 tasks in mathematics and algorithm design. It develops EvoReplay, a replay-based method to reconstruct local search states and perform controlled interventions (e.g., adjusting constants, removing components, substituting models or prompts). All code edits are annotated into one of nine recurring types via an LLM-as-judge pipeline validated by blind human re-annotation. Findings include that most score gains derive from a small subset of edit types and a deterministic cycling pattern in which ~30% of added lines are byte-identical re-introductions of previously deleted lines. The central claim is that benchmark gains arise from qualitatively different mechanisms, only some of which reflect new algorithmic structure.

Significance. If the mechanism distinctions hold, the work would meaningfully advance evaluation practices in evolutionary computation and LLM-guided search by shifting focus from final scores to process diagnostics. The dataset and replay methodology could enable more reproducible and targeted analysis of whether gains reflect genuine innovation versus re-tuning or overfitting. Concrete quantitative observations such as the cycling rate provide falsifiable anchors for follow-up studies.

major comments (2)
  1. [§4.2] §4.2 (LLM-as-Judge Pipeline and Validation): The manuscript reports validation against blind human re-annotation but provides no per-type agreement rates, no quantification of agreement specifically on edits labeled as 'new algorithmic structure', and no sensitivity analysis showing how re-labeling of borderline cases affects the claim that most gains come from a small subset of types. Because this classification step is required to map score improvements to the four distinct mechanisms listed in the abstract, the absence of these metrics leaves the qualitative distinction under-supported.
  2. [§3.1] §3.1 (EvoReplay Interventions): The controlled interventions are introduced to test mechanisms behind high-scoring solutions, yet the text does not detail how each intervention (constant adjustment, component removal, model substitution) isolates 'new algorithmic structure' from re-tuning, recombination, or evaluator overfitting. Without explicit mapping or controls for confounding factors, it is unclear whether the interventions confirm the claimed separation of mechanisms.
minor comments (2)
  1. [Abstract] The abstract states that traces come from 'four evolutionary frameworks' without naming them; adding the specific names would improve immediate clarity for readers.
  2. [Figures] Figure captions for edit-type distributions should explicitly list the nine types and their definitions to allow readers to interpret the 'small subset' result without cross-referencing the main text.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive report. The comments highlight areas where additional detail can strengthen the presentation of our validation and intervention methodology. We respond to each major comment below and indicate the revisions we will make.

read point-by-point responses
  1. Referee: [§4.2] §4.2 (LLM-as-Judge Pipeline and Validation): The manuscript reports validation against blind human re-annotation but provides no per-type agreement rates, no quantification of agreement specifically on edits labeled as 'new algorithmic structure', and no sensitivity analysis showing how re-labeling of borderline cases affects the claim that most gains come from a small subset of types. Because this classification step is required to map score improvements to the four distinct mechanisms listed in the abstract, the absence of these metrics leaves the qualitative distinction under-supported.

    Authors: We agree that per-type agreement rates and a sensitivity analysis would provide stronger support for the classification step. In the revised manuscript we will add a table in §4.2 reporting agreement (Cohen’s kappa and raw percentage) for each of the nine edit types, with a separate row for the ‘new algorithmic structure’ category. We will also include a sensitivity analysis that re-labels borderline cases according to the human annotators’ secondary choices and shows that the result—most score gains arising from a small subset of types—remains stable. revision: yes

  2. Referee: [§3.1] §3.1 (EvoReplay Interventions): The controlled interventions are introduced to test mechanisms behind high-scoring solutions, yet the text does not detail how each intervention (constant adjustment, component removal, model substitution) isolates 'new algorithmic structure' from re-tuning, recombination, or evaluator overfitting. Without explicit mapping or controls for confounding factors, it is unclear whether the interventions confirm the claimed separation of mechanisms.

    Authors: We accept that the current text does not make the mapping between interventions and mechanisms fully explicit. In the revision we will expand §3.1 with a table that directly links each intervention to the mechanism(s) it is intended to isolate (constant adjustment for re-tuning, component removal for new structure versus recombination, model/prompt substitution for internal knowledge versus search-derived structure). We will also describe the controls already present in the replay protocol—held-out test cases and fixed evaluator seeds—to address potential evaluator overfitting. revision: yes

Circularity Check

0 steps flagged

Empirical trace analysis is self-contained with no circular derivation

full rationale

The paper's central results rest on direct inspection of evolutionary search traces via the introduced EvoTrace dataset and EvoReplay interventions, plus LLM-as-judge annotation of edit types validated by blind human re-annotation. These are observational findings about the distribution of score gains and byte-identical line re-introductions; no mathematical derivation, fitted parameter renamed as prediction, or load-bearing self-citation chain is present that would reduce the claimed distinctions among mechanisms to the inputs by construction. The analysis is therefore independent of its own outputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper relies on standard domain assumptions about the meaningfulness of task-specific evaluators in evolutionary search but introduces no free parameters, new axioms beyond those, or invented entities; the contribution lies in the empirical analysis framework and dataset.

axioms (1)
  • domain assumption Task-specific evaluators provide meaningful feedback capable of distinguishing different mechanisms of improvement.
    The entire diagnostic analysis depends on the assumption that the evaluators used during evolutionary search are reliable indicators of progress.

pith-pipeline@v0.9.0 · 5843 in / 1265 out tokens · 48300 ms · 2026-05-20T03:40:13.380664+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

72 extracted references · 72 canonical work pages · 20 internal anchors

  1. [1]

    Pawan Kumar, Emilien Dupont, Francisco J

    Bernardino Romera-Paredes, Mohammadamin Barekatain, Alexander Novikov, Matej Balog, M. Pawan Kumar, Emilien Dupont, Francisco J. R. Ruiz, Jordan S. Ellenberg, Pengming Wang, Omar Fawzi, Pushmeet Kohli, and Alhussein Fawzi. Mathematical discoveries from program search with large language models.Nature, 625(7995):468–475, January 2024. ISSN 0028-0836, 1476-...

  2. [2]

    Alexander Novikov, Ngân V˜u, Marvin Eisenberger, Emilien Dupont, Po-Sen Huang, Adam Zsolt Wagner, Sergey Shirobokov, Borislav Kozlovskii, Francisco J. R. Ruiz, Abbas Mehrabian, M. Pawan Kumar, Abigail See, Swarat Chaudhuri, George Holland, Alex Davies, Sebastian Nowozin, Pushmeet Kohli, and Matej Balog. AlphaEvolve: A coding agent for scientific and algor...

  3. [3]

    Mathematical exploration and discovery at scale

    Bogdan Georgiev, Javier Gómez-Serrano, Terence Tao, and Adam Zsolt Wagner. Mathematical exploration and discovery at scale, December 2025. https://arxiv.org/abs/2511.02864

  4. [4]

    ShinkaEvolve: Towards Open-Ended And Sample-Efficient Program Evolution, September 2025

    Robert Tjarko Lange, Yuki Imajuku, and Edoardo Cetin. ShinkaEvolve: Towards Open-Ended And Sample-Efficient Program Evolution, September 2025. https://arxiv.org/abs/2509. 19349

  5. [5]

    CodeEvolve: An open source evolutionary coding agent for algorithmic discovery and optimization, March 2026

    Henrique Assumpção, Diego Ferreira, Leandro Campos, and Fabricio Murai. CodeEvolve: An open source evolutionary coding agent for algorithmic discovery and optimization, March 2026. https://arxiv.org/abs/2510.14150

  6. [6]

    AdaEvolve: Adaptive LLM Driven Zeroth-Order Optimization, February 2026

    Mert Cemri, Shubham Agrawal, Akshat Gupta, Shu Liu, Audrey Cheng, Qiuyang Mang, Ashwin Naren, Lutfi Eren Erdogan, Koushik Sen, Matei Zaharia, Alex Dimakis, and Ion Stoica. AdaEvolve: Adaptive LLM Driven Zeroth-Order Optimization, February 2026. https: //arxiv.org/abs/2602.20133

  7. [7]

    Pan, Alexander Du, Kurt Keutzer, Alvin Cheung, Alexandros G

    Shu Liu, Shubham Agarwal, Monishwaran Maheswaran, Mert Cemri, Zhifei Li, Qiuyang Mang, Ashwin Naren, Ethan Boneh, Audrey Cheng, Melissa Z. Pan, Alexander Du, Kurt Keutzer, Alvin Cheung, Alexandros G. Dimakis, Koushik Sen, Matei Zaharia, and Ion Stoica. EvoX: Meta- Evolution for Automated Discovery, March 2026.https://arxiv.org/abs/2602.23413

  8. [8]

    Let the Barbarians In: How AI Can Accelerate Systems Performance Research, December 2025.https://arxiv.org/abs/2512.14806

    Audrey Cheng, Shu Liu, Melissa Pan, Zhifei Li, Shubham Agarwal, Mert Cemri, Bowen Wang, Alexander Krentsel, Tian Xia, Jongseok Park, Shuo Yang, Jeff Chen, Lakshya Agrawal, Ashwin Naren, Shulu Li, Ruiying Ma, Aditya Desai, Jiarong Xing, Koushik Sen, Matei Zaharia, and Ion Stoica. Let the Barbarians In: How AI Can Accelerate Systems Performance Research, De...

  9. [9]

    Evo- Engineer: Mastering Automated CUDA Kernel Code Evolution with Large Language Models, October 2025.https://arxiv.org/abs/2510.03760

    Ping Guo, Chenyu Zhu, Siyuan Chen, Fei Liu, Xi Lin, Zhichao Lu, and Qingfu Zhang. Evo- Engineer: Mastering Automated CUDA Kernel Code Evolution with Large Language Models, October 2025.https://arxiv.org/abs/2510.03760

  10. [10]

    Gonzalez, and Ion Stoica

    Shiyi Cao, Ziming Mao, Joseph E. Gonzalez, and Ion Stoica. K-Search: LLM Kernel Generation via Co-Evolving Intrinsic World Model, February 2026. https://arxiv.org/abs/2602. 19128

  11. [11]

    KernelFoundry: Hardware-aware evolutionary GPU kernel optimization, March 2026

    Nina Wiedemann, Quentin Leboutet, Michael Paulitsch, Diana Wofk, and Benjamin Ummen- hofer. KernelFoundry: Hardware-aware evolutionary GPU kernel optimization, March 2026. https://arxiv.org/abs/2603.12440

  12. [12]

    Openevolve: an open-source evolutionary coding agent, 2025

    Asankhaya Sharma. Openevolve: an open-source evolutionary coding agent, 2025. URL https://github.com/algorithmicsuperintelligence/openevolve

  13. [13]

    GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning

    Lakshya A. Agrawal, Shangyin Tan, Dilara Soylu, Noah Ziems, Rishi Khare, Krista Opsahl- Ong, Arnav Singhvi, Herumb Shandilya, Michael J. Ryan, Meng Jiang, Christopher Potts, Koushik Sen, Alexandros G. Dimakis, Ion Stoica, Dan Klein, Matei Zaharia, and Omar Khattab. GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning, February 2026. htt...

  14. [14]

    GigaEvo: An Open Source Op- timization Framework Powered By LLMs And Evolution Algorithms, November 2025

    Valentin Khrulkov, Andrey Galichin, Denis Bashkirov, Dmitry Vinichenko, Oleg Travkin, Roman Alferov, Andrey Kuznetsov, and Ivan Oseledets. GigaEvo: An Open Source Op- timization Framework Powered By LLMs And Evolution Algorithms, November 2025. https://arxiv.org/abs/2511.17592

  15. [15]

    The FM Agent, February 2026.https://arxiv.org/abs/2510.26144

    Annan Li, Chufan Wu, Zengle Ge, Yee Hin Chong, Zhinan Hou, Lizhe Cao, Cheng Ju, Jianmin Wu, Huaiming Li, Haobo Zhang, Shenghao Feng, Mo Zhao, Fengzhi Qiu, Rui Yang, Mengmeng Zhang, Wenyi Zhu, Yingying Sun, Quan Sun, Shunhao Yan, Danyu Liu, Dawei Yin, and Dou Shen. The FM Agent, February 2026.https://arxiv.org/abs/2510.26144

  16. [16]

    AIDE: AI-Driven Exploration in the Space of Code

    Zhengyao Jiang, Dominik Schmidt, Dhruv Srikanth, Dixing Xu, Ian Kaplan, Deniss Jacenko, and Yuxiang Wu. AIDE: AI-Driven Exploration in the Space of Code, February 2025. https: //arxiv.org/abs/2502.13138

  17. [17]

    Chi, Fernando Pereira, Wang-Cheng Kang, Derek Zhiyuan Cheng, and Beidou Wang

    Minghao Yan, Bo Peng, Benjamin Coleman, Ziqi Chen, Zhouhang Xie, Shuo Chen, Zhankui He, Noveen Sachdeva, Isabella Ye, Weili Wang, Chi Wang, Ed H. Chi, Fernando Pereira, Wang-Cheng Kang, Derek Zhiyuan Cheng, and Beidou Wang. PACEvolve: Enabling Long- Horizon Progress-Aware Consistent Evolution, January 2026. https://arxiv.org/abs/ 2601.10657

  18. [18]

    AdaptEvolve: Improving Efficiency of Evolutionary AI Agents through Adaptive Model Selection

    Pretam Ray, Pratik Prabhanjan Brahma, Zicheng Liu, and Emad Barsoum. AdaptEvolve: Improving Efficiency of Evolutionary AI Agents through Adaptive Model Selection, February 2026.https://arxiv.org/abs/2602.11931

  19. [19]

    CausalEvolve: Towards Open-Ended Discovery with Causal Scratchpad, March 2026

    Yongqiang Chen, Chenxi Liu, Zhenhao Chen, Tongliang Liu, Bo Han, and Kun Zhang. CausalEvolve: Towards Open-Ended Discovery with Causal Scratchpad, March 2026. https: //arxiv.org/abs/2603.14575

  20. [20]

    \(X\)-evolve: Solution space evolution powered by large language models, August 2025.https://arxiv.org/abs/2508.07932

    Yi Zhai, Zhiqiang Wei, Ruohan Li, Keyu Pan, Shuo Liu, Lu Zhang, Jianmin Ji, Wuyang Zhang, Yu Zhang, and Yanyong Zhang. \(X\)-evolve: Solution space evolution powered by large language models, August 2025.https://arxiv.org/abs/2508.07932

  21. [21]

    SeaEvo: Advancing Algorithm Discovery with Strategy Space Evolution

    Sichun Luo, Yi Huang, Haochen Luo, Fengyuan Liu, Guanzhi Deng, Lei Li, Qinghua Yao, Zefa Hu, Junlan Feng, and Qi Liu. SeaEvo: Advancing Algorithm Discovery with Strategy Space Evolution, April 2026.https://arxiv.org/abs/2604.24372

  22. [22]

    Meta Context Engineer- ing via Agentic Skill Evolution, February 2026.https://arxiv.org/abs/2601.21557

    Haoran Ye, Xuning He, Vincent Arak, Haonan Dong, and Guojie Song. Meta Context Engineer- ing via Agentic Skill Evolution, February 2026.https://arxiv.org/abs/2601.21557

  23. [23]

    C- Evolve: Consensus-based Evolution for Prompt Groups, September 2025

    Tiancheng Li, Yuhang Wang, Zhiyang Chen, Zijun Wang, Liyuan Ma, and Guo-jun Qi. C- Evolve: Consensus-based Evolution for Prompt Groups, September 2025. https://arxiv. org/abs/2509.23331

  24. [24]

    Population-Evolve: A Parallel Sampling and Evolutionary Method for LLM Math Reasoning, December 2025

    Yanzhi Zhang, Yitong Duan, Zhaoxi Zhang, Jiyan He, and Shuxin Zheng. Population-Evolve: A Parallel Sampling and Evolutionary Method for LLM Math Reasoning, December 2025. https://arxiv.org/abs/2512.19081

  25. [25]

    Contrastive Concept-Tree Search for LLM-Assisted Algorithm Discovery, February 2026

    Timothee Leleu, Sudeera Gunathilaka, Federico Ghimenti, and Surya Ganguli. Contrastive Concept-Tree Search for LLM-Assisted Algorithm Discovery, February 2026. https:// arxiv.org/abs/2602.03132

  26. [26]

    Self-Improving Language Models for Evolutionary Program Synthesis: A Case Study on ARC-AGI, March 2026

    Julien Pourcel, Cédric Colas, and Pierre-Yves Oudeyer. Self-Improving Language Models for Evolutionary Program Synthesis: A Case Study on ARC-AGI, March 2026. https: //arxiv.org/abs/2507.14172

  27. [27]

    ThetaEvolve: Test-time Learning on Open Problems

    Yiping Wang, Shao-Rong Su, Zhiyuan Zeng, Eva Xu, Liliang Ren, Xinyu Yang, Zeyi Huang, Xuehai He, Luyao Ma, Baolin Peng, Hao Cheng, Pengcheng He, Weizhu Chen, Shuohang Wang, Simon Shaolei Du, and Yelong Shen. ThetaEvolve: Test-time Learning on Open Problems, November 2025.https://arxiv.org/abs/2511.23473

  28. [28]

    Learning to Discover at Test Time

    Mert Yuksekgonul, Daniel Koceja, Xinhao Li, Federico Bianchi, Jed McCaleb, Xiaolong Wang, Jan Kautz, Yejin Choi, James Zou, Carlos Guestrin, and Yu Sun. Learning to Discover at Test Time, February 2026.https://arxiv.org/abs/2601.16175. 12

  29. [29]

    Ruiying Ma, Chieh-Jan Mike Liang, Yanjie Gao, and Francis Y . Yan. MetaMuse: Algorithm Generation via Creative Ideation, October 2025.https://arxiv.org/abs/2510.03851

  30. [30]

    LLM Priors for ERM over Programs, February 2026.https://arxiv.org/abs/2510.14331

    Shivam Singhal, Priyadarsi Mishra, Eran Malach, and Tomer Galanti. LLM Priors for ERM over Programs, February 2026.https://arxiv.org/abs/2510.14331

  31. [31]

    Self-Play Only Evolves When Self-Synthetic Pipeline Ensures Learnable Information Gain

    Wei Liu, Siya Qi, Yali Du, and Yulan He. Self-play only evolves when self-synthetic pipeline ensures learnable information gain, 2026.https://arxiv.org/abs/2603.02218

  32. [32]

    AdaExplore: Failure-Driven Adaptation and Diversity-Preserving Search for Efficient Kernel Generation

    Weihua Du, Jingming Zhuo, Yixin Dong, Andre Wang He, Weiwei Sun, Zeyu Zheng, Manupa Karunaratne, Ivan Fox, Tim Dettmers, Tianqi Chen, Yiming Yang, and Sean Welleck. Ada- Explore: Failure-Driven Adaptation and Diversity-Preserving Search for Efficient Kernel Generation, April 2026.https://arxiv.org/abs/2604.16625

  33. [33]

    Record-Remix-Replay: Hierarchical GPU Kernel Optimization using Evolutionary Search

    Daniel Nichols, Konstantinos Parasyris, Caetano Melone, Tal Ben-Nun, Giorgis Georgakoudis, and Harshitha Menon. Record-Remix-Replay: Hierarchical GPU Kernel Optimization using Evolutionary Search, April 2026.https://arxiv.org/abs/2604.11109

  34. [34]

    Astra: A Multi-Agent System for GPU Kernel Performance Optimization, December 2025.https://arxiv.org/abs/2509.07506

    Anjiang Wei, Tianran Sun, Yogesh Seenichamy, Hang Song, Anne Ouyang, Azalia Mirhoseini, Ke Wang, and Alex Aiken. Astra: A Multi-Agent System for GPU Kernel Performance Optimization, December 2025.https://arxiv.org/abs/2509.07506

  35. [35]

    ContextEvolve: Multi-Agent Context Compression for Systems Code Optimization, February 2026.https://arxiv.org/abs/2602.02597

    Hongyuan Su, Yu Zheng, and Yong Li. ContextEvolve: Multi-Agent Context Compression for Systems Code Optimization, February 2026.https://arxiv.org/abs/2602.02597

  36. [36]

    Barbarians at the Gate: How AI is Upending Systems Research, October 2025.https://arxiv.org/abs/2510.06189

    Audrey Cheng, Shu Liu, Melissa Pan, Zhifei Li, Bowen Wang, Alex Krentsel, Tian Xia, Mert Cemri, Jongseok Park, Shuo Yang, Jeff Chen, Lakshya Agrawal, Aditya Desai, Jiarong Xing, Koushik Sen, Matei Zaharia, and Ion Stoica. Barbarians at the Gate: How AI is Upending Systems Research, October 2025.https://arxiv.org/abs/2510.06189

  37. [37]

    Magellan: Autonomous Discovery of Novel Compiler Optimization Heuristics with AlphaEvolve, January 2026

    Hongzheng Chen, Alexander Novikov, Ngân V˜u, Hanna Alam, Zhiru Zhang, Aiden Grossman, Mircea Trofin, and Amir Yazdanbakhsh. Magellan: Autonomous Discovery of Novel Compiler Optimization Heuristics with AlphaEvolve, January 2026. https://arxiv.org/abs/2601. 21096

  38. [38]

    ArchAgent: Agentic AI-driven Computer Architecture Discovery, February 2026.https://arxiv.org/abs/2602.22425

    Raghav Gupta, Akanksha Jain, Abraham Gonzalez, Alexander Novikov, Po-Sen Huang, Matej Balog, Marvin Eisenberger, Sergey Shirobokov, Ngân V ˜u, Martin Dixon, Borivoje Nikoli ´c, Parthasarathy Ranganathan, and Sagar Karandikar. ArchAgent: Agentic AI-driven Computer Architecture Discovery, February 2026.https://arxiv.org/abs/2602.22425

  39. [39]

    MadEvolve: Evolutionary Optimization of Cosmological Algorithms with Large Language Models, February 2026

    Tianyi Li, Shihui Zang, and Moritz Münchmeyer. MadEvolve: Evolutionary Optimization of Cosmological Algorithms with Large Language Models, February 2026. https://arxiv. org/abs/2602.15951

  40. [40]

    Beyond Algorithm Evolution: An LLM-Driven Framework for the Co-Evolution of Swarm Intelligence Optimization Algorithms and Prompts, December 2025

    Shipeng Cen and Ying Tan. Beyond Algorithm Evolution: An LLM-Driven Framework for the Co-Evolution of Swarm Intelligence Optimization Algorithms and Prompts, December 2025. https://arxiv.org/abs/2512.09209

  41. [41]

    Iterated Agent for Symbolic Regression, October 2025.https://arxiv.org/abs/2510.08317

    Zhuo-Yang Song, Zeyu Cai, Shutao Zhang, Jiashen Wei, Jichen Pan, Shi Qiu, Qing-Hong Cao, Tie-Jiun Hou, Xiaohui Liu, Ming-xing Luo, and Hua Xing Zhu. Iterated Agent for Symbolic Regression, October 2025.https://arxiv.org/abs/2510.08317

  42. [42]

    RankEvolve: Automating the Discovery of Retrieval Algorithms via LLM-Driven Evolution, February 2026

    Jinming Nian, Fangchen Li, Dae Hoon Park, and Yi Fang. RankEvolve: Automating the Discovery of Retrieval Algorithms via LLM-Driven Evolution, February 2026. https:// arxiv.org/abs/2602.16932

  43. [43]

    Self-Evolving Recommenda- tion System: End-To-End Autonomous Model Optimization With LLM Agents, February 2026

    Haochen Wang, Yi Wu, Daryl Chang, Li Wei, and Lukasz Heldt. Self-Evolving Recommenda- tion System: End-To-End Autonomous Model Optimization With LLM Agents, February 2026. https://arxiv.org/abs/2602.10226

  44. [44]

    Can LLMs Beat Classical Hyperparameter Optimization Algorithms? A Study on autoresearch

    Fabio Ferreira, Lucca Wobbe, Arjun Krishnakumar, Frank Hutter, and Arber Zela. Can LLMs Beat Classical Hyperparameter Optimization Algorithms? A Study on autoresearch, April 2026. https://arxiv.org/abs/2603.24647. 13

  45. [45]

    Controlled Self-Evolution for Algorithmic Code Optimization, February 2026

    Tu Hu, Ronghao Chen, Shuo Zhang, Jianghao Yin, Mou Xiao Feng, Jingping Liu, Shaolei Zhang, Wenqi Jiang, Yuqi Fang, Sen Hu, Huacan Wang, and Yi Xu. Controlled Self-Evolution for Algorithmic Code Optimization, February 2026. https://arxiv.org/abs/2601.07348

  46. [46]

    AlphaApollo: A System for Deep Agentic Reasoning, March 2026

    Zhanke Zhou, Chentao Cao, Xiao Feng, Xuan Li, Zongze Li, Xiangyu Lu, Jiangchao Yao, Weikai Huang, Tian Cheng, Jianghangfan Zhang, Tangyu Jiang, Linrui Xu, Yiming Zheng, Brando Miranda, Tongliang Liu, Sanmi Koyejo, Masashi Sugiyama, and Bo Han. AlphaApollo: A System for Deep Agentic Reasoning, March 2026. https://arxiv.org/abs/2510. 06261

  47. [47]

    R&D-Agent: An LLM-Agent Framework Towards Autonomous Data Science, October 2025.https://arxiv.org/abs/2505.14738

    Xu Yang, Xiao Yang, Shikai Fang, Yifei Zhang, Jian Wang, Bowen Xian, Qizheng Li, Jingyuan Li, Minrui Xu, Yuante Li, Haoran Pan, Yuge Zhang, Weiqing Liu, Yelong Shen, Weizhu Chen, and Jiang Bian. R&D-Agent: An LLM-Agent Framework Towards Autonomous Data Science, October 2025.https://arxiv.org/abs/2505.14738

  48. [48]

    Group-Evolving Agents: Open-Ended Self-Improvement via Experience Sharing, February 2026.https://arxiv.org/abs/2602.04837

    Zhaotian Weng, Antonis Antoniades, Deepak Nathani, Zhen Zhang, Xiao Pu, and Xin Eric Wang. Group-Evolving Agents: Open-Ended Self-Improvement via Experience Sharing, February 2026.https://arxiv.org/abs/2602.04837

  49. [49]

    Dimakis, Matei Zaharia, and Ion Stoica

    Shu Liu, Mert Cemri, Shubham Agarwal, Alexander Krentsel, Ashwin Naren, Qiuyang Mang, Zhifei Li, Akshat Gupta, Monishwaran Maheswaran, Audrey Cheng, Melissa Pan, Ethan Boneh, Kannan Ramchandran, Koushik Sen, Alexandros G. Dimakis, Matei Zaharia, and Ion Stoica. SkyDiscover: A flexible framework for AI-driven scientific and algorithmic discovery, 2026. URL...

  50. [50]

    John R. Koza. Genetic programming as a means for programming computers by natural selection.Statistics and Computing, 4(2), June 1994. ISSN 0960-3174, 1573-1375. doi: 10.1007/BF00175355

  51. [51]

    Population Based Training of Neural Networks

    Max Jaderberg, Valentin Dalibard, Simon Osindero, Wojciech M. Czarnecki, Jeff Donahue, Ali Razavi, Oriol Vinyals, Tim Green, Iain Dunning, Karen Simonyan, Chrisantha Fernando, and Koray Kavukcuoglu. Population Based Training of Neural Networks, November 2017. https://arxiv.org/abs/1711.09846

  52. [52]

    Quality-Diversity Algorithms Can Provably Be Helpful for Optimization, May 2024.https://arxiv.org/abs/2401.10539

    Chao Qian, Ke Xue, and Ren-Jian Wang. Quality-Diversity Algorithms Can Provably Be Helpful for Optimization, May 2024.https://arxiv.org/abs/2401.10539

  53. [53]

    The Role of Stepping Stones in MAP-Elites: Insights from Search Trajectory Networks

    Giorgia Nadizar, Francesco Rusin, Eric Medvet, and Gabriela Ochoa. The Role of Stepping Stones in MAP-Elites: Insights from Search Trajectory Networks. In Bing Xue, Luca Manzoni, and Illya Bakurov, editors,Genetic Programming, volume 15609, pages 224–239. Springer Nature Switzerland, Cham, 2025. ISBN 978-3-031-89990-4 978-3-031-89991-1. doi: 10.1007/ 978-...

  54. [54]

    Open-Endedness is Essential for Artificial Superhuman Intelligence, June 2024.https://arxiv.org/abs/2406.04268

    Edward Hughes, Michael Dennis, Jack Parker-Holder, Feryal Behbahani, Aditi Mavalankar, Yuge Shi, Tom Schaul, and Tim Rocktaschel. Open-Endedness is Essential for Artificial Superhuman Intelligence, June 2024.https://arxiv.org/abs/2406.04268

  55. [55]

    The Vendi Score: A Diversity Evaluation Metric for Machine Learning, July 2023.https://arxiv.org/abs/2210.02410

    Dan Friedman and Adji Bousso Dieng. The Vendi Score: A Diversity Evaluation Metric for Machine Learning, July 2023.https://arxiv.org/abs/2210.02410

  56. [56]

    Rethinking Code Similarity for Automated Algorithm Design with LLMs, March 2026.https://arxiv.org/abs/2603.02787

    Rui Zhang and Zhichao Lu. Rethinking Code Similarity for Automated Algorithm Design with LLMs, March 2026.https://arxiv.org/abs/2603.02787

  57. [57]

    A Comprehensive Survey of Self-Evolving AI Agents: A New Paradigm Bridging Foundation Models and Lifelong Agentic Systems

    Jinyuan Fang, Yanwen Peng, Xi Zhang, Yingxu Wang, Xinhao Yi, Guibin Zhang, Yi Xu, Bin Wu, Siwei Liu, Zihao Li, Zhaochun Ren, Nikos Aletras, Xi Wang, Han Zhou, and Zaiqiao Meng. A Comprehensive Survey of Self-Evolving AI Agents: A New Paradigm Bridging Foundation Models and Lifelong Agentic Systems, August 2025. https://arxiv.org/abs/ 2508.07407

  58. [58]

    How Far Are AI Scientists from Changing the World?, August 2025.https://arxiv.org/abs/2507.23276

    Qiujie Xie, Yixuan Weng, Minjun Zhu, Fuchen Shen, Shulin Huang, Zhen Lin, Jiahui Zhou, Zilan Mao, Zijie Yang, Linyi Yang, Jian Wu, and Yue Zhang. How Far Are AI Scientists from Changing the World?, August 2025.https://arxiv.org/abs/2507.23276. 14

  59. [59]

    MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering

    Jun Shern Chan, Neil Chowdhury, Oliver Jaffe, James Aung, Dane Sherburn, Evan Mays, Giulio Starace, Kevin Liu, Leon Maksin, Tejal Patwardhan, Lilian Weng, and Aleksander M ˛ adry. MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering, February 2025.https://arxiv.org/abs/2410.07095

  60. [60]

    ALE-Bench: A Benchmark for Long-Horizon Objective-Driven Algorithm Engineering, October 2025.https://arxiv.org/abs/2506.09050

    Yuki Imajuku, Kohki Horie, Yoichi Iwata, Kensho Aoki, Naohiro Takahashi, and Takuya Akiba. ALE-Bench: A Benchmark for Long-Horizon Objective-Driven Algorithm Engineering, October 2025.https://arxiv.org/abs/2506.09050

  61. [61]

    AIRS-Bench: A Suite of Tasks for Frontier AI Research Science Agents, February 2026.https://arxiv.org/abs/2602.06855

    Alisia Lupidi, Bhavul Gauri, Thomas Simon Foster, Bassel Al Omari, Despoina Magka, Alberto Pepe, Alexis Audran-Reiss, Muna Aghamelu, Nicolas Baldwin, Lucia Cipolina-Kun, Jean- Christophe Gagnon-Audet, Chee Hau Leow, Sandra Lefdal, Hossam Mossalam, Abhinav Moudgil, Saba Nazir, Emanuel Tewolde, Isabel Urrego, Jordi Armengol Estape, Amar Budhiraja, Gaurav Ch...

  62. [62]

    Evaluation-driven Scaling for Scientific Discovery

    Haotian Ye, Haowei Lin, Jingyi Tang, Yizhen Luo, Caiyin Yang, Chang Su, Rahul Thapa, Rui Yang, Ruihua Liu, Zeyu Li, Chong Gao, Dachao Ding, Guangrong He, Miaolei Zhang, Lina Sun, Wenyang Wang, Yuchen Zhong, Zhuohao Shen, Di He, Jianzhu Ma, Stefano Ermon, Tongyang Li, Xiaowen Chu, James Zou, and Yuzhi Xu. Evaluation-driven Scaling for Scientific Discovery,...

  63. [63]

    Pan, Negar Arabzadeh, Riccardo Cogo, Yuxuan Zhu, Alexander Xiong, Lakshya A

    Melissa Z. Pan, Negar Arabzadeh, Riccardo Cogo, Yuxuan Zhu, Alexander Xiong, Lakshya A. Agrawal, Huanzhi Mao, Emma Shen, Sid Pallerla, Liana Patel, Shu Liu, Tianneng Shi, Xiaoyuan Liu, Jared Quincy Davis, Emmanuele Lacavalla, Alessandro Basile, Shuyi Yang, Paul Castro, Daniel Kang, Joseph E. Gonzalez, Koushik Sen, Dawn Song, Ion Stoica, Matei Zaharia, and...

  64. [64]

    Can We Predict Before Executing Machine Learning Agents?

    Jingsheng Zheng, Jintian Zhang, Yujie Luo, Yuren Mao, Yunjun Gao, Lun Du, Huajun Chen, and Ningyu Zhang. Can We Predict Before Executing Machine Learning Agents?, January 2026.https://arxiv.org/abs/2601.05930

  65. [65]

    Simple Baselines are Competitive with Code Evolution, February 2026.https://arxiv.org/abs/2602.16805

    Yonatan Gideoni, Sebastian Risi, and Yarin Gal. Simple Baselines are Competitive with Code Evolution, February 2026.https://arxiv.org/abs/2602.16805

  66. [66]

    What Makes an LLM a Good Optimizer? A Trajectory Analysis of LLM-Guided Evolutionary Search

    Xinhao Zhang, Xi Chen, François Portet, and Maxime Peyrard. What Makes an LLM a Good Optimizer? A Trajectory Analysis of LLM-Guided Evolutionary Search, April 2026. https://arxiv.org/abs/2604.19440

  67. [67]

    Fitness Landscape of Large Language Model-Assisted Automated Algorithm Search, August 2025

    Fei Liu, Qingfu Zhang, Jialong Shi, Xialiang Tong, Kun Mao, and Mingxuan Yuan. Fitness Landscape of Large Language Model-Assisted Automated Algorithm Search, August 2025. https://arxiv.org/abs/2504.19636

  68. [68]

    Understanding the Challenges in Iterative Generative Optimiza- tion with LLMs, March 2026.https://arxiv.org/abs/2603.23994

    Allen Nie, Xavier Daull, Zhiyi Kuang, Abhinav Akkiraju, Anish Chaudhuri, Max Piasevoli, Ryan Rong, YuCheng Yuan, Prerit Choudhary, Shannon Xiao, Rasool Fakoor, Adith Swami- nathan, and Ching-An Cheng. Understanding the Challenges in Iterative Generative Optimiza- tion with LLMs, March 2026.https://arxiv.org/abs/2603.23994

  69. [69]

    Lan Pan, Hanbo Xie, and Robert C. Wilson. Large Language Models Think Too Fast To Explore Effectively, May 2025.https://arxiv.org/abs/2501.18009

  70. [70]

    Artificial Hivemind: The Open-Ended Homogeneity of Language Models (and Beyond), October 2025

    Liwei Jiang, Yuanjun Chai, Margaret Li, Mickel Liu, Raymond Fok, Nouha Dziri, Yulia Tsvetkov, Maarten Sap, Alon Albalak, and Yejin Choi. Artificial Hivemind: The Open-Ended Homogeneity of Language Models (and Beyond), October 2025. https://arxiv.org/abs/ 2510.22954

  71. [71]

    Your Agent May Misevolve: Emergent Risks in Self-evolving LLM Agents, March 2026.https://arxiv.org/abs/2509.26354

    Shuai Shao, Qihan Ren, Chen Qian, Boyi Wei, Dadi Guo, Jingyi Yang, Xinhao Song, Linfeng Zhang, Weinan Zhang, Dongrui Liu, and Jing Shao. Your Agent May Misevolve: Emergent Risks in Self-evolving LLM Agents, March 2026.https://arxiv.org/abs/2509.26354. 15

  72. [72]

    Why Do Multi-Agent LLM Systems Fail?

    Mert Cemri, Melissa Z. Pan, Shuyi Yang, Lakshya A. Agrawal, Bhavya Chopra, Rishabh Tiwari, Kurt Keutzer, Aditya Parameswaran, Dan Klein, Kannan Ramchandran, Matei Zaharia, Joseph E. Gonzalez, and Ion Stoica. Why Do Multi-Agent LLM Systems Fail?, October 2025. https://arxiv.org/abs/2503.13657. 16 A Additional EvoTrace Details A.1 Per-field trace schema Evo...