What Do Evolutionary Coding Agents Evolve?

arxiv: 2605.20086 · v1 · pith:47Q5SZXCnew · submitted 2026-05-19 · 💻 cs.NE · cs.AI· cs.LG

What Do Evolutionary Coding Agents Evolve?

Nico Pelleriti , Sree Harsha Nelaturu , Zhanke Zhou , Zongze Li , Max Zimmer , Bo Han , Sebastian Pokutta This is my paper

Pith reviewed 2026-05-20 03:40 UTC · model grok-4.3

classification 💻 cs.NE cs.AIcs.LG

keywords evolutionary searchLLM code generationedit classificationsearch tracesalgorithm designbenchmark evaluationcode evolution

0 comments p. Extension

pith:47Q5SZXC Add to your LaTeX paper

What is a Pith Number?

\usepackage{pith}
\pithnumber{47Q5SZXC}

Prints a linked pith:47Q5SZXC badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more

The pith

Evolutionary coding agents often improve scores by cycling deleted lines back into code rather than inventing new algorithms.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Systems that combine LLMs with evolutionary search generate and refine code for math and algorithm tasks. Final benchmark scores can result from new structure, re-tuning, recombination of existing knowledge, or overfitting to the judge. The authors create the EvoTrace dataset of full search traces from four frameworks and sixteen tasks, then apply EvoReplay to reconstruct states and test interventions on successful solutions. Every edit receives one of nine labels from an LLM judge validated by blind human review. Most gains trace to a small subset of edit types, and about thirty percent of added lines are exact re-introductions of lines deleted earlier in the same run.

Core claim

Benchmark gains in evolutionary coding agents arise from qualitatively different mechanisms, only some of which introduce new algorithmic structure; a deterministic cycling pattern appears in which roughly thirty percent of lines added during search are byte-identical re-introductions of previously deleted lines.

What carries the argument

Annotation of every code edit into one of nine recurring types using a validated LLM-as-judge pipeline applied to full evolutionary traces.

If this is right

Reported progress on coding benchmarks can reflect simple re-tunes or cycling instead of structural novelty.
Diagnostic evaluation must inspect edit distributions and search dynamics rather than final scores alone.
Controlled interventions that block line re-introductions can test whether performance depends on cycling.
The EvoTrace dataset supports more precise comparison of evolutionary coding methods.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The cycling behavior may indicate that search stays within a narrow region of the model's prior knowledge.
Similar re-introduction loops could limit progress in other iterative LLM editing workflows.

Load-bearing premise

The nine edit types assigned by the LLM judge accurately reflect the mechanisms that produce score changes.

What would settle it

Human re-annotation of the same edits that assigns score gains to different edit types and finds no thirty-percent cycling rate.

Figures

Figures reproduced from arXiv: 2605.20086 by Bo Han, Max Zimmer, Nico Pelleriti, Sebastian Pokutta, Sree Harsha Nelaturu, Zhanke Zhou, Zongze Li.

**Figure 1.** Figure 1: A taxonomy of edits performed by evolutionary coding agents. Each panel shows a representative parent–child diff (added lines in green, deleted lines in red) drawn from EvoTrace runs and labeled with one of nine recurring categories: Bug fix, External dependency, Architectural change, Composition, Local refinement, Pruning, Refactor, Efficiency, and Hyperparameter tuning. The categories range from minimal … view at source ↗

**Figure 2.** Figure 2: EvoTrace and EvoReplay. EvoTrace records each evolutionary run as a structured object: programs, parent–child graph, prompts and context, scores, and evaluator metadata. EvoReplay reconstructs local search states from these traces and reruns controlled interventions, including same-prompt replay, Bayesianoptimization retuning, static analysis, cycling detection, ablation, repair, context substitution, a… view at source ↗

**Figure 3.** Figure 3: Program size and numeric-literal hyperparameter count over a run. Best-so-far program length (LOC, left) and numeric-literal count (right), each normalized by the run’s seed value, plotted against normalized iteration. Solid line = cross-run median; shaded band = inter-quartile range; dashed gray line marks the seed value. Math runs (n=59) accumulate modest LOC and hp growth (median final ratios 1.33× and … view at source ↗

**Figure 4.** Figure 4: Edit-taxonomy: frequency vs. per-edit utility across all programs in EvoTrace. (a) Frequency of each label: Hyperparameter tuning dominates the search distribution. (b) Per-edit odds ratio for positive normalized score change: External dependency, Efficiency, and Architectural change are the most helpful categories on a per-edit basis. The categories that most often improve a single edit are not the categ… view at source ↗

**Figure 5.** Figure 5: Best-so-far enrichment of edit labels (aggregate). Enrichment of each taxonomy label among best-so-far updates relative to the all-edits base rate. The categories most overrepresented on successful intermediate steps (Efficiency, External dependency, Hyperparameter tuning, Composition) are not identical to the most frequent labels in [PITH_FULL_IMAGE:figures/full_fig_p018_5.png] view at source ↗

**Figure 6.** Figure 6: Final-best-lineage enrichment (aggregate, robustness check). Enrichment of each label along the lineage from each run’s final best program back to the seed. Efficiency, Hyperparameter tuning, and Composition remain overrepresented relative to the all-edits base rate, supporting the best-so-far view in [PITH_FULL_IMAGE:figures/full_fig_p019_6.png] view at source ↗

**Figure 7.** Figure 7: Edit-label prevalence by domain. Frequency of each taxonomy label among labeled edits, split by domain. Hyperparameter tuning dominates in both domains, but Composition is more prominent on math while structural categories shift their relative weights between domains. B.4 LLM-as-judge validation Taxonomy origin. The 9-category edit taxonomy used in §5.1 was derived inductively from EvoTrace runs rather tha… view at source ↗

**Figure 8.** Figure 8: Per-edit helpfulness (odds ratio for positive normalized score change) by domain. On ALE, External dependency, Efficiency, and Architectural change are the strongest positive categories. On math, External dependency is even stronger and Composition plays a larger role than on ALE. 10 0 6 × 10 −1 enrichment ratio vs all edits Efficiency Hyperparameter tuning Local refinement External dependency Bug fix Comp… view at source ↗

**Figure 9.** Figure 9: Best-so-far enrichment of edit labels by domain. Enrichment of each label among bestso-far updates relative to the all-edits base rate, split by domain. The qualitative signal, a small set of categories (notably Efficiency, External dependency, and Hyperparameter tuning) overrepresented on successful intermediate steps, is consistent with the aggregate view in [PITH_FULL_IMAGE:figures/full_fig_p020_9.png] view at source ↗

**Figure 10.** Figure 10: Final-best-lineage enrichment by domain. Robustness check for [PITH_FULL_IMAGE:figures/full_fig_p021_10.png] view at source ↗

**Figure 11.** Figure 11: Distribution of labels per edit, by domain. Most edits in both domains are multilabel: 52.4% of edits aggregate-wide carry exactly two labels and only 32.4% are single-label. The categories of [PITH_FULL_IMAGE:figures/full_fig_p021_11.png] view at source ↗

**Figure 12.** Figure 12: Per-edit helpfulness by backend. Odds ratio for positive normalized score change broken down by the four evolutionary backends. Some categories (notably External dependency) are consistently positive across backends, while others vary in magnitude. openevolve_native contributes only 2 runs to this corpus, so its column should be interpreted with a wider implicit confidence band; we include it for complete… view at source ↗

read the original abstract

Recent work pairs LLMs with evolutionary search to iteratively generate, modify, and select code using task-specific feedback. These systems have produced strong results in mathematical discovery and algorithm design, yet a fundamental question remains: what do they actually evolve? Progress is typically summarized by the best score a run reaches under a task-specific evaluator, but that score can reflect several different mechanisms: new algorithmic structure, re-tuning an existing strategy, recombining ideas already in the model's internal knowledge, or overfitting to the evaluator. Distinguishing these mechanisms requires inspecting the search process itself, not only its final outcome. We introduce EvoTrace, a dataset of evolutionary coding traces spanning four evolutionary frameworks, reasoning and non-reasoning models, and 16 tasks across mathematics and algorithm design. To analyze these traces, we develop EvoReplay, a replay-based methodology that reconstructs the local search states behind high-scoring solutions and tests controlled interventions, including adjusting constants, removing program components and substituting models or prompting contexts. We annotate every code edit in EvoTrace with one of nine recurring edit types using an LLM-as-judge pipeline validated against blind human re-annotation. Across EvoTrace, most score gains come from a small subset of these edit types. We further find a deterministic cycling pattern: about 30% of code lines added during search are byte-identical re-introductions of previously-deleted lines, present throughout nearly every run. These results show that benchmark gains in evolutionary coding agents can arise from qualitatively different mechanisms, only some of which correspond to new algorithmic structure. EvoTrace enables more diagnostic evaluation of evolutionary coding agents beyond final benchmark scores.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows evolutionary coding agents often gain scores by cycling the same code lines back in, and it supplies a dataset plus replay method to inspect the process instead of just final scores.

read the letter

The main thing to know is that these agents frequently improve by re-introducing byte-identical lines that were deleted earlier, at a steady 30 percent rate across runs. That pattern suggests a lot of the benchmark movement is not new algorithmic structure but something more mechanical and repetitive. The work introduces EvoTrace, a collection of full traces from four frameworks, reasoning and non-reasoning models, and sixteen math and algorithm tasks. EvoReplay then replays local states and runs controlled interventions such as removing components or swapping models. They label every edit with one of nine types using an LLM judge that was checked against blind human re-annotation, and they report that most score gains trace to a small subset of those types. This moves the discussion from outcome scores to observable mechanisms in the search itself. The cycling result and the dataset are the clearest additions. The classification step is reasonable but rests on the judge's ability to separate new structure from re-tuning or recombination. The abstract notes human validation, yet without per-type agreement numbers or a sensitivity check on borderline edits, it is hard to know how stable the mechanism distinctions are. If a few labels shift, the claim that only some gains reflect genuine novelty could weaken. Readers working on LLM-based evolutionary search or on better evaluation protocols for code agents will find the traces and replay approach useful. The paper deserves a serious referee because the dataset and intervention method address a real gap, even if the edit taxonomy needs tighter validation in review.

Referee Report

2 major / 2 minor

Summary. The paper introduces EvoTrace, a dataset of evolutionary coding traces spanning four frameworks, reasoning and non-reasoning models, and 16 tasks in mathematics and algorithm design. It develops EvoReplay, a replay-based method to reconstruct local search states and perform controlled interventions (e.g., adjusting constants, removing components, substituting models or prompts). All code edits are annotated into one of nine recurring types via an LLM-as-judge pipeline validated by blind human re-annotation. Findings include that most score gains derive from a small subset of edit types and a deterministic cycling pattern in which ~30% of added lines are byte-identical re-introductions of previously deleted lines. The central claim is that benchmark gains arise from qualitatively different mechanisms, only some of which reflect new algorithmic structure.

Significance. If the mechanism distinctions hold, the work would meaningfully advance evaluation practices in evolutionary computation and LLM-guided search by shifting focus from final scores to process diagnostics. The dataset and replay methodology could enable more reproducible and targeted analysis of whether gains reflect genuine innovation versus re-tuning or overfitting. Concrete quantitative observations such as the cycling rate provide falsifiable anchors for follow-up studies.

major comments (2)

[§4.2] §4.2 (LLM-as-Judge Pipeline and Validation): The manuscript reports validation against blind human re-annotation but provides no per-type agreement rates, no quantification of agreement specifically on edits labeled as 'new algorithmic structure', and no sensitivity analysis showing how re-labeling of borderline cases affects the claim that most gains come from a small subset of types. Because this classification step is required to map score improvements to the four distinct mechanisms listed in the abstract, the absence of these metrics leaves the qualitative distinction under-supported.
[§3.1] §3.1 (EvoReplay Interventions): The controlled interventions are introduced to test mechanisms behind high-scoring solutions, yet the text does not detail how each intervention (constant adjustment, component removal, model substitution) isolates 'new algorithmic structure' from re-tuning, recombination, or evaluator overfitting. Without explicit mapping or controls for confounding factors, it is unclear whether the interventions confirm the claimed separation of mechanisms.

minor comments (2)

[Abstract] The abstract states that traces come from 'four evolutionary frameworks' without naming them; adding the specific names would improve immediate clarity for readers.
[Figures] Figure captions for edit-type distributions should explicitly list the nine types and their definitions to allow readers to interpret the 'small subset' result without cross-referencing the main text.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive report. The comments highlight areas where additional detail can strengthen the presentation of our validation and intervention methodology. We respond to each major comment below and indicate the revisions we will make.

read point-by-point responses

Referee: [§4.2] §4.2 (LLM-as-Judge Pipeline and Validation): The manuscript reports validation against blind human re-annotation but provides no per-type agreement rates, no quantification of agreement specifically on edits labeled as 'new algorithmic structure', and no sensitivity analysis showing how re-labeling of borderline cases affects the claim that most gains come from a small subset of types. Because this classification step is required to map score improvements to the four distinct mechanisms listed in the abstract, the absence of these metrics leaves the qualitative distinction under-supported.

Authors: We agree that per-type agreement rates and a sensitivity analysis would provide stronger support for the classification step. In the revised manuscript we will add a table in §4.2 reporting agreement (Cohen’s kappa and raw percentage) for each of the nine edit types, with a separate row for the ‘new algorithmic structure’ category. We will also include a sensitivity analysis that re-labels borderline cases according to the human annotators’ secondary choices and shows that the result—most score gains arising from a small subset of types—remains stable. revision: yes
Referee: [§3.1] §3.1 (EvoReplay Interventions): The controlled interventions are introduced to test mechanisms behind high-scoring solutions, yet the text does not detail how each intervention (constant adjustment, component removal, model substitution) isolates 'new algorithmic structure' from re-tuning, recombination, or evaluator overfitting. Without explicit mapping or controls for confounding factors, it is unclear whether the interventions confirm the claimed separation of mechanisms.

Authors: We accept that the current text does not make the mapping between interventions and mechanisms fully explicit. In the revision we will expand §3.1 with a table that directly links each intervention to the mechanism(s) it is intended to isolate (constant adjustment for re-tuning, component removal for new structure versus recombination, model/prompt substitution for internal knowledge versus search-derived structure). We will also describe the controls already present in the replay protocol—held-out test cases and fixed evaluator seeds—to address potential evaluator overfitting. revision: yes

Circularity Check

0 steps flagged

Empirical trace analysis is self-contained with no circular derivation

full rationale

The paper's central results rest on direct inspection of evolutionary search traces via the introduced EvoTrace dataset and EvoReplay interventions, plus LLM-as-judge annotation of edit types validated by blind human re-annotation. These are observational findings about the distribution of score gains and byte-identical line re-introductions; no mathematical derivation, fitted parameter renamed as prediction, or load-bearing self-citation chain is present that would reduce the claimed distinctions among mechanisms to the inputs by construction. The analysis is therefore independent of its own outputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper relies on standard domain assumptions about the meaningfulness of task-specific evaluators in evolutionary search but introduces no free parameters, new axioms beyond those, or invented entities; the contribution lies in the empirical analysis framework and dataset.

axioms (1)

domain assumption Task-specific evaluators provide meaningful feedback capable of distinguishing different mechanisms of improvement.
The entire diagnostic analysis depends on the assumption that the evaluators used during evolutionary search are reliable indicators of progress.

pith-pipeline@v0.9.0 · 5843 in / 1265 out tokens · 48300 ms · 2026-05-20T03:40:13.380664+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

72 extracted references · 72 canonical work pages · 20 internal anchors

[1]

Pawan Kumar, Emilien Dupont, Francisco J

Bernardino Romera-Paredes, Mohammadamin Barekatain, Alexander Novikov, Matej Balog, M. Pawan Kumar, Emilien Dupont, Francisco J. R. Ruiz, Jordan S. Ellenberg, Pengming Wang, Omar Fawzi, Pushmeet Kohli, and Alhussein Fawzi. Mathematical discoveries from program search with large language models.Nature, 625(7995):468–475, January 2024. ISSN 0028-0836, 1476-...

work page doi:10.1038/s41586-023-06924-6 2024
[2]

Alexander Novikov, Ngân V˜u, Marvin Eisenberger, Emilien Dupont, Po-Sen Huang, Adam Zsolt Wagner, Sergey Shirobokov, Borislav Kozlovskii, Francisco J. R. Ruiz, Abbas Mehrabian, M. Pawan Kumar, Abigail See, Swarat Chaudhuri, George Holland, Alex Davies, Sebastian Nowozin, Pushmeet Kohli, and Matej Balog. AlphaEvolve: A coding agent for scientific and algor...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Mathematical exploration and discovery at scale

Bogdan Georgiev, Javier Gómez-Serrano, Terence Tao, and Adam Zsolt Wagner. Mathematical exploration and discovery at scale, December 2025. https://arxiv.org/abs/2511.02864

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

ShinkaEvolve: Towards Open-Ended And Sample-Efficient Program Evolution, September 2025

Robert Tjarko Lange, Yuki Imajuku, and Edoardo Cetin. ShinkaEvolve: Towards Open-Ended And Sample-Efficient Program Evolution, September 2025. https://arxiv.org/abs/2509. 19349

work page 2025
[5]

CodeEvolve: An open source evolutionary coding agent for algorithmic discovery and optimization, March 2026

Henrique Assumpção, Diego Ferreira, Leandro Campos, and Fabricio Murai. CodeEvolve: An open source evolutionary coding agent for algorithmic discovery and optimization, March 2026. https://arxiv.org/abs/2510.14150

work page arXiv 2026
[6]

AdaEvolve: Adaptive LLM Driven Zeroth-Order Optimization, February 2026

Mert Cemri, Shubham Agrawal, Akshat Gupta, Shu Liu, Audrey Cheng, Qiuyang Mang, Ashwin Naren, Lutfi Eren Erdogan, Koushik Sen, Matei Zaharia, Alex Dimakis, and Ion Stoica. AdaEvolve: Adaptive LLM Driven Zeroth-Order Optimization, February 2026. https: //arxiv.org/abs/2602.20133

work page arXiv 2026
[7]

Pan, Alexander Du, Kurt Keutzer, Alvin Cheung, Alexandros G

Shu Liu, Shubham Agarwal, Monishwaran Maheswaran, Mert Cemri, Zhifei Li, Qiuyang Mang, Ashwin Naren, Ethan Boneh, Audrey Cheng, Melissa Z. Pan, Alexander Du, Kurt Keutzer, Alvin Cheung, Alexandros G. Dimakis, Koushik Sen, Matei Zaharia, and Ion Stoica. EvoX: Meta- Evolution for Automated Discovery, March 2026.https://arxiv.org/abs/2602.23413

work page arXiv 2026
[8]

Let the Barbarians In: How AI Can Accelerate Systems Performance Research, December 2025.https://arxiv.org/abs/2512.14806

Audrey Cheng, Shu Liu, Melissa Pan, Zhifei Li, Shubham Agarwal, Mert Cemri, Bowen Wang, Alexander Krentsel, Tian Xia, Jongseok Park, Shuo Yang, Jeff Chen, Lakshya Agrawal, Ashwin Naren, Shulu Li, Ruiying Ma, Aditya Desai, Jiarong Xing, Koushik Sen, Matei Zaharia, and Ion Stoica. Let the Barbarians In: How AI Can Accelerate Systems Performance Research, De...

work page arXiv 2025
[9]

Evo- Engineer: Mastering Automated CUDA Kernel Code Evolution with Large Language Models, October 2025.https://arxiv.org/abs/2510.03760

Ping Guo, Chenyu Zhu, Siyuan Chen, Fei Liu, Xi Lin, Zhichao Lu, and Qingfu Zhang. Evo- Engineer: Mastering Automated CUDA Kernel Code Evolution with Large Language Models, October 2025.https://arxiv.org/abs/2510.03760

work page arXiv 2025
[10]

Gonzalez, and Ion Stoica

Shiyi Cao, Ziming Mao, Joseph E. Gonzalez, and Ion Stoica. K-Search: LLM Kernel Generation via Co-Evolving Intrinsic World Model, February 2026. https://arxiv.org/abs/2602. 19128

work page 2026
[11]

KernelFoundry: Hardware-aware evolutionary GPU kernel optimization, March 2026

Nina Wiedemann, Quentin Leboutet, Michael Paulitsch, Diana Wofk, and Benjamin Ummen- hofer. KernelFoundry: Hardware-aware evolutionary GPU kernel optimization, March 2026. https://arxiv.org/abs/2603.12440

work page arXiv 2026
[12]

Openevolve: an open-source evolutionary coding agent, 2025

Asankhaya Sharma. Openevolve: an open-source evolutionary coding agent, 2025. URL https://github.com/algorithmicsuperintelligence/openevolve

work page 2025
[13]

GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning

Lakshya A. Agrawal, Shangyin Tan, Dilara Soylu, Noah Ziems, Rishi Khare, Krista Opsahl- Ong, Arnav Singhvi, Herumb Shandilya, Michael J. Ryan, Meng Jiang, Christopher Potts, Koushik Sen, Alexandros G. Dimakis, Ion Stoica, Dan Klein, Matei Zaharia, and Omar Khattab. GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning, February 2026. htt...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[14]

GigaEvo: An Open Source Op- timization Framework Powered By LLMs And Evolution Algorithms, November 2025

Valentin Khrulkov, Andrey Galichin, Denis Bashkirov, Dmitry Vinichenko, Oleg Travkin, Roman Alferov, Andrey Kuznetsov, and Ivan Oseledets. GigaEvo: An Open Source Op- timization Framework Powered By LLMs And Evolution Algorithms, November 2025. https://arxiv.org/abs/2511.17592

work page arXiv 2025
[15]

The FM Agent, February 2026.https://arxiv.org/abs/2510.26144

Annan Li, Chufan Wu, Zengle Ge, Yee Hin Chong, Zhinan Hou, Lizhe Cao, Cheng Ju, Jianmin Wu, Huaiming Li, Haobo Zhang, Shenghao Feng, Mo Zhao, Fengzhi Qiu, Rui Yang, Mengmeng Zhang, Wenyi Zhu, Yingying Sun, Quan Sun, Shunhao Yan, Danyu Liu, Dawei Yin, and Dou Shen. The FM Agent, February 2026.https://arxiv.org/abs/2510.26144

work page arXiv 2026
[16]

AIDE: AI-Driven Exploration in the Space of Code

Zhengyao Jiang, Dominik Schmidt, Dhruv Srikanth, Dixing Xu, Ian Kaplan, Deniss Jacenko, and Yuxiang Wu. AIDE: AI-Driven Exploration in the Space of Code, February 2025. https: //arxiv.org/abs/2502.13138

work page internal anchor Pith review Pith/arXiv arXiv 2025
[17]

Chi, Fernando Pereira, Wang-Cheng Kang, Derek Zhiyuan Cheng, and Beidou Wang

Minghao Yan, Bo Peng, Benjamin Coleman, Ziqi Chen, Zhouhang Xie, Shuo Chen, Zhankui He, Noveen Sachdeva, Isabella Ye, Weili Wang, Chi Wang, Ed H. Chi, Fernando Pereira, Wang-Cheng Kang, Derek Zhiyuan Cheng, and Beidou Wang. PACEvolve: Enabling Long- Horizon Progress-Aware Consistent Evolution, January 2026. https://arxiv.org/abs/ 2601.10657

work page arXiv 2026
[18]

AdaptEvolve: Improving Efficiency of Evolutionary AI Agents through Adaptive Model Selection

Pretam Ray, Pratik Prabhanjan Brahma, Zicheng Liu, and Emad Barsoum. AdaptEvolve: Improving Efficiency of Evolutionary AI Agents through Adaptive Model Selection, February 2026.https://arxiv.org/abs/2602.11931

work page internal anchor Pith review Pith/arXiv arXiv 2026
[19]

CausalEvolve: Towards Open-Ended Discovery with Causal Scratchpad, March 2026

Yongqiang Chen, Chenxi Liu, Zhenhao Chen, Tongliang Liu, Bo Han, and Kun Zhang. CausalEvolve: Towards Open-Ended Discovery with Causal Scratchpad, March 2026. https: //arxiv.org/abs/2603.14575

work page arXiv 2026
[20]

\(X\)-evolve: Solution space evolution powered by large language models, August 2025.https://arxiv.org/abs/2508.07932

Yi Zhai, Zhiqiang Wei, Ruohan Li, Keyu Pan, Shuo Liu, Lu Zhang, Jianmin Ji, Wuyang Zhang, Yu Zhang, and Yanyong Zhang. \(X\)-evolve: Solution space evolution powered by large language models, August 2025.https://arxiv.org/abs/2508.07932

work page arXiv 2025
[21]

SeaEvo: Advancing Algorithm Discovery with Strategy Space Evolution

Sichun Luo, Yi Huang, Haochen Luo, Fengyuan Liu, Guanzhi Deng, Lei Li, Qinghua Yao, Zefa Hu, Junlan Feng, and Qi Liu. SeaEvo: Advancing Algorithm Discovery with Strategy Space Evolution, April 2026.https://arxiv.org/abs/2604.24372

work page internal anchor Pith review Pith/arXiv arXiv 2026
[22]

Meta Context Engineer- ing via Agentic Skill Evolution, February 2026.https://arxiv.org/abs/2601.21557

Haoran Ye, Xuning He, Vincent Arak, Haonan Dong, and Guojie Song. Meta Context Engineer- ing via Agentic Skill Evolution, February 2026.https://arxiv.org/abs/2601.21557

work page arXiv 2026
[23]

C- Evolve: Consensus-based Evolution for Prompt Groups, September 2025

Tiancheng Li, Yuhang Wang, Zhiyang Chen, Zijun Wang, Liyuan Ma, and Guo-jun Qi. C- Evolve: Consensus-based Evolution for Prompt Groups, September 2025. https://arxiv. org/abs/2509.23331

work page arXiv 2025
[24]

Population-Evolve: A Parallel Sampling and Evolutionary Method for LLM Math Reasoning, December 2025

Yanzhi Zhang, Yitong Duan, Zhaoxi Zhang, Jiyan He, and Shuxin Zheng. Population-Evolve: A Parallel Sampling and Evolutionary Method for LLM Math Reasoning, December 2025. https://arxiv.org/abs/2512.19081

work page arXiv 2025
[25]

Contrastive Concept-Tree Search for LLM-Assisted Algorithm Discovery, February 2026

Timothee Leleu, Sudeera Gunathilaka, Federico Ghimenti, and Surya Ganguli. Contrastive Concept-Tree Search for LLM-Assisted Algorithm Discovery, February 2026. https:// arxiv.org/abs/2602.03132

work page arXiv 2026
[26]

Self-Improving Language Models for Evolutionary Program Synthesis: A Case Study on ARC-AGI, March 2026

Julien Pourcel, Cédric Colas, and Pierre-Yves Oudeyer. Self-Improving Language Models for Evolutionary Program Synthesis: A Case Study on ARC-AGI, March 2026. https: //arxiv.org/abs/2507.14172

work page arXiv 2026
[27]

ThetaEvolve: Test-time Learning on Open Problems

Yiping Wang, Shao-Rong Su, Zhiyuan Zeng, Eva Xu, Liliang Ren, Xinyu Yang, Zeyi Huang, Xuehai He, Luyao Ma, Baolin Peng, Hao Cheng, Pengcheng He, Weizhu Chen, Shuohang Wang, Simon Shaolei Du, and Yelong Shen. ThetaEvolve: Test-time Learning on Open Problems, November 2025.https://arxiv.org/abs/2511.23473

work page internal anchor Pith review Pith/arXiv arXiv 2025
[28]

Learning to Discover at Test Time

Mert Yuksekgonul, Daniel Koceja, Xinhao Li, Federico Bianchi, Jed McCaleb, Xiaolong Wang, Jan Kautz, Yejin Choi, James Zou, Carlos Guestrin, and Yu Sun. Learning to Discover at Test Time, February 2026.https://arxiv.org/abs/2601.16175. 12

work page internal anchor Pith review Pith/arXiv arXiv 2026
[29]

Ruiying Ma, Chieh-Jan Mike Liang, Yanjie Gao, and Francis Y . Yan. MetaMuse: Algorithm Generation via Creative Ideation, October 2025.https://arxiv.org/abs/2510.03851

work page internal anchor Pith review Pith/arXiv arXiv 2025
[30]

LLM Priors for ERM over Programs, February 2026.https://arxiv.org/abs/2510.14331

Shivam Singhal, Priyadarsi Mishra, Eran Malach, and Tomer Galanti. LLM Priors for ERM over Programs, February 2026.https://arxiv.org/abs/2510.14331

work page arXiv 2026
[31]

Self-Play Only Evolves When Self-Synthetic Pipeline Ensures Learnable Information Gain

Wei Liu, Siya Qi, Yali Du, and Yulan He. Self-play only evolves when self-synthetic pipeline ensures learnable information gain, 2026.https://arxiv.org/abs/2603.02218

work page internal anchor Pith review Pith/arXiv arXiv 2026
[32]

AdaExplore: Failure-Driven Adaptation and Diversity-Preserving Search for Efficient Kernel Generation

Weihua Du, Jingming Zhuo, Yixin Dong, Andre Wang He, Weiwei Sun, Zeyu Zheng, Manupa Karunaratne, Ivan Fox, Tim Dettmers, Tianqi Chen, Yiming Yang, and Sean Welleck. Ada- Explore: Failure-Driven Adaptation and Diversity-Preserving Search for Efficient Kernel Generation, April 2026.https://arxiv.org/abs/2604.16625

work page internal anchor Pith review Pith/arXiv arXiv 2026
[33]

Record-Remix-Replay: Hierarchical GPU Kernel Optimization using Evolutionary Search

Daniel Nichols, Konstantinos Parasyris, Caetano Melone, Tal Ben-Nun, Giorgis Georgakoudis, and Harshitha Menon. Record-Remix-Replay: Hierarchical GPU Kernel Optimization using Evolutionary Search, April 2026.https://arxiv.org/abs/2604.11109

work page internal anchor Pith review Pith/arXiv arXiv 2026
[34]

Astra: A Multi-Agent System for GPU Kernel Performance Optimization, December 2025.https://arxiv.org/abs/2509.07506

Anjiang Wei, Tianran Sun, Yogesh Seenichamy, Hang Song, Anne Ouyang, Azalia Mirhoseini, Ke Wang, and Alex Aiken. Astra: A Multi-Agent System for GPU Kernel Performance Optimization, December 2025.https://arxiv.org/abs/2509.07506

work page arXiv 2025
[35]

ContextEvolve: Multi-Agent Context Compression for Systems Code Optimization, February 2026.https://arxiv.org/abs/2602.02597

Hongyuan Su, Yu Zheng, and Yong Li. ContextEvolve: Multi-Agent Context Compression for Systems Code Optimization, February 2026.https://arxiv.org/abs/2602.02597

work page arXiv 2026
[36]

Barbarians at the Gate: How AI is Upending Systems Research, October 2025.https://arxiv.org/abs/2510.06189

Audrey Cheng, Shu Liu, Melissa Pan, Zhifei Li, Bowen Wang, Alex Krentsel, Tian Xia, Mert Cemri, Jongseok Park, Shuo Yang, Jeff Chen, Lakshya Agrawal, Aditya Desai, Jiarong Xing, Koushik Sen, Matei Zaharia, and Ion Stoica. Barbarians at the Gate: How AI is Upending Systems Research, October 2025.https://arxiv.org/abs/2510.06189

work page arXiv 2025
[37]

Magellan: Autonomous Discovery of Novel Compiler Optimization Heuristics with AlphaEvolve, January 2026

Hongzheng Chen, Alexander Novikov, Ngân V˜u, Hanna Alam, Zhiru Zhang, Aiden Grossman, Mircea Trofin, and Amir Yazdanbakhsh. Magellan: Autonomous Discovery of Novel Compiler Optimization Heuristics with AlphaEvolve, January 2026. https://arxiv.org/abs/2601. 21096

work page 2026
[38]

ArchAgent: Agentic AI-driven Computer Architecture Discovery, February 2026.https://arxiv.org/abs/2602.22425

Raghav Gupta, Akanksha Jain, Abraham Gonzalez, Alexander Novikov, Po-Sen Huang, Matej Balog, Marvin Eisenberger, Sergey Shirobokov, Ngân V ˜u, Martin Dixon, Borivoje Nikoli ´c, Parthasarathy Ranganathan, and Sagar Karandikar. ArchAgent: Agentic AI-driven Computer Architecture Discovery, February 2026.https://arxiv.org/abs/2602.22425

work page arXiv 2026
[39]

MadEvolve: Evolutionary Optimization of Cosmological Algorithms with Large Language Models, February 2026

Tianyi Li, Shihui Zang, and Moritz Münchmeyer. MadEvolve: Evolutionary Optimization of Cosmological Algorithms with Large Language Models, February 2026. https://arxiv. org/abs/2602.15951

work page arXiv 2026
[40]

Beyond Algorithm Evolution: An LLM-Driven Framework for the Co-Evolution of Swarm Intelligence Optimization Algorithms and Prompts, December 2025

Shipeng Cen and Ying Tan. Beyond Algorithm Evolution: An LLM-Driven Framework for the Co-Evolution of Swarm Intelligence Optimization Algorithms and Prompts, December 2025. https://arxiv.org/abs/2512.09209

work page arXiv 2025
[41]

Iterated Agent for Symbolic Regression, October 2025.https://arxiv.org/abs/2510.08317

Zhuo-Yang Song, Zeyu Cai, Shutao Zhang, Jiashen Wei, Jichen Pan, Shi Qiu, Qing-Hong Cao, Tie-Jiun Hou, Xiaohui Liu, Ming-xing Luo, and Hua Xing Zhu. Iterated Agent for Symbolic Regression, October 2025.https://arxiv.org/abs/2510.08317

work page arXiv 2025
[42]

RankEvolve: Automating the Discovery of Retrieval Algorithms via LLM-Driven Evolution, February 2026

Jinming Nian, Fangchen Li, Dae Hoon Park, and Yi Fang. RankEvolve: Automating the Discovery of Retrieval Algorithms via LLM-Driven Evolution, February 2026. https:// arxiv.org/abs/2602.16932

work page arXiv 2026
[43]

Self-Evolving Recommenda- tion System: End-To-End Autonomous Model Optimization With LLM Agents, February 2026

Haochen Wang, Yi Wu, Daryl Chang, Li Wei, and Lukasz Heldt. Self-Evolving Recommenda- tion System: End-To-End Autonomous Model Optimization With LLM Agents, February 2026. https://arxiv.org/abs/2602.10226

work page arXiv 2026
[44]

Can LLMs Beat Classical Hyperparameter Optimization Algorithms? A Study on autoresearch

Fabio Ferreira, Lucca Wobbe, Arjun Krishnakumar, Frank Hutter, and Arber Zela. Can LLMs Beat Classical Hyperparameter Optimization Algorithms? A Study on autoresearch, April 2026. https://arxiv.org/abs/2603.24647. 13

work page internal anchor Pith review Pith/arXiv arXiv 2026
[45]

Controlled Self-Evolution for Algorithmic Code Optimization, February 2026

Tu Hu, Ronghao Chen, Shuo Zhang, Jianghao Yin, Mou Xiao Feng, Jingping Liu, Shaolei Zhang, Wenqi Jiang, Yuqi Fang, Sen Hu, Huacan Wang, and Yi Xu. Controlled Self-Evolution for Algorithmic Code Optimization, February 2026. https://arxiv.org/abs/2601.07348

work page arXiv 2026
[46]

AlphaApollo: A System for Deep Agentic Reasoning, March 2026

Zhanke Zhou, Chentao Cao, Xiao Feng, Xuan Li, Zongze Li, Xiangyu Lu, Jiangchao Yao, Weikai Huang, Tian Cheng, Jianghangfan Zhang, Tangyu Jiang, Linrui Xu, Yiming Zheng, Brando Miranda, Tongliang Liu, Sanmi Koyejo, Masashi Sugiyama, and Bo Han. AlphaApollo: A System for Deep Agentic Reasoning, March 2026. https://arxiv.org/abs/2510. 06261

work page 2026
[47]

R&D-Agent: An LLM-Agent Framework Towards Autonomous Data Science, October 2025.https://arxiv.org/abs/2505.14738

Xu Yang, Xiao Yang, Shikai Fang, Yifei Zhang, Jian Wang, Bowen Xian, Qizheng Li, Jingyuan Li, Minrui Xu, Yuante Li, Haoran Pan, Yuge Zhang, Weiqing Liu, Yelong Shen, Weizhu Chen, and Jiang Bian. R&D-Agent: An LLM-Agent Framework Towards Autonomous Data Science, October 2025.https://arxiv.org/abs/2505.14738

work page arXiv 2025
[48]

Group-Evolving Agents: Open-Ended Self-Improvement via Experience Sharing, February 2026.https://arxiv.org/abs/2602.04837

Zhaotian Weng, Antonis Antoniades, Deepak Nathani, Zhen Zhang, Xiao Pu, and Xin Eric Wang. Group-Evolving Agents: Open-Ended Self-Improvement via Experience Sharing, February 2026.https://arxiv.org/abs/2602.04837

work page arXiv 2026
[49]

Dimakis, Matei Zaharia, and Ion Stoica

Shu Liu, Mert Cemri, Shubham Agarwal, Alexander Krentsel, Ashwin Naren, Qiuyang Mang, Zhifei Li, Akshat Gupta, Monishwaran Maheswaran, Audrey Cheng, Melissa Pan, Ethan Boneh, Kannan Ramchandran, Koushik Sen, Alexandros G. Dimakis, Matei Zaharia, and Ion Stoica. SkyDiscover: A flexible framework for AI-driven scientific and algorithmic discovery, 2026. URL...

work page 2026
[50]

John R. Koza. Genetic programming as a means for programming computers by natural selection.Statistics and Computing, 4(2), June 1994. ISSN 0960-3174, 1573-1375. doi: 10.1007/BF00175355

work page doi:10.1007/bf00175355 1994
[51]

Population Based Training of Neural Networks

Max Jaderberg, Valentin Dalibard, Simon Osindero, Wojciech M. Czarnecki, Jeff Donahue, Ali Razavi, Oriol Vinyals, Tim Green, Iain Dunning, Karen Simonyan, Chrisantha Fernando, and Koray Kavukcuoglu. Population Based Training of Neural Networks, November 2017. https://arxiv.org/abs/1711.09846

work page internal anchor Pith review Pith/arXiv arXiv 2017
[52]

Quality-Diversity Algorithms Can Provably Be Helpful for Optimization, May 2024.https://arxiv.org/abs/2401.10539

Chao Qian, Ke Xue, and Ren-Jian Wang. Quality-Diversity Algorithms Can Provably Be Helpful for Optimization, May 2024.https://arxiv.org/abs/2401.10539

work page arXiv 2024
[53]

The Role of Stepping Stones in MAP-Elites: Insights from Search Trajectory Networks

Giorgia Nadizar, Francesco Rusin, Eric Medvet, and Gabriela Ochoa. The Role of Stepping Stones in MAP-Elites: Insights from Search Trajectory Networks. In Bing Xue, Luca Manzoni, and Illya Bakurov, editors,Genetic Programming, volume 15609, pages 224–239. Springer Nature Switzerland, Cham, 2025. ISBN 978-3-031-89990-4 978-3-031-89991-1. doi: 10.1007/ 978-...

work page 2025
[54]

Open-Endedness is Essential for Artificial Superhuman Intelligence, June 2024.https://arxiv.org/abs/2406.04268

Edward Hughes, Michael Dennis, Jack Parker-Holder, Feryal Behbahani, Aditi Mavalankar, Yuge Shi, Tom Schaul, and Tim Rocktaschel. Open-Endedness is Essential for Artificial Superhuman Intelligence, June 2024.https://arxiv.org/abs/2406.04268

work page arXiv 2024
[55]

The Vendi Score: A Diversity Evaluation Metric for Machine Learning, July 2023.https://arxiv.org/abs/2210.02410

Dan Friedman and Adji Bousso Dieng. The Vendi Score: A Diversity Evaluation Metric for Machine Learning, July 2023.https://arxiv.org/abs/2210.02410

work page arXiv 2023
[56]

Rethinking Code Similarity for Automated Algorithm Design with LLMs, March 2026.https://arxiv.org/abs/2603.02787

Rui Zhang and Zhichao Lu. Rethinking Code Similarity for Automated Algorithm Design with LLMs, March 2026.https://arxiv.org/abs/2603.02787

work page arXiv 2026
[57]

A Comprehensive Survey of Self-Evolving AI Agents: A New Paradigm Bridging Foundation Models and Lifelong Agentic Systems

Jinyuan Fang, Yanwen Peng, Xi Zhang, Yingxu Wang, Xinhao Yi, Guibin Zhang, Yi Xu, Bin Wu, Siwei Liu, Zihao Li, Zhaochun Ren, Nikos Aletras, Xi Wang, Han Zhou, and Zaiqiao Meng. A Comprehensive Survey of Self-Evolving AI Agents: A New Paradigm Bridging Foundation Models and Lifelong Agentic Systems, August 2025. https://arxiv.org/abs/ 2508.07407

work page internal anchor Pith review Pith/arXiv arXiv 2025
[58]

How Far Are AI Scientists from Changing the World?, August 2025.https://arxiv.org/abs/2507.23276

Qiujie Xie, Yixuan Weng, Minjun Zhu, Fuchen Shen, Shulin Huang, Zhen Lin, Jiahui Zhou, Zilan Mao, Zijie Yang, Linyi Yang, Jian Wu, and Yue Zhang. How Far Are AI Scientists from Changing the World?, August 2025.https://arxiv.org/abs/2507.23276. 14

work page arXiv 2025
[59]

MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering

Jun Shern Chan, Neil Chowdhury, Oliver Jaffe, James Aung, Dane Sherburn, Evan Mays, Giulio Starace, Kevin Liu, Leon Maksin, Tejal Patwardhan, Lilian Weng, and Aleksander M ˛ adry. MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering, February 2025.https://arxiv.org/abs/2410.07095

work page internal anchor Pith review Pith/arXiv arXiv 2025
[60]

ALE-Bench: A Benchmark for Long-Horizon Objective-Driven Algorithm Engineering, October 2025.https://arxiv.org/abs/2506.09050

Yuki Imajuku, Kohki Horie, Yoichi Iwata, Kensho Aoki, Naohiro Takahashi, and Takuya Akiba. ALE-Bench: A Benchmark for Long-Horizon Objective-Driven Algorithm Engineering, October 2025.https://arxiv.org/abs/2506.09050

work page arXiv 2025
[61]

AIRS-Bench: A Suite of Tasks for Frontier AI Research Science Agents, February 2026.https://arxiv.org/abs/2602.06855

Alisia Lupidi, Bhavul Gauri, Thomas Simon Foster, Bassel Al Omari, Despoina Magka, Alberto Pepe, Alexis Audran-Reiss, Muna Aghamelu, Nicolas Baldwin, Lucia Cipolina-Kun, Jean- Christophe Gagnon-Audet, Chee Hau Leow, Sandra Lefdal, Hossam Mossalam, Abhinav Moudgil, Saba Nazir, Emanuel Tewolde, Isabel Urrego, Jordi Armengol Estape, Amar Budhiraja, Gaurav Ch...

work page arXiv 2026
[62]

Evaluation-driven Scaling for Scientific Discovery

Haotian Ye, Haowei Lin, Jingyi Tang, Yizhen Luo, Caiyin Yang, Chang Su, Rahul Thapa, Rui Yang, Ruihua Liu, Zeyu Li, Chong Gao, Dachao Ding, Guangrong He, Miaolei Zhang, Lina Sun, Wenyang Wang, Yuchen Zhong, Zhuohao Shen, Di He, Jianzhu Ma, Stefano Ermon, Tongyang Li, Xiaowen Chu, James Zou, and Yuzhi Xu. Evaluation-driven Scaling for Scientific Discovery,...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[63]

Pan, Negar Arabzadeh, Riccardo Cogo, Yuxuan Zhu, Alexander Xiong, Lakshya A

Melissa Z. Pan, Negar Arabzadeh, Riccardo Cogo, Yuxuan Zhu, Alexander Xiong, Lakshya A. Agrawal, Huanzhi Mao, Emma Shen, Sid Pallerla, Liana Patel, Shu Liu, Tianneng Shi, Xiaoyuan Liu, Jared Quincy Davis, Emmanuele Lacavalla, Alessandro Basile, Shuyi Yang, Paul Castro, Daniel Kang, Joseph E. Gonzalez, Koushik Sen, Dawn Song, Ion Stoica, Matei Zaharia, and...

work page arXiv 2026
[64]

Can We Predict Before Executing Machine Learning Agents?

Jingsheng Zheng, Jintian Zhang, Yujie Luo, Yuren Mao, Yunjun Gao, Lun Du, Huajun Chen, and Ningyu Zhang. Can We Predict Before Executing Machine Learning Agents?, January 2026.https://arxiv.org/abs/2601.05930

work page internal anchor Pith review Pith/arXiv arXiv 2026
[65]

Simple Baselines are Competitive with Code Evolution, February 2026.https://arxiv.org/abs/2602.16805

Yonatan Gideoni, Sebastian Risi, and Yarin Gal. Simple Baselines are Competitive with Code Evolution, February 2026.https://arxiv.org/abs/2602.16805

work page arXiv 2026
[66]

What Makes an LLM a Good Optimizer? A Trajectory Analysis of LLM-Guided Evolutionary Search

Xinhao Zhang, Xi Chen, François Portet, and Maxime Peyrard. What Makes an LLM a Good Optimizer? A Trajectory Analysis of LLM-Guided Evolutionary Search, April 2026. https://arxiv.org/abs/2604.19440

work page internal anchor Pith review Pith/arXiv arXiv 2026
[67]

Fitness Landscape of Large Language Model-Assisted Automated Algorithm Search, August 2025

Fei Liu, Qingfu Zhang, Jialong Shi, Xialiang Tong, Kun Mao, and Mingxuan Yuan. Fitness Landscape of Large Language Model-Assisted Automated Algorithm Search, August 2025. https://arxiv.org/abs/2504.19636

work page arXiv 2025
[68]

Understanding the Challenges in Iterative Generative Optimiza- tion with LLMs, March 2026.https://arxiv.org/abs/2603.23994

Allen Nie, Xavier Daull, Zhiyi Kuang, Abhinav Akkiraju, Anish Chaudhuri, Max Piasevoli, Ryan Rong, YuCheng Yuan, Prerit Choudhary, Shannon Xiao, Rasool Fakoor, Adith Swami- nathan, and Ching-An Cheng. Understanding the Challenges in Iterative Generative Optimiza- tion with LLMs, March 2026.https://arxiv.org/abs/2603.23994

work page arXiv 2026
[69]

Lan Pan, Hanbo Xie, and Robert C. Wilson. Large Language Models Think Too Fast To Explore Effectively, May 2025.https://arxiv.org/abs/2501.18009

work page arXiv 2025
[70]

Artificial Hivemind: The Open-Ended Homogeneity of Language Models (and Beyond), October 2025

Liwei Jiang, Yuanjun Chai, Margaret Li, Mickel Liu, Raymond Fok, Nouha Dziri, Yulia Tsvetkov, Maarten Sap, Alon Albalak, and Yejin Choi. Artificial Hivemind: The Open-Ended Homogeneity of Language Models (and Beyond), October 2025. https://arxiv.org/abs/ 2510.22954

work page arXiv 2025
[71]

Your Agent May Misevolve: Emergent Risks in Self-evolving LLM Agents, March 2026.https://arxiv.org/abs/2509.26354

Shuai Shao, Qihan Ren, Chen Qian, Boyi Wei, Dadi Guo, Jingyi Yang, Xinhao Song, Linfeng Zhang, Weinan Zhang, Dongrui Liu, and Jing Shao. Your Agent May Misevolve: Emergent Risks in Self-evolving LLM Agents, March 2026.https://arxiv.org/abs/2509.26354. 15

work page arXiv 2026
[72]

Why Do Multi-Agent LLM Systems Fail?

Mert Cemri, Melissa Z. Pan, Shuyi Yang, Lakshya A. Agrawal, Bhavya Chopra, Rishabh Tiwari, Kurt Keutzer, Aditya Parameswaran, Dan Klein, Kannan Ramchandran, Matei Zaharia, Joseph E. Gonzalez, and Ion Stoica. Why Do Multi-Agent LLM Systems Fail?, October 2025. https://arxiv.org/abs/2503.13657. 16 A Additional EvoTrace Details A.1 Per-field trace schema Evo...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[1] [1]

Pawan Kumar, Emilien Dupont, Francisco J

Bernardino Romera-Paredes, Mohammadamin Barekatain, Alexander Novikov, Matej Balog, M. Pawan Kumar, Emilien Dupont, Francisco J. R. Ruiz, Jordan S. Ellenberg, Pengming Wang, Omar Fawzi, Pushmeet Kohli, and Alhussein Fawzi. Mathematical discoveries from program search with large language models.Nature, 625(7995):468–475, January 2024. ISSN 0028-0836, 1476-...

work page doi:10.1038/s41586-023-06924-6 2024

[2] [2]

Alexander Novikov, Ngân V˜u, Marvin Eisenberger, Emilien Dupont, Po-Sen Huang, Adam Zsolt Wagner, Sergey Shirobokov, Borislav Kozlovskii, Francisco J. R. Ruiz, Abbas Mehrabian, M. Pawan Kumar, Abigail See, Swarat Chaudhuri, George Holland, Alex Davies, Sebastian Nowozin, Pushmeet Kohli, and Matej Balog. AlphaEvolve: A coding agent for scientific and algor...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

Mathematical exploration and discovery at scale

Bogdan Georgiev, Javier Gómez-Serrano, Terence Tao, and Adam Zsolt Wagner. Mathematical exploration and discovery at scale, December 2025. https://arxiv.org/abs/2511.02864

work page internal anchor Pith review Pith/arXiv arXiv 2025

[4] [4]

ShinkaEvolve: Towards Open-Ended And Sample-Efficient Program Evolution, September 2025

Robert Tjarko Lange, Yuki Imajuku, and Edoardo Cetin. ShinkaEvolve: Towards Open-Ended And Sample-Efficient Program Evolution, September 2025. https://arxiv.org/abs/2509. 19349

work page 2025

[5] [5]

CodeEvolve: An open source evolutionary coding agent for algorithmic discovery and optimization, March 2026

Henrique Assumpção, Diego Ferreira, Leandro Campos, and Fabricio Murai. CodeEvolve: An open source evolutionary coding agent for algorithmic discovery and optimization, March 2026. https://arxiv.org/abs/2510.14150

work page arXiv 2026

[6] [6]

AdaEvolve: Adaptive LLM Driven Zeroth-Order Optimization, February 2026

Mert Cemri, Shubham Agrawal, Akshat Gupta, Shu Liu, Audrey Cheng, Qiuyang Mang, Ashwin Naren, Lutfi Eren Erdogan, Koushik Sen, Matei Zaharia, Alex Dimakis, and Ion Stoica. AdaEvolve: Adaptive LLM Driven Zeroth-Order Optimization, February 2026. https: //arxiv.org/abs/2602.20133

work page arXiv 2026

[7] [7]

Pan, Alexander Du, Kurt Keutzer, Alvin Cheung, Alexandros G

Shu Liu, Shubham Agarwal, Monishwaran Maheswaran, Mert Cemri, Zhifei Li, Qiuyang Mang, Ashwin Naren, Ethan Boneh, Audrey Cheng, Melissa Z. Pan, Alexander Du, Kurt Keutzer, Alvin Cheung, Alexandros G. Dimakis, Koushik Sen, Matei Zaharia, and Ion Stoica. EvoX: Meta- Evolution for Automated Discovery, March 2026.https://arxiv.org/abs/2602.23413

work page arXiv 2026

[8] [8]

Let the Barbarians In: How AI Can Accelerate Systems Performance Research, December 2025.https://arxiv.org/abs/2512.14806

Audrey Cheng, Shu Liu, Melissa Pan, Zhifei Li, Shubham Agarwal, Mert Cemri, Bowen Wang, Alexander Krentsel, Tian Xia, Jongseok Park, Shuo Yang, Jeff Chen, Lakshya Agrawal, Ashwin Naren, Shulu Li, Ruiying Ma, Aditya Desai, Jiarong Xing, Koushik Sen, Matei Zaharia, and Ion Stoica. Let the Barbarians In: How AI Can Accelerate Systems Performance Research, De...

work page arXiv 2025

[9] [9]

Evo- Engineer: Mastering Automated CUDA Kernel Code Evolution with Large Language Models, October 2025.https://arxiv.org/abs/2510.03760

Ping Guo, Chenyu Zhu, Siyuan Chen, Fei Liu, Xi Lin, Zhichao Lu, and Qingfu Zhang. Evo- Engineer: Mastering Automated CUDA Kernel Code Evolution with Large Language Models, October 2025.https://arxiv.org/abs/2510.03760

work page arXiv 2025

[10] [10]

Gonzalez, and Ion Stoica

Shiyi Cao, Ziming Mao, Joseph E. Gonzalez, and Ion Stoica. K-Search: LLM Kernel Generation via Co-Evolving Intrinsic World Model, February 2026. https://arxiv.org/abs/2602. 19128

work page 2026

[11] [11]

KernelFoundry: Hardware-aware evolutionary GPU kernel optimization, March 2026

Nina Wiedemann, Quentin Leboutet, Michael Paulitsch, Diana Wofk, and Benjamin Ummen- hofer. KernelFoundry: Hardware-aware evolutionary GPU kernel optimization, March 2026. https://arxiv.org/abs/2603.12440

work page arXiv 2026

[12] [12]

Openevolve: an open-source evolutionary coding agent, 2025

Asankhaya Sharma. Openevolve: an open-source evolutionary coding agent, 2025. URL https://github.com/algorithmicsuperintelligence/openevolve

work page 2025

[13] [13]

GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning

Lakshya A. Agrawal, Shangyin Tan, Dilara Soylu, Noah Ziems, Rishi Khare, Krista Opsahl- Ong, Arnav Singhvi, Herumb Shandilya, Michael J. Ryan, Meng Jiang, Christopher Potts, Koushik Sen, Alexandros G. Dimakis, Ion Stoica, Dan Klein, Matei Zaharia, and Omar Khattab. GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning, February 2026. htt...

work page internal anchor Pith review Pith/arXiv arXiv 2026

[14] [14]

GigaEvo: An Open Source Op- timization Framework Powered By LLMs And Evolution Algorithms, November 2025

Valentin Khrulkov, Andrey Galichin, Denis Bashkirov, Dmitry Vinichenko, Oleg Travkin, Roman Alferov, Andrey Kuznetsov, and Ivan Oseledets. GigaEvo: An Open Source Op- timization Framework Powered By LLMs And Evolution Algorithms, November 2025. https://arxiv.org/abs/2511.17592

work page arXiv 2025

[15] [15]

The FM Agent, February 2026.https://arxiv.org/abs/2510.26144

Annan Li, Chufan Wu, Zengle Ge, Yee Hin Chong, Zhinan Hou, Lizhe Cao, Cheng Ju, Jianmin Wu, Huaiming Li, Haobo Zhang, Shenghao Feng, Mo Zhao, Fengzhi Qiu, Rui Yang, Mengmeng Zhang, Wenyi Zhu, Yingying Sun, Quan Sun, Shunhao Yan, Danyu Liu, Dawei Yin, and Dou Shen. The FM Agent, February 2026.https://arxiv.org/abs/2510.26144

work page arXiv 2026

[16] [16]

AIDE: AI-Driven Exploration in the Space of Code

Zhengyao Jiang, Dominik Schmidt, Dhruv Srikanth, Dixing Xu, Ian Kaplan, Deniss Jacenko, and Yuxiang Wu. AIDE: AI-Driven Exploration in the Space of Code, February 2025. https: //arxiv.org/abs/2502.13138

work page internal anchor Pith review Pith/arXiv arXiv 2025

[17] [17]

Chi, Fernando Pereira, Wang-Cheng Kang, Derek Zhiyuan Cheng, and Beidou Wang

Minghao Yan, Bo Peng, Benjamin Coleman, Ziqi Chen, Zhouhang Xie, Shuo Chen, Zhankui He, Noveen Sachdeva, Isabella Ye, Weili Wang, Chi Wang, Ed H. Chi, Fernando Pereira, Wang-Cheng Kang, Derek Zhiyuan Cheng, and Beidou Wang. PACEvolve: Enabling Long- Horizon Progress-Aware Consistent Evolution, January 2026. https://arxiv.org/abs/ 2601.10657

work page arXiv 2026

[18] [18]

AdaptEvolve: Improving Efficiency of Evolutionary AI Agents through Adaptive Model Selection

Pretam Ray, Pratik Prabhanjan Brahma, Zicheng Liu, and Emad Barsoum. AdaptEvolve: Improving Efficiency of Evolutionary AI Agents through Adaptive Model Selection, February 2026.https://arxiv.org/abs/2602.11931

work page internal anchor Pith review Pith/arXiv arXiv 2026

[19] [19]

CausalEvolve: Towards Open-Ended Discovery with Causal Scratchpad, March 2026

Yongqiang Chen, Chenxi Liu, Zhenhao Chen, Tongliang Liu, Bo Han, and Kun Zhang. CausalEvolve: Towards Open-Ended Discovery with Causal Scratchpad, March 2026. https: //arxiv.org/abs/2603.14575

work page arXiv 2026

[20] [20]

\(X\)-evolve: Solution space evolution powered by large language models, August 2025.https://arxiv.org/abs/2508.07932

Yi Zhai, Zhiqiang Wei, Ruohan Li, Keyu Pan, Shuo Liu, Lu Zhang, Jianmin Ji, Wuyang Zhang, Yu Zhang, and Yanyong Zhang. \(X\)-evolve: Solution space evolution powered by large language models, August 2025.https://arxiv.org/abs/2508.07932

work page arXiv 2025

[21] [21]

SeaEvo: Advancing Algorithm Discovery with Strategy Space Evolution

Sichun Luo, Yi Huang, Haochen Luo, Fengyuan Liu, Guanzhi Deng, Lei Li, Qinghua Yao, Zefa Hu, Junlan Feng, and Qi Liu. SeaEvo: Advancing Algorithm Discovery with Strategy Space Evolution, April 2026.https://arxiv.org/abs/2604.24372

work page internal anchor Pith review Pith/arXiv arXiv 2026

[22] [22]

Meta Context Engineer- ing via Agentic Skill Evolution, February 2026.https://arxiv.org/abs/2601.21557

Haoran Ye, Xuning He, Vincent Arak, Haonan Dong, and Guojie Song. Meta Context Engineer- ing via Agentic Skill Evolution, February 2026.https://arxiv.org/abs/2601.21557

work page arXiv 2026

[23] [23]

C- Evolve: Consensus-based Evolution for Prompt Groups, September 2025

Tiancheng Li, Yuhang Wang, Zhiyang Chen, Zijun Wang, Liyuan Ma, and Guo-jun Qi. C- Evolve: Consensus-based Evolution for Prompt Groups, September 2025. https://arxiv. org/abs/2509.23331

work page arXiv 2025

[24] [24]

Population-Evolve: A Parallel Sampling and Evolutionary Method for LLM Math Reasoning, December 2025

Yanzhi Zhang, Yitong Duan, Zhaoxi Zhang, Jiyan He, and Shuxin Zheng. Population-Evolve: A Parallel Sampling and Evolutionary Method for LLM Math Reasoning, December 2025. https://arxiv.org/abs/2512.19081

work page arXiv 2025

[25] [25]

Contrastive Concept-Tree Search for LLM-Assisted Algorithm Discovery, February 2026

Timothee Leleu, Sudeera Gunathilaka, Federico Ghimenti, and Surya Ganguli. Contrastive Concept-Tree Search for LLM-Assisted Algorithm Discovery, February 2026. https:// arxiv.org/abs/2602.03132

work page arXiv 2026

[26] [26]

Self-Improving Language Models for Evolutionary Program Synthesis: A Case Study on ARC-AGI, March 2026

Julien Pourcel, Cédric Colas, and Pierre-Yves Oudeyer. Self-Improving Language Models for Evolutionary Program Synthesis: A Case Study on ARC-AGI, March 2026. https: //arxiv.org/abs/2507.14172

work page arXiv 2026

[27] [27]

ThetaEvolve: Test-time Learning on Open Problems

Yiping Wang, Shao-Rong Su, Zhiyuan Zeng, Eva Xu, Liliang Ren, Xinyu Yang, Zeyi Huang, Xuehai He, Luyao Ma, Baolin Peng, Hao Cheng, Pengcheng He, Weizhu Chen, Shuohang Wang, Simon Shaolei Du, and Yelong Shen. ThetaEvolve: Test-time Learning on Open Problems, November 2025.https://arxiv.org/abs/2511.23473

work page internal anchor Pith review Pith/arXiv arXiv 2025

[28] [28]

Learning to Discover at Test Time

Mert Yuksekgonul, Daniel Koceja, Xinhao Li, Federico Bianchi, Jed McCaleb, Xiaolong Wang, Jan Kautz, Yejin Choi, James Zou, Carlos Guestrin, and Yu Sun. Learning to Discover at Test Time, February 2026.https://arxiv.org/abs/2601.16175. 12

work page internal anchor Pith review Pith/arXiv arXiv 2026

[29] [29]

Ruiying Ma, Chieh-Jan Mike Liang, Yanjie Gao, and Francis Y . Yan. MetaMuse: Algorithm Generation via Creative Ideation, October 2025.https://arxiv.org/abs/2510.03851

work page internal anchor Pith review Pith/arXiv arXiv 2025

[30] [30]

LLM Priors for ERM over Programs, February 2026.https://arxiv.org/abs/2510.14331

Shivam Singhal, Priyadarsi Mishra, Eran Malach, and Tomer Galanti. LLM Priors for ERM over Programs, February 2026.https://arxiv.org/abs/2510.14331

work page arXiv 2026

[31] [31]

Self-Play Only Evolves When Self-Synthetic Pipeline Ensures Learnable Information Gain

Wei Liu, Siya Qi, Yali Du, and Yulan He. Self-play only evolves when self-synthetic pipeline ensures learnable information gain, 2026.https://arxiv.org/abs/2603.02218

work page internal anchor Pith review Pith/arXiv arXiv 2026

[32] [32]

AdaExplore: Failure-Driven Adaptation and Diversity-Preserving Search for Efficient Kernel Generation

Weihua Du, Jingming Zhuo, Yixin Dong, Andre Wang He, Weiwei Sun, Zeyu Zheng, Manupa Karunaratne, Ivan Fox, Tim Dettmers, Tianqi Chen, Yiming Yang, and Sean Welleck. Ada- Explore: Failure-Driven Adaptation and Diversity-Preserving Search for Efficient Kernel Generation, April 2026.https://arxiv.org/abs/2604.16625

work page internal anchor Pith review Pith/arXiv arXiv 2026

[33] [33]

Record-Remix-Replay: Hierarchical GPU Kernel Optimization using Evolutionary Search

Daniel Nichols, Konstantinos Parasyris, Caetano Melone, Tal Ben-Nun, Giorgis Georgakoudis, and Harshitha Menon. Record-Remix-Replay: Hierarchical GPU Kernel Optimization using Evolutionary Search, April 2026.https://arxiv.org/abs/2604.11109

work page internal anchor Pith review Pith/arXiv arXiv 2026

[34] [34]

Astra: A Multi-Agent System for GPU Kernel Performance Optimization, December 2025.https://arxiv.org/abs/2509.07506

Anjiang Wei, Tianran Sun, Yogesh Seenichamy, Hang Song, Anne Ouyang, Azalia Mirhoseini, Ke Wang, and Alex Aiken. Astra: A Multi-Agent System for GPU Kernel Performance Optimization, December 2025.https://arxiv.org/abs/2509.07506

work page arXiv 2025

[35] [35]

ContextEvolve: Multi-Agent Context Compression for Systems Code Optimization, February 2026.https://arxiv.org/abs/2602.02597

Hongyuan Su, Yu Zheng, and Yong Li. ContextEvolve: Multi-Agent Context Compression for Systems Code Optimization, February 2026.https://arxiv.org/abs/2602.02597

work page arXiv 2026

[36] [36]

Barbarians at the Gate: How AI is Upending Systems Research, October 2025.https://arxiv.org/abs/2510.06189

Audrey Cheng, Shu Liu, Melissa Pan, Zhifei Li, Bowen Wang, Alex Krentsel, Tian Xia, Mert Cemri, Jongseok Park, Shuo Yang, Jeff Chen, Lakshya Agrawal, Aditya Desai, Jiarong Xing, Koushik Sen, Matei Zaharia, and Ion Stoica. Barbarians at the Gate: How AI is Upending Systems Research, October 2025.https://arxiv.org/abs/2510.06189

work page arXiv 2025

[37] [37]

Magellan: Autonomous Discovery of Novel Compiler Optimization Heuristics with AlphaEvolve, January 2026

Hongzheng Chen, Alexander Novikov, Ngân V˜u, Hanna Alam, Zhiru Zhang, Aiden Grossman, Mircea Trofin, and Amir Yazdanbakhsh. Magellan: Autonomous Discovery of Novel Compiler Optimization Heuristics with AlphaEvolve, January 2026. https://arxiv.org/abs/2601. 21096

work page 2026

[38] [38]

ArchAgent: Agentic AI-driven Computer Architecture Discovery, February 2026.https://arxiv.org/abs/2602.22425

Raghav Gupta, Akanksha Jain, Abraham Gonzalez, Alexander Novikov, Po-Sen Huang, Matej Balog, Marvin Eisenberger, Sergey Shirobokov, Ngân V ˜u, Martin Dixon, Borivoje Nikoli ´c, Parthasarathy Ranganathan, and Sagar Karandikar. ArchAgent: Agentic AI-driven Computer Architecture Discovery, February 2026.https://arxiv.org/abs/2602.22425

work page arXiv 2026

[39] [39]

MadEvolve: Evolutionary Optimization of Cosmological Algorithms with Large Language Models, February 2026

Tianyi Li, Shihui Zang, and Moritz Münchmeyer. MadEvolve: Evolutionary Optimization of Cosmological Algorithms with Large Language Models, February 2026. https://arxiv. org/abs/2602.15951

work page arXiv 2026

[40] [40]

Beyond Algorithm Evolution: An LLM-Driven Framework for the Co-Evolution of Swarm Intelligence Optimization Algorithms and Prompts, December 2025

Shipeng Cen and Ying Tan. Beyond Algorithm Evolution: An LLM-Driven Framework for the Co-Evolution of Swarm Intelligence Optimization Algorithms and Prompts, December 2025. https://arxiv.org/abs/2512.09209

work page arXiv 2025

[41] [41]

Iterated Agent for Symbolic Regression, October 2025.https://arxiv.org/abs/2510.08317

Zhuo-Yang Song, Zeyu Cai, Shutao Zhang, Jiashen Wei, Jichen Pan, Shi Qiu, Qing-Hong Cao, Tie-Jiun Hou, Xiaohui Liu, Ming-xing Luo, and Hua Xing Zhu. Iterated Agent for Symbolic Regression, October 2025.https://arxiv.org/abs/2510.08317

work page arXiv 2025

[42] [42]

RankEvolve: Automating the Discovery of Retrieval Algorithms via LLM-Driven Evolution, February 2026

Jinming Nian, Fangchen Li, Dae Hoon Park, and Yi Fang. RankEvolve: Automating the Discovery of Retrieval Algorithms via LLM-Driven Evolution, February 2026. https:// arxiv.org/abs/2602.16932

work page arXiv 2026

[43] [43]

Self-Evolving Recommenda- tion System: End-To-End Autonomous Model Optimization With LLM Agents, February 2026

Haochen Wang, Yi Wu, Daryl Chang, Li Wei, and Lukasz Heldt. Self-Evolving Recommenda- tion System: End-To-End Autonomous Model Optimization With LLM Agents, February 2026. https://arxiv.org/abs/2602.10226

work page arXiv 2026

[44] [44]

Can LLMs Beat Classical Hyperparameter Optimization Algorithms? A Study on autoresearch

Fabio Ferreira, Lucca Wobbe, Arjun Krishnakumar, Frank Hutter, and Arber Zela. Can LLMs Beat Classical Hyperparameter Optimization Algorithms? A Study on autoresearch, April 2026. https://arxiv.org/abs/2603.24647. 13

work page internal anchor Pith review Pith/arXiv arXiv 2026

[45] [45]

Controlled Self-Evolution for Algorithmic Code Optimization, February 2026

Tu Hu, Ronghao Chen, Shuo Zhang, Jianghao Yin, Mou Xiao Feng, Jingping Liu, Shaolei Zhang, Wenqi Jiang, Yuqi Fang, Sen Hu, Huacan Wang, and Yi Xu. Controlled Self-Evolution for Algorithmic Code Optimization, February 2026. https://arxiv.org/abs/2601.07348

work page arXiv 2026

[46] [46]

AlphaApollo: A System for Deep Agentic Reasoning, March 2026

Zhanke Zhou, Chentao Cao, Xiao Feng, Xuan Li, Zongze Li, Xiangyu Lu, Jiangchao Yao, Weikai Huang, Tian Cheng, Jianghangfan Zhang, Tangyu Jiang, Linrui Xu, Yiming Zheng, Brando Miranda, Tongliang Liu, Sanmi Koyejo, Masashi Sugiyama, and Bo Han. AlphaApollo: A System for Deep Agentic Reasoning, March 2026. https://arxiv.org/abs/2510. 06261

work page 2026

[47] [47]

R&D-Agent: An LLM-Agent Framework Towards Autonomous Data Science, October 2025.https://arxiv.org/abs/2505.14738

Xu Yang, Xiao Yang, Shikai Fang, Yifei Zhang, Jian Wang, Bowen Xian, Qizheng Li, Jingyuan Li, Minrui Xu, Yuante Li, Haoran Pan, Yuge Zhang, Weiqing Liu, Yelong Shen, Weizhu Chen, and Jiang Bian. R&D-Agent: An LLM-Agent Framework Towards Autonomous Data Science, October 2025.https://arxiv.org/abs/2505.14738

work page arXiv 2025

[48] [48]

Group-Evolving Agents: Open-Ended Self-Improvement via Experience Sharing, February 2026.https://arxiv.org/abs/2602.04837

Zhaotian Weng, Antonis Antoniades, Deepak Nathani, Zhen Zhang, Xiao Pu, and Xin Eric Wang. Group-Evolving Agents: Open-Ended Self-Improvement via Experience Sharing, February 2026.https://arxiv.org/abs/2602.04837

work page arXiv 2026

[49] [49]

Dimakis, Matei Zaharia, and Ion Stoica

Shu Liu, Mert Cemri, Shubham Agarwal, Alexander Krentsel, Ashwin Naren, Qiuyang Mang, Zhifei Li, Akshat Gupta, Monishwaran Maheswaran, Audrey Cheng, Melissa Pan, Ethan Boneh, Kannan Ramchandran, Koushik Sen, Alexandros G. Dimakis, Matei Zaharia, and Ion Stoica. SkyDiscover: A flexible framework for AI-driven scientific and algorithmic discovery, 2026. URL...

work page 2026

[50] [50]

John R. Koza. Genetic programming as a means for programming computers by natural selection.Statistics and Computing, 4(2), June 1994. ISSN 0960-3174, 1573-1375. doi: 10.1007/BF00175355

work page doi:10.1007/bf00175355 1994

[51] [51]

Population Based Training of Neural Networks

Max Jaderberg, Valentin Dalibard, Simon Osindero, Wojciech M. Czarnecki, Jeff Donahue, Ali Razavi, Oriol Vinyals, Tim Green, Iain Dunning, Karen Simonyan, Chrisantha Fernando, and Koray Kavukcuoglu. Population Based Training of Neural Networks, November 2017. https://arxiv.org/abs/1711.09846

work page internal anchor Pith review Pith/arXiv arXiv 2017

[52] [52]

Quality-Diversity Algorithms Can Provably Be Helpful for Optimization, May 2024.https://arxiv.org/abs/2401.10539

Chao Qian, Ke Xue, and Ren-Jian Wang. Quality-Diversity Algorithms Can Provably Be Helpful for Optimization, May 2024.https://arxiv.org/abs/2401.10539

work page arXiv 2024

[53] [53]

The Role of Stepping Stones in MAP-Elites: Insights from Search Trajectory Networks

Giorgia Nadizar, Francesco Rusin, Eric Medvet, and Gabriela Ochoa. The Role of Stepping Stones in MAP-Elites: Insights from Search Trajectory Networks. In Bing Xue, Luca Manzoni, and Illya Bakurov, editors,Genetic Programming, volume 15609, pages 224–239. Springer Nature Switzerland, Cham, 2025. ISBN 978-3-031-89990-4 978-3-031-89991-1. doi: 10.1007/ 978-...

work page 2025

[54] [54]

Open-Endedness is Essential for Artificial Superhuman Intelligence, June 2024.https://arxiv.org/abs/2406.04268

Edward Hughes, Michael Dennis, Jack Parker-Holder, Feryal Behbahani, Aditi Mavalankar, Yuge Shi, Tom Schaul, and Tim Rocktaschel. Open-Endedness is Essential for Artificial Superhuman Intelligence, June 2024.https://arxiv.org/abs/2406.04268

work page arXiv 2024

[55] [55]

The Vendi Score: A Diversity Evaluation Metric for Machine Learning, July 2023.https://arxiv.org/abs/2210.02410

Dan Friedman and Adji Bousso Dieng. The Vendi Score: A Diversity Evaluation Metric for Machine Learning, July 2023.https://arxiv.org/abs/2210.02410

work page arXiv 2023

[56] [56]

Rethinking Code Similarity for Automated Algorithm Design with LLMs, March 2026.https://arxiv.org/abs/2603.02787

Rui Zhang and Zhichao Lu. Rethinking Code Similarity for Automated Algorithm Design with LLMs, March 2026.https://arxiv.org/abs/2603.02787

work page arXiv 2026

[57] [57]

A Comprehensive Survey of Self-Evolving AI Agents: A New Paradigm Bridging Foundation Models and Lifelong Agentic Systems

Jinyuan Fang, Yanwen Peng, Xi Zhang, Yingxu Wang, Xinhao Yi, Guibin Zhang, Yi Xu, Bin Wu, Siwei Liu, Zihao Li, Zhaochun Ren, Nikos Aletras, Xi Wang, Han Zhou, and Zaiqiao Meng. A Comprehensive Survey of Self-Evolving AI Agents: A New Paradigm Bridging Foundation Models and Lifelong Agentic Systems, August 2025. https://arxiv.org/abs/ 2508.07407

work page internal anchor Pith review Pith/arXiv arXiv 2025

[58] [58]

How Far Are AI Scientists from Changing the World?, August 2025.https://arxiv.org/abs/2507.23276

Qiujie Xie, Yixuan Weng, Minjun Zhu, Fuchen Shen, Shulin Huang, Zhen Lin, Jiahui Zhou, Zilan Mao, Zijie Yang, Linyi Yang, Jian Wu, and Yue Zhang. How Far Are AI Scientists from Changing the World?, August 2025.https://arxiv.org/abs/2507.23276. 14

work page arXiv 2025

[59] [59]

MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering

Jun Shern Chan, Neil Chowdhury, Oliver Jaffe, James Aung, Dane Sherburn, Evan Mays, Giulio Starace, Kevin Liu, Leon Maksin, Tejal Patwardhan, Lilian Weng, and Aleksander M ˛ adry. MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering, February 2025.https://arxiv.org/abs/2410.07095

work page internal anchor Pith review Pith/arXiv arXiv 2025

[60] [60]

ALE-Bench: A Benchmark for Long-Horizon Objective-Driven Algorithm Engineering, October 2025.https://arxiv.org/abs/2506.09050

Yuki Imajuku, Kohki Horie, Yoichi Iwata, Kensho Aoki, Naohiro Takahashi, and Takuya Akiba. ALE-Bench: A Benchmark for Long-Horizon Objective-Driven Algorithm Engineering, October 2025.https://arxiv.org/abs/2506.09050

work page arXiv 2025

[61] [61]

AIRS-Bench: A Suite of Tasks for Frontier AI Research Science Agents, February 2026.https://arxiv.org/abs/2602.06855

Alisia Lupidi, Bhavul Gauri, Thomas Simon Foster, Bassel Al Omari, Despoina Magka, Alberto Pepe, Alexis Audran-Reiss, Muna Aghamelu, Nicolas Baldwin, Lucia Cipolina-Kun, Jean- Christophe Gagnon-Audet, Chee Hau Leow, Sandra Lefdal, Hossam Mossalam, Abhinav Moudgil, Saba Nazir, Emanuel Tewolde, Isabel Urrego, Jordi Armengol Estape, Amar Budhiraja, Gaurav Ch...

work page arXiv 2026

[62] [62]

Evaluation-driven Scaling for Scientific Discovery

Haotian Ye, Haowei Lin, Jingyi Tang, Yizhen Luo, Caiyin Yang, Chang Su, Rahul Thapa, Rui Yang, Ruihua Liu, Zeyu Li, Chong Gao, Dachao Ding, Guangrong He, Miaolei Zhang, Lina Sun, Wenyang Wang, Yuchen Zhong, Zhuohao Shen, Di He, Jianzhu Ma, Stefano Ermon, Tongyang Li, Xiaowen Chu, James Zou, and Yuzhi Xu. Evaluation-driven Scaling for Scientific Discovery,...

work page internal anchor Pith review Pith/arXiv arXiv 2026

[63] [63]

Pan, Negar Arabzadeh, Riccardo Cogo, Yuxuan Zhu, Alexander Xiong, Lakshya A

Melissa Z. Pan, Negar Arabzadeh, Riccardo Cogo, Yuxuan Zhu, Alexander Xiong, Lakshya A. Agrawal, Huanzhi Mao, Emma Shen, Sid Pallerla, Liana Patel, Shu Liu, Tianneng Shi, Xiaoyuan Liu, Jared Quincy Davis, Emmanuele Lacavalla, Alessandro Basile, Shuyi Yang, Paul Castro, Daniel Kang, Joseph E. Gonzalez, Koushik Sen, Dawn Song, Ion Stoica, Matei Zaharia, and...

work page arXiv 2026

[64] [64]

Can We Predict Before Executing Machine Learning Agents?

Jingsheng Zheng, Jintian Zhang, Yujie Luo, Yuren Mao, Yunjun Gao, Lun Du, Huajun Chen, and Ningyu Zhang. Can We Predict Before Executing Machine Learning Agents?, January 2026.https://arxiv.org/abs/2601.05930

work page internal anchor Pith review Pith/arXiv arXiv 2026

[65] [65]

Simple Baselines are Competitive with Code Evolution, February 2026.https://arxiv.org/abs/2602.16805

Yonatan Gideoni, Sebastian Risi, and Yarin Gal. Simple Baselines are Competitive with Code Evolution, February 2026.https://arxiv.org/abs/2602.16805

work page arXiv 2026

[66] [66]

What Makes an LLM a Good Optimizer? A Trajectory Analysis of LLM-Guided Evolutionary Search

Xinhao Zhang, Xi Chen, François Portet, and Maxime Peyrard. What Makes an LLM a Good Optimizer? A Trajectory Analysis of LLM-Guided Evolutionary Search, April 2026. https://arxiv.org/abs/2604.19440

work page internal anchor Pith review Pith/arXiv arXiv 2026

[67] [67]

Fitness Landscape of Large Language Model-Assisted Automated Algorithm Search, August 2025

Fei Liu, Qingfu Zhang, Jialong Shi, Xialiang Tong, Kun Mao, and Mingxuan Yuan. Fitness Landscape of Large Language Model-Assisted Automated Algorithm Search, August 2025. https://arxiv.org/abs/2504.19636

work page arXiv 2025

[68] [68]

Understanding the Challenges in Iterative Generative Optimiza- tion with LLMs, March 2026.https://arxiv.org/abs/2603.23994

Allen Nie, Xavier Daull, Zhiyi Kuang, Abhinav Akkiraju, Anish Chaudhuri, Max Piasevoli, Ryan Rong, YuCheng Yuan, Prerit Choudhary, Shannon Xiao, Rasool Fakoor, Adith Swami- nathan, and Ching-An Cheng. Understanding the Challenges in Iterative Generative Optimiza- tion with LLMs, March 2026.https://arxiv.org/abs/2603.23994

work page arXiv 2026

[69] [69]

Lan Pan, Hanbo Xie, and Robert C. Wilson. Large Language Models Think Too Fast To Explore Effectively, May 2025.https://arxiv.org/abs/2501.18009

work page arXiv 2025

[70] [70]

Artificial Hivemind: The Open-Ended Homogeneity of Language Models (and Beyond), October 2025

Liwei Jiang, Yuanjun Chai, Margaret Li, Mickel Liu, Raymond Fok, Nouha Dziri, Yulia Tsvetkov, Maarten Sap, Alon Albalak, and Yejin Choi. Artificial Hivemind: The Open-Ended Homogeneity of Language Models (and Beyond), October 2025. https://arxiv.org/abs/ 2510.22954

work page arXiv 2025

[71] [71]

Your Agent May Misevolve: Emergent Risks in Self-evolving LLM Agents, March 2026.https://arxiv.org/abs/2509.26354

Shuai Shao, Qihan Ren, Chen Qian, Boyi Wei, Dadi Guo, Jingyi Yang, Xinhao Song, Linfeng Zhang, Weinan Zhang, Dongrui Liu, and Jing Shao. Your Agent May Misevolve: Emergent Risks in Self-evolving LLM Agents, March 2026.https://arxiv.org/abs/2509.26354. 15

work page arXiv 2026

[72] [72]

Why Do Multi-Agent LLM Systems Fail?

Mert Cemri, Melissa Z. Pan, Shuyi Yang, Lakshya A. Agrawal, Bhavya Chopra, Rishabh Tiwari, Kurt Keutzer, Aditya Parameswaran, Dan Klein, Kannan Ramchandran, Matei Zaharia, Joseph E. Gonzalez, and Ion Stoica. Why Do Multi-Agent LLM Systems Fail?, October 2025. https://arxiv.org/abs/2503.13657. 16 A Additional EvoTrace Details A.1 Per-field trace schema Evo...

work page internal anchor Pith review Pith/arXiv arXiv 2025