What Do Evolutionary Coding Agents Evolve?
Pith reviewed 2026-05-20 03:40 UTC · model grok-4.3
pith:47Q5SZXC Add to your LaTeX paper
What is a Pith Number?\usepackage{pith}
\pithnumber{47Q5SZXC}
Prints a linked pith:47Q5SZXC badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more
The pith
Evolutionary coding agents often improve scores by cycling deleted lines back into code rather than inventing new algorithms.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Benchmark gains in evolutionary coding agents arise from qualitatively different mechanisms, only some of which introduce new algorithmic structure; a deterministic cycling pattern appears in which roughly thirty percent of lines added during search are byte-identical re-introductions of previously deleted lines.
What carries the argument
Annotation of every code edit into one of nine recurring types using a validated LLM-as-judge pipeline applied to full evolutionary traces.
If this is right
- Reported progress on coding benchmarks can reflect simple re-tunes or cycling instead of structural novelty.
- Diagnostic evaluation must inspect edit distributions and search dynamics rather than final scores alone.
- Controlled interventions that block line re-introductions can test whether performance depends on cycling.
- The EvoTrace dataset supports more precise comparison of evolutionary coding methods.
Where Pith is reading between the lines
- The cycling behavior may indicate that search stays within a narrow region of the model's prior knowledge.
- Similar re-introduction loops could limit progress in other iterative LLM editing workflows.
Load-bearing premise
The nine edit types assigned by the LLM judge accurately reflect the mechanisms that produce score changes.
What would settle it
Human re-annotation of the same edits that assigns score gains to different edit types and finds no thirty-percent cycling rate.
Figures
read the original abstract
Recent work pairs LLMs with evolutionary search to iteratively generate, modify, and select code using task-specific feedback. These systems have produced strong results in mathematical discovery and algorithm design, yet a fundamental question remains: what do they actually evolve? Progress is typically summarized by the best score a run reaches under a task-specific evaluator, but that score can reflect several different mechanisms: new algorithmic structure, re-tuning an existing strategy, recombining ideas already in the model's internal knowledge, or overfitting to the evaluator. Distinguishing these mechanisms requires inspecting the search process itself, not only its final outcome. We introduce EvoTrace, a dataset of evolutionary coding traces spanning four evolutionary frameworks, reasoning and non-reasoning models, and 16 tasks across mathematics and algorithm design. To analyze these traces, we develop EvoReplay, a replay-based methodology that reconstructs the local search states behind high-scoring solutions and tests controlled interventions, including adjusting constants, removing program components and substituting models or prompting contexts. We annotate every code edit in EvoTrace with one of nine recurring edit types using an LLM-as-judge pipeline validated against blind human re-annotation. Across EvoTrace, most score gains come from a small subset of these edit types. We further find a deterministic cycling pattern: about 30% of code lines added during search are byte-identical re-introductions of previously-deleted lines, present throughout nearly every run. These results show that benchmark gains in evolutionary coding agents can arise from qualitatively different mechanisms, only some of which correspond to new algorithmic structure. EvoTrace enables more diagnostic evaluation of evolutionary coding agents beyond final benchmark scores.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces EvoTrace, a dataset of evolutionary coding traces spanning four frameworks, reasoning and non-reasoning models, and 16 tasks in mathematics and algorithm design. It develops EvoReplay, a replay-based method to reconstruct local search states and perform controlled interventions (e.g., adjusting constants, removing components, substituting models or prompts). All code edits are annotated into one of nine recurring types via an LLM-as-judge pipeline validated by blind human re-annotation. Findings include that most score gains derive from a small subset of edit types and a deterministic cycling pattern in which ~30% of added lines are byte-identical re-introductions of previously deleted lines. The central claim is that benchmark gains arise from qualitatively different mechanisms, only some of which reflect new algorithmic structure.
Significance. If the mechanism distinctions hold, the work would meaningfully advance evaluation practices in evolutionary computation and LLM-guided search by shifting focus from final scores to process diagnostics. The dataset and replay methodology could enable more reproducible and targeted analysis of whether gains reflect genuine innovation versus re-tuning or overfitting. Concrete quantitative observations such as the cycling rate provide falsifiable anchors for follow-up studies.
major comments (2)
- [§4.2] §4.2 (LLM-as-Judge Pipeline and Validation): The manuscript reports validation against blind human re-annotation but provides no per-type agreement rates, no quantification of agreement specifically on edits labeled as 'new algorithmic structure', and no sensitivity analysis showing how re-labeling of borderline cases affects the claim that most gains come from a small subset of types. Because this classification step is required to map score improvements to the four distinct mechanisms listed in the abstract, the absence of these metrics leaves the qualitative distinction under-supported.
- [§3.1] §3.1 (EvoReplay Interventions): The controlled interventions are introduced to test mechanisms behind high-scoring solutions, yet the text does not detail how each intervention (constant adjustment, component removal, model substitution) isolates 'new algorithmic structure' from re-tuning, recombination, or evaluator overfitting. Without explicit mapping or controls for confounding factors, it is unclear whether the interventions confirm the claimed separation of mechanisms.
minor comments (2)
- [Abstract] The abstract states that traces come from 'four evolutionary frameworks' without naming them; adding the specific names would improve immediate clarity for readers.
- [Figures] Figure captions for edit-type distributions should explicitly list the nine types and their definitions to allow readers to interpret the 'small subset' result without cross-referencing the main text.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive report. The comments highlight areas where additional detail can strengthen the presentation of our validation and intervention methodology. We respond to each major comment below and indicate the revisions we will make.
read point-by-point responses
-
Referee: [§4.2] §4.2 (LLM-as-Judge Pipeline and Validation): The manuscript reports validation against blind human re-annotation but provides no per-type agreement rates, no quantification of agreement specifically on edits labeled as 'new algorithmic structure', and no sensitivity analysis showing how re-labeling of borderline cases affects the claim that most gains come from a small subset of types. Because this classification step is required to map score improvements to the four distinct mechanisms listed in the abstract, the absence of these metrics leaves the qualitative distinction under-supported.
Authors: We agree that per-type agreement rates and a sensitivity analysis would provide stronger support for the classification step. In the revised manuscript we will add a table in §4.2 reporting agreement (Cohen’s kappa and raw percentage) for each of the nine edit types, with a separate row for the ‘new algorithmic structure’ category. We will also include a sensitivity analysis that re-labels borderline cases according to the human annotators’ secondary choices and shows that the result—most score gains arising from a small subset of types—remains stable. revision: yes
-
Referee: [§3.1] §3.1 (EvoReplay Interventions): The controlled interventions are introduced to test mechanisms behind high-scoring solutions, yet the text does not detail how each intervention (constant adjustment, component removal, model substitution) isolates 'new algorithmic structure' from re-tuning, recombination, or evaluator overfitting. Without explicit mapping or controls for confounding factors, it is unclear whether the interventions confirm the claimed separation of mechanisms.
Authors: We accept that the current text does not make the mapping between interventions and mechanisms fully explicit. In the revision we will expand §3.1 with a table that directly links each intervention to the mechanism(s) it is intended to isolate (constant adjustment for re-tuning, component removal for new structure versus recombination, model/prompt substitution for internal knowledge versus search-derived structure). We will also describe the controls already present in the replay protocol—held-out test cases and fixed evaluator seeds—to address potential evaluator overfitting. revision: yes
Circularity Check
Empirical trace analysis is self-contained with no circular derivation
full rationale
The paper's central results rest on direct inspection of evolutionary search traces via the introduced EvoTrace dataset and EvoReplay interventions, plus LLM-as-judge annotation of edit types validated by blind human re-annotation. These are observational findings about the distribution of score gains and byte-identical line re-introductions; no mathematical derivation, fitted parameter renamed as prediction, or load-bearing self-citation chain is present that would reduce the claimed distinctions among mechanisms to the inputs by construction. The analysis is therefore independent of its own outputs.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Task-specific evaluators provide meaningful feedback capable of distinguishing different mechanisms of improvement.
Reference graph
Works this paper leans on
-
[1]
Pawan Kumar, Emilien Dupont, Francisco J
Bernardino Romera-Paredes, Mohammadamin Barekatain, Alexander Novikov, Matej Balog, M. Pawan Kumar, Emilien Dupont, Francisco J. R. Ruiz, Jordan S. Ellenberg, Pengming Wang, Omar Fawzi, Pushmeet Kohli, and Alhussein Fawzi. Mathematical discoveries from program search with large language models.Nature, 625(7995):468–475, January 2024. ISSN 0028-0836, 1476-...
-
[2]
Alexander Novikov, Ngân V˜u, Marvin Eisenberger, Emilien Dupont, Po-Sen Huang, Adam Zsolt Wagner, Sergey Shirobokov, Borislav Kozlovskii, Francisco J. R. Ruiz, Abbas Mehrabian, M. Pawan Kumar, Abigail See, Swarat Chaudhuri, George Holland, Alex Davies, Sebastian Nowozin, Pushmeet Kohli, and Matej Balog. AlphaEvolve: A coding agent for scientific and algor...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[3]
Mathematical exploration and discovery at scale
Bogdan Georgiev, Javier Gómez-Serrano, Terence Tao, and Adam Zsolt Wagner. Mathematical exploration and discovery at scale, December 2025. https://arxiv.org/abs/2511.02864
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[4]
ShinkaEvolve: Towards Open-Ended And Sample-Efficient Program Evolution, September 2025
Robert Tjarko Lange, Yuki Imajuku, and Edoardo Cetin. ShinkaEvolve: Towards Open-Ended And Sample-Efficient Program Evolution, September 2025. https://arxiv.org/abs/2509. 19349
work page 2025
-
[5]
Henrique Assumpção, Diego Ferreira, Leandro Campos, and Fabricio Murai. CodeEvolve: An open source evolutionary coding agent for algorithmic discovery and optimization, March 2026. https://arxiv.org/abs/2510.14150
-
[6]
AdaEvolve: Adaptive LLM Driven Zeroth-Order Optimization, February 2026
Mert Cemri, Shubham Agrawal, Akshat Gupta, Shu Liu, Audrey Cheng, Qiuyang Mang, Ashwin Naren, Lutfi Eren Erdogan, Koushik Sen, Matei Zaharia, Alex Dimakis, and Ion Stoica. AdaEvolve: Adaptive LLM Driven Zeroth-Order Optimization, February 2026. https: //arxiv.org/abs/2602.20133
-
[7]
Pan, Alexander Du, Kurt Keutzer, Alvin Cheung, Alexandros G
Shu Liu, Shubham Agarwal, Monishwaran Maheswaran, Mert Cemri, Zhifei Li, Qiuyang Mang, Ashwin Naren, Ethan Boneh, Audrey Cheng, Melissa Z. Pan, Alexander Du, Kurt Keutzer, Alvin Cheung, Alexandros G. Dimakis, Koushik Sen, Matei Zaharia, and Ion Stoica. EvoX: Meta- Evolution for Automated Discovery, March 2026.https://arxiv.org/abs/2602.23413
-
[8]
Audrey Cheng, Shu Liu, Melissa Pan, Zhifei Li, Shubham Agarwal, Mert Cemri, Bowen Wang, Alexander Krentsel, Tian Xia, Jongseok Park, Shuo Yang, Jeff Chen, Lakshya Agrawal, Ashwin Naren, Shulu Li, Ruiying Ma, Aditya Desai, Jiarong Xing, Koushik Sen, Matei Zaharia, and Ion Stoica. Let the Barbarians In: How AI Can Accelerate Systems Performance Research, De...
-
[9]
Ping Guo, Chenyu Zhu, Siyuan Chen, Fei Liu, Xi Lin, Zhichao Lu, and Qingfu Zhang. Evo- Engineer: Mastering Automated CUDA Kernel Code Evolution with Large Language Models, October 2025.https://arxiv.org/abs/2510.03760
-
[10]
Shiyi Cao, Ziming Mao, Joseph E. Gonzalez, and Ion Stoica. K-Search: LLM Kernel Generation via Co-Evolving Intrinsic World Model, February 2026. https://arxiv.org/abs/2602. 19128
work page 2026
-
[11]
KernelFoundry: Hardware-aware evolutionary GPU kernel optimization, March 2026
Nina Wiedemann, Quentin Leboutet, Michael Paulitsch, Diana Wofk, and Benjamin Ummen- hofer. KernelFoundry: Hardware-aware evolutionary GPU kernel optimization, March 2026. https://arxiv.org/abs/2603.12440
-
[12]
Openevolve: an open-source evolutionary coding agent, 2025
Asankhaya Sharma. Openevolve: an open-source evolutionary coding agent, 2025. URL https://github.com/algorithmicsuperintelligence/openevolve
work page 2025
-
[13]
GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning
Lakshya A. Agrawal, Shangyin Tan, Dilara Soylu, Noah Ziems, Rishi Khare, Krista Opsahl- Ong, Arnav Singhvi, Herumb Shandilya, Michael J. Ryan, Meng Jiang, Christopher Potts, Koushik Sen, Alexandros G. Dimakis, Ion Stoica, Dan Klein, Matei Zaharia, and Omar Khattab. GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning, February 2026. htt...
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[14]
Valentin Khrulkov, Andrey Galichin, Denis Bashkirov, Dmitry Vinichenko, Oleg Travkin, Roman Alferov, Andrey Kuznetsov, and Ivan Oseledets. GigaEvo: An Open Source Op- timization Framework Powered By LLMs And Evolution Algorithms, November 2025. https://arxiv.org/abs/2511.17592
-
[15]
The FM Agent, February 2026.https://arxiv.org/abs/2510.26144
Annan Li, Chufan Wu, Zengle Ge, Yee Hin Chong, Zhinan Hou, Lizhe Cao, Cheng Ju, Jianmin Wu, Huaiming Li, Haobo Zhang, Shenghao Feng, Mo Zhao, Fengzhi Qiu, Rui Yang, Mengmeng Zhang, Wenyi Zhu, Yingying Sun, Quan Sun, Shunhao Yan, Danyu Liu, Dawei Yin, and Dou Shen. The FM Agent, February 2026.https://arxiv.org/abs/2510.26144
-
[16]
AIDE: AI-Driven Exploration in the Space of Code
Zhengyao Jiang, Dominik Schmidt, Dhruv Srikanth, Dixing Xu, Ian Kaplan, Deniss Jacenko, and Yuxiang Wu. AIDE: AI-Driven Exploration in the Space of Code, February 2025. https: //arxiv.org/abs/2502.13138
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[17]
Chi, Fernando Pereira, Wang-Cheng Kang, Derek Zhiyuan Cheng, and Beidou Wang
Minghao Yan, Bo Peng, Benjamin Coleman, Ziqi Chen, Zhouhang Xie, Shuo Chen, Zhankui He, Noveen Sachdeva, Isabella Ye, Weili Wang, Chi Wang, Ed H. Chi, Fernando Pereira, Wang-Cheng Kang, Derek Zhiyuan Cheng, and Beidou Wang. PACEvolve: Enabling Long- Horizon Progress-Aware Consistent Evolution, January 2026. https://arxiv.org/abs/ 2601.10657
-
[18]
AdaptEvolve: Improving Efficiency of Evolutionary AI Agents through Adaptive Model Selection
Pretam Ray, Pratik Prabhanjan Brahma, Zicheng Liu, and Emad Barsoum. AdaptEvolve: Improving Efficiency of Evolutionary AI Agents through Adaptive Model Selection, February 2026.https://arxiv.org/abs/2602.11931
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[19]
CausalEvolve: Towards Open-Ended Discovery with Causal Scratchpad, March 2026
Yongqiang Chen, Chenxi Liu, Zhenhao Chen, Tongliang Liu, Bo Han, and Kun Zhang. CausalEvolve: Towards Open-Ended Discovery with Causal Scratchpad, March 2026. https: //arxiv.org/abs/2603.14575
-
[20]
Yi Zhai, Zhiqiang Wei, Ruohan Li, Keyu Pan, Shuo Liu, Lu Zhang, Jianmin Ji, Wuyang Zhang, Yu Zhang, and Yanyong Zhang. \(X\)-evolve: Solution space evolution powered by large language models, August 2025.https://arxiv.org/abs/2508.07932
-
[21]
SeaEvo: Advancing Algorithm Discovery with Strategy Space Evolution
Sichun Luo, Yi Huang, Haochen Luo, Fengyuan Liu, Guanzhi Deng, Lei Li, Qinghua Yao, Zefa Hu, Junlan Feng, and Qi Liu. SeaEvo: Advancing Algorithm Discovery with Strategy Space Evolution, April 2026.https://arxiv.org/abs/2604.24372
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[22]
Haoran Ye, Xuning He, Vincent Arak, Haonan Dong, and Guojie Song. Meta Context Engineer- ing via Agentic Skill Evolution, February 2026.https://arxiv.org/abs/2601.21557
-
[23]
C- Evolve: Consensus-based Evolution for Prompt Groups, September 2025
Tiancheng Li, Yuhang Wang, Zhiyang Chen, Zijun Wang, Liyuan Ma, and Guo-jun Qi. C- Evolve: Consensus-based Evolution for Prompt Groups, September 2025. https://arxiv. org/abs/2509.23331
-
[24]
Population-Evolve: A Parallel Sampling and Evolutionary Method for LLM Math Reasoning, December 2025
Yanzhi Zhang, Yitong Duan, Zhaoxi Zhang, Jiyan He, and Shuxin Zheng. Population-Evolve: A Parallel Sampling and Evolutionary Method for LLM Math Reasoning, December 2025. https://arxiv.org/abs/2512.19081
-
[25]
Contrastive Concept-Tree Search for LLM-Assisted Algorithm Discovery, February 2026
Timothee Leleu, Sudeera Gunathilaka, Federico Ghimenti, and Surya Ganguli. Contrastive Concept-Tree Search for LLM-Assisted Algorithm Discovery, February 2026. https:// arxiv.org/abs/2602.03132
-
[26]
Julien Pourcel, Cédric Colas, and Pierre-Yves Oudeyer. Self-Improving Language Models for Evolutionary Program Synthesis: A Case Study on ARC-AGI, March 2026. https: //arxiv.org/abs/2507.14172
-
[27]
ThetaEvolve: Test-time Learning on Open Problems
Yiping Wang, Shao-Rong Su, Zhiyuan Zeng, Eva Xu, Liliang Ren, Xinyu Yang, Zeyi Huang, Xuehai He, Luyao Ma, Baolin Peng, Hao Cheng, Pengcheng He, Weizhu Chen, Shuohang Wang, Simon Shaolei Du, and Yelong Shen. ThetaEvolve: Test-time Learning on Open Problems, November 2025.https://arxiv.org/abs/2511.23473
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[28]
Learning to Discover at Test Time
Mert Yuksekgonul, Daniel Koceja, Xinhao Li, Federico Bianchi, Jed McCaleb, Xiaolong Wang, Jan Kautz, Yejin Choi, James Zou, Carlos Guestrin, and Yu Sun. Learning to Discover at Test Time, February 2026.https://arxiv.org/abs/2601.16175. 12
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[29]
Ruiying Ma, Chieh-Jan Mike Liang, Yanjie Gao, and Francis Y . Yan. MetaMuse: Algorithm Generation via Creative Ideation, October 2025.https://arxiv.org/abs/2510.03851
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[30]
LLM Priors for ERM over Programs, February 2026.https://arxiv.org/abs/2510.14331
Shivam Singhal, Priyadarsi Mishra, Eran Malach, and Tomer Galanti. LLM Priors for ERM over Programs, February 2026.https://arxiv.org/abs/2510.14331
-
[31]
Self-Play Only Evolves When Self-Synthetic Pipeline Ensures Learnable Information Gain
Wei Liu, Siya Qi, Yali Du, and Yulan He. Self-play only evolves when self-synthetic pipeline ensures learnable information gain, 2026.https://arxiv.org/abs/2603.02218
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[32]
Weihua Du, Jingming Zhuo, Yixin Dong, Andre Wang He, Weiwei Sun, Zeyu Zheng, Manupa Karunaratne, Ivan Fox, Tim Dettmers, Tianqi Chen, Yiming Yang, and Sean Welleck. Ada- Explore: Failure-Driven Adaptation and Diversity-Preserving Search for Efficient Kernel Generation, April 2026.https://arxiv.org/abs/2604.16625
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[33]
Record-Remix-Replay: Hierarchical GPU Kernel Optimization using Evolutionary Search
Daniel Nichols, Konstantinos Parasyris, Caetano Melone, Tal Ben-Nun, Giorgis Georgakoudis, and Harshitha Menon. Record-Remix-Replay: Hierarchical GPU Kernel Optimization using Evolutionary Search, April 2026.https://arxiv.org/abs/2604.11109
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[34]
Anjiang Wei, Tianran Sun, Yogesh Seenichamy, Hang Song, Anne Ouyang, Azalia Mirhoseini, Ke Wang, and Alex Aiken. Astra: A Multi-Agent System for GPU Kernel Performance Optimization, December 2025.https://arxiv.org/abs/2509.07506
-
[35]
Hongyuan Su, Yu Zheng, and Yong Li. ContextEvolve: Multi-Agent Context Compression for Systems Code Optimization, February 2026.https://arxiv.org/abs/2602.02597
-
[36]
Audrey Cheng, Shu Liu, Melissa Pan, Zhifei Li, Bowen Wang, Alex Krentsel, Tian Xia, Mert Cemri, Jongseok Park, Shuo Yang, Jeff Chen, Lakshya Agrawal, Aditya Desai, Jiarong Xing, Koushik Sen, Matei Zaharia, and Ion Stoica. Barbarians at the Gate: How AI is Upending Systems Research, October 2025.https://arxiv.org/abs/2510.06189
-
[37]
Hongzheng Chen, Alexander Novikov, Ngân V˜u, Hanna Alam, Zhiru Zhang, Aiden Grossman, Mircea Trofin, and Amir Yazdanbakhsh. Magellan: Autonomous Discovery of Novel Compiler Optimization Heuristics with AlphaEvolve, January 2026. https://arxiv.org/abs/2601. 21096
work page 2026
-
[38]
Raghav Gupta, Akanksha Jain, Abraham Gonzalez, Alexander Novikov, Po-Sen Huang, Matej Balog, Marvin Eisenberger, Sergey Shirobokov, Ngân V ˜u, Martin Dixon, Borivoje Nikoli ´c, Parthasarathy Ranganathan, and Sagar Karandikar. ArchAgent: Agentic AI-driven Computer Architecture Discovery, February 2026.https://arxiv.org/abs/2602.22425
-
[39]
Tianyi Li, Shihui Zang, and Moritz Münchmeyer. MadEvolve: Evolutionary Optimization of Cosmological Algorithms with Large Language Models, February 2026. https://arxiv. org/abs/2602.15951
-
[40]
Shipeng Cen and Ying Tan. Beyond Algorithm Evolution: An LLM-Driven Framework for the Co-Evolution of Swarm Intelligence Optimization Algorithms and Prompts, December 2025. https://arxiv.org/abs/2512.09209
-
[41]
Iterated Agent for Symbolic Regression, October 2025.https://arxiv.org/abs/2510.08317
Zhuo-Yang Song, Zeyu Cai, Shutao Zhang, Jiashen Wei, Jichen Pan, Shi Qiu, Qing-Hong Cao, Tie-Jiun Hou, Xiaohui Liu, Ming-xing Luo, and Hua Xing Zhu. Iterated Agent for Symbolic Regression, October 2025.https://arxiv.org/abs/2510.08317
-
[42]
RankEvolve: Automating the Discovery of Retrieval Algorithms via LLM-Driven Evolution, February 2026
Jinming Nian, Fangchen Li, Dae Hoon Park, and Yi Fang. RankEvolve: Automating the Discovery of Retrieval Algorithms via LLM-Driven Evolution, February 2026. https:// arxiv.org/abs/2602.16932
-
[43]
Haochen Wang, Yi Wu, Daryl Chang, Li Wei, and Lukasz Heldt. Self-Evolving Recommenda- tion System: End-To-End Autonomous Model Optimization With LLM Agents, February 2026. https://arxiv.org/abs/2602.10226
-
[44]
Can LLMs Beat Classical Hyperparameter Optimization Algorithms? A Study on autoresearch
Fabio Ferreira, Lucca Wobbe, Arjun Krishnakumar, Frank Hutter, and Arber Zela. Can LLMs Beat Classical Hyperparameter Optimization Algorithms? A Study on autoresearch, April 2026. https://arxiv.org/abs/2603.24647. 13
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[45]
Controlled Self-Evolution for Algorithmic Code Optimization, February 2026
Tu Hu, Ronghao Chen, Shuo Zhang, Jianghao Yin, Mou Xiao Feng, Jingping Liu, Shaolei Zhang, Wenqi Jiang, Yuqi Fang, Sen Hu, Huacan Wang, and Yi Xu. Controlled Self-Evolution for Algorithmic Code Optimization, February 2026. https://arxiv.org/abs/2601.07348
-
[46]
AlphaApollo: A System for Deep Agentic Reasoning, March 2026
Zhanke Zhou, Chentao Cao, Xiao Feng, Xuan Li, Zongze Li, Xiangyu Lu, Jiangchao Yao, Weikai Huang, Tian Cheng, Jianghangfan Zhang, Tangyu Jiang, Linrui Xu, Yiming Zheng, Brando Miranda, Tongliang Liu, Sanmi Koyejo, Masashi Sugiyama, and Bo Han. AlphaApollo: A System for Deep Agentic Reasoning, March 2026. https://arxiv.org/abs/2510. 06261
work page 2026
-
[47]
Xu Yang, Xiao Yang, Shikai Fang, Yifei Zhang, Jian Wang, Bowen Xian, Qizheng Li, Jingyuan Li, Minrui Xu, Yuante Li, Haoran Pan, Yuge Zhang, Weiqing Liu, Yelong Shen, Weizhu Chen, and Jiang Bian. R&D-Agent: An LLM-Agent Framework Towards Autonomous Data Science, October 2025.https://arxiv.org/abs/2505.14738
-
[48]
Zhaotian Weng, Antonis Antoniades, Deepak Nathani, Zhen Zhang, Xiao Pu, and Xin Eric Wang. Group-Evolving Agents: Open-Ended Self-Improvement via Experience Sharing, February 2026.https://arxiv.org/abs/2602.04837
-
[49]
Dimakis, Matei Zaharia, and Ion Stoica
Shu Liu, Mert Cemri, Shubham Agarwal, Alexander Krentsel, Ashwin Naren, Qiuyang Mang, Zhifei Li, Akshat Gupta, Monishwaran Maheswaran, Audrey Cheng, Melissa Pan, Ethan Boneh, Kannan Ramchandran, Koushik Sen, Alexandros G. Dimakis, Matei Zaharia, and Ion Stoica. SkyDiscover: A flexible framework for AI-driven scientific and algorithmic discovery, 2026. URL...
work page 2026
-
[50]
John R. Koza. Genetic programming as a means for programming computers by natural selection.Statistics and Computing, 4(2), June 1994. ISSN 0960-3174, 1573-1375. doi: 10.1007/BF00175355
-
[51]
Population Based Training of Neural Networks
Max Jaderberg, Valentin Dalibard, Simon Osindero, Wojciech M. Czarnecki, Jeff Donahue, Ali Razavi, Oriol Vinyals, Tim Green, Iain Dunning, Karen Simonyan, Chrisantha Fernando, and Koray Kavukcuoglu. Population Based Training of Neural Networks, November 2017. https://arxiv.org/abs/1711.09846
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[52]
Chao Qian, Ke Xue, and Ren-Jian Wang. Quality-Diversity Algorithms Can Provably Be Helpful for Optimization, May 2024.https://arxiv.org/abs/2401.10539
-
[53]
The Role of Stepping Stones in MAP-Elites: Insights from Search Trajectory Networks
Giorgia Nadizar, Francesco Rusin, Eric Medvet, and Gabriela Ochoa. The Role of Stepping Stones in MAP-Elites: Insights from Search Trajectory Networks. In Bing Xue, Luca Manzoni, and Illya Bakurov, editors,Genetic Programming, volume 15609, pages 224–239. Springer Nature Switzerland, Cham, 2025. ISBN 978-3-031-89990-4 978-3-031-89991-1. doi: 10.1007/ 978-...
work page 2025
-
[54]
Edward Hughes, Michael Dennis, Jack Parker-Holder, Feryal Behbahani, Aditi Mavalankar, Yuge Shi, Tom Schaul, and Tim Rocktaschel. Open-Endedness is Essential for Artificial Superhuman Intelligence, June 2024.https://arxiv.org/abs/2406.04268
-
[55]
Dan Friedman and Adji Bousso Dieng. The Vendi Score: A Diversity Evaluation Metric for Machine Learning, July 2023.https://arxiv.org/abs/2210.02410
-
[56]
Rui Zhang and Zhichao Lu. Rethinking Code Similarity for Automated Algorithm Design with LLMs, March 2026.https://arxiv.org/abs/2603.02787
-
[57]
Jinyuan Fang, Yanwen Peng, Xi Zhang, Yingxu Wang, Xinhao Yi, Guibin Zhang, Yi Xu, Bin Wu, Siwei Liu, Zihao Li, Zhaochun Ren, Nikos Aletras, Xi Wang, Han Zhou, and Zaiqiao Meng. A Comprehensive Survey of Self-Evolving AI Agents: A New Paradigm Bridging Foundation Models and Lifelong Agentic Systems, August 2025. https://arxiv.org/abs/ 2508.07407
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[58]
How Far Are AI Scientists from Changing the World?, August 2025.https://arxiv.org/abs/2507.23276
Qiujie Xie, Yixuan Weng, Minjun Zhu, Fuchen Shen, Shulin Huang, Zhen Lin, Jiahui Zhou, Zilan Mao, Zijie Yang, Linyi Yang, Jian Wu, and Yue Zhang. How Far Are AI Scientists from Changing the World?, August 2025.https://arxiv.org/abs/2507.23276. 14
-
[59]
MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering
Jun Shern Chan, Neil Chowdhury, Oliver Jaffe, James Aung, Dane Sherburn, Evan Mays, Giulio Starace, Kevin Liu, Leon Maksin, Tejal Patwardhan, Lilian Weng, and Aleksander M ˛ adry. MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering, February 2025.https://arxiv.org/abs/2410.07095
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[60]
Yuki Imajuku, Kohki Horie, Yoichi Iwata, Kensho Aoki, Naohiro Takahashi, and Takuya Akiba. ALE-Bench: A Benchmark for Long-Horizon Objective-Driven Algorithm Engineering, October 2025.https://arxiv.org/abs/2506.09050
-
[61]
Alisia Lupidi, Bhavul Gauri, Thomas Simon Foster, Bassel Al Omari, Despoina Magka, Alberto Pepe, Alexis Audran-Reiss, Muna Aghamelu, Nicolas Baldwin, Lucia Cipolina-Kun, Jean- Christophe Gagnon-Audet, Chee Hau Leow, Sandra Lefdal, Hossam Mossalam, Abhinav Moudgil, Saba Nazir, Emanuel Tewolde, Isabel Urrego, Jordi Armengol Estape, Amar Budhiraja, Gaurav Ch...
-
[62]
Evaluation-driven Scaling for Scientific Discovery
Haotian Ye, Haowei Lin, Jingyi Tang, Yizhen Luo, Caiyin Yang, Chang Su, Rahul Thapa, Rui Yang, Ruihua Liu, Zeyu Li, Chong Gao, Dachao Ding, Guangrong He, Miaolei Zhang, Lina Sun, Wenyang Wang, Yuchen Zhong, Zhuohao Shen, Di He, Jianzhu Ma, Stefano Ermon, Tongyang Li, Xiaowen Chu, James Zou, and Yuzhi Xu. Evaluation-driven Scaling for Scientific Discovery,...
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[63]
Pan, Negar Arabzadeh, Riccardo Cogo, Yuxuan Zhu, Alexander Xiong, Lakshya A
Melissa Z. Pan, Negar Arabzadeh, Riccardo Cogo, Yuxuan Zhu, Alexander Xiong, Lakshya A. Agrawal, Huanzhi Mao, Emma Shen, Sid Pallerla, Liana Patel, Shu Liu, Tianneng Shi, Xiaoyuan Liu, Jared Quincy Davis, Emmanuele Lacavalla, Alessandro Basile, Shuyi Yang, Paul Castro, Daniel Kang, Joseph E. Gonzalez, Koushik Sen, Dawn Song, Ion Stoica, Matei Zaharia, and...
-
[64]
Can We Predict Before Executing Machine Learning Agents?
Jingsheng Zheng, Jintian Zhang, Yujie Luo, Yuren Mao, Yunjun Gao, Lun Du, Huajun Chen, and Ningyu Zhang. Can We Predict Before Executing Machine Learning Agents?, January 2026.https://arxiv.org/abs/2601.05930
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[65]
Simple Baselines are Competitive with Code Evolution, February 2026.https://arxiv.org/abs/2602.16805
Yonatan Gideoni, Sebastian Risi, and Yarin Gal. Simple Baselines are Competitive with Code Evolution, February 2026.https://arxiv.org/abs/2602.16805
-
[66]
What Makes an LLM a Good Optimizer? A Trajectory Analysis of LLM-Guided Evolutionary Search
Xinhao Zhang, Xi Chen, François Portet, and Maxime Peyrard. What Makes an LLM a Good Optimizer? A Trajectory Analysis of LLM-Guided Evolutionary Search, April 2026. https://arxiv.org/abs/2604.19440
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[67]
Fitness Landscape of Large Language Model-Assisted Automated Algorithm Search, August 2025
Fei Liu, Qingfu Zhang, Jialong Shi, Xialiang Tong, Kun Mao, and Mingxuan Yuan. Fitness Landscape of Large Language Model-Assisted Automated Algorithm Search, August 2025. https://arxiv.org/abs/2504.19636
-
[68]
Allen Nie, Xavier Daull, Zhiyi Kuang, Abhinav Akkiraju, Anish Chaudhuri, Max Piasevoli, Ryan Rong, YuCheng Yuan, Prerit Choudhary, Shannon Xiao, Rasool Fakoor, Adith Swami- nathan, and Ching-An Cheng. Understanding the Challenges in Iterative Generative Optimiza- tion with LLMs, March 2026.https://arxiv.org/abs/2603.23994
- [69]
-
[70]
Artificial Hivemind: The Open-Ended Homogeneity of Language Models (and Beyond), October 2025
Liwei Jiang, Yuanjun Chai, Margaret Li, Mickel Liu, Raymond Fok, Nouha Dziri, Yulia Tsvetkov, Maarten Sap, Alon Albalak, and Yejin Choi. Artificial Hivemind: The Open-Ended Homogeneity of Language Models (and Beyond), October 2025. https://arxiv.org/abs/ 2510.22954
-
[71]
Shuai Shao, Qihan Ren, Chen Qian, Boyi Wei, Dadi Guo, Jingyi Yang, Xinhao Song, Linfeng Zhang, Weinan Zhang, Dongrui Liu, and Jing Shao. Your Agent May Misevolve: Emergent Risks in Self-evolving LLM Agents, March 2026.https://arxiv.org/abs/2509.26354. 15
-
[72]
Why Do Multi-Agent LLM Systems Fail?
Mert Cemri, Melissa Z. Pan, Shuyi Yang, Lakshya A. Agrawal, Bhavya Chopra, Rishabh Tiwari, Kurt Keutzer, Aditya Parameswaran, Dan Klein, Kannan Ramchandran, Matei Zaharia, Joseph E. Gonzalez, and Ion Stoica. Why Do Multi-Agent LLM Systems Fail?, October 2025. https://arxiv.org/abs/2503.13657. 16 A Additional EvoTrace Details A.1 Per-field trace schema Evo...
work page internal anchor Pith review Pith/arXiv arXiv 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.