Recognition: no theorem link
AgentGA: Evolving Code Solutions in Agent-Seed Space
Pith reviewed 2026-05-12 04:08 UTC · model grok-4.3
The pith
Optimizing the agent seed—starting prompt plus inherited parent archives—lets a genetic algorithm evolve superior autonomous code-generation runs for tabular AutoML.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
AgentGA couples a genetic algorithm with long-horizon agents by evolving the agent seed—the task prompt together with optional parent archives that initialize each fresh workspace—rather than mutating code directly. Selection proceeds via deterministic 1:1 elite tournaments, operator allocation adapts online via a modified Hedge controller, and each generation launches an isolated autonomous run. On the full 16-competition benchmark the method records a 71.90 percent average exceed-human rate against 51.38 percent for the AIDE reference and wins 15 of 16 tasks; within runs, archive-inheriting descendants prevail in 51.9 percent of 1,680 parent-child tournaments while de-novo proposals win 8.
What carries the argument
The agent seed: the task prompt plus optional parent archives that initialize a fresh workspace, which serves as the evolvable unit allowing inheritance of artifacts across generations without direct code editing.
If this is right
- Descendants that inherit parent archives outperform fresh proposals in the majority of direct comparisons.
- A population-level genetic algorithm with elite tournaments and online operator allocation can be applied to long-horizon autonomous agents.
- Agent-seed optimization constitutes a practical design choice for autonomous code-search systems on tabular AutoML tasks.
- The approach wins 15 of 16 competitions on the Weco-Kaggle Lite benchmark at an average 71.90 percent exceed-human rate.
Where Pith is reading between the lines
- The same seed-evolution pattern could be tested on domains other than tabular data, such as image or text generation pipelines, to check whether workspace inheritance transfers.
- If archive reuse proves robust, future systems might reduce population size by focusing compute on promising inherited lineages rather than broad random restarts.
- Separating prompt evolution from workspace-state evolution would clarify which component drives most of the observed tournament advantage.
- The method suggests that agent frameworks could benefit from explicit mechanisms to log and replay successful initialization states across independent runs.
Load-bearing premise
The reported performance gains arise specifically from optimizing and inheriting agent seeds rather than from unstated details of the underlying agent implementation, benchmark tuning, or selective reporting of runs.
What would settle it
A controlled experiment that keeps the same agent implementation and total compute budget but disables seed optimization and inheritance, then measures whether the exceed-human rate falls back to the 51.38 percent reference level.
Figures
read the original abstract
We present AgentGA, a framework that evolves autonomous code-generation runs by optimizing the agent seed: the task prompt plus optional parent archives that initialize a fresh workspace. The outer loop searches over these reusable starting conditions rather than editing code directly. Each generation launches a fresh autonomous run in an isolated workspace, while selected parent archives provide inherited artifacts that descendants can inspect and reuse. AgentGA couples a population-level genetic algorithm with long-horizon agents; selection uses deterministic 1:1 elite tournaments and operator allocation is adapted online with a modified Hedge controller. We instantiate the approach for tabular AutoML on the 16-competition Weco-Kaggle Lite benchmark. Across the full benchmark, AgentGA averages 71.90% Exceeds % of Human versus 51.38% for the AIDE reference, winning 15/16 competitions. Within AgentGA runs, descendants conditioned on inherited parent archives win 51.9% of 1,680 parent-child tournaments versus 8.6% for de novo proposals. These results support agent-seed optimization as a practical design choice for autonomous code-search systems.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents AgentGA, a framework that evolves autonomous code-generation agents by optimizing the agent seed (task prompt plus optional parent archives) via an outer genetic algorithm loop rather than direct code edits. Each generation runs a fresh long-horizon agent in an isolated workspace, with selection via deterministic 1:1 elite tournaments and online Hedge-based operator adaptation. On the 16-competition Weco-Kaggle Lite tabular AutoML benchmark, AgentGA achieves 71.90% average Exceeds % of Human versus 51.38% for the AIDE reference (winning 15/16 competitions); within AgentGA, inherited-archive descendants win 51.9% of 1,680 parent-child tournaments versus 8.6% for de novo proposals.
Significance. If the base-agent equivalence and experimental controls hold, the work provides evidence that optimizing reusable starting conditions can improve performance in long-horizon code-search systems, offering a practical alternative or complement to direct code mutation. The internal tournament design supplies a within-method control that partially isolates the inheritance effect, and the deterministic selection plus adaptive operators are clear methodological strengths.
major comments (3)
- [Abstract] Abstract: The central claim that performance gains arise from agent-seed optimization requires that the underlying long-horizon agent, workspace isolation, and benchmark execution are identical between AgentGA and AIDE except for the outer GA loop and inheritance mechanism. No such confirmation or base-agent description is supplied, so the 71.90% vs 51.38% comparison cannot be attributed specifically to seed evolution.
- [Abstract] Abstract: The headline results (71.90% Exceeds % of Human, 15/16 wins) are stated without any experimental protocol, run-selection criteria, statistical tests, or description of how runs were chosen or excluded. This absence makes the benchmark outcomes impossible to assess for support of the central claim.
- [Abstract] Abstract: The internal 51.9% vs 8.6% tournament statistic is computed only within AgentGA runs and therefore does not address potential confounds in the cross-method comparison to AIDE.
minor comments (1)
- [Abstract] The abstract would benefit from a brief sentence clarifying the precise definition of 'Exceeds % of Human' and the construction of the Weco-Kaggle Lite benchmark.
Simulated Author's Rebuttal
We thank the referee for the careful reading and for highlighting the need for clearer experimental controls and protocol details. We address each major comment below and will incorporate revisions to strengthen the manuscript.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central claim that performance gains arise from agent-seed optimization requires that the underlying long-horizon agent, workspace isolation, and benchmark execution are identical between AgentGA and AIDE except for the outer GA loop and inheritance mechanism. No such confirmation or base-agent description is supplied, so the 71.90% vs 51.38% comparison cannot be attributed specifically to seed evolution.
Authors: We agree that explicit confirmation of base-agent equivalence is required to support attribution of gains to seed optimization. The manuscript positions AgentGA as an outer loop around the same long-horizon autonomous code-generation agent and isolated workspace used by the AIDE reference, with the sole additions being the genetic algorithm over seeds and parent-archive inheritance. To make this equivalence unambiguous, we will add a dedicated subsection in the Methods section that describes the base agent configuration, workspace isolation protocol, and benchmark execution harness, together with an explicit statement that these elements are held fixed between the two methods. revision: yes
-
Referee: [Abstract] Abstract: The headline results (71.90% Exceeds % of Human, 15/16 wins) are stated without any experimental protocol, run-selection criteria, statistical tests, or description of how runs were chosen or excluded. This absence makes the benchmark outcomes impossible to assess for support of the central claim.
Authors: We acknowledge that the current manuscript does not supply a full experimental protocol, run-selection criteria, or statistical analysis in the abstract or main text. The reported figures reflect single deterministic runs per competition for each method. We will insert a new Experimental Setup section that details the run protocol, any exclusion rules, computational budget, and the rationale for omitting formal statistical tests (driven by the high cost of long-horizon agent executions). This addition will allow readers to evaluate the strength of the benchmark comparison directly. revision: yes
-
Referee: [Abstract] Abstract: The internal 51.9% vs 8.6% tournament statistic is computed only within AgentGA runs and therefore does not address potential confounds in the cross-method comparison to AIDE.
Authors: The referee correctly notes that the parent-child tournament results are computed exclusively inside AgentGA runs and therefore cannot serve as a control for confounds between AgentGA and AIDE. This internal statistic is intended only to isolate the contribution of archive inheritance within our own method. The primary cross-method evidence rests on the assumption of identical base agents and benchmarks, which we will make explicit in the revised Methods section as described in our response to the first comment. We will also revise the text to state the limited scope of the tournament analysis. revision: partial
Circularity Check
No circularity: purely empirical benchmark comparisons with no derivations or self-referential chains
full rationale
The paper contains no equations, derivations, fitted parameters, or mathematical claims. All results are direct empirical measurements (win rates, tournament outcomes) on the Weco-Kaggle Lite benchmark against an external reference (AIDE). No self-citations, ansatzes, uniqueness theorems, or renamings of known results appear in the provided text. The central claim is an observed performance difference, not a reduction of any output to its inputs by construction. This is the normal case for an empirical systems paper.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
SELA : Tree - Search Enhanced LLM Agents for Automated Machine Learning , October 2024
Yizhou Chi, Yizhang Lin, Sirui Hong, Duyi Pan, Yaying Fei, Guanghao Mei, Bangbang Liu, Tianqi Pang, Jacky Kwok, Ceyao Zhang, Bang Liu, and Chenglin Wu. SELA : Tree - Search Enhanced LLM Agents for Automated Machine Learning , October 2024. URL http://arxiv.org/abs/2410.17238. arXiv:2410.17238 [cs]
-
[2]
AutoGluon-Tabular: Robust and Accurate AutoML for Structured Data
Nick Erickson, Jonas Mueller, Alexander Shirkov, Hang Zhang, Pedro Larroy, Mu Li, and Alexander Smola. AutoGluon - Tabular : Robust and Accurate AutoML for Structured Data , March 2020. URL http://arxiv.org/abs/2003.06505. arXiv:2003.06505 [stat]
work page internal anchor Pith review arXiv 2020
-
[3]
MLZero : A Multi - Agent System for End -to-end Machine Learning Automation
Haoyang Fang, Boran Han, Nick Erickson, Xiyuan Zhang, Su Zhou, Anirudh Dagar, Jiani Zhang, Ali Caner Turkmen, Cuixiong Hu, Huzefa Rangwala, Ying Nian Wu, Bernie Wang, and George Karypis. MLZero : A Multi - Agent System for End -to-end Machine Learning Automation . In Advances in Neural Information Processing Systems 38 ( NeurIPS ) , 2025. doi:10.48550/arX...
-
[4]
Efficient and Robust Automated Machine Learning
Matthias Feurer, Aaron Klein, Katharina Eggensperger, Jost Springenberg, Manuel Blum, and Frank Hutter. Efficient and Robust Automated Machine Learning . In Advances in Neural Information Processing Systems , volume 28. Curran Associates, Inc., 2015. URL https://papers.nips.cc/paper/2015/hash/11d0e6287202fced83f79975ec59a3a6-Abstract.html
work page 2015
-
[5]
Auto- Sklearn 2.0: Hands -free AutoML via Meta - Learning
Matthias Feurer, Katharina Eggensperger, Stefan Falkner, Marius Lindauer, and Frank Hutter. Auto- Sklearn 2.0: Hands -free AutoML via Meta - Learning . Journal of Machine Learning Research, 23 0 (261): 0 1--61, 2022. ISSN 1533-7928. URL http://jmlr.org/papers/v23/21-0992.html
work page 2022
-
[6]
Yoav Freund and Robert E Schapire. A Decision - Theoretic Generalization of On - Line Learning and an Application to Boosting . Journal of Computer and System Sciences, 55 0 (1): 0 119--139, August 1997. ISSN 0022-0000. doi:10.1006/jcss.1997.1504. URL https://www.sciencedirect.com/science/article/pii/S002200009791504X
-
[7]
Antoine Grosnit, Alexandre Maraval, Refinath S. N, Zichao Zhao, James Doran, Giuseppe Paolo, Albert Thomas, Jonas Gonzalez, Abhineet Kumar, Khyati Khandelwal, Abdelhakim Benechehab, Hamza Cherkaoui, Youssef Attia El-Hili, Kun Shao, Jianye Hao, Jun Yao, Balázs Kégl, Haitham Bou-Ammar, and Jun Wang. Kolb- Based Experiential Learning for Generalist Agents wi...
-
[8]
Qingyan Guo, Rui Wang, Junliang Guo, Bei Li, Kaitao Song, Xu Tan, Guoqing Liu, Jiang Bian, and Yujiu Yang. EvoPrompt : Connecting LLMs with Evolutionary Algorithms Yields Powerful Prompt Optimizers . In Proceedings of the 12th International Conference on Learning Representations ( ICLR ) , 2024 a . doi:10.48550/arXiv.2309.08532. URL http://arxiv.org/abs/2...
-
[9]
DS - Agent : Automated Data Science by Empowering Large Language Models with Case - Based Reasoning
Siyuan Guo, Cheng Deng, Ying Wen, Hechang Chen, Yi Chang, and Jun Wang. DS - Agent : Automated Data Science by Empowering Large Language Models with Case - Based Reasoning . In Proceedings of the 41st International Conference on Machine Learning ( ICML ) , pages 16813--16848, 2024 b . doi:10.48550/arXiv.2402.17453. URL http://arxiv.org/abs/2402.17453
-
[10]
Noah Hollmann, Samuel Müller, and Frank Hutter. Large Language Models for Automated Data Science : Introducing CAAFE for Context - Aware Automated Feature Engineering . In Advances in Neural Information Processing Systems 36 ( NeurIPS ) , 2023. doi:10.48550/arXiv.2305.03403. URL http://arxiv.org/abs/2305.03403
-
[11]
Data interpreter: An llm agent for data science,
Sirui Hong, Yizhang Lin, Bang Liu, Bangbang Liu, Binhao Wu, Ceyao Zhang, Chenxing Wei, Danyang Li, Jiaqi Chen, Jiayi Zhang, Jinlin Wang, Li Zhang, Lingyao Zhang, Min Yang, Mingchen Zhuge, Taicheng Guo, Tuo Zhou, Wei Tao, Xiangru Tang, Xiangtao Lu, Xiawu Zheng, Xinbing Liang, Yaying Fei, Yuheng Cheng, Zhibin Gou, Zongze Xu, and Chenglin Wu. Data Interprete...
-
[12]
Automated Design of Agentic Systems
Shengran Hu, Cong Lu, and Jeff Clune. Automated design of agentic systems, 2025. URL https://arxiv.org/abs/2408.08435
work page internal anchor Pith review arXiv 2025
-
[13]
Aide: Ai-driven exploration in the space of code.arXiv preprint arXiv:2502.13138, 2025
Zhengyao Jiang, Dominik Schmidt, Dhruv Srikanth, Dixing Xu, Ian Kaplan, Deniss Jacenko, and Yuxiang Wu. AIDE : AI - Driven Exploration in the Space of Code , February 2025. URL http://arxiv.org/abs/2502.13138. arXiv:2502.13138 [cs]
-
[14]
Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, Rémi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, Thomas Hubert, Peter Choy, Cyprien de Masson d’Autume, Igor Babuschkin, Xinyun Chen, Po-Sen Huang, Johannes Welbl, Sven Gowal, Alexey Cherepanov, James Molloy, Daniel J. Mankowitz, Esme Sutherland Robson, Pushm...
-
[15]
AutoKaggle : A Multi - Agent Framework for Autonomous Data Science Competitions
Ziming Li, Qianbo Zang, David Ma, Jiawei Guo, Tuney Zheng, Minghao Liu, Xinyao Niu, Yue Wang, Jian Yang, Jiaheng Liu, Wanjun Zhong, Wangchunshu Zhou, Wenhao Huang, and Ge Zhang. AutoKaggle : A Multi - Agent Framework for Autonomous Data Science Competitions . In Proceedings of the 13th International Conference on Learning Representations ( ICLR ) , 2025. ...
-
[16]
I- MCTS : Enhancing Agentic AutoML via Introspective Monte Carlo Tree Search , February 2025
Zujie Liang, Feng Wei, Wujiang Xu, Lin Chen, Yuxi Qian, and Xinhui Wu. I- MCTS : Enhancing Agentic AutoML via Introspective Monte Carlo Tree Search , February 2025. URL http://arxiv.org/abs/2502.14693. arXiv:2502.14693 [cs]
-
[17]
Se-agent: Self-evolution trajectory optimization in multi-step reasoning with llm-based agents
Jiaye Lin, Yifu Guo, Yuzhen Han, Sen Hu, Ziyi Ni, Licheng Wang, Mingguang Chen, Hongzhang Liu, Ronghao Chen, Yangfan He, Daxin Jiang, Binxing Jiao, Chen Hu, and Huacan Wang. Se-agent: Self-evolution trajectory optimization in multi-step reasoning with llm-based agents, 2025. URL https://arxiv.org/abs/2508.02085
-
[18]
Alexander Novikov, Ngân Vũ, Marvin Eisenberger, Emilien Dupont, Po-Sen Huang, Adam Zsolt Wagner, Sergey Shirobokov, Borislav Kozlovskii, Francisco J. R. Ruiz, Abbas Mehrabian, M. Pawan Kumar, Abigail See, Swarat Chaudhuri, George Holland, Alex Davies, Sebastian Nowozin, Pushmeet Kohli, and Matej Balog. AlphaEvolve : A coding agent for scientific and algor...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[19]
Randal S. Olson, Nathan Bartley, Ryan J. Urbanowicz, and Jason H. Moore. Evaluation of a Tree -based Pipeline Optimization Tool for Automating Data Science . In Proceedings of the Genetic and Evolutionary Computation Conference 2016 , GECCO '16, pages 485--492, New York, NY, USA, July 2016. Association for Computing Machinery. ISBN 978-1-4503-4206-3. doi:...
-
[20]
Bernardino Romera-Paredes, Mohammadamin Barekatain, Alexander Novikov, Matej Balog, M. Pawan Kumar, Emilien Dupont, Francisco J. R. Ruiz, Jordan S. Ellenberg, Pengming Wang, Omar Fawzi, Pushmeet Kohli, and Alhussein Fawzi. Mathematical discoveries from program search with large language models. Nature, 625 0 (7995): 0 468--475, January 2024. ISSN 1476-468...
-
[21]
Agentsquare: Automatic llm agent search in modular design space, 2025
Yu Shang, Yu Li, Keyu Zhao, Likai Ma, Jiahe Liu, Fengli Xu, and Yong Li. Agentsquare: Automatic llm agent search in modular design space, 2025. URL https://arxiv.org/abs/2410.06153
-
[22]
Kimi K2.5: Visual Agentic Intelligence
Kimi Team. Kimi k2.5: Visual agentic intelligence, 2026. URL https://arxiv.org/abs/2602.02276
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[23]
Chris Thornton, Frank Hutter, Holger H. Hoos, and Kevin Leyton-Brown. Auto- WEKA : combined selection and hyperparameter optimization of classification algorithms. In Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining , KDD '13, pages 847--855, New York, NY, USA, August 2013. Association for Computing Machin...
-
[24]
arXiv preprint arXiv:2410.02958 , year=
Patara Trirat, Wonyong Jeong, and Sung Ju Hwang. AutoML - Agent : A Multi - Agent LLM Framework for Full - Pipeline AutoML . In Proceedings of the 42nd International Conference on Machine Learning ( ICML ) , 2025. doi:10.48550/arXiv.2410.02958. URL http://arxiv.org/abs/2410.02958
-
[25]
Group-Evolving Agents: Open-Ended Self-Improvement via Experience Sharing, 2026
Zhaotian Weng, Antonis Antoniades, Deepak Nathani, Zhen Zhang, Xiao Pu, and Xin Eric Wang. Group-evolving agents: Open-ended self-improvement via experience sharing, 2026. URL https://arxiv.org/abs/2602.04837
-
[26]
ReAct: Synergizing Reasoning and Acting in Language Models
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. ReAct : Synergizing Reasoning and Acting in Language Models . In Proceedings of the 11th International Conference on Learning Representations ( ICLR ) , 2023. URL http://arxiv.org/abs/2210.03629
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[27]
Haoran Ye, Jiarui Wang, Zhiguang Cao, Federico Berto, Chuanbo Hua, Haeyeon Kim, Jinkyoo Park, and Guojie Song. ReEvo : Large Language Models as Hyper - Heuristics with Reflective Evolution . In Advances in Neural Information Processing Systems 37 ( NeurIPS ) , 2024. doi:10.48550/arXiv.2402.01145. URL http://arxiv.org/abs/2402.01145
-
[28]
arXiv preprint arXiv:2502.07373 , year=
Guibin Zhang, Kaijie Chen, Guancheng Wan, Heng Chang, Hong Cheng, Kun Wang, Shuyue Hu, and Lei Bai. Evoflow: Evolving diverse agentic workflows on the fly, 2025 a . URL https://arxiv.org/abs/2502.07373
- [29]
-
[30]
AFlow: Automating Agentic Workflow Generation
Jiayi Zhang, Jinyu Xiang, Zhaoyang Yu, Fengwei Teng, Xionghui Chen, Jiaqi Chen, Mingchen Zhuge, Xin Cheng, Sirui Hong, Jinlin Wang, Bingnan Zheng, Bang Liu, Yuyu Luo, and Chenglin Wu. Aflow: Automating agentic workflow generation, 2025 b . URL https://arxiv.org/abs/2410.10762
work page internal anchor Pith review arXiv 2025
-
[31]
MLCopilot : Unleashing the Power of Large Language Models in Solving Machine Learning Tasks
Lei Zhang, Yuge Zhang, Kan Ren, Dongsheng Li, and Yuqing Yang. MLCopilot : Unleashing the Power of Large Language Models in Solving Machine Learning Tasks . In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics ( EACL ) , pages 2931--2959, 2024. doi:10.48550/arXiv.2304.14979. URL http://arxiv.org/ab...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.