MLEvolve: A Self-Evolving Framework for Automated Machine Learning Algorithm Discovery
Pith reviewed 2026-06-28 00:54 UTC · model grok-4.3
The pith
MLEvolve lets LLM agents discover machine learning algorithms by sharing information across search branches and reusing past experience.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
MLEvolve is an LLM-based self-evolving multi-agent framework that extends tree search to Progressive MCGS for cross-branch information flow, adds Retrospective Memory for experience retrieval and reuse, and decouples strategic planning from code generation via adaptive coding modes, yielding state-of-the-art average medal rate and valid submission rate on MLE-Bench under a 12-hour budget while also outperforming AlphaEvolve on mathematical algorithm optimization.
What carries the argument
Progressive MCGS, which augments tree search with graph-based reference edges for cross-branch flow and applies an entropy-inspired progressive schedule to move from broad exploration to focused exploitation.
If this is right
- Higher average medal rate and valid submission rate on MLE-Bench when restricted to a 12-hour budget.
- Better performance than specialized algorithm discovery methods such as AlphaEvolve on mathematical optimization tasks.
- Demonstrated cross-domain generalization from machine learning engineering to mathematical algorithm discovery.
- Sustained self-evolution over long-horizon tasks through accumulated experience reuse.
Where Pith is reading between the lines
- The same cross-branch and memory mechanisms could be tested on other long-horizon LLM tasks such as automated scientific experiment design.
- If Retrospective Memory continues to scale without degradation, the framework may support multi-week iterative discovery runs without external resets.
- Disabling the progressive entropy schedule while keeping the graph edges would isolate whether the exploration-to-exploitation shift is necessary for the reported gains.
Load-bearing premise
The measured performance gains come from the cross-branch edges, retrospective memory retrieval, and adaptive coding modes rather than from other aspects of the implementation.
What would settle it
An ablation that removes the cross-branch reference edges or the dynamic memory component and measures no reduction in medal rate or valid submission rate on MLE-Bench.
read the original abstract
Large language model (LLM) agents are increasingly applied to long-horizon tasks such as scientific discovery and machine learning engineering (MLE), where sustained self-evolution becomes a key capability. However, existing MLE agents suffer from inter-branch information isolation, memoryless search, and lack of hierarchical control, which together hinder long-horizon optimization. We present MLEvolve, an LLM-based self-evolving multi-agent framework for end-to-end machine learning algorithm discovery. By extending tree search to Progressive MCGS, MLEvolve enables cross-branch information flow through graph-based reference edges and gradually shifts the search from broad exploration to focused exploitation with an entropy-inspired progressive schedule. To allow the agent to evolve with accumulated experience, we introduce Retrospective Memory, which combines a cold-start domain knowledge base with a dynamic global memory for task-specific experience retrieval and reuse. For stable long-horizon iteration, we further decouple strategic planning from code generation with adaptive coding modes. Evaluation on MLE-Bench shows that MLEvolve achieves state-of-the-art performance across multiple dimensions including average medal rate and valid submission rate under a 12-hour budget (half the standard runtime). Moreover, MLEvolve also outperforms specialized algorithm discovery methods including AlphaEvolve on mathematical algorithm optimization tasks, demonstrating strong cross-domain generalization. Our code is available at https://github.com/InternScience/MLEvolve.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces MLEvolve, an LLM-based multi-agent framework for end-to-end machine learning algorithm discovery. It extends tree search via Progressive MCGS to enable cross-branch information flow through graph edges and an entropy-inspired progressive schedule, introduces Retrospective Memory combining a cold-start knowledge base with dynamic global memory for experience retrieval, and decouples strategic planning from code generation using adaptive coding modes. The central empirical claim is that the full system achieves state-of-the-art average medal rate and valid submission rate on MLE-Bench under a 12-hour budget (half the standard runtime) and outperforms specialized methods including AlphaEvolve on mathematical algorithm optimization tasks.
Significance. If the performance claims hold and the gains can be attributed to the three proposed mechanisms, the work would constitute a meaningful step forward in self-evolving LLM agents for long-horizon MLE tasks by addressing inter-branch isolation and memoryless search. The public release of code at the cited GitHub repository is a clear strength that supports reproducibility and follow-on work.
major comments (1)
- [Evaluation] Evaluation section: the central claim attributes the reported SOTA medal rate, valid submission rate, and outperformance of AlphaEvolve specifically to Progressive MCGS cross-branch edges, Retrospective Memory, and adaptive coding modes. However, only the full system versus external baselines is reported; no ablation that removes or disables any one of these three components while keeping the remainder fixed is described. Without such controls it is not possible to rule out that the observed gains arise from the LLM backbone, total token budget, or prompt engineering rather than the cited algorithmic additions.
minor comments (1)
- [Abstract] Abstract: performance numbers are stated without any accompanying statistical details, run counts, variance, or baseline definitions, which reduces the ability of readers to assess the claims at a glance.
Simulated Author's Rebuttal
We thank the referee for the constructive comments. We address the concern about the lack of ablation studies below and will revise the manuscript accordingly.
read point-by-point responses
-
Referee: [Evaluation] Evaluation section: the central claim attributes the reported SOTA medal rate, valid submission rate, and outperformance of AlphaEvolve specifically to Progressive MCGS cross-branch edges, Retrospective Memory, and adaptive coding modes. However, only the full system versus external baselines is reported; no ablation that removes or disables any one of these three components while keeping the remainder fixed is described. Without such controls it is not possible to rule out that the observed gains arise from the LLM backbone, total token budget, or prompt engineering rather than the cited algorithmic additions.
Authors: We agree that the current evaluation reports only full-system results against external baselines and does not include internal ablations that isolate Progressive MCGS, Retrospective Memory, or adaptive coding modes. This limits the strength of causal attribution. In the revised manuscript we will add ablation experiments that disable each component individually while holding the LLM backbone, token budget, and other elements fixed, reporting the resulting changes in medal rate and submission rate on MLE-Bench. revision: yes
Circularity Check
No circularity; framework is descriptive with no derivation chain
full rationale
The manuscript is a systems description of an LLM agent framework. It contains no equations, no fitted parameters, no 'predictions' of quantities derived from inputs, and no self-citation load-bearing uniqueness theorems. All performance claims are empirical comparisons on MLE-Bench; the three named mechanisms are presented as design choices whose contribution is asserted by the full-system results rather than by any algebraic reduction to the inputs. This matches the default case of a self-contained empirical paper with score 0.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 1 Pith paper
-
Agents-K1: Towards Agent-native Knowledge Orchestration
Agents-K1 builds agent-native scientific knowledge graphs from full papers via a multimodal parser, 4B GRPO-trained extractor, and tri-source graph interface, applied to 2.46M papers yielding Scholar-KG.
Reference graph
Works this paper leans on
-
[1]
AI and science: what 1,600 researchers think
Richard Van Noorden and Jeffrey M Perkel. “AI and science: what 1,600 researchers think”. In: Nature621.7980 (2023), pp. 672–675
2023
-
[2]
A survey on the optimization of large language model-based agents
Shangheng Du et al. “A survey on the optimization of large language model-based agents”. In: ACM Computing Surveys58.9 (2026), pp. 1–37. 13
2026
-
[3]
Towards end-to-end automation of AI research
Chris Lu et al. “Towards end-to-end automation of AI research”. In:Nature651.8107 (2026), pp. 914–919
2026
-
[4]
NovelSeek Team et al. “NovelSeek: When Agent Becomes the Scientist–Building Closed-Loop System from Hypothesis to Verification”. In:arXiv preprint arXiv:2505.16938(2025)
arXiv 2025
-
[5]
Internagent-1.5: A unified agentic framework for long-horizon autonomous scientific discovery
Shiyang Feng et al. “Internagent-1.5: A unified agentic framework for long-horizon autonomous scientific discovery”. In:arXiv preprint arXiv:2602.08990(2026)
arXiv 2026
-
[6]
Alphaevolve: A coding agent for scientific and algorithmic discovery
Alexander Novikov et al. “Alphaevolve: A coding agent for scientific and algorithmic discovery”. In:arXiv preprint arXiv:2506.13131(2025)
Pith/arXiv arXiv 2025
-
[7]
Software engineering for machine learning: A case study
Saleema Amershi et al. “Software engineering for machine learning: A case study”. In:2019 IEEE/ACM 41st International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP). IEEE. 2019, pp. 291–300
2019
-
[8]
AutoML: A survey of the state-of-the-art
Xin He, Kaiyong Zhao, and Xiaowen Chu. “AutoML: A survey of the state-of-the-art”. In: Knowledge-based systems212 (2021), p. 106622
2021
-
[9]
Auto-sklearn 2.0: Hands-free automl via meta-learning
Matthias Feurer et al. “Auto-sklearn 2.0: Hands-free automl via meta-learning”. In:Journal of Machine Learning Research23.261 (2022), pp. 1–61
2022
-
[10]
Openhands: An open platform for ai software developers as generalist agents
Xingyao Wang et al. “Openhands: An open platform for ai software developers as generalist agents”. In:International Conference on Learning Representations. Vol. 2025. 2025, pp. 65882– 65919
2025
-
[11]
Mlagentbench: Evaluating language agents on machine learning experi- mentation
Qian Huang et al. “Mlagentbench: Evaluating language agents on machine learning experi- mentation”. In:arXiv preprint arXiv:2310.03302(2023)
arXiv 2023
-
[12]
Aide: Ai-driven exploration in the space of code
Zhengyao Jiang et al. “Aide: Ai-driven exploration in the space of code”. In:arXiv preprint arXiv:2502.13138(2025)
Pith/arXiv arXiv 2025
-
[13]
R&D-Agent: An LLM-Agent Framework Towards Autonomous Data Science
Xu Yang et al. “R&D-Agent: An LLM-Agent Framework Towards Autonomous Data Science”. In:arXiv preprint arXiv:2505.14738(2025)
arXiv 2025
-
[14]
ML-Master: Towards AI-for-AI via Integration of Exploration and Reasoning
Zexi Liu et al. “ML-Master: Towards AI-for-AI via Integration of Exploration and Reasoning”. In:arXiv preprint arXiv:2506.16499(2025)
arXiv 2025
- [15]
-
[16]
AI Research Agents for Machine Learning: Search, Exploration, and Generalization in MLE-bench
Edan Toledo et al. “AI Research Agents for Machine Learning: Search, Exploration, and Generalization in MLE-bench”. In:arXiv preprint arXiv:2507.02554(2025)
arXiv 2025
-
[17]
AutoMind: Adaptive Knowledgeable Agent for Automated Data Science
Yixin Ou et al. “AutoMind: Adaptive Knowledgeable Agent for Automated Data Science”. In: arXiv preprint arXiv:2506.10974(2025)
arXiv 2025
-
[18]
MARS: Modular Agent with Reflective Search for Automated AI Research
Jiefeng Chen et al. “MARS: Modular Agent with Reflective Search for Automated AI Research”. In:arXiv preprint arXiv:2602.02660(2026)
Pith/arXiv arXiv 2026
-
[19]
Toward ultra-long-horizon agentic science: Cognitive accumulation for machine learning engineering
Xinyu Zhu et al. “Toward ultra-long-horizon agentic science: Cognitive accumulation for machine learning engineering”. In:arXiv preprint arXiv:2601.10402(2026)
arXiv 2026
-
[20]
AutoMLGen: Navigating Fine-Grained Optimization for Coding Agents
Shangheng Du et al. “AutoMLGen: Navigating Fine-Grained Optimization for Coding Agents”. In:arXiv preprint arXiv:2510.08511(2025)
arXiv 2025
-
[21]
Mathematical exploration and discovery at scale
Bogdan Georgiev et al. “Mathematical exploration and discovery at scale”. In:arXiv preprint arXiv:2511.02864(2025)
Pith/arXiv arXiv 2025
-
[22]
MLE-STAR: Machine Learning Engineering Agent via Search and Targeted Refinement
Jaehyun Nam et al. “MLE-STAR: Machine Learning Engineering Agent via Search and Targeted Refinement”. In:arXiv preprint arXiv:2506.15692(2025)
arXiv 2025
-
[23]
Mlzero: A multi-agent system for end-to-end machine learning automa- tion
Haoyang Fang et al. “Mlzero: A multi-agent system for end-to-end machine learning automa- tion”. In:arXiv preprint arXiv:2505.13941(2025). 14
arXiv 2025
-
[24]
MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering
Jun Shern Chan et al. “MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering”. In:The Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. 2025
2025
-
[25]
AIBuildAI: An AI Agent for Automatically Building AI Models
Ruiyi Zhang et al. “AIBuildAI: An AI Agent for Automatically Building AI Models”. In:arXiv preprint arXiv:2604.14455(2026)
Pith/arXiv arXiv 2026
-
[26]
KAPSO: A Knowledge- grounded framework for Autonomous Program Synthesis and Optimization
Alireza Nadafian, Alireza Mohammadshahi, and Majid Yazdani. “KAPSO: A Knowledge- grounded framework for Autonomous Program Synthesis and Optimization”. In:arXiv preprint arXiv:2601.21526(2026)
arXiv 2026
-
[27]
Monte-Carlo graph search for Alp- haZero
Johannes Czech, Patrick Korus, and Kristian Kersting. “Monte-Carlo graph search for Alp- haZero”. In:arXiv preprint arXiv:2012.11045(2020)
arXiv 2012
-
[28]
Monte-carlo graph search: the value of merging similar states
Edouard Leurent and Odalric-Ambrym Maillard. “Monte-carlo graph search: the value of merging similar states”. In:Asian Conference on Machine Learning. PMLR. 2020, pp. 577–592
2020
-
[29]
Locagent: Graph-guided llm agents for code localization
Zhaoling Chen et al. “Locagent: Graph-guided llm agents for code localization”. In:Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025, pp. 8697–8727
2025
-
[30]
Codexgraph: Bridging large language models and code repositories via code graph databases
Xiangyan Liu et al. “Codexgraph: Bridging large language models and code repositories via code graph databases”. In:Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). 2025, pp. 142–160
2025
-
[31]
A survey on the memory mechanism of large language model-based agents
Zeyu Zhang et al. “A survey on the memory mechanism of large language model-based agents”. In:ACM Transactions on Information Systems43.6 (2025), pp. 1–47
2025
-
[32]
A-mem: Agentic memory for llm agents
Wujiang Xu et al. “A-mem: Agentic memory for llm agents”. In:Advances in Neural Information Processing Systems38 (2026), pp. 17577–17604
2026
-
[33]
Reasoning as Gradient: Scaling MLE Agents Beyond Tree Search
Yifei Zhang et al. “Reasoning as Gradient: Scaling MLE Agents Beyond Tree Search”. In:arXiv preprint arXiv:2603.01692(2026)
Pith/arXiv arXiv 2026
-
[34]
Information theory and statistical mechanics
Edwin T Jaynes. “Information theory and statistical mechanics”. In:Physical review106.4 (1957), p. 620
1957
-
[35]
Billion-scale similarity search with GPUs
Jeff Johnson, Matthijs Douze, and Hervé Jégou. “Billion-scale similarity search with GPUs”. In: IEEE transactions on big data7.3 (2019), pp. 535–547
2019
-
[36]
Evaluation-driven Scaling for Scientific Discovery
Haotian Ye et al. “Evaluation-driven Scaling for Scientific Discovery”. In:arXiv preprint arXiv:2604.19341(2026)
Pith/arXiv arXiv 2026
-
[37]
Learning to discover at test time
Mert Yuksekgonul et al. “Learning to discover at test time”. In:arXiv preprint arXiv:2601.16175 (2026)
Pith/arXiv arXiv 2026
-
[38]
Task":"aptos2019-blindness-detection
Asankhaya Sharma.OpenEvolve: an open-source evolutionary coding agent. 2025.url: https: //github.com/algorithmicsuperintelligence/openevolve. 15 Appendix A. Agent Descriptions MLEvolve is realized through a team of specialized agents, each tailored to a specific search phase or operator type. We summarize their roles: • Draft Agent.Generates initial candi...
arXiv 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.