MLEvolve: A Self-Evolving Framework for Automated Machine Learning Algorithm Discovery

Boyuan Sun; Bo Zhang; Jie Zhou; Jinxin Shi; Lei Bai; Liang He; Shangheng Du; Shiyang Feng; Tianshuo Peng; Xiangchao Yan

arxiv: 2606.06473 · v1 · pith:MFIGT5G3new · submitted 2026-06-04 · 💻 cs.AI · cs.CL

MLEvolve: A Self-Evolving Framework for Automated Machine Learning Algorithm Discovery

Shangheng Du , Xiangchao Yan , Jinxin Shi , Zongsheng Cao , Shiyang Feng , Zichen Liang , Boyuan Sun , Tianshuo Peng

show 6 more authors

Yifan Zhou Xin Li Jie Zhou Liang He Bo Zhang Lei Bai

This is my paper

Pith reviewed 2026-06-28 00:54 UTC · model grok-4.3

classification 💻 cs.AI cs.CL

keywords automated machine learningLLM agentsalgorithm discoveryself-evolving frameworksmulti-agent systemsMLE-Benchtree searchmemory retrieval

0 comments

The pith

MLEvolve lets LLM agents discover machine learning algorithms by sharing information across search branches and reusing past experience.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Current LLM agents for machine learning engineering tasks suffer from isolated search branches that cannot share findings, searches that forget prior attempts, and weak separation between high-level planning and low-level code writing. MLEvolve introduces three targeted fixes: Progressive MCGS extends ordinary tree search with graph edges that link branches and gradually narrows focus using an entropy schedule; Retrospective Memory stores both fixed domain knowledge and dynamic task experience for later retrieval; adaptive coding modes keep strategic decisions separate from code generation. These changes produce higher medal rates and valid submission rates on MLE-Bench even when the time budget is cut in half, and they also beat specialized methods on mathematical algorithm tasks. A reader would care because the work shows one concrete route toward agents that can keep improving on long engineering problems without repeated human resets.

Core claim

MLEvolve is an LLM-based self-evolving multi-agent framework that extends tree search to Progressive MCGS for cross-branch information flow, adds Retrospective Memory for experience retrieval and reuse, and decouples strategic planning from code generation via adaptive coding modes, yielding state-of-the-art average medal rate and valid submission rate on MLE-Bench under a 12-hour budget while also outperforming AlphaEvolve on mathematical algorithm optimization.

What carries the argument

Progressive MCGS, which augments tree search with graph-based reference edges for cross-branch flow and applies an entropy-inspired progressive schedule to move from broad exploration to focused exploitation.

If this is right

Higher average medal rate and valid submission rate on MLE-Bench when restricted to a 12-hour budget.
Better performance than specialized algorithm discovery methods such as AlphaEvolve on mathematical optimization tasks.
Demonstrated cross-domain generalization from machine learning engineering to mathematical algorithm discovery.
Sustained self-evolution over long-horizon tasks through accumulated experience reuse.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same cross-branch and memory mechanisms could be tested on other long-horizon LLM tasks such as automated scientific experiment design.
If Retrospective Memory continues to scale without degradation, the framework may support multi-week iterative discovery runs without external resets.
Disabling the progressive entropy schedule while keeping the graph edges would isolate whether the exploration-to-exploitation shift is necessary for the reported gains.

Load-bearing premise

The measured performance gains come from the cross-branch edges, retrospective memory retrieval, and adaptive coding modes rather than from other aspects of the implementation.

What would settle it

An ablation that removes the cross-branch reference edges or the dynamic memory component and measures no reduction in medal rate or valid submission rate on MLE-Bench.

read the original abstract

Large language model (LLM) agents are increasingly applied to long-horizon tasks such as scientific discovery and machine learning engineering (MLE), where sustained self-evolution becomes a key capability. However, existing MLE agents suffer from inter-branch information isolation, memoryless search, and lack of hierarchical control, which together hinder long-horizon optimization. We present MLEvolve, an LLM-based self-evolving multi-agent framework for end-to-end machine learning algorithm discovery. By extending tree search to Progressive MCGS, MLEvolve enables cross-branch information flow through graph-based reference edges and gradually shifts the search from broad exploration to focused exploitation with an entropy-inspired progressive schedule. To allow the agent to evolve with accumulated experience, we introduce Retrospective Memory, which combines a cold-start domain knowledge base with a dynamic global memory for task-specific experience retrieval and reuse. For stable long-horizon iteration, we further decouple strategic planning from code generation with adaptive coding modes. Evaluation on MLE-Bench shows that MLEvolve achieves state-of-the-art performance across multiple dimensions including average medal rate and valid submission rate under a 12-hour budget (half the standard runtime). Moreover, MLEvolve also outperforms specialized algorithm discovery methods including AlphaEvolve on mathematical algorithm optimization tasks, demonstrating strong cross-domain generalization. Our code is available at https://github.com/InternScience/MLEvolve.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MLEvolve names three mechanisms for LLM agent search but the results compare only the full system to baselines with no ablations.

read the letter

The main thing to know about this paper is that it proposes a self-evolving multi-agent system called MLEvolve for discovering machine learning algorithms, but the reported improvements are not tied to the three new components through any controlled experiments.

What is actually new here are the extensions to tree search called Progressive MCGS, which adds cross-branch reference edges and uses an entropy-based schedule to move from broad to focused search. Then there's Retrospective Memory that starts with domain knowledge and adds dynamic task-specific retrieval. Finally, the adaptive coding modes decouple high-level planning from the actual code writing. These target the problems of isolated branches, no memory accumulation, and unstable long runs that the authors see in earlier LLM agents for MLE.

The paper does a solid job making those problems concrete and showing that the full system reaches higher average medal rates and valid submission rates on MLE-Bench even with only 12 hours instead of the usual 24. It also beats specialized methods like AlphaEvolve when applied to mathematical algorithm optimization, which suggests some generalization. The code release on GitHub is a practical plus.

The soft spot is the evaluation design. Only the complete MLEvolve is tested against outside baselines. There are no results that remove the cross-branch edges, disable the memory retrieval, or fix the coding mode while keeping the rest the same. Without those, the performance lift could come from the choice of base model, the total token count, or other unmentioned details rather than the listed mechanisms. The abstract gives the headline numbers but no supporting data on run counts, variance, or statistical significance.

This work is aimed at researchers who build LLM agents for automated machine learning and scientific discovery tasks. Someone already following that literature might pick up useful implementation ideas from the described components and the public code.

I would recommend sending the paper to peer review. The framework is detailed enough and the code release helps, so referees could ask for the needed ablations and more rigorous reporting to strengthen the claims.

Referee Report

1 major / 1 minor

Summary. The paper introduces MLEvolve, an LLM-based multi-agent framework for end-to-end machine learning algorithm discovery. It extends tree search via Progressive MCGS to enable cross-branch information flow through graph edges and an entropy-inspired progressive schedule, introduces Retrospective Memory combining a cold-start knowledge base with dynamic global memory for experience retrieval, and decouples strategic planning from code generation using adaptive coding modes. The central empirical claim is that the full system achieves state-of-the-art average medal rate and valid submission rate on MLE-Bench under a 12-hour budget (half the standard runtime) and outperforms specialized methods including AlphaEvolve on mathematical algorithm optimization tasks.

Significance. If the performance claims hold and the gains can be attributed to the three proposed mechanisms, the work would constitute a meaningful step forward in self-evolving LLM agents for long-horizon MLE tasks by addressing inter-branch isolation and memoryless search. The public release of code at the cited GitHub repository is a clear strength that supports reproducibility and follow-on work.

major comments (1)

[Evaluation] Evaluation section: the central claim attributes the reported SOTA medal rate, valid submission rate, and outperformance of AlphaEvolve specifically to Progressive MCGS cross-branch edges, Retrospective Memory, and adaptive coding modes. However, only the full system versus external baselines is reported; no ablation that removes or disables any one of these three components while keeping the remainder fixed is described. Without such controls it is not possible to rule out that the observed gains arise from the LLM backbone, total token budget, or prompt engineering rather than the cited algorithmic additions.

minor comments (1)

[Abstract] Abstract: performance numbers are stated without any accompanying statistical details, run counts, variance, or baseline definitions, which reduces the ability of readers to assess the claims at a glance.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive comments. We address the concern about the lack of ablation studies below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [Evaluation] Evaluation section: the central claim attributes the reported SOTA medal rate, valid submission rate, and outperformance of AlphaEvolve specifically to Progressive MCGS cross-branch edges, Retrospective Memory, and adaptive coding modes. However, only the full system versus external baselines is reported; no ablation that removes or disables any one of these three components while keeping the remainder fixed is described. Without such controls it is not possible to rule out that the observed gains arise from the LLM backbone, total token budget, or prompt engineering rather than the cited algorithmic additions.

Authors: We agree that the current evaluation reports only full-system results against external baselines and does not include internal ablations that isolate Progressive MCGS, Retrospective Memory, or adaptive coding modes. This limits the strength of causal attribution. In the revised manuscript we will add ablation experiments that disable each component individually while holding the LLM backbone, token budget, and other elements fixed, reporting the resulting changes in medal rate and submission rate on MLE-Bench. revision: yes

Circularity Check

0 steps flagged

No circularity; framework is descriptive with no derivation chain

full rationale

The manuscript is a systems description of an LLM agent framework. It contains no equations, no fitted parameters, no 'predictions' of quantities derived from inputs, and no self-citation load-bearing uniqueness theorems. All performance claims are empirical comparisons on MLE-Bench; the three named mechanisms are presented as design choices whose contribution is asserted by the full-system results rather than by any algebraic reduction to the inputs. This matches the default case of a self-contained empirical paper with score 0.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no equations, parameters, or background assumptions; therefore the ledger is empty.

pith-pipeline@v0.9.1-grok · 5820 in / 1168 out tokens · 31885 ms · 2026-06-28T00:54:17.287666+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Agents-K1: Towards Agent-native Knowledge Orchestration
cs.AI 2026-06 unverdicted novelty 5.0

Agents-K1 builds agent-native scientific knowledge graphs from full papers via a multimodal parser, 4B GRPO-trained extractor, and tri-source graph interface, applied to 2.46M papers yielding Scholar-KG.

Reference graph

Works this paper leans on

38 extracted references · 8 linked inside Pith · cited by 1 Pith paper

[1]

AI and science: what 1,600 researchers think

Richard Van Noorden and Jeffrey M Perkel. “AI and science: what 1,600 researchers think”. In: Nature621.7980 (2023), pp. 672–675

2023
[2]

A survey on the optimization of large language model-based agents

Shangheng Du et al. “A survey on the optimization of large language model-based agents”. In: ACM Computing Surveys58.9 (2026), pp. 1–37. 13

2026
[3]

Towards end-to-end automation of AI research

Chris Lu et al. “Towards end-to-end automation of AI research”. In:Nature651.8107 (2026), pp. 914–919

2026
[4]

NovelSeek: When Agent Becomes the Scientist–Building Closed-Loop System from Hypothesis to Verification

NovelSeek Team et al. “NovelSeek: When Agent Becomes the Scientist–Building Closed-Loop System from Hypothesis to Verification”. In:arXiv preprint arXiv:2505.16938(2025)

arXiv 2025
[5]

Internagent-1.5: A unified agentic framework for long-horizon autonomous scientific discovery

Shiyang Feng et al. “Internagent-1.5: A unified agentic framework for long-horizon autonomous scientific discovery”. In:arXiv preprint arXiv:2602.08990(2026)

arXiv 2026
[6]

Alphaevolve: A coding agent for scientific and algorithmic discovery

Alexander Novikov et al. “Alphaevolve: A coding agent for scientific and algorithmic discovery”. In:arXiv preprint arXiv:2506.13131(2025)

Pith/arXiv arXiv 2025
[7]

Software engineering for machine learning: A case study

Saleema Amershi et al. “Software engineering for machine learning: A case study”. In:2019 IEEE/ACM 41st International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP). IEEE. 2019, pp. 291–300

2019
[8]

AutoML: A survey of the state-of-the-art

Xin He, Kaiyong Zhao, and Xiaowen Chu. “AutoML: A survey of the state-of-the-art”. In: Knowledge-based systems212 (2021), p. 106622

2021
[9]

Auto-sklearn 2.0: Hands-free automl via meta-learning

Matthias Feurer et al. “Auto-sklearn 2.0: Hands-free automl via meta-learning”. In:Journal of Machine Learning Research23.261 (2022), pp. 1–61

2022
[10]

Openhands: An open platform for ai software developers as generalist agents

Xingyao Wang et al. “Openhands: An open platform for ai software developers as generalist agents”. In:International Conference on Learning Representations. Vol. 2025. 2025, pp. 65882– 65919

2025
[11]

Mlagentbench: Evaluating language agents on machine learning experi- mentation

Qian Huang et al. “Mlagentbench: Evaluating language agents on machine learning experi- mentation”. In:arXiv preprint arXiv:2310.03302(2023)

arXiv 2023
[12]

Aide: Ai-driven exploration in the space of code

Zhengyao Jiang et al. “Aide: Ai-driven exploration in the space of code”. In:arXiv preprint arXiv:2502.13138(2025)

Pith/arXiv arXiv 2025
[13]

R&D-Agent: An LLM-Agent Framework Towards Autonomous Data Science

Xu Yang et al. “R&D-Agent: An LLM-Agent Framework Towards Autonomous Data Science”. In:arXiv preprint arXiv:2505.14738(2025)

arXiv 2025
[14]

ML-Master: Towards AI-for-AI via Integration of Exploration and Reasoning

Zexi Liu et al. “ML-Master: Towards AI-for-AI via Integration of Exploration and Reasoning”. In:arXiv preprint arXiv:2506.16499(2025)

arXiv 2025
[15]

The fm agent

Annan Li et al. “The fm agent”. In:arXiv preprint arXiv:2510.26144(2025)

arXiv 2025
[16]

AI Research Agents for Machine Learning: Search, Exploration, and Generalization in MLE-bench

Edan Toledo et al. “AI Research Agents for Machine Learning: Search, Exploration, and Generalization in MLE-bench”. In:arXiv preprint arXiv:2507.02554(2025)

arXiv 2025
[17]

AutoMind: Adaptive Knowledgeable Agent for Automated Data Science

Yixin Ou et al. “AutoMind: Adaptive Knowledgeable Agent for Automated Data Science”. In: arXiv preprint arXiv:2506.10974(2025)

arXiv 2025
[18]

MARS: Modular Agent with Reflective Search for Automated AI Research

Jiefeng Chen et al. “MARS: Modular Agent with Reflective Search for Automated AI Research”. In:arXiv preprint arXiv:2602.02660(2026)

Pith/arXiv arXiv 2026
[19]

Toward ultra-long-horizon agentic science: Cognitive accumulation for machine learning engineering

Xinyu Zhu et al. “Toward ultra-long-horizon agentic science: Cognitive accumulation for machine learning engineering”. In:arXiv preprint arXiv:2601.10402(2026)

arXiv 2026
[20]

AutoMLGen: Navigating Fine-Grained Optimization for Coding Agents

Shangheng Du et al. “AutoMLGen: Navigating Fine-Grained Optimization for Coding Agents”. In:arXiv preprint arXiv:2510.08511(2025)

arXiv 2025
[21]

Mathematical exploration and discovery at scale

Bogdan Georgiev et al. “Mathematical exploration and discovery at scale”. In:arXiv preprint arXiv:2511.02864(2025)

Pith/arXiv arXiv 2025
[22]

MLE-STAR: Machine Learning Engineering Agent via Search and Targeted Refinement

Jaehyun Nam et al. “MLE-STAR: Machine Learning Engineering Agent via Search and Targeted Refinement”. In:arXiv preprint arXiv:2506.15692(2025)

arXiv 2025
[23]

Mlzero: A multi-agent system for end-to-end machine learning automa- tion

Haoyang Fang et al. “Mlzero: A multi-agent system for end-to-end machine learning automa- tion”. In:arXiv preprint arXiv:2505.13941(2025). 14

arXiv 2025
[24]

MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering

Jun Shern Chan et al. “MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering”. In:The Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. 2025

2025
[25]

AIBuildAI: An AI Agent for Automatically Building AI Models

Ruiyi Zhang et al. “AIBuildAI: An AI Agent for Automatically Building AI Models”. In:arXiv preprint arXiv:2604.14455(2026)

Pith/arXiv arXiv 2026
[26]

KAPSO: A Knowledge- grounded framework for Autonomous Program Synthesis and Optimization

Alireza Nadafian, Alireza Mohammadshahi, and Majid Yazdani. “KAPSO: A Knowledge- grounded framework for Autonomous Program Synthesis and Optimization”. In:arXiv preprint arXiv:2601.21526(2026)

arXiv 2026
[27]

Monte-Carlo graph search for Alp- haZero

Johannes Czech, Patrick Korus, and Kristian Kersting. “Monte-Carlo graph search for Alp- haZero”. In:arXiv preprint arXiv:2012.11045(2020)

arXiv 2012
[28]

Monte-carlo graph search: the value of merging similar states

Edouard Leurent and Odalric-Ambrym Maillard. “Monte-carlo graph search: the value of merging similar states”. In:Asian Conference on Machine Learning. PMLR. 2020, pp. 577–592

2020
[29]

Locagent: Graph-guided llm agents for code localization

Zhaoling Chen et al. “Locagent: Graph-guided llm agents for code localization”. In:Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025, pp. 8697–8727

2025
[30]

Codexgraph: Bridging large language models and code repositories via code graph databases

Xiangyan Liu et al. “Codexgraph: Bridging large language models and code repositories via code graph databases”. In:Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). 2025, pp. 142–160

2025
[31]

A survey on the memory mechanism of large language model-based agents

Zeyu Zhang et al. “A survey on the memory mechanism of large language model-based agents”. In:ACM Transactions on Information Systems43.6 (2025), pp. 1–47

2025
[32]

A-mem: Agentic memory for llm agents

Wujiang Xu et al. “A-mem: Agentic memory for llm agents”. In:Advances in Neural Information Processing Systems38 (2026), pp. 17577–17604

2026
[33]

Reasoning as Gradient: Scaling MLE Agents Beyond Tree Search

Yifei Zhang et al. “Reasoning as Gradient: Scaling MLE Agents Beyond Tree Search”. In:arXiv preprint arXiv:2603.01692(2026)

Pith/arXiv arXiv 2026
[34]

Information theory and statistical mechanics

Edwin T Jaynes. “Information theory and statistical mechanics”. In:Physical review106.4 (1957), p. 620

1957
[35]

Billion-scale similarity search with GPUs

Jeff Johnson, Matthijs Douze, and Hervé Jégou. “Billion-scale similarity search with GPUs”. In: IEEE transactions on big data7.3 (2019), pp. 535–547

2019
[36]

Evaluation-driven Scaling for Scientific Discovery

Haotian Ye et al. “Evaluation-driven Scaling for Scientific Discovery”. In:arXiv preprint arXiv:2604.19341(2026)

Pith/arXiv arXiv 2026
[37]

Learning to discover at test time

Mert Yuksekgonul et al. “Learning to discover at test time”. In:arXiv preprint arXiv:2601.16175 (2026)

Pith/arXiv arXiv 2026
[38]

Task":"aptos2019-blindness-detection

Asankhaya Sharma.OpenEvolve: an open-source evolutionary coding agent. 2025.url: https: //github.com/algorithmicsuperintelligence/openevolve. 15 Appendix A. Agent Descriptions MLEvolve is realized through a team of specialized agents, each tailored to a specific search phase or operator type. We summarize their roles: • Draft Agent.Generates initial candi...

arXiv 2025

[1] [1]

AI and science: what 1,600 researchers think

Richard Van Noorden and Jeffrey M Perkel. “AI and science: what 1,600 researchers think”. In: Nature621.7980 (2023), pp. 672–675

2023

[2] [2]

A survey on the optimization of large language model-based agents

Shangheng Du et al. “A survey on the optimization of large language model-based agents”. In: ACM Computing Surveys58.9 (2026), pp. 1–37. 13

2026

[3] [3]

Towards end-to-end automation of AI research

Chris Lu et al. “Towards end-to-end automation of AI research”. In:Nature651.8107 (2026), pp. 914–919

2026

[4] [4]

NovelSeek: When Agent Becomes the Scientist–Building Closed-Loop System from Hypothesis to Verification

NovelSeek Team et al. “NovelSeek: When Agent Becomes the Scientist–Building Closed-Loop System from Hypothesis to Verification”. In:arXiv preprint arXiv:2505.16938(2025)

arXiv 2025

[5] [5]

Internagent-1.5: A unified agentic framework for long-horizon autonomous scientific discovery

Shiyang Feng et al. “Internagent-1.5: A unified agentic framework for long-horizon autonomous scientific discovery”. In:arXiv preprint arXiv:2602.08990(2026)

arXiv 2026

[6] [6]

Alphaevolve: A coding agent for scientific and algorithmic discovery

Alexander Novikov et al. “Alphaevolve: A coding agent for scientific and algorithmic discovery”. In:arXiv preprint arXiv:2506.13131(2025)

Pith/arXiv arXiv 2025

[7] [7]

Software engineering for machine learning: A case study

Saleema Amershi et al. “Software engineering for machine learning: A case study”. In:2019 IEEE/ACM 41st International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP). IEEE. 2019, pp. 291–300

2019

[8] [8]

AutoML: A survey of the state-of-the-art

Xin He, Kaiyong Zhao, and Xiaowen Chu. “AutoML: A survey of the state-of-the-art”. In: Knowledge-based systems212 (2021), p. 106622

2021

[9] [9]

Auto-sklearn 2.0: Hands-free automl via meta-learning

Matthias Feurer et al. “Auto-sklearn 2.0: Hands-free automl via meta-learning”. In:Journal of Machine Learning Research23.261 (2022), pp. 1–61

2022

[10] [10]

Openhands: An open platform for ai software developers as generalist agents

Xingyao Wang et al. “Openhands: An open platform for ai software developers as generalist agents”. In:International Conference on Learning Representations. Vol. 2025. 2025, pp. 65882– 65919

2025

[11] [11]

Mlagentbench: Evaluating language agents on machine learning experi- mentation

Qian Huang et al. “Mlagentbench: Evaluating language agents on machine learning experi- mentation”. In:arXiv preprint arXiv:2310.03302(2023)

arXiv 2023

[12] [12]

Aide: Ai-driven exploration in the space of code

Zhengyao Jiang et al. “Aide: Ai-driven exploration in the space of code”. In:arXiv preprint arXiv:2502.13138(2025)

Pith/arXiv arXiv 2025

[13] [13]

R&D-Agent: An LLM-Agent Framework Towards Autonomous Data Science

Xu Yang et al. “R&D-Agent: An LLM-Agent Framework Towards Autonomous Data Science”. In:arXiv preprint arXiv:2505.14738(2025)

arXiv 2025

[14] [14]

ML-Master: Towards AI-for-AI via Integration of Exploration and Reasoning

Zexi Liu et al. “ML-Master: Towards AI-for-AI via Integration of Exploration and Reasoning”. In:arXiv preprint arXiv:2506.16499(2025)

arXiv 2025

[15] [15]

The fm agent

Annan Li et al. “The fm agent”. In:arXiv preprint arXiv:2510.26144(2025)

arXiv 2025

[16] [16]

AI Research Agents for Machine Learning: Search, Exploration, and Generalization in MLE-bench

Edan Toledo et al. “AI Research Agents for Machine Learning: Search, Exploration, and Generalization in MLE-bench”. In:arXiv preprint arXiv:2507.02554(2025)

arXiv 2025

[17] [17]

AutoMind: Adaptive Knowledgeable Agent for Automated Data Science

Yixin Ou et al. “AutoMind: Adaptive Knowledgeable Agent for Automated Data Science”. In: arXiv preprint arXiv:2506.10974(2025)

arXiv 2025

[18] [18]

MARS: Modular Agent with Reflective Search for Automated AI Research

Jiefeng Chen et al. “MARS: Modular Agent with Reflective Search for Automated AI Research”. In:arXiv preprint arXiv:2602.02660(2026)

Pith/arXiv arXiv 2026

[19] [19]

Toward ultra-long-horizon agentic science: Cognitive accumulation for machine learning engineering

Xinyu Zhu et al. “Toward ultra-long-horizon agentic science: Cognitive accumulation for machine learning engineering”. In:arXiv preprint arXiv:2601.10402(2026)

arXiv 2026

[20] [20]

AutoMLGen: Navigating Fine-Grained Optimization for Coding Agents

Shangheng Du et al. “AutoMLGen: Navigating Fine-Grained Optimization for Coding Agents”. In:arXiv preprint arXiv:2510.08511(2025)

arXiv 2025

[21] [21]

Mathematical exploration and discovery at scale

Bogdan Georgiev et al. “Mathematical exploration and discovery at scale”. In:arXiv preprint arXiv:2511.02864(2025)

Pith/arXiv arXiv 2025

[22] [22]

MLE-STAR: Machine Learning Engineering Agent via Search and Targeted Refinement

Jaehyun Nam et al. “MLE-STAR: Machine Learning Engineering Agent via Search and Targeted Refinement”. In:arXiv preprint arXiv:2506.15692(2025)

arXiv 2025

[23] [23]

Mlzero: A multi-agent system for end-to-end machine learning automa- tion

Haoyang Fang et al. “Mlzero: A multi-agent system for end-to-end machine learning automa- tion”. In:arXiv preprint arXiv:2505.13941(2025). 14

arXiv 2025

[24] [24]

MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering

Jun Shern Chan et al. “MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering”. In:The Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. 2025

2025

[25] [25]

AIBuildAI: An AI Agent for Automatically Building AI Models

Ruiyi Zhang et al. “AIBuildAI: An AI Agent for Automatically Building AI Models”. In:arXiv preprint arXiv:2604.14455(2026)

Pith/arXiv arXiv 2026

[26] [26]

KAPSO: A Knowledge- grounded framework for Autonomous Program Synthesis and Optimization

Alireza Nadafian, Alireza Mohammadshahi, and Majid Yazdani. “KAPSO: A Knowledge- grounded framework for Autonomous Program Synthesis and Optimization”. In:arXiv preprint arXiv:2601.21526(2026)

arXiv 2026

[27] [27]

Monte-Carlo graph search for Alp- haZero

Johannes Czech, Patrick Korus, and Kristian Kersting. “Monte-Carlo graph search for Alp- haZero”. In:arXiv preprint arXiv:2012.11045(2020)

arXiv 2012

[28] [28]

Monte-carlo graph search: the value of merging similar states

Edouard Leurent and Odalric-Ambrym Maillard. “Monte-carlo graph search: the value of merging similar states”. In:Asian Conference on Machine Learning. PMLR. 2020, pp. 577–592

2020

[29] [29]

Locagent: Graph-guided llm agents for code localization

Zhaoling Chen et al. “Locagent: Graph-guided llm agents for code localization”. In:Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025, pp. 8697–8727

2025

[30] [30]

Codexgraph: Bridging large language models and code repositories via code graph databases

Xiangyan Liu et al. “Codexgraph: Bridging large language models and code repositories via code graph databases”. In:Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). 2025, pp. 142–160

2025

[31] [31]

A survey on the memory mechanism of large language model-based agents

Zeyu Zhang et al. “A survey on the memory mechanism of large language model-based agents”. In:ACM Transactions on Information Systems43.6 (2025), pp. 1–47

2025

[32] [32]

A-mem: Agentic memory for llm agents

Wujiang Xu et al. “A-mem: Agentic memory for llm agents”. In:Advances in Neural Information Processing Systems38 (2026), pp. 17577–17604

2026

[33] [33]

Reasoning as Gradient: Scaling MLE Agents Beyond Tree Search

Yifei Zhang et al. “Reasoning as Gradient: Scaling MLE Agents Beyond Tree Search”. In:arXiv preprint arXiv:2603.01692(2026)

Pith/arXiv arXiv 2026

[34] [34]

Information theory and statistical mechanics

Edwin T Jaynes. “Information theory and statistical mechanics”. In:Physical review106.4 (1957), p. 620

1957

[35] [35]

Billion-scale similarity search with GPUs

Jeff Johnson, Matthijs Douze, and Hervé Jégou. “Billion-scale similarity search with GPUs”. In: IEEE transactions on big data7.3 (2019), pp. 535–547

2019

[36] [36]

Evaluation-driven Scaling for Scientific Discovery

Haotian Ye et al. “Evaluation-driven Scaling for Scientific Discovery”. In:arXiv preprint arXiv:2604.19341(2026)

Pith/arXiv arXiv 2026

[37] [37]

Learning to discover at test time

Mert Yuksekgonul et al. “Learning to discover at test time”. In:arXiv preprint arXiv:2601.16175 (2026)

Pith/arXiv arXiv 2026

[38] [38]

Task":"aptos2019-blindness-detection

Asankhaya Sharma.OpenEvolve: an open-source evolutionary coding agent. 2025.url: https: //github.com/algorithmicsuperintelligence/openevolve. 15 Appendix A. Agent Descriptions MLEvolve is realized through a team of specialized agents, each tailored to a specific search phase or operator type. We summarize their roles: • Draft Agent.Generates initial candi...

arXiv 2025