arxiv: 2602.07906 · v5 · submitted 2026-02-08 · 💻 cs.LG · cs.AI

Recognition: no theorem link

AceGRPO: Adaptive Curriculum Enhanced Group Relative Policy Optimization for Autonomous Machine Learning Engineering

Yuzhu Cai , Zexi Liu , Xinyu Zhu , Cheng Wang , Yanfeng Wang , Siheng Chen

Authors on Pith no claims yet

Pith reviewed 2026-05-16 06:29 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords AceGRPOautonomous machine learning engineeringreinforcement learningadaptive samplingevolving data bufferlearnability potentialLLM agentsMLE-Bench

0 comments

The pith

AceGRPO combines an evolving data buffer with learnability-guided sampling to let LLM agents sustain iterative optimization in machine learning engineering.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper targets behavioral stagnation in prompt-based LLM agents that cannot adapt over long sequences of machine learning engineering tasks. It introduces AceGRPO, a reinforcement learning method that maintains an evolving data buffer to reuse execution traces as training material and applies adaptive sampling according to a learnability potential score that favors tasks at the current performance frontier. These components reduce wasteful data selection and allow parameter updates to continue without freezing. The resulting Ace-30B model reaches complete valid submission rates on MLE-Bench-Lite while closing the gap to proprietary systems and surpassing larger open baselines.

Core claim

AceGRPO integrates an Evolving Data Buffer that continuously repurposes execution traces into reusable training tasks together with Adaptive Sampling driven by a Learnability Potential function that dynamically selects tasks at the agent's learning frontier, thereby enabling efficient group relative policy optimization for sustained long-horizon autonomous machine learning engineering.

What carries the argument

AceGRPO's dual mechanism of the Evolving Data Buffer for trace repurposing and Learnability Potential-guided Adaptive Sampling for dynamic task prioritization.

Load-bearing premise

The evolving data buffer and learnability potential function will consistently prevent stagnation and inefficient selection without introducing new bias or prohibitive extra compute.

What would settle it

Observing whether the Ace-30B model maintains 100 percent valid submission rates across repeated long-horizon MLE tasks or shows performance collapse from repeated suboptimal data choices would directly test the claim.

Figures

Figures reproduced from arXiv: 2602.07906 by Cheng Wang, Siheng Chen, Xinyu Zhu, Yanfeng Wang, Yuzhu Cai, Zexi Liu.

**Figure 1.** Figure 1: The Agent Optimization Loop for MLE. It is a continuous iteration over three distinct phases: Draft (initial generation), Debug (error correction), and Improve (metric optimization). The transitions are deterministically governed by the execution feedback, allowing the agent to refine its code solution and accumulate context through repeated interactions. can effectively optimize agents for long-horizon … view at source ↗

**Figure 2.** Figure 2: The AceGRPO framework. MLE is step-wise optimization over a dynamic task distribution. The method features an Evolving Data Buffer that accumulates intermediate states (Draft, Debug, Improve) via Streaming Expansion. An Adaptive Sampler selects high-potential tasks to maximize the gradient signal for GRPO training, enabling efficient self-improvement on long-horizon challenges. high rewards) or beyond capa… view at source ↗

**Figure 3.** Figure 3: Performance comparison on first valid submissions across different models. We report three key metrics: (left) Medal Rate measuring the percentage of tasks achieving medals on first submission, (middle) Average HumanRank score indicating the mean HumanRank score obtained on first valid submission, and (right) Average Step showing the mean number of iterations required to produce the first valid submission.… view at source ↗

**Figure 4.** Figure 4: Medal rate evolution over time. Ace-30B shows significant improvement over the base model Qwen3-30B-A3BThinking-2507, approaching the performance of larger closedsource model GPT-5.2. Submission rate, matching the strongest proprietary LLM (Claude-4.5-Sonnet) and indicating substantially improved robustness in long-horizon MLE tasks. These consistent gains across multiple metrics demonstrate that AceGRPO… view at source ↗

**Figure 5.** Figure 5: Training dynamics of rewards curve and performance of valid submission [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

read the original abstract

Autonomous Machine Learning Engineering (MLE) requires agents to perform sustained, iterative optimization over long horizons. While recent LLM-based agents show promise, current prompt-based agents for MLE suffer from behavioral stagnation due to frozen parameters. Although Reinforcement Learning (RL) offers a remedy, applying it to MLE is hindered by prohibitive execution latency and inefficient data selection. Recognizing these challenges, we propose AceGRPO with two core components: (1) Evolving Data Buffer that continuously repurposes execution traces into reusable training tasks, and (2) Adaptive Sampling guided by a Learnability Potential function, which dynamically prioritizes tasks at the agent's learning frontier to maximize learning efficiency. Leveraging AceGRPO, our trained Ace-30B model achieves a 100% valid submission rate on MLE-Bench-Lite, approaches the performance of proprietary frontier models, and outperforms larger open-source baselines (e.g., DeepSeek-V3.2), demonstrating robust capability for sustained iterative optimization. Code is available at https://github.com/yuzhu-cai/AceGRPO.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

read the letter

AceGRPO layers an evolving execution-trace buffer and a learnability-guided sampler onto GRPO to keep MLE agents improving over long horizons, with the 30B model claiming 100% valid submissions on the lite benchmark. The buffer reuses past runs as training tasks instead of discarding them, and the sampler tries to pick items at the agent's current frontier. That combination is the concrete addition here, and the released code lets others inspect how the pieces fit together. The reported result is that the trained model reaches perfect valid submissions on MLE-Bench-Lite while beating some larger open baselines like DeepSeek-V3.2. For groups already running RL on agentic coding or optimization loops, those numbers plus the public repo give something immediate to test. The soft spot is the Learnability Potential function itself. The abstract describes it only at a high level, with no formula, no derivation, and no ablation that turns the adaptive sampler off while holding the buffer and GRPO fixed. Without that isolation it is hard to know whether the gains come from the sampler, from longer training, or from the buffer alone. The results also lack any mention of run-to-run variance or statistical tests, so the 100% figure is difficult to weigh. This paper is for researchers who train LLM agents on sustained engineering tasks and want practical RL tweaks rather than new theory. Readers who need reproducible components and are willing to dig into the code will get the most out of it. I would send it to peer review. The problem it targets is real, the engineering steps are explicit enough to evaluate, and the performance claim is high enough to justify referee time even if the method section needs tightening.

Referee Report

2 major / 1 minor

Summary. The paper proposes AceGRPO, an extension of Group Relative Policy Optimization for autonomous machine learning engineering agents. It introduces an Evolving Data Buffer to repurpose execution traces into reusable tasks and Adaptive Sampling driven by a Learnability Potential function to prioritize tasks at the agent's learning frontier. The central claim is that the resulting Ace-30B model achieves a 100% valid submission rate on MLE-Bench-Lite, approaches proprietary frontier model performance, and outperforms larger open-source baselines such as DeepSeek-V3.2.

Significance. If the performance claims are supported by rigorous experiments, the work would be significant for autonomous MLE agents by demonstrating a practical RL curriculum mechanism that mitigates behavioral stagnation over long horizons. The public code release is a clear strength that enables reproducibility and community validation.

major comments (2)

[Abstract] Abstract: The Learnability Potential function is described only at a high level with no explicit formula, equation, or pseudocode; without this definition it is impossible to verify whether the function is non-circular or free of selection bias, undermining the claim that adaptive sampling (rather than the buffer or training length) drives the 100% valid rate.
[Experimental results] Experimental results: No details are supplied on experimental setup, number of runs, statistical tests, baseline implementations, or ablation studies that disable the Learnability Potential function while keeping the Evolving Data Buffer and GRPO fixed; the headline performance numbers therefore cannot be evaluated.

minor comments (1)

[Introduction] The distinction between the method name (AceGRPO) and the model name (Ace-30B) is clear in the abstract but could be reinforced in the introduction to avoid reader confusion.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript describing AceGRPO. The comments correctly identify areas where additional clarity and detail are needed to support the central claims. We address each point below and will revise the manuscript to incorporate the requested information.

read point-by-point responses

Referee: [Abstract] Abstract: The Learnability Potential function is described only at a high level with no explicit formula, equation, or pseudocode; without this definition it is impossible to verify whether the function is non-circular or free of selection bias, undermining the claim that adaptive sampling (rather than the buffer or training length) drives the 100% valid rate.

Authors: We agree that the abstract provides only a high-level description of the Learnability Potential function. In the revised manuscript we will add the explicit mathematical definition, including the full equation and pseudocode, to the Methods section and will reference this definition from the abstract. This addition will allow verification that the function is non-circular and free of selection bias, thereby strengthening the argument that adaptive sampling (rather than the buffer or training length alone) is responsible for the reported 100% valid submission rate. revision: yes
Referee: [Experimental results] Experimental results: No details are supplied on experimental setup, number of runs, statistical tests, baseline implementations, or ablation studies that disable the Learnability Potential function while keeping the Evolving Data Buffer and GRPO fixed; the headline performance numbers therefore cannot be evaluated.

Authors: We acknowledge that the current version lacks sufficient detail on the experimental protocol. In the revision we will expand the Experimental Setup and Results sections to report the number of independent runs, the statistical tests performed, the precise baseline implementations, and dedicated ablation studies that disable only the Learnability Potential function while retaining the Evolving Data Buffer and GRPO. These additions will enable readers to rigorously evaluate the headline performance numbers. revision: yes

Circularity Check

0 steps flagged

Empirical method with no load-bearing derivations or self-referential reductions

full rationale

The paper presents AceGRPO as an engineering combination of an Evolving Data Buffer and Adaptive Sampling via a Learnability Potential function. No equations, derivations, or uniqueness theorems appear in the abstract or described text that reduce performance claims (e.g., 100% valid rate) to fitted parameters, self-citations, or inputs by construction. Claims rest on experimental outcomes on MLE-Bench-Lite rather than a closed mathematical chain. This matches the reader's assessment of score 1.0; the central contribution is empirical and does not exhibit the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the unproven premise that the two proposed components will overcome latency and data-selection barriers in RL for MLE; no free parameters or invented entities are quantified in the abstract.

axioms (1)

domain assumption Reinforcement learning can be made efficient enough for long-horizon LLM agent training in MLE despite execution latency.
Invoked when the abstract states that RL offers a remedy but is hindered by latency, then proposes components to address it.

invented entities (1)

Learnability Potential function no independent evidence
purpose: Dynamically prioritize tasks at the agent's learning frontier
Introduced as the core of Adaptive Sampling; no independent evidence or falsifiable prediction is supplied in the abstract.

pith-pipeline@v0.9.0 · 5501 in / 1286 out tokens · 37605 ms · 2026-05-16T06:29:20.590237+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

21 extracted references · 21 canonical work pages · 8 internal anchors

[1]

MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering

Chan, J. S., Chowdhury, N., Jaffe, O., Aung, J., Sherburn, D., Mays, E., Starace, G., Liu, K., Maksin, L., Patward- han, T., et al. Mle-bench: Evaluating machine learning agents on machine learning engineering.arXiv preprint arXiv:2410.07095,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Deng, Y ., Hsu, I., Yan, J., Wang, Z., Han, R., Zhang, G., Chen, Y ., Wang, W., Pfister, T., Lee, C.-Y ., et al

URL https://arxi v.org/abs/2505.09655. Deng, Y ., Hsu, I., Yan, J., Wang, Z., Han, R., Zhang, G., Chen, Y ., Wang, W., Pfister, T., Lee, C.-Y ., et al. Super- vised reinforcement learning: From expert trajectories to step-wise reasoning.arXiv preprint arXiv:2510.25992,

work page arXiv
[3]

Automlgen: Navigating fine- grained optimization for coding agents.arXiv preprint arXiv:2510.08511, 2025a

Du, S., Yan, X., Jiang, D., Yuan, J., Hu, Y ., Li, X., He, L., Zhang, B., and Bai, L. Automlgen: Navigating fine- grained optimization for coding agents.arXiv preprint arXiv:2510.08511, 2025a. Du, Y ., Cai, Y ., Zhou, Y ., Wang, C., Qian, Y ., Pang, X., Liu, Q., Hu, Y ., and Chen, S. Swe-dev: Evaluating and training autonomous feature-driven software deve...

work page arXiv
[4]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

URL https://deepmind.google/models/g emini/pro/. Guo, D., Yang, D., Zhang, H., Song, J., Zhang, R., Xu, R., Zhu, Q., Ma, S., Wang, P., Bi, X., et al. Deepseek-r1: In- centivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Yuxuan Huang, Yihang Chen, Haozheng Zhang, Kang Li, Huichi Zhou, Meng Fang, Linyi Yang, Xiaoguang Li, Lifeng Shang, Songcen Xu, Jianye Hao, Kun Shao, and Jun Wang

Huang, Q., V ora, J., Liang, P., and Leskovec, J. Mlagent- bench: Evaluating language agents on machine learning experimentation.arXiv preprint arXiv:2310.03302,

work page arXiv
[6]

Aide: Ai-driven exploration in the space of code

Jiang, Z., Schmidt, D., Srikanth, D., Xu, D., Kaplan, I., Jacenko, D., and Wu, Y . Aide: Ai-driven exploration in the space of code.arXiv preprint arXiv:2502.13138,

work page arXiv
[7]

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

Jimenez, C. E., Yang, J., Wettig, A., Yao, S., Pei, K., Press, O., and Narasimhan, K. Swe-bench: Can language mod- els resolve real-world github issues?arXiv preprint arXiv:2310.06770,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

H., Hou, Z., Cao, L., Ju, C., Wu, J., Li, H., Zhang, H., et al

Li, A., Wu, C., Ge, Z., Chong, Y . H., Hou, Z., Cao, L., Ju, C., Wu, J., Li, H., Zhang, H., et al. The fm agent.arXiv preprint arXiv:2510.26144,

work page arXiv
[9]

DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models

Liu, A., Mei, A., Lin, B., Xue, B., Wang, B., Xu, B., Wu, B., Zhang, B., Lin, C., Dong, C., et al. Deepseek-v3. 2: Pushing the frontier of open large language models. arXiv preprint arXiv:2512.02556, 2025a. Liu, Z., Cai, Y ., Zhu, X., Zheng, Y ., Chen, R., Wen, Y ., Wang, Y ., Chen, S., et al. Ml-master: Towards ai-for-ai via integration of exploration an...

work page internal anchor Pith review Pith/arXiv arXiv
[10]

E., Popa, R

Luo, M., Jain, N., Singh, J., Tan, S., Patel, A., Wu, Q., Ariyak, A., Cai, C., Venkat, T., Zhu, S., Athiwaratkun, B., Roongta, M., Zhang, C., Li, L. E., Popa, R. A., Sen, K., and Stoica, I. Deepswe: Training a state-of-the-art coding agent from scratch by scaling rl. https://pretty-r adio-b75.notion.site/DeepSWE-Trainin g-a-Fully-Open-sourced-State-of-the...

work page arXiv
[11]

Qiang, R., Zhuang, Y ., Li, Y ., Zhang, R., Li, C., Wong, I

URL https://cdn.openai.com/gpt-5-sys tem-card.pdf. Qiang, R., Zhuang, Y ., Li, Y ., Zhang, R., Li, C., Wong, I. S.-H., Yang, S., Liang, P., Zhang, C., Dai, B., et al. Mle-dojo: Interactive environments for empowering llm agents in machine learning engineering.arXiv preprint arXiv:2505.07782,

work page arXiv
[12]

Reinforcement learning with sparse re- wards using guidance from offline demonstration.arXiv preprint arXiv:2202.04628,

Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., and Shakkottai, S. Reinforcement learning with sparse re- wards using guidance from offline demonstration.arXiv preprint arXiv:2202.04628,

work page arXiv
[13]

Operand quant: A single-agent architecture for au- tonomous machine learning engineering.arXiv preprint arXiv:2510.11694,

Sahney, A., Gorthi, R., Łastowski, C., and Vega, J. Operand quant: A single-agent architecture for au- tonomous machine learning engineering.arXiv preprint arXiv:2510.11694,

work page arXiv
[14]

Eigen-1: Adaptive multi-agent refinement with monitor-based rag for scientific reasoning.arXiv preprint arXiv:2509.21193,

Tang, X., Xu, W., Wang, Y ., Guo, Z., Shao, D., Chen, J., Zhang, C., Wang, Z., Zhang, L., Wan, G., et al. Eigen-1: Adaptive multi-agent refinement with monitor-based rag for scientific reasoning.arXiv preprint arXiv:2509.21193,

work page arXiv
[15]

M., et al

Toledo, E., Hambardzumyan, K., Josifoski, M., Hazra, R., Baldwin, N., Audran-Reiss, A., Kuchnik, M., Magka, D., Jiang, M., Lupidi, A. M., et al. Ai research agents for machine learning: Search, exploration, and generalization in mle-bench.arXiv preprint arXiv:2507.02554,

work page arXiv
[16]

OpenHands: An Open Platform for AI Software Developers as Generalist Agents

Wang, X., Li, B., Song, Y ., Xu, F. F., Tang, X., Zhuge, M., Pan, J., Song, Y ., Li, B., Singh, J., et al. Openhands: An open platform for ai software developers as generalist agents.arXiv preprint arXiv:2407.16741,

work page internal anchor Pith review Pith/arXiv arXiv
[17]

Wei, Y ., Duchenne, O., Copet, J., Carbonneaux, Q., Zhang, L., Fried, D., Synnaeve, G., Singh, R., and Wang, S. I. Swe-rl: Advancing llm reasoning via reinforcement learning on open software evolution.arXiv preprint arXiv:2502.18449,

work page internal anchor Pith review Pith/arXiv arXiv
[18]

Qwen3 Technical Report

Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025a. Yang, X., Yang, X., Fang, S., Xian, B., Li, Y ., Wang, J., Xu, M., Pan, H., Hong, X., Liu, W., et al. R&d-agent: Automating data-driven ai solution building through llm- powered automated rese...

work page internal anchor Pith review Pith/arXiv arXiv
[19]

o1-coder: an o1 replication for coding.arXiv preprint arXiv:2412.00154,

Zhang, Y ., Wu, S., Yang, Y ., Shu, J., Xiao, J., Kong, C., and Sang, J. o1-coder: an o1 replication for coding.arXiv preprint arXiv:2412.00154,

work page arXiv
[20]

Absolute Zero: Reinforced Self-play Reasoning with Zero Data

Zhao, A., Wu, Y ., Yue, Y ., Wu, T., Xu, Q., Lin, M., Wang, S., Wu, Q., Zheng, Z., and Huang, G. Absolute zero: Reinforced self-play reasoning with zero data.arXiv preprint arXiv:2505.03335,

work page internal anchor Pith review Pith/arXiv arXiv
[21]

Toward ultra-long-horizon agentic science: Cognitive accumu- lation for machine learning engineering.arXiv preprint arXiv:2601.10402,

Zhu, X., Cai, Y ., Liu, Z., Zheng, B., Wang, C., Ye, R., Chen, J., Wang, H., Wang, W.-C., Zhang, Y ., et al. Toward ultra-long-horizon agentic science: Cognitive accumu- lation for machine learning engineering.arXiv preprint arXiv:2601.10402,

work page arXiv