Recognition: no theorem link
AceGRPO: Adaptive Curriculum Enhanced Group Relative Policy Optimization for Autonomous Machine Learning Engineering
Pith reviewed 2026-05-16 06:29 UTC · model grok-4.3
The pith
AceGRPO combines an evolving data buffer with learnability-guided sampling to let LLM agents sustain iterative optimization in machine learning engineering.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
AceGRPO integrates an Evolving Data Buffer that continuously repurposes execution traces into reusable training tasks together with Adaptive Sampling driven by a Learnability Potential function that dynamically selects tasks at the agent's learning frontier, thereby enabling efficient group relative policy optimization for sustained long-horizon autonomous machine learning engineering.
What carries the argument
AceGRPO's dual mechanism of the Evolving Data Buffer for trace repurposing and Learnability Potential-guided Adaptive Sampling for dynamic task prioritization.
Load-bearing premise
The evolving data buffer and learnability potential function will consistently prevent stagnation and inefficient selection without introducing new bias or prohibitive extra compute.
What would settle it
Observing whether the Ace-30B model maintains 100 percent valid submission rates across repeated long-horizon MLE tasks or shows performance collapse from repeated suboptimal data choices would directly test the claim.
Figures
read the original abstract
Autonomous Machine Learning Engineering (MLE) requires agents to perform sustained, iterative optimization over long horizons. While recent LLM-based agents show promise, current prompt-based agents for MLE suffer from behavioral stagnation due to frozen parameters. Although Reinforcement Learning (RL) offers a remedy, applying it to MLE is hindered by prohibitive execution latency and inefficient data selection. Recognizing these challenges, we propose AceGRPO with two core components: (1) Evolving Data Buffer that continuously repurposes execution traces into reusable training tasks, and (2) Adaptive Sampling guided by a Learnability Potential function, which dynamically prioritizes tasks at the agent's learning frontier to maximize learning efficiency. Leveraging AceGRPO, our trained Ace-30B model achieves a 100% valid submission rate on MLE-Bench-Lite, approaches the performance of proprietary frontier models, and outperforms larger open-source baselines (e.g., DeepSeek-V3.2), demonstrating robust capability for sustained iterative optimization. Code is available at https://github.com/yuzhu-cai/AceGRPO.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes AceGRPO, an extension of Group Relative Policy Optimization for autonomous machine learning engineering agents. It introduces an Evolving Data Buffer to repurpose execution traces into reusable tasks and Adaptive Sampling driven by a Learnability Potential function to prioritize tasks at the agent's learning frontier. The central claim is that the resulting Ace-30B model achieves a 100% valid submission rate on MLE-Bench-Lite, approaches proprietary frontier model performance, and outperforms larger open-source baselines such as DeepSeek-V3.2.
Significance. If the performance claims are supported by rigorous experiments, the work would be significant for autonomous MLE agents by demonstrating a practical RL curriculum mechanism that mitigates behavioral stagnation over long horizons. The public code release is a clear strength that enables reproducibility and community validation.
major comments (2)
- [Abstract] Abstract: The Learnability Potential function is described only at a high level with no explicit formula, equation, or pseudocode; without this definition it is impossible to verify whether the function is non-circular or free of selection bias, undermining the claim that adaptive sampling (rather than the buffer or training length) drives the 100% valid rate.
- [Experimental results] Experimental results: No details are supplied on experimental setup, number of runs, statistical tests, baseline implementations, or ablation studies that disable the Learnability Potential function while keeping the Evolving Data Buffer and GRPO fixed; the headline performance numbers therefore cannot be evaluated.
minor comments (1)
- [Introduction] The distinction between the method name (AceGRPO) and the model name (Ace-30B) is clear in the abstract but could be reinforced in the introduction to avoid reader confusion.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript describing AceGRPO. The comments correctly identify areas where additional clarity and detail are needed to support the central claims. We address each point below and will revise the manuscript to incorporate the requested information.
read point-by-point responses
-
Referee: [Abstract] Abstract: The Learnability Potential function is described only at a high level with no explicit formula, equation, or pseudocode; without this definition it is impossible to verify whether the function is non-circular or free of selection bias, undermining the claim that adaptive sampling (rather than the buffer or training length) drives the 100% valid rate.
Authors: We agree that the abstract provides only a high-level description of the Learnability Potential function. In the revised manuscript we will add the explicit mathematical definition, including the full equation and pseudocode, to the Methods section and will reference this definition from the abstract. This addition will allow verification that the function is non-circular and free of selection bias, thereby strengthening the argument that adaptive sampling (rather than the buffer or training length alone) is responsible for the reported 100% valid submission rate. revision: yes
-
Referee: [Experimental results] Experimental results: No details are supplied on experimental setup, number of runs, statistical tests, baseline implementations, or ablation studies that disable the Learnability Potential function while keeping the Evolving Data Buffer and GRPO fixed; the headline performance numbers therefore cannot be evaluated.
Authors: We acknowledge that the current version lacks sufficient detail on the experimental protocol. In the revision we will expand the Experimental Setup and Results sections to report the number of independent runs, the statistical tests performed, the precise baseline implementations, and dedicated ablation studies that disable only the Learnability Potential function while retaining the Evolving Data Buffer and GRPO. These additions will enable readers to rigorously evaluate the headline performance numbers. revision: yes
Circularity Check
Empirical method with no load-bearing derivations or self-referential reductions
full rationale
The paper presents AceGRPO as an engineering combination of an Evolving Data Buffer and Adaptive Sampling via a Learnability Potential function. No equations, derivations, or uniqueness theorems appear in the abstract or described text that reduce performance claims (e.g., 100% valid rate) to fitted parameters, self-citations, or inputs by construction. Claims rest on experimental outcomes on MLE-Bench-Lite rather than a closed mathematical chain. This matches the reader's assessment of score 1.0; the central contribution is empirical and does not exhibit the enumerated circularity patterns.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Reinforcement learning can be made efficient enough for long-horizon LLM agent training in MLE despite execution latency.
invented entities (1)
-
Learnability Potential function
no independent evidence
Reference graph
Works this paper leans on
-
[1]
MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering
Chan, J. S., Chowdhury, N., Jaffe, O., Aung, J., Sherburn, D., Mays, E., Starace, G., Liu, K., Maksin, L., Patward- han, T., et al. Mle-bench: Evaluating machine learning agents on machine learning engineering.arXiv preprint arXiv:2410.07095,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
URL https://arxi v.org/abs/2505.09655. Deng, Y ., Hsu, I., Yan, J., Wang, Z., Han, R., Zhang, G., Chen, Y ., Wang, W., Pfister, T., Lee, C.-Y ., et al. Super- vised reinforcement learning: From expert trajectories to step-wise reasoning.arXiv preprint arXiv:2510.25992,
-
[3]
Du, S., Yan, X., Jiang, D., Yuan, J., Hu, Y ., Li, X., He, L., Zhang, B., and Bai, L. Automlgen: Navigating fine- grained optimization for coding agents.arXiv preprint arXiv:2510.08511, 2025a. Du, Y ., Cai, Y ., Zhou, Y ., Wang, C., Qian, Y ., Pang, X., Liu, Q., Hu, Y ., and Chen, S. Swe-dev: Evaluating and training autonomous feature-driven software deve...
-
[4]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
URL https://deepmind.google/models/g emini/pro/. Guo, D., Yang, D., Zhang, H., Song, J., Zhang, R., Xu, R., Zhu, Q., Ma, S., Wang, P., Bi, X., et al. Deepseek-r1: In- centivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948,
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
Huang, Q., V ora, J., Liang, P., and Leskovec, J. Mlagent- bench: Evaluating language agents on machine learning experimentation.arXiv preprint arXiv:2310.03302,
-
[6]
Aide: Ai-driven exploration in the space of code
Jiang, Z., Schmidt, D., Srikanth, D., Xu, D., Kaplan, I., Jacenko, D., and Wu, Y . Aide: Ai-driven exploration in the space of code.arXiv preprint arXiv:2502.13138,
-
[7]
SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
Jimenez, C. E., Yang, J., Wettig, A., Yao, S., Pei, K., Press, O., and Narasimhan, K. Swe-bench: Can language mod- els resolve real-world github issues?arXiv preprint arXiv:2310.06770,
work page internal anchor Pith review Pith/arXiv arXiv
-
[8]
H., Hou, Z., Cao, L., Ju, C., Wu, J., Li, H., Zhang, H., et al
Li, A., Wu, C., Ge, Z., Chong, Y . H., Hou, Z., Cao, L., Ju, C., Wu, J., Li, H., Zhang, H., et al. The fm agent.arXiv preprint arXiv:2510.26144,
-
[9]
DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models
Liu, A., Mei, A., Lin, B., Xue, B., Wang, B., Xu, B., Wu, B., Zhang, B., Lin, C., Dong, C., et al. Deepseek-v3. 2: Pushing the frontier of open large language models. arXiv preprint arXiv:2512.02556, 2025a. Liu, Z., Cai, Y ., Zhu, X., Zheng, Y ., Chen, R., Wen, Y ., Wang, Y ., Chen, S., et al. Ml-master: Towards ai-for-ai via integration of exploration an...
work page internal anchor Pith review Pith/arXiv arXiv
-
[10]
Luo, M., Jain, N., Singh, J., Tan, S., Patel, A., Wu, Q., Ariyak, A., Cai, C., Venkat, T., Zhu, S., Athiwaratkun, B., Roongta, M., Zhang, C., Li, L. E., Popa, R. A., Sen, K., and Stoica, I. Deepswe: Training a state-of-the-art coding agent from scratch by scaling rl. https://pretty-r adio-b75.notion.site/DeepSWE-Trainin g-a-Fully-Open-sourced-State-of-the...
-
[11]
Qiang, R., Zhuang, Y ., Li, Y ., Zhang, R., Li, C., Wong, I
URL https://cdn.openai.com/gpt-5-sys tem-card.pdf. Qiang, R., Zhuang, Y ., Li, Y ., Zhang, R., Li, C., Wong, I. S.-H., Yang, S., Liang, P., Zhang, C., Dai, B., et al. Mle-dojo: Interactive environments for empowering llm agents in machine learning engineering.arXiv preprint arXiv:2505.07782,
-
[12]
Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., and Shakkottai, S. Reinforcement learning with sparse re- wards using guidance from offline demonstration.arXiv preprint arXiv:2202.04628,
-
[13]
Sahney, A., Gorthi, R., Łastowski, C., and Vega, J. Operand quant: A single-agent architecture for au- tonomous machine learning engineering.arXiv preprint arXiv:2510.11694,
-
[14]
Tang, X., Xu, W., Wang, Y ., Guo, Z., Shao, D., Chen, J., Zhang, C., Wang, Z., Zhang, L., Wan, G., et al. Eigen-1: Adaptive multi-agent refinement with monitor-based rag for scientific reasoning.arXiv preprint arXiv:2509.21193,
- [15]
-
[16]
OpenHands: An Open Platform for AI Software Developers as Generalist Agents
Wang, X., Li, B., Song, Y ., Xu, F. F., Tang, X., Zhuge, M., Pan, J., Song, Y ., Li, B., Singh, J., et al. Openhands: An open platform for ai software developers as generalist agents.arXiv preprint arXiv:2407.16741,
work page internal anchor Pith review Pith/arXiv arXiv
-
[17]
Wei, Y ., Duchenne, O., Copet, J., Carbonneaux, Q., Zhang, L., Fried, D., Synnaeve, G., Singh, R., and Wang, S. I. Swe-rl: Advancing llm reasoning via reinforcement learning on open software evolution.arXiv preprint arXiv:2502.18449,
work page internal anchor Pith review Pith/arXiv arXiv
-
[18]
Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025a. Yang, X., Yang, X., Fang, S., Xian, B., Li, Y ., Wang, J., Xu, M., Pan, H., Hong, X., Liu, W., et al. R&d-agent: Automating data-driven ai solution building through llm- powered automated rese...
work page internal anchor Pith review Pith/arXiv arXiv
-
[19]
o1-coder: an o1 replication for coding.arXiv preprint arXiv:2412.00154,
Zhang, Y ., Wu, S., Yang, Y ., Shu, J., Xiao, J., Kong, C., and Sang, J. o1-coder: an o1 replication for coding.arXiv preprint arXiv:2412.00154,
-
[20]
Absolute Zero: Reinforced Self-play Reasoning with Zero Data
Zhao, A., Wu, Y ., Yue, Y ., Wu, T., Xu, Q., Lin, M., Wang, S., Wu, Q., Zheng, Z., and Huang, G. Absolute zero: Reinforced self-play reasoning with zero data.arXiv preprint arXiv:2505.03335,
work page internal anchor Pith review Pith/arXiv arXiv
-
[21]
Zhu, X., Cai, Y ., Liu, Z., Zheng, B., Wang, C., Ye, R., Chen, J., Wang, H., Wang, W.-C., Zhang, Y ., et al. Toward ultra-long-horizon agentic science: Cognitive accumu- lation for machine learning engineering.arXiv preprint arXiv:2601.10402,
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.