pith. sign in

arxiv: 2505.23723 · v2 · submitted 2025-05-29 · 💻 cs.CL · cs.AI· cs.LG

ML-Agent: Reinforcing LLM Agents for Autonomous Machine Learning Engineering

Pith reviewed 2026-05-19 12:45 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG
keywords LLM agentsreinforcement learningautonomous machine learningagentic AIML engineeringonline RLQwen modelcross-task generalization
0
0 comments X

The pith

A 7B LLM agent trained with online RL on ML tasks matches larger proprietary models at lower cost.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shifts from prompt-only LLM agents to a learning-based paradigm where agents improve through interactive experimentation and reinforcement learning on machine learning tasks. It introduces a three-part framework of exploration-enriched fine-tuning, step-wise RL, and a dedicated reward module to train a 7B Qwen model. Despite exposure to only nine ML tasks, the resulting agent reaches performance levels comparable to agents built on much larger closed models such as GPT-5 while using far less compute and exhibiting cross-task generalization. A sympathetic reader cares because the work points toward more accessible, open, and scalable ways to create autonomous ML systems.

Core claim

Training an LLM agent on ML engineering tasks via online reinforcement learning with exploration-enriched fine-tuning, step-wise action updates, and an agentic ML-specific reward module produces a 7B model that achieves performance comparable to agents using much larger proprietary LLMs despite training on only nine tasks and demonstrates strong cross-task generalization.

What carries the argument

The agentic ML training framework consisting of exploration-enriched fine-tuning, step-wise RL, and an ML-specific reward module that converts execution feedback into consistent scalar rewards.

If this is right

  • Smaller open-weight LLMs become viable for autonomous ML engineering roles.
  • Training on a handful of tasks can produce agents that generalize to unseen ML problems.
  • Computational cost for capable ML agents drops dramatically compared with proprietary large-model approaches.
  • Online RL on execution trajectories enables continuous improvement without full retraining.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same RL framework might transfer to agentic tasks in software engineering or scientific hypothesis generation.
  • Scaling the number of training tasks or model size could further widen the performance gap versus prompt-only baselines.
  • Open release of such agents would lower barriers for researchers to build and iterate on domain-specific ML automation tools.

Load-bearing premise

The agentic ML-specific reward module can turn varied and sometimes noisy ML feedback signals into consistent scalar rewards that support stable step-wise RL without systematic bias.

What would settle it

Evaluating the trained 7B ML-Agent on a new set of ML tasks and finding that its performance falls substantially below that of GPT-5-based agents would falsify the claim of comparable results and cross-task generalization.

Figures

Figures reproduced from arXiv: 2505.23723 by Bo Zhang, Jingyi Chai, Lei Bai, Rui Ye, Shuo Tang, Siheng Chen, Xinyu Zhu, Zexi Liu.

Figure 1
Figure 1. Figure 1: Overview of the agentic ML training framework, introducing (1) exploration-enriched fine-tuning for diverse action pool creation, (2) a step-wise RL paradigm for efficient experience collection using expert trajectories, and (3) an agentic ML-specific reward module for various ML feedback handling and task-specific performance evaluation. (See Appendix B.1). An expert LLM with policy πe then executes the f… view at source ↗
Figure 3
Figure 3. Figure 3: ML-Agent achieves continuous performance improve￾ments; Step-wise RL (evaluated every 5 steps) is more efficient than episode-wise RL (standard PPO, evaluated every 1 step) on both held-in and held-out tasks [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Exploration-enriched fine-tuning is crucial for RL train￾ing. "N/A" means the training based on the model fails to generate valid results [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 6
Figure 6. Figure 6: Task file structure and prompt about the machine learning problem of cifar-10 task, for instance. script, "train.py", generated by GPT-4o-mini; (3) an evaluation script, "eval.py", which is used to calculate the test score from the submitted results; (4) a problem description file, "research_problem.txt"; and (5) a "prepare.py" script to download the data if necessary. An example file structure and related… view at source ↗
Figure 7
Figure 7. Figure 7: Unique noun counts per verb across 100 randomly sampled edit instructions, comparing the Qwen2.5-7B-Instruct model (blue) with the ML-Agent-SFT model (red). 4 8 16 32 64 128 Best@K 0.55 0.60 0.65 0.70 0.75 0.80 0.85 Score cifar10 0k 3k 10k 30k 4 8 16 32 64 128 Best@K 17500 18000 18500 19000 19500 20000 20500 house-price 0k 3k 10k 30k 4 8 16 32 64 128 Best@K 0.03 0.04 0.05 0.06 0.07 0.08 0.09 Score denoisin… view at source ↗
Figure 8
Figure 8. Figure 8: Different number of training samples in RL, starting from our sft model. Effects of training data size in RL. Here, we investigate how varying the number of training data samples (0k, 3k, 10k, 30k) affects the performance in RL. The 0k scenario represents ML-Agent-SFT model. For every model, we run 128 trajectories for each task and evaluate the best@K, where K ranges over [4, 8, 16, 32, 64, 128], as shown… view at source ↗
read the original abstract

The emergence of large language model (LLM)-based agents has significantly advanced the development of autonomous machine learning (ML) engineering. However, the dominant prompt-based paradigm exhibits limitations: smaller models lack the capacity to learn from execution trajectories for generalization, while large proprietary models incur high computational overhead, restricting accessibility and scalability. Focusing on this, for the first time, we explore the paradigm of learning-based agentic ML, where an LLM agent learns through interactive experimentation on ML tasks using online reinforcement learning (RL). To realize this, we propose a novel agentic ML training framework with three key components: (1) exploration-enriched fine-tuning, which enables LLM agents to generate diverse actions for enhanced RL exploration; (2) step-wise RL, which enables training on a single action step, accelerating experience collection and improving training efficiency; (3) an agentic ML-specific reward module, which unifies varied ML feedback signals into consistent rewards for RL optimization. Leveraging this framework, we train ML-Agent, driven by a 7B-sized Qwen-2.5 LLM for autonomous ML. Despite training on only 9 ML tasks, our 7B-sized ML-Agent achieves comparable performance to agents using much larger proprietary LLMs (e.g., GPT-5) but at significantly lower computational cost, demonstrating strong performance and cross-task generalization.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes a learning-based agentic ML framework with three components—exploration-enriched fine-tuning, step-wise RL, and an agentic ML-specific reward module—to train ML-Agent, a 7B Qwen-2.5 LLM agent for autonomous machine learning engineering. It claims that despite training on only 9 ML tasks, this agent achieves performance comparable to agents using much larger proprietary LLMs (e.g., GPT-5) at significantly lower computational cost while demonstrating strong cross-task generalization.

Significance. If the empirical results and supporting analyses hold, the work would be significant in showing that reinforcement learning on small open models can match proprietary large-model agents in interactive ML engineering tasks, potentially improving accessibility and reducing compute barriers in autonomous ML systems.

major comments (2)
  1. [Abstract] Abstract: the claim of 'comparable performance' to GPT-5 agents is stated without any quantitative metrics, baselines, statistical details, error bars, or tables, leaving the central empirical result without visible support.
  2. [§3] Framework description: the agentic ML-specific reward module is described only qualitatively as unifying 'varied ML feedback signals' into consistent rewards; no equation, pseudocode, weighting scheme, normalization procedure, or handling of noisy signals (e.g., NaN losses or timeouts) is provided, which is load-bearing for the step-wise RL optimization and generalization claims.
minor comments (1)
  1. [Abstract] Clarify whether 'GPT-5' refers to a specific released model or a placeholder, and ensure all model names in experiments are consistently referenced.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment point by point below, indicating the revisions we plan to incorporate.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim of 'comparable performance' to GPT-5 agents is stated without any quantitative metrics, baselines, statistical details, error bars, or tables, leaving the central empirical result without visible support.

    Authors: We agree that the abstract would benefit from more explicit quantitative grounding for the central claim. In the revised manuscript, we will update the abstract to include key performance metrics (such as success rates on the evaluated ML tasks), direct comparisons to the GPT-5-based baselines, and explicit references to the experimental tables and figures that contain statistical details and error bars. This change will make the empirical support visible at the abstract level without altering the overall narrative. revision: yes

  2. Referee: [§3] Framework description: the agentic ML-specific reward module is described only qualitatively as unifying 'varied ML feedback signals' into consistent rewards; no equation, pseudocode, weighting scheme, normalization procedure, or handling of noisy signals (e.g., NaN losses or timeouts) is provided, which is load-bearing for the step-wise RL optimization and generalization claims.

    Authors: We acknowledge that the current presentation of the agentic ML-specific reward module remains at a high level. We will revise §3 to include the full mathematical formulation of the reward function, pseudocode for the unification and computation process, the weighting and normalization schemes applied to different ML feedback signals, and explicit procedures for handling noisy or invalid signals such as NaN losses and timeouts. These additions will directly support the claims regarding step-wise RL training efficiency and cross-task generalization. revision: yes

Circularity Check

0 steps flagged

Empirical RL training outcome on ML tasks with no derivation reducing to fitted inputs

full rationale

The paper reports an empirical result: a 7B Qwen-2.5 model trained via the proposed three-component framework (exploration-enriched fine-tuning, step-wise RL, and agentic ML-specific reward module) on only 9 tasks achieves performance comparable to larger proprietary agents. No mathematical derivation chain, equations, or uniqueness theorem is presented that reduces the headline claim to a quantity defined in terms of its own fitted parameters or self-citations. The central performance claim rests on external evaluation benchmarks rather than internal self-definition, satisfying the self-contained-against-benchmarks criterion for a low circularity score.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The framework rests on the effectiveness of the newly introduced reward unification and exploration mechanisms; specific numerical parameters for RL or reward weighting are not detailed in the abstract.

free parameters (1)
  • reward module scaling factors
    Parameters that combine different ML feedback signals into a single reward are introduced by the framework and must be set during training.
axioms (1)
  • domain assumption Varied ML task feedback signals can be unified into consistent scalar rewards suitable for RL
    Invoked in the description of the agentic ML-specific reward module as the basis for optimization.
invented entities (1)
  • ML-Agent no independent evidence
    purpose: The trained 7B LLM agent capable of autonomous ML engineering
    New agent introduced through the described training process.

pith-pipeline@v0.9.0 · 5792 in / 1363 out tokens · 59529 ms · 2026-05-19T12:45:07.310352+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. 1GC-7RC: One Graphic Card -- Seven Research Challenges! How Good Are AI Agents at Doing Your Job?

    cs.LG 2026-05 unverdicted novelty 7.0

    Introduces the 1GC-7RC benchmark to evaluate AI coding agents on seven diverse ML tasks under single-GPU time and access constraints.

  2. Can We Predict Before Executing Machine Learning Agents?

    cs.CL 2026-01 unverdicted novelty 6.0

    LLMs primed with verified data reports predict agent solution quality at 61.5% accuracy, powering a Predict-then-Verify agent that converges 6x faster than execution-only baselines.

  3. AgenticRecTune: Multi-Agent with Self-Evolving Skillhub for Recommendation System Optimization

    cs.IR 2026-04 unverdicted novelty 5.0

    AgenticRecTune deploys five LLM agents (Actor, Critic, Insight, Skill, Online) and a self-evolving Skillhub to handle end-to-end configuration optimization for multi-stage recommendation systems.

  4. AgenticRecTune: Multi-Agent with Self-Evolving Skillhub for Recommendation System Optimization

    cs.IR 2026-04 unverdicted novelty 4.0

    AgenticRecTune deploys Actor, Critic, Insight, Skill, and Online agents plus a self-evolving Skillhub to propose, filter, test, and learn from recommendation system configurations using Gemini LLMs.

Reference graph

Works this paper leans on

35 extracted references · 35 canonical work pages · cited by 3 Pith papers · 1 internal anchor

  1. [1]

    The Landscape of Emerging AI Agent Architectures for Reasoning, Planning, and Tool Calling: A Survey

    Association for Computational Linguistics. URL https://aclanthology.org/P11-1015/. Masterman, T., Besen, S., Sawtell, M., and Chao, A. The landscape of emerging ai agent architectures for reason- ing, planning, and tool calling: A survey.arXiv preprint arXiv:2404.11584, 2024. Mohr, F., Wever, M., and Hüllermeier, E. Ml-plan: Auto- mated machine learning v...

  2. [2]

    The first modification (changing the learning rate) has been implemented and evaluated

    Identify potential modifications: Completed. The first modification (changing the learning rate) has been implemented and evaluated. The second modification ( increasing the number of epochs) has been implemented and evaluated. The third modification (changing the optimizer) has been implemented and evaluated

  3. [3]

    Implement modifications: The optimizer has been modified in ‘train_modified_optimizer.py‘, resulting in a validation RMSE of approximately 0.0838

  4. [4]

    script_name

    Iterate: I will now explore further modifications, such as adjusting the batch size or experimenting with different learning rates, to further reduce the RMSE. Current Status: Step 3 is completed. I will now proceed to step 4 and explore further modifications to the training script. Fact Check: 1. The validation RMSE after changing the optimizer is approx...

  5. [5]

    The first modification (changing the learning rate) has been implemented and evaluated

    Identify potential modifications: Completed. The first modification (changing the learning rate) has been implemented and evaluated. The second modification ( increasing the number of epochs) has been implemented and evaluated. The third modification (changing the optimizer) has been implemented and evaluated. The fourth modification (changing the learnin...

  6. [6]

    Implement modifications: The learning rate has been modified in ‘train_modified_optimizer_lr.py‘

  7. [7]

    script_name

    Iterate: Execute the modified script to evaluate the impact on RMSE. Current Status: Step 3 is completed. I will now proceed to step 4 and execute the modified training script. Fact Check: 1. The learning rate has been changed from 1e-4 to 1e-5. (Confirmed) Thought: I will execute the modified training script ‘train_modified_optimizer_lr.py‘ to see if cha...

  8. [8]

    The first modification (changing the learning rate) has been implemented and evaluated

    Identify potential modifications: Completed. The first modification (changing the learning rate) has been implemented and evaluated. The second modification ( increasing the number of epochs) has been implemented and evaluated. The third modification (changing the optimizer) has been implemented and evaluated. The fourth modification (changing the learnin...

  9. [9]

    Implement modifications: The learning rate has been modified in ‘train_modified_optimizer_lr.py‘, resulting in a validation RMSE of approximately 0.2621

  10. [10]

    script_name

    Iterate: I will revert the learning rate back to 1e-4 and explore other modifications, such as adjusting the batch size or experimenting with different optimizers, to further reduce the RMSE. Current Status: Step 3 is completed. I will now proceed to step 4 and revert the learning rate back to 1e-4 in the training script. Fact Check: 1. The validation RMS...

  11. [11]

    Feature Engineering: Logarithmic transformations were applied to skewed numerical features, but the validation MAE remains high at 18678

  12. [12]

    Model Evaluation: The model’s performance needs further improvement, as the validation MAE is still above the target

  13. [13]

    script_name

    Iterate: I will explore additional feature engineering techniques, such as creating new interaction features or transforming existing features further. Current Status: The validation MAE is still above the target, and I will now consider creating additional interaction features or transforming existing features to see if that helps reduce the MAE. Fact Ch...

  14. [14]

    Identify potential modifications: Identified several potential modifications, including learning rate scheduling, data augmentation, model complexity, regularization, and early stopping

  15. [15]

    A learning rate scheduler has been added to the training loop

    Implement the modification: Completed. A learning rate scheduler has been added to the training loop

  16. [16]

    The validation RMSE after the learning rate scheduler modification is 0.2463417127589059

    Evaluate results: Completed. The validation RMSE after the learning rate scheduler modification is 0.2463417127589059

  17. [17]

    Data augmentation has been added to the training data transformation pipeline

    Next modification: Completed. Data augmentation has been added to the training data transformation pipeline

  18. [18]

    The validation RMSE after data augmentation is 0.36237191602303853

    Evaluate results again: Completed. The validation RMSE after data augmentation is 0.36237191602303853

  19. [19]

    script_name

    Next modification: I will increase the model complexity by adding more layers to the U-Net architecture. Current Status: - Step 1 is completed. - Step 2 is completed. - Step 3 is completed. - Step 4 is completed. - Step 5 is completed. - Step 6 is completed. - Step 7 is in progress as I will now modify the script to increase model complexity. Fact Check: ...

  20. [20]

    The learning rate was changed to 0.05 and the number of boosting stages increased to 200, but the performance decreased slightly

    Experiment with modifications: Completed. The learning rate was changed to 0.05 and the number of boosting stages increased to 200, but the performance decreased slightly

  21. [21]

    The modified training script was executed, resulting in a validation quadratic weighted kappa score of 0.7418

    Evaluate performance: Completed. The modified training script was executed, resulting in a validation quadratic weighted kappa score of 0.7418

  22. [22]

    script_name

    Iterate: I will now explore further modifications to improve the model’s performance. Current Status: - Step 4 is in progress. I will consider additional modifications, such as trying different feature extraction techniques or adjusting other hyperparameters, to enhance the model’s performance. Next Step: I will try to modify the feature extraction proces...

  23. [23]

    Identify potential modifications: - **First Modification**: Implement text preprocessing (tokenization, stopword removal, stemming/lemmatization) before vectorization to improve feature 28 ML-Agent: Reinforcing LLM Agents for Autonomous Machine Learning Engineering representation

  24. [24]

    Implement the modification: Modify the script to include text preprocessing

  25. [25]

    If the loss does not improve, consider further modifications

    Iterate: After implementing the first modification, run the training process and evaluate the impact on loss. If the loss does not improve, consider further modifications. Current Status: The first modification has been identified and will be implemented next. Fact Check: 1. The script uses a ‘LogisticRegression‘ classifier for the text classification tas...

  26. [26]

    (Confirmed)

    The script lacks preprocessing of text data before vectorization. (Confirmed)

  27. [27]

    (Confirmed)

    The script does not perform feature selection or dimensionality reduction. (Confirmed)

  28. [28]

    script_name

    The script does not include cross-validation or hyperparameter tuning. (Confirmed) Thought: I will proceed to modify the ‘train.py‘ script to include text preprocessing steps before the vectorization process. This should help improve the model’s performance by providing better features. Action: Edit Script (AI) Action Input: { "script_name": "train.py", "...

  29. [32]

    dir_path

    Aim to achieve a test accuracy of 99.99%: Not started. Fact Check: None, as no progress has been made yet. Thought: The first step is to understand the current state of the ‘train.py‘ script to identify potential areas for improvement. This will involve inspecting the script to see what model and techniques are currently being used. Action: List Files Act...

  30. [33]

    **Understand the current state of the ‘train.py‘ script**: Not started

  31. [36]

    script_name

    Aim to achieve a test accuracy of 99.99%: Not started. Fact Check: The presence of ‘train.py‘, ‘train.csv‘, ‘test.csv‘, and ‘backup/‘ is directly confirmed by the file listing. Thought: The next step is to inspect the ‘train.py‘ script to understand the current implementation. This will help identify potential areas for improvement. Action: Inspect Script...

  32. [37]

    The script uses a simple neural network with TF-IDF features

    **Understand the current state of the ‘train.py‘ script**: Completed. The script uses a simple neural network with TF-IDF features

  33. [38]

    Identify areas in the script that could be modified to improve model performance: Not started

  34. [39]

    Make one modification at a time to the script and test the effect on performance: Not started

  35. [40]

    final_answer

    Aim to achieve a test accuracy of 99.99%: Not started. Fact Check: - The script uses TF-IDF vectorization and a simple neural network confirmed by inspection. - The script trains the model for 10 epochs, confirmed by inspection. - The learning rate is 0.1 with SGD optimizer, confirmed by inspection. Thought: The next step is to identify potential areas in...