Agentrl: Scaling agentic reinforcement learning with a multi-turn, multi-task framework

Hanchen Zhang, Xiao Liu, Bowen Lv, Xueqiao Sun, Bohao Jing, Iat Long Iong, Zhenyu Hou, Zehan Qi, Hanyu Lai, Yifan Xu, et al · 2025 · arXiv 2510.04206

12 Pith papers cite this work. Polarity classification is still indexing.

12 Pith papers citing it

read on arXiv browse 12 citing papers

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

AstraFlow: Dataflow-Oriented Reinforcement Learning for Agentic LLMs

cs.LG · 2026-05-15 · unverdicted · novelty 7.0

AstraFlow decouples RL components into autonomous dataflow services to natively support multi-policy agentic LLM training, elastic scaling, and cross-region execution with 2.7x speedup on math, code, search, and AgentBench workloads.

Group-Graph Policy Optimization for Long-Horizon Agentic Reinforcement Learning

cs.LG · 2026-06-22 · unverdicted · novelty 6.0

G2PO transforms linear trajectories into graphs, aggregates identical states for lower-variance value estimates, and uses edge-centric TD standardization, reporting up to 22.2% gains over GRPO on WebShop, ALFWorld, and AppWorld.

AgenticRL: Self-Refining Agentic Reinforcement Learning for Vision-Conditioned UAV Navigation

cs.RO · 2026-06-02 · unverdicted · novelty 6.0

AgenticRL deploys a multimodal GPT agent in a closed-loop process to autonomously design and refine reward functions for PPO-trained vision-conditioned UAV navigation policies, reporting 71% policy improvement and 91% real-world success.

AnomalyAgent: Agentic Industrial Anomaly Synthesis via Tool-Augmented Reinforcement Learning

cs.CV · 2026-04-09 · unverdicted · novelty 6.0

AnomalyAgent uses tool-augmented reinforcement learning with self-reflection to generate realistic industrial anomalies, achieving better metrics than zero-shot methods on MVTec-AD.

TRACE: Capability-Targeted Agentic Training

cs.AI · 2026-04-07 · unverdicted · novelty 6.0

TRACE identifies capability gaps from agent trajectory contrasts, synthesizes per-capability RL training environments, and routes LoRA adapters at inference to improve performance on customer service and tool-use benchmarks.

SABER: A Stealthy Agentic Black-Box Attack Framework for Vision-Language-Action Models

cs.RO · 2026-03-26 · unverdicted · novelty 6.0

SABER uses a trained ReAct agent to produce bounded adversarial edits to robot instructions, cutting task success by 20.6% and increasing execution length and violations on the LIBERO benchmark across six VLA models.

AgentIAD: Agentic Industrial Anomaly Detection via Adaptive Memory Augmentation

cs.CV · 2025-12-15 · unverdicted · novelty 6.0

AgentIAD introduces an agentic VLM with Perceptive Zoomer, Web Searcher, and Comparative Retriever tools plus two-stage SFT-then-RL training, achieving 5.92% higher classification accuracy than prior SOTA on the MMAD benchmark.

Tree Training: Accelerating Agentic LLMs Training via Shared Prefix Reuse

cs.LG · 2025-11-01 · unverdicted · novelty 6.0

Tree Training serializes tree trajectories via DFS and uses redundancy-free partitioning to compute weighted per-token losses exactly once per token, achieving up to 6.2x training speedup on dense and MoE models.

AlphaMemo: Structured Search-Process Memory for Self-Evolving Alpha Mining Agents

cs.AI · 2026-05-26 · unverdicted · novelty 5.0

AlphaMemo equips LLM alpha-mining agents with AST-diff motif memory, residual learning, and asymmetric veto control to improve out-of-sample factor discovery on CSI 500 and S&P 500.

GROW: Aligning GRPO with State-Action Modeling for Open-World VLM Agents

cs.LG · 2026-05-18 · unverdicted · novelty 5.0 · 2 refs

GROW decomposes trajectories into state-action samples to enable GRPO for multi-turn VLM agents and reports state-of-the-art results on more than 800 Minecraft tasks.

Skill1: Unified Evolution of Skill-Augmented Agents via Reinforcement Learning

cs.AI · 2026-05-07 · unverdicted · novelty 5.0 · 3 refs

Skill1 trains a single RL policy to co-evolve skill selection, utilization, and distillation in language model agents from one task-outcome reward, using low-frequency trends to credit selection and high-frequency variation to credit distillation, outperforming baselines on ALFWorld and WebShop.

Trading Human Curation for Synthetic Augmentation in RLVR

cs.LG · 2026-06-02 · unverdicted · novelty 4.0

Gated synthetic augmentations can substitute for additional human-authored RLVR tasks at a cost-adjusted trade rate of 1.4x-11.6x while retaining held-out generalization on ten benchmarks spanning code, instruction following, reasoning, and agentic function calling.

citing papers explorer

Showing 12 of 12 citing papers after filters.

AstraFlow: Dataflow-Oriented Reinforcement Learning for Agentic LLMs cs.LG · 2026-05-15 · unverdicted · none · ref 44
AstraFlow decouples RL components into autonomous dataflow services to natively support multi-policy agentic LLM training, elastic scaling, and cross-region execution with 2.7x speedup on math, code, search, and AgentBench workloads.
Group-Graph Policy Optimization for Long-Horizon Agentic Reinforcement Learning cs.LG · 2026-06-22 · unverdicted · none · ref 60
G2PO transforms linear trajectories into graphs, aggregates identical states for lower-variance value estimates, and uses edge-centric TD standardization, reporting up to 22.2% gains over GRPO on WebShop, ALFWorld, and AppWorld.
AgenticRL: Self-Refining Agentic Reinforcement Learning for Vision-Conditioned UAV Navigation cs.RO · 2026-06-02 · unverdicted · none · ref 20
AgenticRL deploys a multimodal GPT agent in a closed-loop process to autonomously design and refine reward functions for PPO-trained vision-conditioned UAV navigation policies, reporting 71% policy improvement and 91% real-world success.
AnomalyAgent: Agentic Industrial Anomaly Synthesis via Tool-Augmented Reinforcement Learning cs.CV · 2026-04-09 · unverdicted · none · ref 43
AnomalyAgent uses tool-augmented reinforcement learning with self-reflection to generate realistic industrial anomalies, achieving better metrics than zero-shot methods on MVTec-AD.
TRACE: Capability-Targeted Agentic Training cs.AI · 2026-04-07 · unverdicted · none · ref 5
TRACE identifies capability gaps from agent trajectory contrasts, synthesizes per-capability RL training environments, and routes LoRA adapters at inference to improve performance on customer service and tool-use benchmarks.
SABER: A Stealthy Agentic Black-Box Attack Framework for Vision-Language-Action Models cs.RO · 2026-03-26 · unverdicted · none · ref 31
SABER uses a trained ReAct agent to produce bounded adversarial edits to robot instructions, cutting task success by 20.6% and increasing execution length and violations on the LIBERO benchmark across six VLA models.
AgentIAD: Agentic Industrial Anomaly Detection via Adaptive Memory Augmentation cs.CV · 2025-12-15 · unverdicted · none · ref 37
AgentIAD introduces an agentic VLM with Perceptive Zoomer, Web Searcher, and Comparative Retriever tools plus two-stage SFT-then-RL training, achieving 5.92% higher classification accuracy than prior SOTA on the MMAD benchmark.
Tree Training: Accelerating Agentic LLMs Training via Shared Prefix Reuse cs.LG · 2025-11-01 · unverdicted · none · ref 18
Tree Training serializes tree trajectories via DFS and uses redundancy-free partitioning to compute weighted per-token losses exactly once per token, achieving up to 6.2x training speedup on dense and MoE models.
AlphaMemo: Structured Search-Process Memory for Self-Evolving Alpha Mining Agents cs.AI · 2026-05-26 · unverdicted · none · ref 16
AlphaMemo equips LLM alpha-mining agents with AST-diff motif memory, residual learning, and asymmetric veto control to improve out-of-sample factor discovery on CSI 500 and S&P 500.
GROW: Aligning GRPO with State-Action Modeling for Open-World VLM Agents cs.LG · 2026-05-18 · unverdicted · none · ref 28 · 2 links
GROW decomposes trajectories into state-action samples to enable GRPO for multi-turn VLM agents and reports state-of-the-art results on more than 800 Minecraft tasks.
Skill1: Unified Evolution of Skill-Augmented Agents via Reinforcement Learning cs.AI · 2026-05-07 · unverdicted · none · ref 87 · 3 links
Skill1 trains a single RL policy to co-evolve skill selection, utilization, and distillation in language model agents from one task-outcome reward, using low-frequency trends to credit selection and high-frequency variation to credit distillation, outperforming baselines on ALFWorld and WebShop.
Trading Human Curation for Synthetic Augmentation in RLVR cs.LG · 2026-06-02 · unverdicted · none · ref 5
Gated synthetic augmentations can substitute for additional human-authored RLVR tasks at a cost-adjusted trade rate of 1.4x-11.6x while retaining held-out generalization on ten benchmarks spanning code, instruction following, reasoning, and agentic function calling.

Agentrl: Scaling agentic reinforcement learning with a multi-turn, multi-task framework

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer