SENTINEL generates targeted tasks from model failures in a Controller-Proposer-Solver loop, raising Pass^1 from 66.4 to 74.9 on Tau2-Bench Retail and outperforming standard RL.
Agent-rlvr: Training software engineering agents via guidance and environment rewards
9 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
background 2polarities
background 2representative citing papers
This survey introduces the Generate-Filter-Control-Replay (GFCR) taxonomy to structure rollout pipelines for RL-based post-training of reasoning LLMs.
AgenticRL deploys a multimodal GPT agent in a closed-loop process to autonomously design and refine reward functions for PPO-trained vision-conditioned UAV navigation policies, reporting 71% policy improvement and 91% real-world success.
ReSkill is an RL-in-the-loop framework that embeds assertion-driven skill creation, within-group sampling, and Thompson Sampling into GRPO to reconcile skill evolution with policy learning, outperforming prior methods especially on unseen tasks.
ConSPO is a new contrastive sequence-level policy optimization method that addresses GRPO limitations via length-normalized log-probability scores and InfoNCE-style objectives, outperforming baselines on reasoning benchmarks.
SWE-Bench Pro is a new benchmark with 1,865 long-horizon tasks from 41 repositories designed to evaluate AI agents on realistic enterprise-level software engineering problems beyond prior benchmarks.
SWE-Shepherd trains a lightweight PRM on SWE-Bench trajectories to score intermediate actions and guide code agents, showing gains in efficiency and action quality on SWE-Bench Verified.
RLVR training on five synthetic Atlassian API environments raises average tool-use reward for Qwen models from 0.35-0.92 to 0.95-1.00 on four non-degenerate scenarios.
Gated synthetic augmentations can substitute for additional human-authored RLVR tasks at a cost-adjusted trade rate of 1.4x-11.6x while retaining held-out generalization on ten benchmarks spanning code, instruction following, reasoning, and agentic function calling.
citing papers explorer
-
SENTINEL: Failure-Driven Reinforcement Learning for Training Tool-Using Language Model Agents
SENTINEL generates targeted tasks from model failures in a Controller-Proposer-Solver loop, raising Pass^1 from 66.4 to 74.9 on Tau2-Bench Retail and outperforming standard RL.
-
Generate, Filter, Control, Replay: A Comprehensive Survey of Rollout Strategies for LLM Reinforcement Learning
This survey introduces the Generate-Filter-Control-Replay (GFCR) taxonomy to structure rollout pipelines for RL-based post-training of reasoning LLMs.
-
AgenticRL: Self-Refining Agentic Reinforcement Learning for Vision-Conditioned UAV Navigation
AgenticRL deploys a multimodal GPT agent in a closed-loop process to autonomously design and refine reward functions for PPO-trained vision-conditioned UAV navigation policies, reporting 71% policy improvement and 91% real-world success.
-
ReSkill: Reconciling Skill Creation with Policy Optimization in Agentic RL
ReSkill is an RL-in-the-loop framework that embeds assertion-driven skill creation, within-group sampling, and Thompson Sampling into GRPO to reconcile skill evolution with policy learning, outperforming prior methods especially on unseen tasks.
-
Revisiting Reinforcement Learning with Verifiable Rewards from a Contrastive Perspective
ConSPO is a new contrastive sequence-level policy optimization method that addresses GRPO limitations via length-normalized log-probability scores and InfoNCE-style objectives, outperforming baselines on reasoning benchmarks.
-
SWE-Shepherd: Advancing PRMs for Reinforcing Code Agents
SWE-Shepherd trains a lightweight PRM on SWE-Bench trajectories to score intermediate actions and guide code agents, showing gains in efficiency and action quality on SWE-Bench Verified.
-
Beyond Next-Token Prediction: An RLVR Proof of Concept for Tool-Use Agents on Atlassian Workflows
RLVR training on five synthetic Atlassian API environments raises average tool-use reward for Qwen models from 0.35-0.92 to 0.95-1.00 on four non-degenerate scenarios.
-
Trading Human Curation for Synthetic Augmentation in RLVR
Gated synthetic augmentations can substitute for additional human-authored RLVR tasks at a cost-adjusted trade rate of 1.4x-11.6x while retaining held-out generalization on ten benchmarks spanning code, instruction following, reasoning, and agentic function calling.