Agent-rlvr: Training software engineering agents via guidance and environment rewards

Jeff Da, Clinton Wang, Xiang Deng, Yuntao Ma, Nikhil Barhate, Sean Hendryx · 2025 · arXiv 2506.11425

9 Pith papers cite this work. Polarity classification is still indexing.

9 Pith papers citing it

read on arXiv browse 9 citing papers

citation-role summary

background 2

citation-polarity summary

background 2

representative citing papers

SENTINEL: Failure-Driven Reinforcement Learning for Training Tool-Using Language Model Agents

cs.CL · 2026-06-11 · unverdicted · novelty 7.0

SENTINEL generates targeted tasks from model failures in a Controller-Proposer-Solver loop, raising Pass^1 from 66.4 to 74.9 on Tau2-Bench Retail and outperforming standard RL.

Generate, Filter, Control, Replay: A Comprehensive Survey of Rollout Strategies for LLM Reinforcement Learning

cs.LG · 2026-04-08 · unverdicted · novelty 7.0

This survey introduces the Generate-Filter-Control-Replay (GFCR) taxonomy to structure rollout pipelines for RL-based post-training of reasoning LLMs.

AgenticRL: Self-Refining Agentic Reinforcement Learning for Vision-Conditioned UAV Navigation

cs.RO · 2026-06-02 · unverdicted · novelty 6.0

AgenticRL deploys a multimodal GPT agent in a closed-loop process to autonomously design and refine reward functions for PPO-trained vision-conditioned UAV navigation policies, reporting 71% policy improvement and 91% real-world success.

ReSkill: Reconciling Skill Creation with Policy Optimization in Agentic RL

cs.AI · 2026-06-01 · unverdicted · novelty 6.0

ReSkill is an RL-in-the-loop framework that embeds assertion-driven skill creation, within-group sampling, and Thompson Sampling into GRPO to reconcile skill evolution with policy learning, outperforming prior methods especially on unseen tasks.

Revisiting Reinforcement Learning with Verifiable Rewards from a Contrastive Perspective

cs.LG · 2026-05-13 · unverdicted · novelty 6.0 · 2 refs

ConSPO is a new contrastive sequence-level policy optimization method that addresses GRPO limitations via length-normalized log-probability scores and InfoNCE-style objectives, outperforming baselines on reasoning benchmarks.

SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?

cs.SE · 2025-09-21 · conditional · novelty 6.0

SWE-Bench Pro is a new benchmark with 1,865 long-horizon tasks from 41 repositories designed to evaluate AI agents on realistic enterprise-level software engineering problems beyond prior benchmarks.

SWE-Shepherd: Advancing PRMs for Reinforcing Code Agents

cs.SE · 2026-04-12 · unverdicted · novelty 5.0

SWE-Shepherd trains a lightweight PRM on SWE-Bench trajectories to score intermediate actions and guide code agents, showing gains in efficiency and action quality on SWE-Bench Verified.

Beyond Next-Token Prediction: An RLVR Proof of Concept for Tool-Use Agents on Atlassian Workflows

cs.AI · 2026-07-01 · unverdicted · novelty 4.0

RLVR training on five synthetic Atlassian API environments raises average tool-use reward for Qwen models from 0.35-0.92 to 0.95-1.00 on four non-degenerate scenarios.

Trading Human Curation for Synthetic Augmentation in RLVR

cs.LG · 2026-06-02 · unverdicted · novelty 4.0

Gated synthetic augmentations can substitute for additional human-authored RLVR tasks at a cost-adjusted trade rate of 1.4x-11.6x while retaining held-out generalization on ten benchmarks spanning code, instruction following, reasoning, and agentic function calling.

citing papers explorer

Showing 8 of 8 citing papers after filters.

SENTINEL: Failure-Driven Reinforcement Learning for Training Tool-Using Language Model Agents cs.CL · 2026-06-11 · unverdicted · none · ref 53
SENTINEL generates targeted tasks from model failures in a Controller-Proposer-Solver loop, raising Pass^1 from 66.4 to 74.9 on Tau2-Bench Retail and outperforming standard RL.
Generate, Filter, Control, Replay: A Comprehensive Survey of Rollout Strategies for LLM Reinforcement Learning cs.LG · 2026-04-08 · unverdicted · none · ref 16
This survey introduces the Generate-Filter-Control-Replay (GFCR) taxonomy to structure rollout pipelines for RL-based post-training of reasoning LLMs.
AgenticRL: Self-Refining Agentic Reinforcement Learning for Vision-Conditioned UAV Navigation cs.RO · 2026-06-02 · unverdicted · none · ref 21
AgenticRL deploys a multimodal GPT agent in a closed-loop process to autonomously design and refine reward functions for PPO-trained vision-conditioned UAV navigation policies, reporting 71% policy improvement and 91% real-world success.
ReSkill: Reconciling Skill Creation with Policy Optimization in Agentic RL cs.AI · 2026-06-01 · unverdicted · none · ref 2
ReSkill is an RL-in-the-loop framework that embeds assertion-driven skill creation, within-group sampling, and Thompson Sampling into GRPO to reconcile skill evolution with policy learning, outperforming prior methods especially on unseen tasks.
Revisiting Reinforcement Learning with Verifiable Rewards from a Contrastive Perspective cs.LG · 2026-05-13 · unverdicted · none · ref 16 · 2 links
ConSPO is a new contrastive sequence-level policy optimization method that addresses GRPO limitations via length-normalized log-probability scores and InfoNCE-style objectives, outperforming baselines on reasoning benchmarks.
SWE-Shepherd: Advancing PRMs for Reinforcing Code Agents cs.SE · 2026-04-12 · unverdicted · none · ref 5
SWE-Shepherd trains a lightweight PRM on SWE-Bench trajectories to score intermediate actions and guide code agents, showing gains in efficiency and action quality on SWE-Bench Verified.
Beyond Next-Token Prediction: An RLVR Proof of Concept for Tool-Use Agents on Atlassian Workflows cs.AI · 2026-07-01 · unverdicted · none · ref 2
RLVR training on five synthetic Atlassian API environments raises average tool-use reward for Qwen models from 0.35-0.92 to 0.95-1.00 on four non-degenerate scenarios.
Trading Human Curation for Synthetic Augmentation in RLVR cs.LG · 2026-06-02 · unverdicted · none · ref 1
Gated synthetic augmentations can substitute for additional human-authored RLVR tasks at a cost-adjusted trade rate of 1.4x-11.6x while retaining held-out generalization on ten benchmarks spanning code, instruction following, reasoning, and agentic function calling.

Agent-rlvr: Training software engineering agents via guidance and environment rewards

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer