SENTINEL generates targeted tasks from model failures in a Controller-Proposer-Solver loop, raising Pass^1 from 66.4 to 74.9 on Tau2-Bench Retail and outperforming standard RL.
Agent-rlvr: Training software engineering agents via guidance and environment rewards
9 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
background 2polarities
background 2representative citing papers
This survey introduces the Generate-Filter-Control-Replay (GFCR) taxonomy to structure rollout pipelines for RL-based post-training of reasoning LLMs.
AgenticRL deploys a multimodal GPT agent in a closed-loop process to autonomously design and refine reward functions for PPO-trained vision-conditioned UAV navigation policies, reporting 71% policy improvement and 91% real-world success.
ReSkill is an RL-in-the-loop framework that embeds assertion-driven skill creation, within-group sampling, and Thompson Sampling into GRPO to reconcile skill evolution with policy learning, outperforming prior methods especially on unseen tasks.
ConSPO is a new contrastive sequence-level policy optimization method that addresses GRPO limitations via length-normalized log-probability scores and InfoNCE-style objectives, outperforming baselines on reasoning benchmarks.
SWE-Bench Pro is a new benchmark with 1,865 long-horizon tasks from 41 repositories designed to evaluate AI agents on realistic enterprise-level software engineering problems beyond prior benchmarks.
SWE-Shepherd trains a lightweight PRM on SWE-Bench trajectories to score intermediate actions and guide code agents, showing gains in efficiency and action quality on SWE-Bench Verified.
RLVR training on five synthetic Atlassian API environments raises average tool-use reward for Qwen models from 0.35-0.92 to 0.95-1.00 on four non-degenerate scenarios.
Gated synthetic augmentations can substitute for additional human-authored RLVR tasks at a cost-adjusted trade rate of 1.4x-11.6x while retaining held-out generalization on ten benchmarks spanning code, instruction following, reasoning, and agentic function calling.
citing papers explorer
-
SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?
SWE-Bench Pro is a new benchmark with 1,865 long-horizon tasks from 41 repositories designed to evaluate AI agents on realistic enterprise-level software engineering problems beyond prior benchmarks.