This survey introduces the Generate-Filter-Control-Replay (GFCR) taxonomy to structure rollout pipelines for RL-based post-training of reasoning LLMs.
Agent-rlvr: Training software engineering agents via guidance and environment rewards
4 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
background 2polarities
background 2representative citing papers
SWE-Bench Pro is a new benchmark with 1,865 long-horizon tasks from 41 repositories designed to evaluate AI agents on realistic enterprise-level software engineering problems beyond prior benchmarks.
SWE-Shepherd trains a lightweight PRM on SWE-Bench trajectories to score intermediate actions and guide code agents, showing gains in efficiency and action quality on SWE-Bench Verified.
citing papers explorer
-
Generate, Filter, Control, Replay: A Comprehensive Survey of Rollout Strategies for LLM Reinforcement Learning
This survey introduces the Generate-Filter-Control-Replay (GFCR) taxonomy to structure rollout pipelines for RL-based post-training of reasoning LLMs.
-
SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?
SWE-Bench Pro is a new benchmark with 1,865 long-horizon tasks from 41 repositories designed to evaluate AI agents on realistic enterprise-level software engineering problems beyond prior benchmarks.
-
SWE-Shepherd: Advancing PRMs for Reinforcing Code Agents
SWE-Shepherd trains a lightweight PRM on SWE-Bench trajectories to score intermediate actions and guide code agents, showing gains in efficiency and action quality on SWE-Bench Verified.
- Revisiting Reinforcement Learning with Verifiable Rewards from a Contrastive Perspective