Prefix Sampling replays self-generated trajectory prefixes to control rollout pass rates to ~50% in binary-reward GRPO, delivering 2.01x and 1.55x speedups on Qwen3-14B/32B with slight score improvements on SWE-bench Verified.
Reuse your FLOPs: Scaling RL on hard problems by conditioning on very off-policy prefixes.CoRR, abs/2601.18795
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
citation-role summary
background 1
citation-polarity summary
fields
cs.LG 1years
2026 1verdicts
UNVERDICTED 1roles
background 1polarities
background 1representative citing papers
citing papers explorer
-
Rollout Pass-Rate Control: Steering Binary-Reward RL Toward Its Most Informative Regime
Prefix Sampling replays self-generated trajectory prefixes to control rollout pass rates to ~50% in binary-reward GRPO, delivering 2.01x and 1.55x speedups on Qwen3-14B/32B with slight score improvements on SWE-bench Verified.