SFT-then-RL Outperforms Mixed-Policy Methods for LLM Reasoning

· 2026 · cs.LG · arXiv 2604.23747

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

open full Pith review browse 1 citing papers arXiv PDF

abstract

Recent mixed-policy optimization methods for LLM reasoning that interleave or blend supervised and reinforcement learning signals report improvements over the standard SFT-then-RL pipeline. We show that numerous recently published research papers rely on a faulty baseline caused by two distinct bugs: a CPU-offloaded optimizer bug in DeepSpeed that silently drops intermediate micro-batches during gradient accumulation (affecting multiple downstream frameworks including TRL, OpenRLHF and Llama-Factory), and a loss aggregation bug in OpenRLHF that incorrectly weights per-mini-batch losses. Together they suppress SFT performance, with the optimizer bug accounting for most of the gap and the loss aggregation bug contributing a smaller additional effect. Once corrected, the standard SFT-then-RL pipeline surpasses every published mixed-policy method we evaluate by +3.8 points on math benchmarks with Qwen2.5-Math-7B and by +22.2 points with Llama-3.1-8B. Even a truncated variant with just 50 RL steps outperforms mixed-policy methods on math benchmarks while using fewer FLOPs.

representative citing papers

CroCo: Cross-Lingual Contrastive Preference Tuning on Self-Generations

cs.CL · 2026-05-25 · unverdicted · novelty 5.0

CroCo applies English-reward-ranked self-generations for contrastive preference tuning that improves two LLMs on structured and open-ended tasks across 14 languages without language-specific annotations.

citing papers explorer

Showing 1 of 1 citing paper.

CroCo: Cross-Lingual Contrastive Preference Tuning on Self-Generations cs.CL · 2026-05-25 · unverdicted · none · ref 26 · internal anchor
CroCo applies English-reward-ranked self-generations for contrastive preference tuning that improves two LLMs on structured and open-ended tasks across 14 languages without language-specific annotations.

SFT-then-RL Outperforms Mixed-Policy Methods for LLM Reasoning

fields

years

verdicts

representative citing papers

citing papers explorer