GAC derives adaptive mixing weights for SFT-RL hybrid post-training from online gradient variance and signal disagreement estimates, improving benchmark performance over fixed schedules with under 1% overhead.
On the Non-decoupling of Supervised Fine-tuning and Reinforcement Learning in Post-training
1 Pith paper cite this work. Polarity classification is still indexing.
abstract
Post-training of large language models routinely interleaves supervised fine-tuning (SFT) with reinforcement learning (RL). These two methods have different objectives: SFT minimizes the cross-entropy loss between model outputs and expert responses, while RL maximizes reward signals derived from human preferences or rule-based verifiers. Modern reasoning models have widely adopted the practice of alternating SFT and RL training. However, there is no theoretical account of whether they can be decoupled. We prove that decoupling is impossible in either order: (1) SFT-then-RL coupling: RL increases SFT loss under both distributional (KL-based) and landscape (PL-based) analyses; and (2) RL-then-SFT coupling: SFT lowers the reward achieved by RL under analogous conditions. Under the PL condition, we further derive the optimal RL duration that balances reward improvement against SFT degradation, identify the non-decoupling threshold governing when RL can improve SFT, and bound the gradient misalignment via spectral concentration. Experiments on Qwen3-0.6B confirm the predicted degradation, verifying that SFT and RL cannot be separated without loss of prior performance in the post-training pipeline.
fields
cs.LG 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
GAC: Noise-Aware Adaptive Mixing for Hybrid SFT-RL Post-Training
GAC derives adaptive mixing weights for SFT-RL hybrid post-training from online gradient variance and signal disagreement estimates, improving benchmark performance over fixed schedules with under 1% overhead.