On the Non-decoupling of Supervised Fine-tuning and Reinforcement Learning in Post-training

· 2026 · cs.LG · arXiv 2601.07389

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

open full Pith review browse 1 citing papers arXiv PDF

abstract

Post-training of large language models routinely interleaves supervised fine-tuning (SFT) with reinforcement learning (RL). These two methods have different objectives: SFT minimizes the cross-entropy loss between model outputs and expert responses, while RL maximizes reward signals derived from human preferences or rule-based verifiers. Modern reasoning models have widely adopted the practice of alternating SFT and RL training. However, there is no theoretical account of whether they can be decoupled. We prove that decoupling is impossible in either order: (1) SFT-then-RL coupling: RL increases SFT loss under both distributional (KL-based) and landscape (PL-based) analyses; and (2) RL-then-SFT coupling: SFT lowers the reward achieved by RL under analogous conditions. Under the PL condition, we further derive the optimal RL duration that balances reward improvement against SFT degradation, identify the non-decoupling threshold governing when RL can improve SFT, and bound the gradient misalignment via spectral concentration. Experiments on Qwen3-0.6B confirm the predicted degradation, verifying that SFT and RL cannot be separated without loss of prior performance in the post-training pipeline.

representative citing papers

GAC: Noise-Aware Adaptive Mixing for Hybrid SFT-RL Post-Training

cs.LG · 2026-05-25 · unverdicted · novelty 5.0

GAC derives adaptive mixing weights for SFT-RL hybrid post-training from online gradient variance and signal disagreement estimates, improving benchmark performance over fixed schedules with under 1% overhead.

citing papers explorer

Showing 1 of 1 citing paper.

GAC: Noise-Aware Adaptive Mixing for Hybrid SFT-RL Post-Training cs.LG · 2026-05-25 · unverdicted · none · ref 36 · internal anchor
GAC derives adaptive mixing weights for SFT-RL hybrid post-training from online gradient variance and signal disagreement estimates, improving benchmark performance over fixed schedules with under 1% overhead.

On the Non-decoupling of Supervised Fine-tuning and Reinforcement Learning in Post-training

fields

years

verdicts

representative citing papers

citing papers explorer