Recognition: 2 theorem links
· Lean TheoremOn the Non-decoupling of Supervised Fine-tuning and Reinforcement Learning in Post-training
Pith reviewed 2026-05-16 14:44 UTC · model grok-4.3
The pith
Reinforcement learning after supervised fine-tuning increases SFT loss while SFT after RL reduces reward, proving the stages cannot be decoupled.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
SFT-then-RL coupling is shown because RL optimization increases the SFT cross-entropy loss, established through KL-based distributional analysis and PL-based landscape analysis. RL-then-SFT coupling follows as SFT reduces the reward achieved by RL under matching conditions. Under the PL condition the optimal RL duration balances reward improvement against SFT degradation, a non-decoupling threshold identifies when RL can still improve SFT, and gradient misalignment is bounded via spectral concentration. Experiments on Qwen3-0.6B confirm the predicted loss increase and reward drop.
What carries the argument
The non-decoupling threshold under the Polyak-Lojasiewicz condition, which marks the training point where further RL begins to raise SFT loss faster than it improves reward.
If this is right
- Optimal RL duration can be calculated from the PL parameters to limit SFT degradation.
- The non-decoupling threshold determines whether interleaving SFT and RL yields net benefit.
- Gradient misalignment between the two objectives admits an explicit bound from spectral concentration.
- Post-training pipelines must track both cross-entropy loss and reward throughout rather than treating stages as independent.
Where Pith is reading between the lines
- Joint objectives that minimize a weighted sum of SFT loss and negative reward could avoid sequential interference.
- The same coupling pattern may appear when other preference-based and imitation-based methods are alternated.
- Monitoring both loss and reward curves during training could flag coupling effects before full degradation occurs.
- The threshold may scale with model size, offering a way to predict when decoupling becomes feasible at larger scales.
Load-bearing premise
The Polyak-Lojasiewicz condition holds for the loss landscapes that arise in LLM post-training.
What would settle it
An experiment in which RL training after SFT produces no increase in SFT cross-entropy loss on a model satisfying the PL condition would directly contradict the coupling result.
Figures
read the original abstract
Post-training of large language models routinely interleaves supervised fine-tuning (SFT) with reinforcement learning (RL). These two methods have different objectives: SFT minimizes the cross-entropy loss between model outputs and expert responses, while RL maximizes reward signals derived from human preferences or rule-based verifiers. Modern reasoning models have widely adopted the practice of alternating SFT and RL training. However, there is no theoretical account of whether they can be decoupled. We prove that decoupling is impossible in either order: (1) SFT-then-RL coupling: RL increases SFT loss under both distributional (KL-based) and landscape (PL-based) analyses; and (2) RL-then-SFT coupling: SFT lowers the reward achieved by RL under analogous conditions. Under the PL condition, we further derive the optimal RL duration that balances reward improvement against SFT degradation, identify the non-decoupling threshold governing when RL can improve SFT, and bound the gradient misalignment via spectral concentration. Experiments on Qwen3-0.6B confirm the predicted degradation, verifying that SFT and RL cannot be separated without loss of prior performance in the post-training pipeline.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that SFT and RL cannot be decoupled in LLM post-training. It proves that RL after SFT increases SFT loss under both KL-divergence distributional analysis and Polyak-Łojasiewicz (PL) landscape analysis, while SFT after RL reduces the reward achieved by RL. Under the PL condition, it derives the optimal RL duration balancing reward gain against SFT degradation, identifies the non-decoupling threshold, and bounds gradient misalignment via spectral concentration. Experiments on Qwen3-0.6B are reported to confirm the predicted degradation.
Significance. If the central proofs are sound and the PL condition holds as assumed, the result would be significant for post-training pipeline design, providing theoretical justification for interleaving rather than sequencing SFT and RL to avoid performance loss. The dual distributional and landscape arguments plus small-model experiments constitute a strength, though the lack of global verification for the PL inequality limits the immediate applicability of the quantitative predictions.
major comments (3)
- [Landscape analysis section] The PL-based landscape analysis (deriving optimal RL duration, non-decoupling threshold, and gradient misalignment bound) rests on the global inequality ||∇L||² ≥ 2μ(L − L*) for the cross-entropy loss L. No proof of global validity for transformer architectures nor any empirical estimate of μ along the post-training trajectory is supplied, yet this is the precise condition required for the quantitative claims to follow.
- [Experiments section] The experimental confirmation on Qwen3-0.6B is stated to verify the predicted SFT degradation, but the manuscript provides neither the exact measured increase in SFT loss, the data exclusion rules, nor error analysis, making it impossible to assess whether the observed effects quantitatively match the theoretical predictions.
- [KL-based analysis] The KL-based distributional argument for SFT-then-RL coupling requires explicit statement of the conditions on the reward model and policy update that guarantee RL strictly increases SFT loss; without these, it is unclear whether the result holds beyond the specific preference data distributions considered.
minor comments (2)
- [Notation and definitions] Notation for losses (cross-entropy) and rewards should be unified across the distributional and landscape sections to avoid reader confusion.
- [Abstract] The abstract claims 'proofs' but the main text should clarify whether full derivations are provided or if key steps rely on the stated PL assumption.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript accordingly to clarify assumptions, strengthen experimental reporting, and make conditions explicit while preserving the core theoretical contributions under the stated conditions.
read point-by-point responses
-
Referee: [Landscape analysis section] The PL-based landscape analysis (deriving optimal RL duration, non-decoupling threshold, and gradient misalignment bound) rests on the global inequality ||∇L||² ≥ 2μ(L − L*) for the cross-entropy loss L. No proof of global validity for transformer architectures nor any empirical estimate of μ along the post-training trajectory is supplied, yet this is the precise condition required for the quantitative claims to follow.
Authors: We agree that no global proof of the PL inequality is supplied for transformer cross-entropy loss, which remains an open question in optimization. All quantitative results (optimal RL duration, non-decoupling threshold, gradient bound) are derived under the explicit assumption that the PL condition holds with μ > 0 along the post-training path. In revision we will insert a new paragraph in the landscape section stating this assumption, discussing its scope, and noting that empirical estimation of μ is left for future work. The qualitative non-decoupling result does not require the global PL inequality. revision: partial
-
Referee: [Experiments section] The experimental confirmation on Qwen3-0.6B is stated to verify the predicted SFT degradation, but the manuscript provides neither the exact measured increase in SFT loss, the data exclusion rules, nor error analysis, making it impossible to assess whether the observed effects quantitatively match the theoretical predictions.
Authors: We accept this criticism. The revised manuscript will report the exact measured SFT loss increases (with numerical deltas), detail the data exclusion/filtering rules applied to the evaluation set, and include error analysis via standard deviations across at least three independent runs with different seeds. These additions will permit direct quantitative comparison to the theoretical predictions. revision: yes
-
Referee: [KL-based analysis] The KL-based distributional argument for SFT-then-RL coupling requires explicit statement of the conditions on the reward model and policy update that guarantee RL strictly increases SFT loss; without these, it is unclear whether the result holds beyond the specific preference data distributions considered.
Authors: We will revise the KL section to state the assumptions explicitly: the reward model remains fixed, the RL update follows the standard KL-regularized policy gradient objective with positive coefficient β, and the preference distribution satisfies bounded support and Lipschitz conditions on the reward. Under these, the proof shows strict increase in SFT loss. A supporting lemma will be added to formalize the conditions and the resulting inequality. revision: yes
- Global proof of the PL inequality ||∇L||² ≥ 2μ(L − L*) for the cross-entropy loss on transformer architectures
Circularity Check
No circularity: derivations are conditional on explicit PL assumption and standard KL arguments without self-referential reduction
full rationale
The paper derives non-decoupling results via two independent routes: a KL-based distributional argument and a PL-based landscape argument. The PL inequality is introduced as an assumption ('Under the PL condition, we further derive...') rather than derived from the paper's own outputs or fitted values. No equations reduce the claimed optimal RL duration, non-decoupling threshold, or gradient bounds back to the target results by construction. No self-citations are load-bearing for the central proofs, and no parameters are fitted to data then relabeled as predictions. The analyses remain self-contained once the stated assumption is granted.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Polyak-Lojasiewicz (PL) condition holds on the SFT and RL loss landscapes
- standard math KL divergence governs distributional shift between SFT and RL objectives
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We prove that decoupling is impossible in either order: (1) SFT-then-RL coupling: RL increases SFT loss under both distributional (KL-based) and landscape (PL-based) analyses
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Under the PL condition, we further derive the optimal RL duration that balances reward improvement against SFT degradation
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
- [1]
-
[2]
Y. Bai, A. Jones, K. Ndousse, A. Askell, A. Chen, N. DasSarma, D. Drain, S. Fort, D. Gan- guli, T. Henighan, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback.arXiv preprint arXiv:2204.05862,
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
J. Chen, T. Yu, H. Bai, L. Yao, J. Wu, K. Li, F. Mi, C. Tao, L. Zhu, M. Zhang, X. Li, L. Hou, L. Shang, and Q. Liu. The synergy dilemma of long-CoT SFT and RL: Investigating post- training techniques for reasoning VLMs.arXiv preprint arXiv:2507.07562, 2025a. L. Chen, X. Han, L. Shen, J. Bai, and K.-F. Wong. Beyond two-stage training: Cooperative SFT and R...
-
[4]
T. Chu, Y. Zhai, J. Yang, S. Tong, S. Xie, D. Schuurmans, Q. V. Le, S. Levine, and Y. Ma. SFT memorizes, RL generalizes: A comparative study of foundation model post-training. arXiv preprint arXiv:2501.17161,
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
D. Guo, D. Yang, H. Zhang, J. Song, R. Zhang, R. Xu, Q. Zhu, S. Ma, P. Wang, X. Bi, et al. Deepseek-r1: Incentivizing reasoning capability in LLMs via reinforcement learning. arXiv preprint arXiv:2501.12948,
work page internal anchor Pith review Pith/arXiv arXiv
- [6]
- [7]
-
[8]
Tulu 3: Pushing Frontiers in Open Language Model Post-Training
N. Lambert, J. Morrison, V. Pyatkin, S. Huang, H. Ivison, F. Brahman, L. J. V. Miranda, A. Liu, N. Dziri, S. Lyu, Y. Gu, S. Malik, V. Graf, J. D. Hwang, J. Yang, R. L. Bras, O. Tafjord, C. Wilhelm, L. Soldaini, N. A. Smith, Y. Wang, P. Dasigi, and H. Ha- jishirzi. Tulu 3: Pushing frontiers in open language model post-training.arXiv preprint arXiv:2411.15124,
work page internal anchor Pith review Pith/arXiv arXiv
- [9]
- [10]
-
[11]
X. B. Peng, A. Kumar, G. Zhang, and S. Levine. Advantage-weighted regression: Simple and scalable off-policy reinforcement learning.arXiv preprint arXiv:1910.00177,
work page internal anchor Pith review Pith/arXiv arXiv 1910
-
[12]
N. Razin, Z. Wang, H. Strauss, S. Wei, J. D. Lee, and S. Arora. What makes a reward model a good teacher? an optimization perspective.arXiv preprint arXiv:2503.15477,
-
[13]
Proximal Policy Optimization Algorithms
J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy opti- mization algorithms.arXiv preprint arXiv:1707.06347,
work page internal anchor Pith review Pith/arXiv arXiv
-
[14]
Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, X. Bi, H. Zhang, M. Zhang, Y. Li, Y. Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300,
work page internal anchor Pith review Pith/arXiv arXiv
-
[15]
URLhttps://arxiv.org/abs/2505.09388. B. Wang, C. Lee, N. Lee, S.-C. Lin, W. Dai, Y. Chen, Y. Chen, Z. Yang, Z. Liu, M. Shoeybi, et al. Nemotron-cascade: Scaling cascaded reinforcement learning for general-purpose reasoning models.arXiv preprint arXiv:2512.13607,
work page internal anchor Pith review Pith/arXiv arXiv
-
[16]
Q. Yu, Z. Zhang, R. Zhu, Y. Yuan, X. Zuo, Y. Yue, W. Dai, T. Fan, G. Liu, L. Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476,
work page internal anchor Pith review Pith/arXiv arXiv
-
[17]
13 Y. Yue, Y. Yuan, Q. Yu, X. Zuo, R. Zhu, W. Xu, J. Chen, C. Wang, T. Fan, Z. Du, et al. Vapo: Efficient and reliable reinforcement learning for advanced reasoning tasks.arXiv preprint arXiv:2504.05118,
work page internal anchor Pith review Pith/arXiv arXiv
-
[18]
D. M. Ziegler, N. Stiennon, J. Wu, T. B. Brown, A. Radford, D. Amodei, P. Christiano, and G. Irving. Fine-tuning language models from human preferences.arXiv preprint arXiv:1909.08593,
work page internal anchor Pith review Pith/arXiv arXiv 1909
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.