arxiv: 2601.07389 · v2 · submitted 2026-01-12 · 💻 cs.LG · cs.AI· cs.IT· math.IT

Recognition: 2 theorem links

· Lean Theorem

On the Non-decoupling of Supervised Fine-tuning and Reinforcement Learning in Post-training

Xueyan Niu , Bo Bai , Wei Han , Weixi Zhang

Authors on Pith no claims yet

Pith reviewed 2026-05-16 14:44 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.ITmath.IT

keywords supervised fine-tuningreinforcement learningLLM post-trainingdecouplingPolyak-Lojasiewicz conditionKL divergencegradient misalignmentreward optimization

0 comments

The pith

Reinforcement learning after supervised fine-tuning increases SFT loss while SFT after RL reduces reward, proving the stages cannot be decoupled.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that supervised fine-tuning and reinforcement learning interfere when applied sequentially in large language model post-training. RL training moves the model away from the expert responses that SFT targets, raising cross-entropy loss under both KL divergence measures of distribution shift and Polyak-Lojasiewicz landscape conditions. SFT applied after RL similarly lowers the reward signal that RL had improved. This mutual degradation means alternating the two methods in modern pipelines erodes earlier gains rather than building on them independently.

Core claim

SFT-then-RL coupling is shown because RL optimization increases the SFT cross-entropy loss, established through KL-based distributional analysis and PL-based landscape analysis. RL-then-SFT coupling follows as SFT reduces the reward achieved by RL under matching conditions. Under the PL condition the optimal RL duration balances reward improvement against SFT degradation, a non-decoupling threshold identifies when RL can still improve SFT, and gradient misalignment is bounded via spectral concentration. Experiments on Qwen3-0.6B confirm the predicted loss increase and reward drop.

What carries the argument

The non-decoupling threshold under the Polyak-Lojasiewicz condition, which marks the training point where further RL begins to raise SFT loss faster than it improves reward.

If this is right

Optimal RL duration can be calculated from the PL parameters to limit SFT degradation.
The non-decoupling threshold determines whether interleaving SFT and RL yields net benefit.
Gradient misalignment between the two objectives admits an explicit bound from spectral concentration.
Post-training pipelines must track both cross-entropy loss and reward throughout rather than treating stages as independent.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Joint objectives that minimize a weighted sum of SFT loss and negative reward could avoid sequential interference.
The same coupling pattern may appear when other preference-based and imitation-based methods are alternated.
Monitoring both loss and reward curves during training could flag coupling effects before full degradation occurs.
The threshold may scale with model size, offering a way to predict when decoupling becomes feasible at larger scales.

Load-bearing premise

The Polyak-Lojasiewicz condition holds for the loss landscapes that arise in LLM post-training.

What would settle it

An experiment in which RL training after SFT produces no increase in SFT cross-entropy loss on a model satisfying the PL condition would directly contradict the coupling result.

Figures

Figures reproduced from arXiv: 2601.07389 by Bo Bai, Wei Han, Weixi Zhang, Xueyan Niu.

**Figure 1.** Figure 1: Training pipeline for modern LLMs. This work focuses on two post-training methods, Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL), that refine a pretrained base model after its initial pretraining phase. RL and has demonstrated remarkable reasoning capabilities. To further improve performance, DeepSeek-R1 (Guo et al., 2025) is post-trained from the DeepSeek-V3-Base model by alternating SFT a… view at source ↗

**Figure 2.** Figure 2: Any combination of SFT and RL in post-training reduces to the two canonical pipelines: (a) SFT-then-RL and (b) RL-then-SFT. Acceptability (CoLA) dataset (Warstadt et al., 2019) using both SFT-then-RL and RLthen-SFT pipelines. Results show that RL diminishes SFT memory, as reflected by increased cross-entropy loss, and that RL becomes sensitive to further SFT, as shown by reward degradation. These paired c… view at source ↗

**Figure 3.** Figure 3: Experimental evidence of coupling. (a) SFT-then-RL: SFT loss climbs immediately once GRPO starts and eventually exceeds the base-model baseline. (b) RL-then-SFT: reward collapses as soon as SFT begins and falls below the basemodel level eventually. therefore assigning −1 even if the answer is semantically correct. To reduce the format sensitivity only in evaluation, we used a robust evaluation which keep… view at source ↗

read the original abstract

Post-training of large language models routinely interleaves supervised fine-tuning (SFT) with reinforcement learning (RL). These two methods have different objectives: SFT minimizes the cross-entropy loss between model outputs and expert responses, while RL maximizes reward signals derived from human preferences or rule-based verifiers. Modern reasoning models have widely adopted the practice of alternating SFT and RL training. However, there is no theoretical account of whether they can be decoupled. We prove that decoupling is impossible in either order: (1) SFT-then-RL coupling: RL increases SFT loss under both distributional (KL-based) and landscape (PL-based) analyses; and (2) RL-then-SFT coupling: SFT lowers the reward achieved by RL under analogous conditions. Under the PL condition, we further derive the optimal RL duration that balances reward improvement against SFT degradation, identify the non-decoupling threshold governing when RL can improve SFT, and bound the gradient misalignment via spectral concentration. Experiments on Qwen3-0.6B confirm the predicted degradation, verifying that SFT and RL cannot be separated without loss of prior performance in the post-training pipeline.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper claims SFT and RL cannot be decoupled in either order under KL and PL analyses, with small-model experiments backing the degradation, but the global PL assumption is unverified and likely the weakest link.

read the letter

The core takeaway is that SFT and RL steps in LLM post-training interfere with each other: RL raises the SFT cross-entropy loss, and SFT reduces the reward from prior RL. The authors derive this via KL distributional arguments plus PL landscape bounds, then give an optimal RL duration and non-decoupling threshold under the PL condition. Experiments on Qwen3-0.6B show the predicted SFT degradation after RL, which matches the theory at small scale. That is the actual new piece—an explicit non-decoupling result where prior work treated the stages as more modular. The KL part looks straightforward and does not require extra assumptions beyond standard divergence properties. The PL section supplies the quantitative claims but rests on the inequality holding globally for the transformer loss, which the paper states without a proof or trajectory estimate of μ. In high-dimensional non-convex settings this usually fails outside local neighborhoods, so the optimal-duration and gradient-misalignment bounds are conditional at best. The 0.6B experiments are consistent but too small to speak to scaling behavior or whether the effect survives larger models and different reward models. Citation pattern is light on prior post-training theory, which is fine if the derivations are original. This is worth a serious referee for groups building reasoning pipelines, because the question is practical and the framing is direct. The math needs checking on the PL step, but the overall direction is worth engaging.

Referee Report

3 major / 2 minor

Summary. The paper claims that SFT and RL cannot be decoupled in LLM post-training. It proves that RL after SFT increases SFT loss under both KL-divergence distributional analysis and Polyak-Łojasiewicz (PL) landscape analysis, while SFT after RL reduces the reward achieved by RL. Under the PL condition, it derives the optimal RL duration balancing reward gain against SFT degradation, identifies the non-decoupling threshold, and bounds gradient misalignment via spectral concentration. Experiments on Qwen3-0.6B are reported to confirm the predicted degradation.

Significance. If the central proofs are sound and the PL condition holds as assumed, the result would be significant for post-training pipeline design, providing theoretical justification for interleaving rather than sequencing SFT and RL to avoid performance loss. The dual distributional and landscape arguments plus small-model experiments constitute a strength, though the lack of global verification for the PL inequality limits the immediate applicability of the quantitative predictions.

major comments (3)

[Landscape analysis section] The PL-based landscape analysis (deriving optimal RL duration, non-decoupling threshold, and gradient misalignment bound) rests on the global inequality ||∇L||² ≥ 2μ(L − L*) for the cross-entropy loss L. No proof of global validity for transformer architectures nor any empirical estimate of μ along the post-training trajectory is supplied, yet this is the precise condition required for the quantitative claims to follow.
[Experiments section] The experimental confirmation on Qwen3-0.6B is stated to verify the predicted SFT degradation, but the manuscript provides neither the exact measured increase in SFT loss, the data exclusion rules, nor error analysis, making it impossible to assess whether the observed effects quantitatively match the theoretical predictions.
[KL-based analysis] The KL-based distributional argument for SFT-then-RL coupling requires explicit statement of the conditions on the reward model and policy update that guarantee RL strictly increases SFT loss; without these, it is unclear whether the result holds beyond the specific preference data distributions considered.

minor comments (2)

[Notation and definitions] Notation for losses (cross-entropy) and rewards should be unified across the distributional and landscape sections to avoid reader confusion.
[Abstract] The abstract claims 'proofs' but the main text should clarify whether full derivations are provided or if key steps rely on the stated PL assumption.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript accordingly to clarify assumptions, strengthen experimental reporting, and make conditions explicit while preserving the core theoretical contributions under the stated conditions.

read point-by-point responses

Referee: [Landscape analysis section] The PL-based landscape analysis (deriving optimal RL duration, non-decoupling threshold, and gradient misalignment bound) rests on the global inequality ||∇L||² ≥ 2μ(L − L*) for the cross-entropy loss L. No proof of global validity for transformer architectures nor any empirical estimate of μ along the post-training trajectory is supplied, yet this is the precise condition required for the quantitative claims to follow.

Authors: We agree that no global proof of the PL inequality is supplied for transformer cross-entropy loss, which remains an open question in optimization. All quantitative results (optimal RL duration, non-decoupling threshold, gradient bound) are derived under the explicit assumption that the PL condition holds with μ > 0 along the post-training path. In revision we will insert a new paragraph in the landscape section stating this assumption, discussing its scope, and noting that empirical estimation of μ is left for future work. The qualitative non-decoupling result does not require the global PL inequality. revision: partial
Referee: [Experiments section] The experimental confirmation on Qwen3-0.6B is stated to verify the predicted SFT degradation, but the manuscript provides neither the exact measured increase in SFT loss, the data exclusion rules, nor error analysis, making it impossible to assess whether the observed effects quantitatively match the theoretical predictions.

Authors: We accept this criticism. The revised manuscript will report the exact measured SFT loss increases (with numerical deltas), detail the data exclusion/filtering rules applied to the evaluation set, and include error analysis via standard deviations across at least three independent runs with different seeds. These additions will permit direct quantitative comparison to the theoretical predictions. revision: yes
Referee: [KL-based analysis] The KL-based distributional argument for SFT-then-RL coupling requires explicit statement of the conditions on the reward model and policy update that guarantee RL strictly increases SFT loss; without these, it is unclear whether the result holds beyond the specific preference data distributions considered.

Authors: We will revise the KL section to state the assumptions explicitly: the reward model remains fixed, the RL update follows the standard KL-regularized policy gradient objective with positive coefficient β, and the preference distribution satisfies bounded support and Lipschitz conditions on the reward. Under these, the proof shows strict increase in SFT loss. A supporting lemma will be added to formalize the conditions and the resulting inequality. revision: yes

standing simulated objections not resolved

Global proof of the PL inequality ||∇L||² ≥ 2μ(L − L*) for the cross-entropy loss on transformer architectures

Circularity Check

0 steps flagged

No circularity: derivations are conditional on explicit PL assumption and standard KL arguments without self-referential reduction

full rationale

The paper derives non-decoupling results via two independent routes: a KL-based distributional argument and a PL-based landscape argument. The PL inequality is introduced as an assumption ('Under the PL condition, we further derive...') rather than derived from the paper's own outputs or fitted values. No equations reduce the claimed optimal RL duration, non-decoupling threshold, or gradient bounds back to the target results by construction. No self-citations are load-bearing for the central proofs, and no parameters are fitted to data then relabeled as predictions. The analyses remain self-contained once the stated assumption is granted.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the Polyak-Lojasiewicz condition for landscape analysis and standard properties of KL divergence in the context of LLM training objectives.

axioms (2)

domain assumption Polyak-Lojasiewicz (PL) condition holds on the SFT and RL loss landscapes
Invoked to derive optimal RL duration, non-decoupling threshold, and gradient misalignment bounds.
standard math KL divergence governs distributional shift between SFT and RL objectives
Used for the distributional analysis of coupling in either training order.

pith-pipeline@v0.9.0 · 5514 in / 1358 out tokens · 64258 ms · 2026-05-16T14:44:36.808245+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We prove that decoupling is impossible in either order: (1) SFT-then-RL coupling: RL increases SFT loss under both distributional (KL-based) and landscape (PL-based) analyses
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Under the PL condition, we further derive the optimal RL duration that balances reward improvement against SFT degradation

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

18 extracted references · 18 canonical work pages · 11 internal anchors

[1]

S. N. Akter, S. Prabhumoye, E. Nyberg, M. Patwary, M. Shoeybi, Y. Choi, and B. Catan- zaro. Front-loading reasoning: The synergy between pretraining and post-training data. arXiv preprint arXiv:2510.03264,

work page arXiv
[2]

Y. Bai, A. Jones, K. Ndousse, A. Askell, A. Chen, N. DasSarma, D. Drain, S. Fort, D. Gan- guli, T. Henighan, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback.arXiv preprint arXiv:2204.05862,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

J. Chen, T. Yu, H. Bai, L. Yao, J. Wu, K. Li, F. Mi, C. Tao, L. Zhu, M. Zhang, X. Li, L. Hou, L. Shang, and Q. Liu. The synergy dilemma of long-CoT SFT and RL: Investigating post- training techniques for reasoning VLMs.arXiv preprint arXiv:2507.07562, 2025a. L. Chen, X. Han, L. Shen, J. Bai, and K.-F. Wong. Beyond two-stage training: Cooperative SFT and R...

work page arXiv
[4]

T. Chu, Y. Zhai, J. Yang, S. Tong, S. Xie, D. Schuurmans, Q. V. Le, S. Levine, and Y. Ma. SFT memorizes, RL generalizes: A comparative study of foundation model post-training. arXiv preprint arXiv:2501.17161,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

D. Guo, D. Yang, H. Zhang, J. Song, R. Zhang, R. Xu, Q. Zhu, S. Ma, P. Wang, X. Bi, et al. Deepseek-r1: Incentivizing reasoning capability in LLMs via reinforcement learning. arXiv preprint arXiv:2501.12948,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

B. He, Z. Qu, Z. Liu, Y. Chen, Y. Zuo, C. Qian, K. Zhang, W. Chen, C. Xiao, G. Cui, et al. JustRL: Scaling a 1.5b LLM with a simple RL recipe.arXiv preprint arXiv:2512.16649,

work page arXiv
[7]

M. Huan, Y. Li, T. Zheng, X. Xu, S. Kim, M. Du, R. Poovendran, G. Neubig, and X. Yue. Does math reasoning improve general LLM capabilities? understanding transferability of LLM reasoning.arXiv preprint arXiv:2507.00432,

work page arXiv
[8]

Tulu 3: Pushing Frontiers in Open Language Model Post-Training

N. Lambert, J. Morrison, V. Pyatkin, S. Huang, H. Ivison, F. Brahman, L. J. V. Miranda, A. Liu, N. Dziri, S. Lyu, Y. Gu, S. Malik, V. Graf, J. D. Hwang, J. Yang, R. L. Bras, O. Tafjord, C. Wilhelm, L. Soldaini, N. A. Smith, Y. Wang, P. Dasigi, and H. Ha- jishirzi. Tulu 3: Pushing frontiers in open language model post-training.arXiv preprint arXiv:2411.15124,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Z. Liu, Z. Yang, Y. Chen, C. Lee, M. Shoeybi, B. Catanzaro, and W. Ping. AceReason- Nemotron 1.1: Advancing math and code reasoning through SFT and RL synergy.arXiv preprint arXiv:2506.13284,

work page arXiv
[10]

12 X. Niu, B. Bai, L. Deng, and W. Han. Beyond scaling laws: Understanding transformer performance with associative memory.arXiv preprint arXiv:2405.08707,

work page arXiv
[11]

X. B. Peng, A. Kumar, G. Zhang, and S. Levine. Advantage-weighted regression: Simple and scalable off-policy reinforcement learning.arXiv preprint arXiv:1910.00177,

work page internal anchor Pith review Pith/arXiv arXiv 1910
[12]

What makes a reward model a good teacher? an optimization perspective.arXivpreprint arXiv:2503.15477, 2025

N. Razin, Z. Wang, H. Strauss, S. Wei, J. D. Lee, and S. Arora. What makes a reward model a good teacher? an optimization perspective.arXiv preprint arXiv:2503.15477,

work page arXiv
[13]

Proximal Policy Optimization Algorithms

J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy opti- mization algorithms.arXiv preprint arXiv:1707.06347,

work page internal anchor Pith review Pith/arXiv arXiv
[14]

Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, X. Bi, H. Zhang, M. Zhang, Y. Li, Y. Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300,

work page internal anchor Pith review Pith/arXiv arXiv
[15]

URLhttps://arxiv.org/abs/2505.09388. B. Wang, C. Lee, N. Lee, S.-C. Lin, W. Dai, Y. Chen, Y. Chen, Z. Yang, Z. Liu, M. Shoeybi, et al. Nemotron-cascade: Scaling cascaded reinforcement learning for general-purpose reasoning models.arXiv preprint arXiv:2512.13607,

work page internal anchor Pith review Pith/arXiv arXiv
[16]

Q. Yu, Z. Zhang, R. Zhu, Y. Yuan, X. Zuo, Y. Yue, W. Dai, T. Fan, G. Liu, L. Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476,

work page internal anchor Pith review Pith/arXiv arXiv
[17]

13 Y. Yue, Y. Yuan, Q. Yu, X. Zuo, R. Zhu, W. Xu, J. Chen, C. Wang, T. Fan, Z. Du, et al. Vapo: Efficient and reliable reinforcement learning for advanced reasoning tasks.arXiv preprint arXiv:2504.05118,

work page internal anchor Pith review Pith/arXiv arXiv
[18]

D. M. Ziegler, N. Stiennon, J. Wu, T. B. Brown, A. Radford, D. Amodei, P. Christiano, and G. Irving. Fine-tuning language models from human preferences.arXiv preprint arXiv:1909.08593,

work page internal anchor Pith review Pith/arXiv arXiv 1909