RL Fine-Tuning Heals OOD Forgetting in SFT

Doina Precup; Guillaume Rabusseau; Hangzhan Jin; Mohammad Hamdaqa; Reihaneh Rabbany; Sicheng Lyu; Sitao Luan; Tianwei Ni

arxiv: 2509.12235 · v3 · submitted 2025-09-08 · 💻 cs.LG · cs.AI

RL Fine-Tuning Heals OOD Forgetting in SFT

Hangzhan Jin , Sitao Luan , Tianwei Ni , Sicheng Lyu , Guillaume Rabusseau , Reihaneh Rabbany , Doina Precup , Mohammad Hamdaqa This is my paper

Pith reviewed 2026-05-18 17:31 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords supervised fine-tuningreinforcement learningout-of-distribution reasoningforgettingsingular vectorslarge language modelspost-training dynamics

0 comments

The pith

Reinforcement learning restores out-of-distribution reasoning lost during extended supervised fine-tuning of language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Supervised fine-tuning followed by reinforcement learning is the standard way to improve reasoning in large language models, yet the reasons for its success are not well understood. Checkpoint analyses reveal that out-of-distribution reasoning often improves early in supervised fine-tuning but then worsens even as in-distribution performance continues to rise. Reinforcement learning restores the lost out-of-distribution capability rather than creating new gains beyond the early peak, and it succeeds only for a limited range of supervised fine-tuning checkpoints. Spectral analysis ties this forgetting and recovery to rotations of singular vectors in the model weights, with singular values staying relatively constant. This points to a view of post-training where supervised fine-tuning can induce forgetting that reinforcement learning then heals.

Core claim

Through checkpoint-wise analyses of in-distribution and out-of-distribution reasoning, OOD performance peaks early during SFT and then declines despite continued improvement in ID reasoning. RL typically does not surpass this early SFT peak; rather, it restores OOD capability lost during later SFT, and only from a bounded range of SFT checkpoints. Further spectral analysis shows that this forgetting-and-recovery pattern correlates with rotations of singular vectors, while singular values remain largely stable.

What carries the argument

Checkpoint-wise performance tracking combined with spectral analysis of singular vector rotations during SFT and RL stages.

If this is right

SFT improves ID reasoning but can degrade OOD performance after an early peak.
RL fine-tuning recovers lost OOD capability only when applied to a bounded range of SFT checkpoints.
The recovery does not exceed the highest OOD level reached in early SFT.
Singular vector rotations track the forgetting and recovery cycle while singular values stay stable.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Monitoring changes in singular vector directions during training could flag when OOD degradation begins.
Training methods that limit singular vector rotation might reduce reliance on a separate RL recovery stage.
The same checkpoint and spectral analysis could be applied to other post-training methods such as preference optimization to test for similar patterns.

Load-bearing premise

The observed OOD forgetting during continued SFT and the subsequent recovery by RL reflect a general property of post-training dynamics rather than an artifact of the specific models, datasets, or evaluation splits chosen.

What would settle it

Re-running the checkpoint-wise ID and OOD evaluations plus spectral analysis on a new model family and reasoning dataset, then finding no early OOD peak during SFT or no restoration by RL from any checkpoint range.

Figures

Figures reproduced from arXiv: 2509.12235 by Doina Precup, Guillaume Rabusseau, Hangzhan Jin, Mohammad Hamdaqa, Reihaneh Rabbany, Sicheng Lyu, Sitao Luan, Tianwei Ni.

**Figure 2.** Figure 2: (2a) Training and test loss, and (2b) format error curves during SFT. Evolution of (2c) OOD [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Advantage estimation distribution and demonstration of questions with non-unique solutions [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Singular value restoration for SFT stage. 5 10 15 20 25 30 35 40 Layer 20 30 40 50 60 70 Accuracy LLaMA Qwen (a) Layer-wise (ID) 5 10 15 20 25 30 35 40 Layer 12 14 16 18 20 22 Accuracy LLaMA Qwen (b) Layer-wise (OOD) 64 256 512 768 1024 1536 2048 2560 3072 3584 4096 topk 20 30 40 50 60 70 Accuracy LLaMA Qwen (c) Top-k (ID) 64 256 512 768 1024 1536 2048 2560 3072 3584 4096 topk 12 14 16 18 20 22 Accuracy LL… view at source ↗

**Figure 5.** Figure 5: Singular value restoration for RL stage. performance in LLaMA, however, Qwen stays relatively robust. This suggests that, in SFT stage, the task-specific knowledge does not depend too much on the last several layers and OOD capabilities are highly impacted by the the top and bottom blocks of the models. • Top-k Analysis As shown in Figure 6c and 6d, restoring the top 2560 singular vectors of LLaMA and top … view at source ↗

**Figure 6.** Figure 6: Singular vector restoration for SFT stage. In RL stage, we observe that • Layer-wise Analysis As shown in Figure 7a and 7b, the restoration of singular vectors consistently causes performance degradation of ID and OOD performance for LLaMA, with some perturbations in intermediate (15 − 25) layers for OOD performance. ID and OOD performance of Qwen is relatively robust, and also have some perturbations in i… view at source ↗

**Figure 7.** Figure 7: Singular vector restoration for RL stage. 5 RELATED WORK ON RL REASONING 5.1 RL IMPROVES REASONING AND OOD GENERALIZATION Following the introduction of DeepSeek-R1 (DeepSeek-AI, 2025), large-scale RL has emerged as a principal driver of improved reasoning, directly eliciting long chain-of-thought behavior and strong math/coding performance. Notably, the zero-SFT variant (R1-Zero) is trained solely with RL … view at source ↗

**Figure 8.** Figure 8: In-distribution training/test loss and OOD loss curves for LLaMA-3.2-11B and Qwen-2.5- [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗

**Figure 9.** Figure 9: ID performance on three tasks C.3 EVOLUTION OF ROTATION-AWARE FINE-TUNING Inspired by our previous experiments in Section 4, we find that the top singular vectors dominate around 70% of the performance of ID and OOD, and the recovery of singular vectors nearly rolls back the performance for both models to the previous stage. Then we penalize the singular vectors in the top rank (e.g., 128, 256, 512, 1024) … view at source ↗

**Figure 10.** Figure 10: OOD performance after penalty of top 512 rank in singular vectors compared to the [PITH_FULL_IMAGE:figures/full_fig_p018_10.png] view at source ↗

**Figure 11.** Figure 11: ID and OOD accuracy for Qwen-2.5-7B on GeneralPoints. [PITH_FULL_IMAGE:figures/full_fig_p018_11.png] view at source ↗

**Figure 12.** Figure 12: Loss of single-stage RL fine-tuning 18 [PITH_FULL_IMAGE:figures/full_fig_p018_12.png] view at source ↗

**Figure 13.** Figure 13: An example of reward hacking. The RL-only curve sees an increasing reward signal (right panel) but stagnant or low success rates (left panel). C.7 CHANGES OF SINGULAR VALUES To investigate how does SFT and RL reshape the spectral structure of the parameter matrices, we analyze the singular values of Wq,Wk,Wv and their differences (∆σi = σ SFT1100 i − σ SFT140 i for LLaMA and σ SFT1100 i − σ SFT140 i for Q… view at source ↗

**Figure 14.** Figure 14: Singular value changes in the q_proj, k_proj, and v_proj matrices of the first self-attention layer (layers[5].self_attn) in LLaMA-3.2-11B. Panels (a)–(c) illustrate the impact of supervised fine-tuning (SFT) on Wq, Wk, and Wv, respectively, while panels (d)–(f) depict the corresponding changes following reinforcement learning (RL). Each panel shows the difference in singular values before and after the r… view at source ↗

**Figure 15.** Figure 15: An example of rotation between SFT and RL. [PITH_FULL_IMAGE:figures/full_fig_p021_15.png] view at source ↗

**Figure 16.** Figure 16: PCA visualization of the hidden representations at checkpoints [PITH_FULL_IMAGE:figures/full_fig_p021_16.png] view at source ↗

read the original abstract

Supervised Fine-Tuning (SFT) followed by Reinforcement Learning (RL) is a standard post-training recipe for improving Large Language Models (LLM) reasoning, but why it works remains unclear. We revisit the common claim that ``SFT memorizes, RL generalizes'' through checkpoint-wise analyses of in-distribution (ID) and out-of-distribution (OOD) reasoning. We find that OOD performance often peaks early during SFT and then declines despite continued improvement in ID reasoning. RL typically does not surpass this early SFT peak; rather, it restores OOD capability lost during later SFT, and only from a bounded range of SFT checkpoints. Further spectral analysis shows that this forgetting-and-recovery pattern correlates with rotations of singular vectors, while singular values remain largely stable. These findings suggest a more precise view of post-training dynamics: SFT can forget, RL can recover, and controlling singular-vector rotation may improve OOD robustness. Code is available at \href{https://github.com/jinhangzhan/RL\_Heals\_SFT.git}{https://github.com/jinhangzhan/RL\_Heals\_SFT}.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper claims that OOD reasoning performance in LLMs often peaks early during SFT and then declines even as ID performance keeps improving. RL fine-tuning typically restores the lost OOD capability from a bounded range of SFT checkpoints without surpassing the early SFT peak. Spectral (SVD) analysis shows this forgetting-and-recovery pattern correlates with rotations of singular vectors while singular values remain largely stable. The authors conclude that SFT can induce forgetting, RL can recover it, and controlling singular-vector rotation may improve OOD robustness. Code is released.

Significance. If these dynamics hold beyond the studied settings, the work refines the 'SFT memorizes, RL generalizes' view by identifying a recoverable forgetting phase and linking it to a concrete spectral signature. This could inform checkpoint selection and regularization strategies for OOD robustness in reasoning models. The released code is a clear strength, allowing direct reproduction of the checkpoint-wise curves and SVD decompositions.

major comments (2)

[§3] §3 (checkpoint-wise ID/OOD curves): the central claim that OOD performance peaks early and then declines rests on these trajectories. Without reported error bars, number of random seeds, or statistical tests for the peak location and decline, it is difficult to judge whether the forgetting pattern is robust or sensitive to run-to-run variation.
[§4] §4 (SVD analysis): the reported correlation between singular-vector rotations and the OOD recovery pattern is presented as empirical. To support the mechanistic interpretation that rotation drives the forgetting, a quantitative metric (e.g., principal-angle change versus OOD drop) and its consistency across the evaluated models would be needed; the current qualitative description leaves open whether the link is load-bearing or coincidental.

minor comments (1)

[Figures 2-4] Figure captions and axis labels for the checkpoint curves could explicitly state the number of evaluation examples per point and whether results are averaged over multiple prompts or seeds.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight important aspects of empirical robustness and mechanistic interpretation that we will address in the revision.

read point-by-point responses

Referee: [§3] §3 (checkpoint-wise ID/OOD curves): the central claim that OOD performance peaks early and then declines rests on these trajectories. Without reported error bars, number of random seeds, or statistical tests for the peak location and decline, it is difficult to judge whether the forgetting pattern is robust or sensitive to run-to-run variation.

Authors: We acknowledge the validity of this concern. The presented curves were generated from single runs per configuration, which limits assessment of variability. In the revised manuscript we will rerun the key SFT trajectories with at least three random seeds, add error bars to the ID/OOD plots in §3, and report that the early peak and subsequent decline remain consistent across seeds. A brief note on the stability of the peak location will be included. revision: yes
Referee: [§4] §4 (SVD analysis): the reported correlation between singular-vector rotations and the OOD recovery pattern is presented as empirical. To support the mechanistic interpretation that rotation drives the forgetting, a quantitative metric (e.g., principal-angle change versus OOD drop) and its consistency across the evaluated models would be needed; the current qualitative description leaves open whether the link is load-bearing or coincidental.

Authors: We agree that a quantitative link would strengthen the mechanistic claim. In the revision we will introduce a principal-angle metric between the leading singular vectors of early-SFT and later checkpoints, compute its correlation with the observed OOD drop, and report these values (including Pearson coefficients) for every model examined in §4. This will replace the purely qualitative description and demonstrate consistency across models. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical observations from checkpoint analyses and spectral correlations

full rationale

The paper presents direct experimental results from SFT checkpoint evaluations on ID/OOD reasoning tasks and subsequent spectral analysis of singular vectors/values. No derivation chain, fitted parameters renamed as predictions, or self-citation load-bearing steps are present. Claims such as OOD performance peaking early then declining, RL restoring lost capability from bounded checkpoints, and correlation with singular-vector rotations are reported as observed patterns rather than quantities derived by construction from inputs or prior self-citations. The work is self-contained against external benchmarks via code release and does not invoke uniqueness theorems or ansatzes that reduce to the target results.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work rests on standard machine-learning assumptions about the validity of ID/OOD splits for reasoning tasks and the interpretability of singular vectors in weight matrices as carriers of task-relevant information.

axioms (1)

domain assumption Singular value decomposition of weight matrices reveals directions whose rotation correlates with changes in out-of-distribution reasoning performance.
Invoked in the spectral analysis section to link internal model changes to the observed forgetting-and-recovery behavior.

pith-pipeline@v0.9.0 · 5765 in / 1434 out tokens · 67566 ms · 2026-05-18T17:31:38.955537+00:00 · methodology

discussion (0)

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

ControBench: An Interaction-Aware Benchmark for Controversial Discourse Analysis on Social Networks
cs.CL 2026-05 unverdicted novelty 7.0

ControBench is a new interaction-aware benchmark combining heterogeneous graphs and rich text for controversial discourse analysis on social networks.
Rotation-Preserving Supervised Fine-Tuning
cs.LG 2026-05 unverdicted novelty 6.0

RPSFT improves the in-domain versus out-of-domain performance trade-off during LLM supervised fine-tuning by penalizing rotations in pretrained singular subspaces as a proxy for loss-sensitive directions.
Emergent Slow Thinking in LLMs as Inverse Tree Freezing
cs.AI 2025-09 unverdicted novelty 6.0

RLVR drives a concept network in LLMs through nucleation and freezing into inverse trees that support slow thinking, and intervening with brief SFT at peak frustration outperforms standard RLVR while post-freeze SFT c...
Reinforcement Learning for Compositional Generalization with Outcome-Level Optimization
cs.LG 2026-05 unverdicted novelty 4.0

Outcome-level RL with binary or composite rewards improves compositional generalization over supervised fine-tuning by avoiding overfitting to frequent training patterns.

Reference graph

Works this paper leans on

17 extracted references · 17 canonical work pages · cited by 4 Pith papers · 2 internal anchors

[1]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

URLhttps://arxiv.org/abs/2501.12948. Jonathan B Freeman and Rick Dale. Assessing bimodality to detect the presence of a dual cognitive process.Behavior research methods, 45(1):83–97, 2013. Yuqian Fu, Tinghong Chen, Jiajun Chai, Xihuai Wang, Songjun Tu, Guojun Yin, Wei Lin, Qichao Zhang, Yuanheng Zhu, and Dongbin Zhao. Srft: A single-stage method with supe...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1103/physreve.106.054124 2013
[2]

Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?

URLhttps://arxiv.org/abs/2504.13837. David Yunis, Kumar Kshitij Patel, Samuel Wheeler, Pedro Savarese, Gal Vardi, Karen Livescu, Michael Maire, and Matthew R Walter. Approaching deep learning through the spectral dynamics of weights.arXiv preprint arXiv:2408.11804, 2024. Simon Zhai, Hao Bai, Zipeng Lin, Jiayi Pan, Peter Tong, Yifei Zhou, Alane Suhr, Saini...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[3]

First, turn right to face north

work page
[5]

Turn left to face west

work page
[9]

Turn right to face east

work page
[10]

Move forward until you reach next intersection where Levi & Korsinsky, LLP is on your right behind

work page
[11]

Turn left to face north

work page
[13]

Turn slightly right to face northeast

work page
[14]

Move forward until you reach next intersection

work page
[16]

Move forward until you reach next intersection where Mr Goods Buy & Sell is on your left front

work page
[17]

Turn left to face northeast

work page
[18]

Move forward until you reach next intersection where Skullfade Barbers is on your left front

work page
[19]

Turn right to face northwest

work page
[20]

current observation

Move forward until you reach destination where The destination Ann Cleaners is on your left. [Action space] forward(): indicates moving forward one step turn direction(x): indicates adjust the ego agent direction towards x direction. x could be any following 8 directions [’north’, ’northeast’, ’east’, ’southeast’, ’south’, ’southwest’, ’west’, ’northwest’...

work page 2025
[21]

Number check: if numbers in formula are invalid (not from set, wrong count, etc.) → INCOR- RECT_NUMBER

work page
[22]

final answer

Solution check: if no valid “final answer” after format checking→NO_SOLUTION

work page
[23]

forgetting

Aggregation: if≥2of the above are true for a step→also count AGGREGATED_ERR. How we compute metrics- Loss: mean token-level CE on train data (OOD). - CORRECT_SOLUTION(CS): a problem is correct if any of its 6 attempts ends with a correct final answer; accuracy is the fraction over 24. - Step-level rates (NO_SOLUTION(NS), ILLEGAL_FORMAT(IF), INCORRECT_NUMB...

work page 2059

[1] [1]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

URLhttps://arxiv.org/abs/2501.12948. Jonathan B Freeman and Rick Dale. Assessing bimodality to detect the presence of a dual cognitive process.Behavior research methods, 45(1):83–97, 2013. Yuqian Fu, Tinghong Chen, Jiajun Chai, Xihuai Wang, Songjun Tu, Guojun Yin, Wei Lin, Qichao Zhang, Yuanheng Zhu, and Dongbin Zhao. Srft: A single-stage method with supe...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1103/physreve.106.054124 2013

[2] [2]

Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?

URLhttps://arxiv.org/abs/2504.13837. David Yunis, Kumar Kshitij Patel, Samuel Wheeler, Pedro Savarese, Gal Vardi, Karen Livescu, Michael Maire, and Matthew R Walter. Approaching deep learning through the spectral dynamics of weights.arXiv preprint arXiv:2408.11804, 2024. Simon Zhai, Hao Bai, Zipeng Lin, Jiayi Pan, Peter Tong, Yifei Zhou, Alane Suhr, Saini...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[3] [3]

First, turn right to face north

work page

[4] [5]

Turn left to face west

work page

[5] [9]

Turn right to face east

work page

[6] [10]

Move forward until you reach next intersection where Levi & Korsinsky, LLP is on your right behind

work page

[7] [11]

Turn left to face north

work page

[8] [13]

Turn slightly right to face northeast

work page

[9] [14]

Move forward until you reach next intersection

work page

[10] [16]

Move forward until you reach next intersection where Mr Goods Buy & Sell is on your left front

work page

[11] [17]

Turn left to face northeast

work page

[12] [18]

Move forward until you reach next intersection where Skullfade Barbers is on your left front

work page

[13] [19]

Turn right to face northwest

work page

[14] [20]

current observation

Move forward until you reach destination where The destination Ann Cleaners is on your left. [Action space] forward(): indicates moving forward one step turn direction(x): indicates adjust the ego agent direction towards x direction. x could be any following 8 directions [’north’, ’northeast’, ’east’, ’southeast’, ’south’, ’southwest’, ’west’, ’northwest’...

work page 2025

[15] [21]

Number check: if numbers in formula are invalid (not from set, wrong count, etc.) → INCOR- RECT_NUMBER

work page

[16] [22]

final answer

Solution check: if no valid “final answer” after format checking→NO_SOLUTION

work page

[17] [23]

forgetting

Aggregation: if≥2of the above are true for a step→also count AGGREGATED_ERR. How we compute metrics- Loss: mean token-level CE on train data (OOD). - CORRECT_SOLUTION(CS): a problem is correct if any of its 6 attempts ends with a correct final answer; accuracy is the fraction over 24. - Step-level rates (NO_SOLUTION(NS), ILLEGAL_FORMAT(IF), INCORRECT_NUMB...

work page 2059