SFT on LLMs causes OOD reasoning to peak early then decline while ID improves; RL recovers the lost OOD performance from specific SFT checkpoints, with the pattern correlating to rotations in singular vectors of model weights.
Title resolution pending
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.LG 1years
2025 1verdicts
CONDITIONAL 1representative citing papers
citing papers explorer
-
RL Fine-Tuning Heals OOD Forgetting in SFT
SFT on LLMs causes OOD reasoning to peak early then decline while ID improves; RL recovers the lost OOD performance from specific SFT checkpoints, with the pattern correlating to rotations in singular vectors of model weights.