SFT on LLMs causes OOD reasoning to peak early then decline while ID improves; RL recovers the lost OOD performance from specific SFT checkpoints, with the pattern correlating to rotations in singular vectors of model weights.
Title resolution pending
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
years
2025 2representative citing papers
Reinforcement learning post-training enables generalization to unseen textual rule variants and visual changes in foundation models, while supervised fine-tuning primarily leads to memorization.
citing papers explorer
-
RL Fine-Tuning Heals OOD Forgetting in SFT
SFT on LLMs causes OOD reasoning to peak early then decline while ID improves; RL recovers the lost OOD performance from specific SFT checkpoints, with the pattern correlating to rotations in singular vectors of model weights.
-
SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training
Reinforcement learning post-training enables generalization to unseen textual rule variants and visual changes in foundation models, while supervised fine-tuning primarily leads to memorization.