Online RL fine-tuning forgets less than SFT because it is implicitly biased toward KL-minimal solutions among all policies that solve the new task.
We trained multiple models under a broad sweep of hyperparame- ters (see Table 2)
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.LG 1years
2025 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
RL's Razor: Why Online Reinforcement Learning Forgets Less
Online RL fine-tuning forgets less than SFT because it is implicitly biased toward KL-minimal solutions among all policies that solve the new task.