Online RL fine-tuning forgets less than SFT because it is implicitly biased toward KL-minimal solutions among all policies that solve the new task.
We assessed performance on unrelated benchmarks as described in Section 3.1, using the Language Model Evaluation Harness (Gao et al., 2024)
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.LG 1years
2025 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
RL's Razor: Why Online Reinforcement Learning Forgets Less
Online RL fine-tuning forgets less than SFT because it is implicitly biased toward KL-minimal solutions among all policies that solve the new task.