Online RL fine-tuning forgets less than SFT because it is implicitly biased toward KL-minimal solutions among all policies that solve the new task.
An exponential function was fit to the filtered points to produce the trade-off curves
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.LG 1years
2025 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
RL's Razor: Why Online Reinforcement Learning Forgets Less
Online RL fine-tuning forgets less than SFT because it is implicitly biased toward KL-minimal solutions among all policies that solve the new task.