pith. machine review for the scientific record. sign in

arxiv: 2509.04259 · v1 · submitted 2025-09-04 · 💻 cs.LG

Recognition: no theorem link

RL's Razor: Why Online Reinforcement Learning Forgets Less

Authors on Pith no claims yet

Pith reviewed 2026-05-17 04:58 UTC · model grok-4.3

classification 💻 cs.LG
keywords reinforcement learningsupervised fine-tuningcatastrophic forgettingKL divergencefine-tuninglanguage modelsrobotic modelscontinual learning
0
0 comments X

The pith

On-policy RL forgets less than SFT because it selects the minimal-KL solution to new tasks among many possibilities.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that online reinforcement learning preserves prior knowledge better than supervised fine-tuning when adapting models to new tasks, even when both achieve similar performance on the new task. This occurs because RL is biased toward solutions that stay close in distribution to the base model, as measured by KL divergence evaluated on the new task data. A sympathetic reader would care because catastrophic forgetting is a major obstacle to building continually improving AI systems that retain old skills while learning new ones. If true, this suggests RL as a preferred method for fine-tuning in settings where maintaining broad capabilities matters.

Core claim

The central claim is that on-policy RL is implicitly biased towards KL-minimal solutions among the many that solve the new task, whereas SFT can converge to distributions arbitrarily far from the base model. The degree of forgetting is determined by the distributional shift, measured as the KL-divergence between the fine-tuned and base policy evaluated on the new task. This principle, termed RL's Razor, is validated through experiments with large language models and robotic foundation models, with theoretical justification for why on-policy RL updates lead to a smaller KL change.

What carries the argument

The implicit bias in on-policy RL updates toward minimizing the KL divergence from the base policy while solving the new task.

If this is right

  • RL fine-tuned models show significantly better retention of prior capabilities compared to SFT models with similar new-task performance.
  • The amount of forgetting can be predicted from the KL divergence between the fine-tuned policy and the base policy on the new task.
  • Theoretical analysis shows that on-policy RL updates naturally lead to smaller changes in KL than SFT.
  • Experiments confirm the pattern holds for both language models and robotic foundation models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This could imply that RL is particularly useful for sequential task learning where models must accumulate skills without erasing earlier ones.
  • Practitioners might prefer RL over SFT when the base model has valuable general knowledge that should be preserved.
  • Future work could explore hybrid methods that combine SFT's speed with RL's stability by adding KL regularization to SFT.

Load-bearing premise

that the observed degree of forgetting is determined by the KL-divergence between fine-tuned and base policy evaluated on the new task

What would settle it

An observation that a high-KL fine-tuned model forgets less than a low-KL one on the same task, or that RL sometimes chooses high-KL solutions when lower-KL alternatives exist that solve the task equally well.

read the original abstract

Comparison of fine-tuning models with reinforcement learning (RL) and supervised fine-tuning (SFT) reveals that, despite similar performance at a new task, RL preserves prior knowledge and capabilities significantly better. We find that the degree of forgetting is determined by the distributional shift, measured as the KL-divergence between the fine-tuned and base policy evaluated on the new task. Our analysis reveals that on-policy RL is implicitly biased towards KL-minimal solutions among the many that solve the new task, whereas SFT can converge to distributions arbitrarily far from the base model. We validate these findings through experiments with large language models and robotic foundation models and further provide theoretical justification for why on-policy RL updates lead to a smaller KL change. We term this principle $\textit{RL's Razor}$: among all ways to solve a new task, RL prefers those closest in KL to the original model.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that online RL fine-tuning preserves prior knowledge better than SFT at comparable new-task performance because on-policy RL is implicitly biased toward solutions minimizing KL(fine-tuned || base) on the new-task distribution, while SFT can drift arbitrarily far; this 'RL's Razor' is supported by LLM and robotic experiments plus a theoretical justification that on-policy updates produce smaller KL change.

Significance. If substantiated, the result offers a principled account of why RL exhibits less catastrophic forgetting than SFT in continual fine-tuning of large models, with direct implications for method selection in LLMs and robotics. The dual validation across language and robotic domains plus the attempt at a theoretical derivation are strengths that would elevate the work above purely empirical comparisons.

major comments (2)
  1. [Abstract and theoretical justification] Abstract and theoretical justification section: the central claim that forgetting is governed by KL(fine-tuned || base) evaluated on the new task is load-bearing, yet the manuscript does not demonstrate that this new-task KL controls effective shift on the original data distribution when supports overlap only partially. Additional analysis or controls are needed to rule out that optimization dynamics still induce large changes on prior-task regions.
  2. [Experiments] Experimental section: while performance on the new task is reported as comparable, the manuscript should explicitly tabulate the measured KL values alongside forgetting metrics for each method and task to allow direct verification of the claimed correlation; without these numbers the empirical support for the KL-bias mechanism remains indirect.
minor comments (2)
  1. Notation for the base policy and fine-tuned policy should be introduced once and used consistently; occasional switches between π_0 and π_base create unnecessary ambiguity.
  2. Figure captions should state the exact number of runs and error bars used so readers can assess statistical reliability of the reported forgetting differences.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and insightful comments. We address each major point below and have revised the manuscript accordingly to strengthen the presentation of our central claims.

read point-by-point responses
  1. Referee: [Abstract and theoretical justification] Abstract and theoretical justification section: the central claim that forgetting is governed by KL(fine-tuned || base) evaluated on the new task is load-bearing, yet the manuscript does not demonstrate that this new-task KL controls effective shift on the original data distribution when supports overlap only partially. Additional analysis or controls are needed to rule out that optimization dynamics still induce large changes on prior-task regions.

    Authors: We appreciate the referee's emphasis on the need to connect new-task KL to shifts on the original distribution under partial support overlap. Our theoretical analysis shows that on-policy updates minimize KL with respect to the sampling distribution used during training (the new task), and the empirical results across LLMs and robotics demonstrate that this bias correlates with lower forgetting on prior tasks. To directly address the concern about optimization dynamics on prior regions, the revised manuscript includes additional analysis measuring KL divergence on original-task distributions and controls for partial-overlap cases, confirming that the on-policy bias limits extraneous drift. revision: partial

  2. Referee: [Experiments] Experimental section: while performance on the new task is reported as comparable, the manuscript should explicitly tabulate the measured KL values alongside forgetting metrics for each method and task to allow direct verification of the claimed correlation; without these numbers the empirical support for the KL-bias mechanism remains indirect.

    Authors: We agree that tabulating the KL values would make the empirical support more direct. The revised manuscript now includes a dedicated table reporting the measured KL(fine-tuned || base) on the new-task distribution for every method and task, placed alongside the forgetting metrics. This allows readers to verify the correlation between lower KL and reduced forgetting. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation follows from on-policy update mechanics

full rationale

The paper derives the bias toward KL-minimal solutions directly from the local nature of on-policy gradient updates, which constrain policy changes without requiring fitted parameters or self-referential definitions. The claim that forgetting correlates with KL on the new-task distribution is supported by both empirical measurements across LLMs and robotic models and a theoretical argument based on the update rule itself, rather than by renaming or importing uniqueness from prior self-citations. No step reduces the central result to its inputs by construction; the analysis remains self-contained against the stated assumptions about on-policy dynamics.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that forgetting is governed by KL shift and that on-policy RL updates inherently minimize this shift; no explicit free parameters or new entities are introduced in the abstract.

axioms (1)
  • domain assumption Properties of KL divergence and on-policy policy gradient updates determine the magnitude of distributional shift
    Invoked to explain why RL produces smaller KL change than SFT

pith-pipeline@v0.9.0 · 5448 in / 1130 out tokens · 23236 ms · 2026-05-17T04:58:40.180049+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 21 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Lightning OPD: Efficient Post-Training for Large Reasoning Models with Offline On-Policy Distillation

    cs.LG 2026-04 unverdicted novelty 8.0

    Lightning OPD enforces teacher consistency by precomputing log-probabilities over SFT rollouts, matching standard OPD performance with bounded gradient discrepancy and achieving 4x speedup on math and code reasoning tasks.

  2. Learning, Fast and Slow: Towards LLMs That Adapt Continually

    cs.LG 2026-05 unverdicted novelty 7.0

    Fast-Slow Training uses context optimization as fast weights alongside parameter updates as slow weights to achieve up to 3x better sample efficiency, higher performance, and less catastrophic forgetting than standard...

  3. Unsupervised Process Reward Models

    cs.LG 2026-05 unverdicted novelty 7.0

    Unsupervised PRMs derived from LLM probabilities achieve up to 15% better error detection than LLM judges and match supervised PRMs in verification and RL tasks.

  4. Preserving Foundational Capabilities in Flow-Matching VLAs through Conservative SFT

    cs.RO 2026-05 unverdicted novelty 7.0

    ConSFT prevents catastrophic forgetting in fine-tuning flow-matching VLAs by dynamically scaling gradients based on model confidence, retaining over 20% more pre-trained capability than standard SFT without prior data...

  5. Understanding the Role of Hallucination in Reinforcement Post-Training of Multimodal Reasoning Models

    cs.LG 2026-04 unverdicted novelty 7.0

    RL post-training on hallucination-forced multimodal data improves reasoning performance and can outperform standard training.

  6. How to Fine-Tune a Reasoning Model? A Teacher-Student Cooperation Framework to Synthesize Student-Consistent SFT Data

    cs.CL 2026-03 conditional novelty 7.0

    TESSY creates stylistically consistent synthetic data via teacher-student token interleaving, yielding 11.25% and 6.68% gains on code benchmarks where pure teacher data causes 3.25% and 10.02% drops.

  7. Learning to See What You Need: Gaze Attention for Multimodal Large Language Models

    cs.CV 2026-05 unverdicted novelty 6.0

    Gaze Attention groups visual embeddings into selectable regions and dynamically restricts attention to task-relevant ones, matching dense baselines with up to 90% fewer visual KV entries via added context tokens.

  8. Learning, Fast and Slow: Towards LLMs That Adapt Continually

    cs.LG 2026-05 unverdicted novelty 6.0

    Fast-Slow Training combines slow parameter updates with fast context optimization to achieve up to 3x better sample efficiency, higher performance, less forgetting, and preserved plasticity in continual LLM learning.

  9. Rotation-Preserving Supervised Fine-Tuning

    cs.LG 2026-05 unverdicted novelty 6.0

    RPSFT improves the in-domain versus out-of-domain performance trade-off during LLM supervised fine-tuning by penalizing rotations in pretrained singular subspaces as a proxy for loss-sensitive directions.

  10. CRAFT: Forgetting-Aware Intervention-Based Adaptation for Continual Learning

    cs.LG 2026-05 unverdicted novelty 6.0

    CRAFT is a continual learning method for LLMs that applies low-rank interventions on hidden states, unified by KL divergence for routing similar tasks, regularizing against forgetting, and merging updates, showing red...

  11. ZAYA1-8B Technical Report

    cs.AI 2026-05 unverdicted novelty 6.0

    ZAYA1-8B is a reasoning MoE model with 700M active parameters that matches larger models on math and coding benchmarks and reaches 91.9% on AIME'25 via Markovian RSA test-time compute.

  12. Stabilizing LLM Supervised Fine-Tuning via Explicit Distributional Control

    cs.LG 2026-05 unverdicted novelty 6.0

    Anchored Learning stabilizes LLM supervised fine-tuning by interpolating a moving anchor between the current model and a frozen reference to create bounded local updates in distribution space.

  13. Breaking Lock-In: Preserving Steerability under Low-Data VLA Post-Training

    cs.RO 2026-04 unverdicted novelty 6.0

    DeLock mitigates lock-in in low-data VLA post-training via visual grounding preservation and test-time contrastive prompt guidance, outperforming baselines across eight evaluations while matching data-heavy generalist...

  14. Hybrid Policy Distillation for LLMs

    cs.CL 2026-04 unverdicted novelty 6.0

    Hybrid Policy Distillation unifies existing knowledge distillation methods for LLMs into a reweighted log-likelihood objective and introduces a hybrid forward-reverse KL approach with mixed data sampling to improve st...

  15. Lightning OPD: Efficient Post-Training for Large Reasoning Models with Offline On-Policy Distillation

    cs.LG 2026-04 unverdicted novelty 6.0

    Lightning OPD is an offline on-policy distillation method that matches standard OPD performance at 4x efficiency by enforcing teacher consistency between SFT and distillation phases.

  16. On-Policy Distillation with Best-of-N Teacher Rollout Selection

    cs.CV 2026-05 unverdicted novelty 5.0

    BRTS improves on-policy distillation by sampling multiple teacher rollouts and selecting the best one via a correctness-first then alignment priority rule, yielding gains on AIME and AMC math benchmarks.

  17. On-Policy Distillation with Best-of-N Teacher Rollout Selection

    cs.CV 2026-05 unverdicted novelty 5.0

    BRTS improves on-policy distillation by selecting the highest-quality teacher trajectory from a small pool of samples based on correctness and alignment with the student, yielding gains on AIME and AMC math benchmarks.

  18. On the Implicit Reward Overfitting and the Low-rank Dynamics in RLVR

    cs.LG 2026-05 unverdicted novelty 5.0

    RLVR exhibits implicit reward overfitting to training data and optimizes heavy-tailed singular spectra with rank-1 focus on reasoning capability.

  19. CRAFT: Forgetting-Aware Intervention-Based Adaptation for Continual Learning

    cs.LG 2026-05 unverdicted novelty 5.0

    CRAFT is a continual learning method for LLMs that learns low-rank interventions on hidden representations, using a unified KL-divergence objective to handle task routing by output divergence, forgetting control via p...

  20. On Training Large Language Models for Long-Horizon Tasks: An Empirical Study of Horizon Length

    cs.AI 2026-05 unverdicted novelty 5.0

    Longer action horizons bottleneck LLM agent training through instability, but training with reduced horizons stabilizes learning and enables better generalization to longer horizons.

  21. Rethinking Expert Trajectory Utilization in LLM Post-training for Mathematical Reasoning

    cs.LG 2025-12 unverdicted novelty 5.0

    Sequential SFT followed by RL, guided by the Plasticity-Ceiling Framework, achieves higher performance ceilings in LLM mathematical reasoning than synchronized methods by optimizing data scale and transition timing.

Reference graph

Works this paper leans on

12 extracted references · 12 canonical work pages · cited by 17 Pith papers

  1. [1]

    Song Lai, Haohan Zhao, Rong Feng, Changyi Ma, Wenzhuo Liu, Hongbo Zhao, Xi Lin, Dong Yi, Min Xie, Qingfu Zhang, et al

    PMlR, 2019. Song Lai, Haohan Zhao, Rong Feng, Changyi Ma, Wenzhuo Liu, Hongbo Zhao, Xi Lin, Dong Yi, Min Xie, Qingfu Zhang, et al. Reinforcement fine-tuning naturally mitigates forgetting in continual post-training. arXiv preprint arXiv:2507.05386, 2025. Chunyuan Li, Zhe Gan, Zhengyuan Yang, Jianwei Yang, Linjie Li, Lijuan Wang, Jianfeng Gao, et al. Multi...

  2. [2]

    We trained multiple models under a broad sweep of hyperparame- ters (see Table 2)

    Hyperparameter sweep. We trained multiple models under a broad sweep of hyperparame- ters (see Table 2)

  3. [3]

    For Math and Science Q&A, accuracy was measured by comparing the model’s final answer to the ground truth, ignoring intermediate reasoning chains

    New-task evaluation. For Math and Science Q&A, accuracy was measured by comparing the model’s final answer to the ground truth, ignoring intermediate reasoning chains. For Tool Use, we extracted API calls from the output and matched them against ground-truth calls via regular expressions

  4. [4]

    We assessed performance on unrelated benchmarks as described in Section 3.1, using the Language Model Evaluation Harness (Gao et al., 2024)

    Previous-task evaluation. We assessed performance on unrelated benchmarks as described in Section 3.1, using the Language Model Evaluation Harness (Gao et al., 2024)

  5. [5]

    From the trained models, we retained only those lying within 2 accuracy points of the Pareto frontier

    Pareto filtering. From the trained models, we retained only those lying within 2 accuracy points of the Pareto frontier

  6. [6]

    An exponential function was fit to the filtered points to produce the trade-off curves

    Curve fitting. An exponential function was fit to the filtered points to produce the trade-off curves. Figure 6: Example for the process of creating the pareto frontier plots B.2 R OBOTIC EXPERIMENTS We evaluated the RL–SFT forgetting gap in a robotic control setting using the OpenVLA-7B model (Kim et al., 2024) as our base policy in the SimplerEnv enviro...

  7. [7]

    GRPO + KL regularization with coefficient 0.1

  8. [8]

    SFT 1: all even digits mapped to label 0, all odd digits to label 1. 19

  9. [9]

    SFT 2: even digits randomly mapped to {0, 4}, odd digits to {1, 5}

  10. [10]

    Oracle distribution

    SFT with oracle distribution: annotations drawn from the minimum-KL distribution consistent with task correctness. Oracle distribution. Motivated by the KL–forgetting connection, we define the oracle distribution as the one that achieves perfect task accuracy while remaining closest (in KL divergence) to the pretraining distribution π0. Concretely, for an...

  11. [11]

    Using the FashionMNIST eval- uation set, we calculated the gradient of the loss with respect to model parameters

    Forgetting direction. Using the FashionMNIST eval- uation set, we calculated the gradient of the loss with respect to model parameters. We then measured the cosine similarity between this gradient and the actual parameter update from the training step. A positive cosine indicates that the update increases FashionM- NIST loss (catastrophic forgetting), whi...

  12. [12]

    We measured the change in KL divergence between the model’s output distributions on the Pari- tyMNIST test set before and after the update

    KL shift. We measured the change in KL divergence between the model’s output distributions on the Pari- tyMNIST test set before and after the update. 21 Plotting per-step KL change against the cosine similarity (Figure 10) revealed a strong correlation: steps producing larger KL shifts tended to align more with the forgetting gradient. This analysis demon...