Online IL overcomes an information-theoretic bottleneck that offline IL faces in non-realizable settings even at horizon 1, under a new structural characterization of reward-relative misspecification.
hub
Retaining by doing: The role of on-policy data in mitigating forgetting
13 Pith papers cite this work. Polarity classification is still indexing.
abstract
Adapting language models (LMs) to new tasks via post-training carries the risk of degrading existing capabilities -- a phenomenon classically known as catastrophic forgetting. In this paper, toward identifying guidelines for mitigating this phenomenon, we systematically compare the forgetting patterns of two widely adopted post-training methods: supervised fine-tuning (SFT) and reinforcement learning (RL). Our experiments reveal a consistent trend across LM families (Llama, Qwen) and tasks (instruction following, general knowledge, and arithmetic reasoning): RL leads to less forgetting than SFT while achieving comparable or higher target task performance. To investigate the cause for this difference, we consider a simplified setting in which the LM is modeled as a mixture of two distributions, one corresponding to prior knowledge and the other to the target task. We identify that the mode-seeking nature of RL, which stems from its use of on-policy data, enables keeping prior knowledge intact when learning the target task. We then verify this insight by demonstrating that the use on-policy data underlies the robustness of RL to forgetting in practical settings, as opposed to other algorithmic choices such as the KL regularization or advantage estimation. Lastly, as a practical implication, our results highlight the potential of mitigating forgetting using approximately on-policy data, which can be substantially more efficient to obtain than fully on-policy data.
hub tools
citation-role summary
citation-polarity summary
years
2026 13roles
method 2polarities
use method 2representative citing papers
Self-Policy Distillation extracts a capability subspace from model gradients on correctness tokens, projects KV activations into it for self-generation, and fine-tunes LLMs to achieve up to 13-16% gains over baselines without external signals.
TESSY creates stylistically consistent synthetic data via teacher-student token interleaving, yielding 11.25% and 6.68% gains on code benchmarks where pure teacher data causes 3.25% and 10.02% drops.
Later-domain RL training harms earlier domains via second-order damage concentrated in a low-dimensional shared conflict subspace; brief domain refresh contracts this component to enable selective recovery.
STOP uses structured on-policy analysis to prune long reasoning traces to their earliest correct node, cutting token usage 19-42% with little accuracy loss on math benchmarks.
Anchored Learning stabilizes LLM supervised fine-tuning by interpolating a moving anchor between the current model and a frozen reference to create bounded local updates in distribution space.
Filtering post-training data to visually grounded questions improves VLM video understanding performance by up to 6.2 points using 69% of the data.
LifeLong-RFT applies chunking-level on-policy reinforcement learning with Quantized Action Consistency Reward, Continuous Trajectory Alignment Reward, and Format Compliance Reward to fine-tune VLA models, achieving a 22% average success rate gain over supervised fine-tuning on the LIBERO benchmark's
MOTAB is a new distillation pipeline that monitors on-policy student trajectories and backtracks with teacher intervention to mitigate dual exposure biases, improving reasoning performance by about 3%.
BRTS improves on-policy distillation by sampling multiple teacher rollouts and selecting the best one via a correctness-first then alignment priority rule, yielding gains on AIME and AMC math benchmarks.
CRAFT is a continual learning method for LLMs that learns low-rank interventions on hidden representations, using a unified KL-divergence objective to handle task routing by output divergence, forgetting control via prior-state regularization, and intervention merging.
MindDR combines a Planning Agent, DeepSearch Agent, and Report Agent with SFT cold-start, Search-RL, Report-RL, and preference alignment to reach competitive scores on research benchmarks using 30B-scale models.
citing papers explorer
No citing papers match the current filters.