Reinforcement Fine-Tuning Naturally Mitigates Forgetting in Continual Post-Training

Changyi Ma; Dong Yi; Fei Zhu; Gaofeng Meng; Haohan Zhao; Hongbin Liu; Hongbo Zhao; Qingfu Zhang; Rong Feng; Song Lai

arxiv: 2507.05386 · v6 · pith:TUKZJAADnew · submitted 2025-07-07 · 💻 cs.LG · cs.AI· cs.CL

Reinforcement Fine-Tuning Naturally Mitigates Forgetting in Continual Post-Training

Song Lai , Haohan Zhao , Rong Feng , Changyi Ma , Wenzhuo Liu , Hongbo Zhao , Xi Lin , Dong Yi

show 4 more authors

Qingfu Zhang Hongbin Liu Gaofeng Meng Fei Zhu

This is my paper

classification 💻 cs.LG cs.AIcs.CL

keywords knowledgemodelpost-trainingtaskscontinualfine-tuninglearninglike

0 comments

read the original abstract

Continual post-training (CPT) is a popular and effective technique for adapting foundation models like multimodal large language models to ever-evolving downstream tasks. While existing research primarily focuses on methods like data replay, model expansion, or parameter regularization, the fundamental role of the learning paradigm remains largely unexplored. This paper presents a comparative analysis of two core post-training paradigms: supervised fine-tuning (SFT) and reinforcement fine-tuning (RFT), investigating their respective impacts on knowledge retention during CPT. Our experiments are conducted across multiple multimodal tasks, utilizing Qwen2.5-VL-7B-Instruct as the base model. The investigation yields two significant findings: (1) When continuously learning on downstream tasks, SFT leads to catastrophic forgetting of previously learned tasks. In contrast, RFT inherently preserves prior knowledge and achieves performance comparable to multi-task training. (2) RFT successfully protects and even enhances the model's general knowledge on standard benchmarks, while SFT degrades general model capabilities severely. Further analysis reveals that this stability is not primarily due to explicit mechanisms like KL penalty or chain-of-thought reasoning. We investigate RFT's learning dynamics and find that its selective update mechanism inherently prevents interference with established knowledge. Based on this insight, we propose a rollout-based instance filtering algorithm (RIF-RFT) that enhances the training efficiency of RFT by focusing on learnable samples. Our comprehensive study demonstrates the superiority of RFT as a robust paradigm for continual post-training.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 8 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Overcoming Catastrophic Forgetting in Visual Continual Learning with Reinforcement Fine-Tuning
cs.CV 2026-05 unverdicted novelty 7.0

RaPO reduces catastrophic forgetting in visual continual learning by shaping rewards around policy drift and stabilizing advantages with cross-task exponential moving averages during reinforcement fine-tuning of multi...
Preserving Foundational Capabilities in Flow-Matching VLAs through Conservative SFT
cs.RO 2026-05 unverdicted novelty 7.0

ConSFT prevents catastrophic forgetting in fine-tuning flow-matching VLAs by dynamically scaling gradients based on model confidence, retaining over 20% more pre-trained capability than standard SFT without prior data...
Rotation-Preserving Supervised Fine-Tuning
cs.LG 2026-05 unverdicted novelty 6.0

RPSFT improves the in-domain versus out-of-domain performance trade-off during LLM supervised fine-tuning by penalizing rotations in pretrained singular subspaces as a proxy for loss-sensitive directions.
CRAFT: Forgetting-Aware Intervention-Based Adaptation for Continual Learning
cs.LG 2026-05 unverdicted novelty 6.0

CRAFT is a continual learning method for LLMs that applies low-rank interventions on hidden states, unified by KL divergence for routing similar tasks, regularizing against forgetting, and merging updates, showing red...
Towards Long-Lived Robots: Continual Learning VLA Models via Reinforcement Fine-Tuning
cs.RO 2026-02 unverdicted novelty 6.0

LifeLong-RFT applies chunking-level on-policy reinforcement learning with Quantized Action Consistency Reward, Continuous Trajectory Alignment Reward, and Format Compliance Reward to fine-tune VLA models, achieving a ...
RL's Razor: Why Online Reinforcement Learning Forgets Less
cs.LG 2025-09 unverdicted novelty 6.0

Online RL fine-tuning forgets less than SFT because it is implicitly biased toward KL-minimal solutions among all policies that solve the new task.
Preserving Foundational Capabilities in Flow-Matching VLAs through Conservative SFT
cs.RO 2026-05 unverdicted novelty 5.0

ConSFT is a gradient-scaling fine-tuning objective for flow-matching VLAs that bounds parameter disruption via model-confidence weighting, yielding over 20% better capability retention than vanilla SFT on LIBERO and RoboTwin.
CRAFT: Forgetting-Aware Intervention-Based Adaptation for Continual Learning
cs.LG 2026-05 unverdicted novelty 5.0

CRAFT is a continual learning method for LLMs that learns low-rank interventions on hidden representations, using a unified KL-divergence objective to handle task routing by output divergence, forgetting control via p...