Mechanistic origins of catastrophic forgetting: why RL preserves circuits better than SFT?

· 2026 · cs.LG · arXiv 2605.28860

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

open full Pith review browse 1 citing papers arXiv PDF

abstract

Fine-tuning large language models (LLMs) frequently induces catastrophic forgetting of prior capabilities. Recent work has shown that reinforcement learning (RL) retains prior capabilities more effectively than supervised fine-tuning (SFT), attributing this to policy-gradient updates remaining closer to the base policy \cite{shenfeld2025rl}. We extend this behavioral account to the mechanistic level and ask whether RL's advantage is mirrored by stronger preservation of internal computational circuits. We introduce differential circuit vulnerability, a head-level measure of how much a circuit degrades under fine-tuning, and use it to compare RL and SFT on Qwen2.5-3B-Instruct adapted to scientific question-answering. We find a clear mechanistic trade-off: SFT adapts more rapidly to the target task but produces substantially greater circuit disruption and forgetting of prior capabilities, whereas RL preserves a larger fraction of the base circuit at the cost of slower task adaptation. These findings suggest that circuit preservation may help explain why RL is more robust to catastrophic forgetting. We released our code here: https://github.com/rl-sft-circuit-research/differential-circuit-vulnerability.

representative citing papers

Quantifying Subliminal Behavioral Transfer Ratios in Language Model Distillation

cs.LG · 2026-06-09 · unverdicted · novelty 5.0

Steering Llama-2-7B-Chat and Qwen2.5-7B-Instruct teachers and distilling students on benign data transfers measurable jailbreak susceptibility, with Llama showing threshold behavior at α = -0.15 and Qwen reaching transfer ratios up to 0.61.

citing papers explorer

Showing 1 of 1 citing paper.

Quantifying Subliminal Behavioral Transfer Ratios in Language Model Distillation cs.LG · 2026-06-09 · unverdicted · none · ref 9 · internal anchor
Steering Llama-2-7B-Chat and Qwen2.5-7B-Instruct teachers and distilling students on benign data transfers measurable jailbreak susceptibility, with Llama showing threshold behavior at α = -0.15 and Qwen reaching transfer ratios up to 0.61.

Mechanistic origins of catastrophic forgetting: why RL preserves circuits better than SFT?

fields

years

verdicts

representative citing papers

citing papers explorer