An Empirical Study of Catastrophic Forgetting in Large Language Models During Continual Fine-tuning
Pith reviewed 2026-05-18 08:22 UTC · model grok-4.3
The pith
Catastrophic forgetting occurs in LLMs during continual instruction tuning and grows more severe as models scale from 1B to 7B parameters.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors show through direct measurement that catastrophic forgetting is generally observed in LLMs ranging from 1b to 7b parameters during continual instruction tuning. Severity increases with model scale in this range, possibly because larger models begin with stronger initial performance. BLOOMZ exhibits less forgetting than mT0. LLMs can also reduce language biases such as gender bias during the process, and general instruction tuning beforehand helps limit forgetting in subsequent fine-tuning.
What carries the argument
Continual instruction tuning followed by repeated evaluation of retained accuracy on domain-knowledge, reasoning, and reading-comprehension benchmarks.
If this is right
- Forgetting appears across the full 1B–7B size range tested.
- Within this range, larger models lose more of their prior capabilities.
- Decoder-only models keep more knowledge than encoder-decoder models.
- The same tuning process can reduce certain language biases.
- Running broad instruction tuning before specialized steps limits later forgetting.
Where Pith is reading between the lines
- If the scale trend holds, methods that protect early knowledge may become more important as models grow beyond 7B.
- The bias-reduction observation raises the possibility that continual tuning can be used deliberately to correct unwanted behaviors learned in pre-training.
- Interleaving general and domain-specific tuning stages might be tested as a simple way to preserve broad competence.
Load-bearing premise
The chosen tasks for domain knowledge, reasoning, and reading comprehension give an unbiased picture of retained knowledge that is not heavily shaped by the exact order or content of the new tuning data.
What would settle it
An experiment in which performance on the original tasks stays flat or rises after the model is fine-tuned on new tasks, or in which forgetting lessens rather than increases as model size grows from 1B to 7B.
read the original abstract
Catastrophic forgetting (CF) is a phenomenon that occurs in machine learning when a model forgets previously learned information while acquiring new knowledge for achieving a satisfactory performance in downstream tasks. As large language models (LLMs) have demonstrated remarkable performance, it is intriguing to investigate whether CF exists during the continual instruction tuning of LLMs. This study empirically evaluates the forgetting phenomenon in LLMs' knowledge during continual instruction tuning from the perspectives of domain knowledge, reasoning, and reading comprehension. The experiments reveal that catastrophic forgetting is generally observed in LLMs ranging from 1b to 7b parameters. Surprisingly, as the model scale increases, the severity of forgetting intensifies in such a model sale range which may result from the much significant initial performance in the larger LLM. Comparing the decoder-only model BLOOMZ with the encoder-decoder model mT0, BLOOMZ exhibits less forgetting and retains more knowledge. Interestingly, we also observe that LLMs can mitigate language biases, such as gender bias, during continual fine-tuning. Furthermore, our findings indicate that general instruction tuning can help alleviate the forgetting phenomenon in LLMs during subsequent fine-tuning.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper empirically examines catastrophic forgetting (CF) during continual instruction tuning of LLMs. It evaluates models ranging from 1B to 7B parameters across tasks in domain knowledge, reasoning, and reading comprehension. The central findings are that CF occurs generally in this scale range, that forgetting severity increases with model scale (attributed to higher initial performance in larger models), that the decoder-only BLOOMZ exhibits less forgetting than the encoder-decoder mT0, that continual fine-tuning can mitigate certain language biases, and that prior general instruction tuning reduces subsequent forgetting.
Significance. If the scale-dependent forgetting claim survives controls for baseline performance and task difficulty, the work would provide useful empirical grounding for continual-learning strategies in LLMs. The multi-perspective evaluation (knowledge, reasoning, comprehension) and the architectural comparison are strengths; the observation that general instruction tuning alleviates later forgetting is practically relevant. The study is purely empirical with no parameter-free derivations or machine-checked proofs.
major comments (1)
- [Abstract / Results section] Abstract and experimental results: the claim that 'as the model scale increases, the severity of forgetting intensifies' is load-bearing for the paper's main contribution. The manuscript reports absolute performance deltas on fixed downstream tasks without indicating use of normalized retention metrics (e.g., retained fraction (post-pre)/pre) or regression controls for initial performance. Larger models start with higher baselines, so larger absolute drops can occur even under identical relative forgetting rates; this confound must be ruled out before the scale effect can be asserted.
minor comments (2)
- [Experimental setup] The manuscript should report exact dataset names, sizes, and task-difficulty controls for the domain-knowledge, reasoning, and reading-comprehension evaluations, along with error bars or statistical significance tests on all forgetting deltas.
- [Method] Clarify the precise continual fine-tuning sequence and data composition used for the 1B–7B models so that readers can assess whether task ordering or content overlap could drive the observed patterns.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback on our manuscript. We address the major comment regarding the interpretation of scale-dependent forgetting below.
read point-by-point responses
-
Referee: [Abstract / Results section] Abstract and experimental results: the claim that 'as the model scale increases, the severity of forgetting intensifies' is load-bearing for the paper's main contribution. The manuscript reports absolute performance deltas on fixed downstream tasks without indicating use of normalized retention metrics (e.g., retained fraction (post-pre)/pre) or regression controls for initial performance. Larger models start with higher baselines, so larger absolute drops can occur even under identical relative forgetting rates; this confound must be ruled out before the scale effect can be asserted.
Authors: We acknowledge this valid concern about potential confounding between model scale and baseline performance. While our manuscript already notes that the observed increase in forgetting severity 'may result from the much significant initial performance in the larger LLM', we agree that absolute deltas alone are insufficient to fully substantiate the claim. In the revised version, we will add analyses using normalized retention metrics (e.g., retained fraction = (post - pre)/pre) across the tasks. We will also include regression controls for initial performance to isolate the effect of scale. These additions will clarify whether the severity indeed intensifies with scale beyond what is expected from higher baselines. revision: yes
Circularity Check
No circularity: purely empirical measurements with no derivations or self-referential reductions
full rationale
This paper conducts an empirical study measuring catastrophic forgetting via performance deltas on downstream tasks across model scales (1B-7B). There are no equations, first-principles derivations, fitted parameters renamed as predictions, or load-bearing self-citations that reduce claims to inputs by construction. All reported observations (e.g., scale-dependent forgetting severity, model architecture comparisons) are direct experimental results against external benchmarks, with the analysis self-contained and independent of any internal redefinitions or ansatzes.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 19 Pith papers
-
MeMo: Memory as a Model
MeMo encodes new knowledge into a separate memory model for frozen LLMs, achieving strong performance on BrowseComp-Plus, NarrativeQA, and MuSiQue while capturing cross-document relationships and remaining robust to r...
-
Learning, Fast and Slow: Towards LLMs That Adapt Continually
Fast-Slow Training uses context optimization as fast weights alongside parameter updates as slow weights to achieve up to 3x better sample efficiency, higher performance, and less catastrophic forgetting than standard...
-
HEBATRON: A Hebrew-Specialized Open-Weight Mixture-of-Experts Language Model
Hebatron is the first open-weight Hebrew MoE LLM adapted from Nemotron-3, reaching 73.8% on Hebrew reasoning benchmarks while activating only 3B parameters per pass and supporting 65k-token context.
-
Beyond Steering Vector: Flow-based Activation Steering for Inference-Time Intervention
FLAS learns a multi-step velocity field v_t(h,t,c) to steer activations, outperforming prompting with harmonic means of 1.015 and 1.113 on two Gemma models without per-concept tuning.
-
SafeAnchor: Preventing Cumulative Safety Erosion in Continual Domain Adaptation of Large Language Models
SafeAnchor preserves 93.2% of original safety alignment across sequential domain adaptations by anchoring low-rank safety subspaces and constraining orthogonal updates, while matching unconstrained fine-tuning perform...
-
Understanding and Enforcing Weight Disentanglement in Task Arithmetic
Task-Feature Specialization explains weight disentanglement in task arithmetic and leads to orthogonality, which OrthoReg enforces to enhance performance of model composition methods.
-
MNAFT: modality neuron-aware fine-tuning of multimodal large language models for image translation
MNAFT identifies language-agnostic and language-specific neurons via activation analysis and selectively fine-tunes only relevant ones in MLLMs to close the modality gap and outperform full fine-tuning and other metho...
-
SLE-FNO: Single-Layer Extensions for Task-Agnostic Continual Learning in Fourier Neural Operators
SLE-FNO achieves zero forgetting and strong plasticity-stability balance in continual learning for FNO surrogate models of pulsatile blood flow by adding minimal single-layer extensions across four out-of-distribution tasks.
-
You've Got a Golden Ticket: Improving Generative Robot Policies With A Single Noise Vector
Optimizing a single constant initial noise vector for frozen generative robot policies improves success rates on 38 of 43 tasks by up to 58% relative improvement.
-
Learning, Fast and Slow: Towards LLMs That Adapt Continually
Fast-Slow Training combines slow parameter updates with fast context optimization to achieve up to 3x better sample efficiency, higher performance, less forgetting, and preserved plasticity in continual LLM learning.
-
Do Self-Evolving Agents Forget? Capability Degradation and Preservation in Lifelong LLM Agent Adaptation
Self-evolving LLM agents exhibit capability erosion under continual adaptation, which Capability-Preserving Evolution mitigates by raising retained simple-task performance from 41.8% to 52.8% in workflow evolution und...
-
PERSA: Reinforcement Learning for Professor-Style Personalized Feedback with LLMs
PERSA combines RLHF with selective parameter-efficient updates to top transformer layers, raising style alignment scores from 35% to 96% on code feedback benchmarks while holding correctness near 100%.
-
$M^2$-VLA: Boosting Vision-Language Models for Generalizable Manipulation via Layer Mixture and Meta-Skills
M²-VLA shows that generalized VLMs can serve as direct backbones for robotic manipulation by selectively extracting task-critical features via Mixture of Layers and adding Meta Skill Modules for efficient trajectory learning.
-
Routing-Based Continual Learning for Multimodal Large Language Models
Routing architecture for MLLMs enables continual learning with constant compute, matching multi-task learning performance and supporting cross-modal transfer.
-
Diffusion-Inspired Masked Fine-Tuning for Knowledge Injection in Autoregressive LLMs
Masked fine-tuning enables autoregressive LLMs to inject new factual knowledge without paraphrases and with reversal-curse resistance, matching diffusion LLM advantages on QA tasks.
-
Improving Sparse Memory Finetuning
Sparse memory modules with KL-based surprising-token selection let retrofitted LLMs acquire new factual knowledge while largely preserving held-out capabilities.
-
Jailbreak Attacks and Defenses Against Large Language Models: A Survey
A survey that creates taxonomies for jailbreak attacks and defenses on LLMs, subdivides them into sub-classes, and compares evaluation approaches.
-
Towards EnergyGPT: A Large Language Model Specialized for the Energy Sector
Fine-tuned LLaMA 3.1-8B variants for the energy sector outperform the base model on domain QA benchmarks, with LoRA delivering similar gains at lower training cost.
-
LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods
A survey that organizes LLMs-as-judges research into functionality, methodology, applications, meta-evaluation, and limitations.
Reference graph
Works this paper leans on
-
[1]
Mathqa: Towards interpretable math word problem solving with operation-based formalisms
Aida Amini, Saadia Gabriel, Shanchuan Lin, Rik Koncel-Kedziorski, Yejin Choi, and Han- naneh Hajishirzi. Mathqa: Towards interpretable math word problem solving with operation-based formalisms. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long a...
work page 2019
-
[2]
Sparks of Artificial General Intelligence: Early experiments with GPT-4
S´ebastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg, et al. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712,
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
Adaprompt: Adaptive model training for prompt-based nlp
11 Under review Yulong Chen, Yang Liu, Li Dong, Shuohang Wang, Chenguang Zhu, Michael Zeng, and Yue Zhang. Adaprompt: Adaptive model training for prompt-based nlp. arXiv preprint arXiv:2202.04824,
-
[4]
Boolq: Exploring the surprising difficulty of natural yes/no questions
Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. Boolq: Exploring the surprising difficulty of natural yes/no questions. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers),...
work page 2019
-
[5]
Measuring Massive Multitask Language Understanding
URL https: //doi.org/10.5281/zenodo.5371628. Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300,
work page internal anchor Pith review Pith/arXiv arXiv doi:10.5281/zenodo.5371628 2009
-
[6]
Continual learning of natural language processing tasks: A survey
Zixuan Ke and Bing Liu. Continual learning of natural language processing tasks: A survey. arXiv preprint arXiv:2211.12701,
-
[7]
Continual training of language models for few-shot learning
Zixuan Ke, Haowei Lin, Yijia Shao, Hu Xu, Lei Shu, and Bing Liu. Continual training of language models for few-shot learning. arXiv preprint arXiv:2210.05549,
-
[8]
Adam: A Method for Stochastic Optimization
Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980,
work page internal anchor Pith review Pith/arXiv arXiv
-
[9]
Race: Large-scale reading comprehension dataset from examinations
Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang, and Eduard Hovy. Race: Large-scale reading comprehension dataset from examinations. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 785–794,
work page 2017
-
[10]
Yun Luo, Xiaotian Lin, Zhen Yang, Fandong Meng, Jie Zhou, and Yue Zhang. Mitigating catastrophic forgetting in task-incremental continual learning with adaptive classification criterion. arXiv preprint arXiv:2305.12270, 2023a. Yun Luo, Zhen Yang, Xuefeng Bai, Fandong Meng, Jie Zhou, and Yue Zhang. Investigat- ing forgetting in pre-trained representations ...
-
[11]
Crosslingual generalization through multitask finetuning
Niklas Muennighoff, Thomas Wang, Lintang Sutawika, Adam Roberts, Stella Biderman, Teven Le Scao, M Saiful Bari, Sheng Shen, Zheng-Xin Yong, Hailey Schoelkopf, et al. Crosslingual generalization through multitask finetuning. arXiv preprint arXiv:2211.01786,
-
[12]
Crows-pairs: A chal- lenge dataset for measuring social biases in masked language models
Nikita Nangia, Clara Vania, Rasika Bhalerao, and Samuel Bowman. Crows-pairs: A chal- lenge dataset for measuring social biases in masked language models. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1953–1967,
work page 2020
-
[13]
Multitask Prompted Training Enables Zero-Shot Task Generalization
Victor Sanh, Albert Webson, Colin Raffel, Stephen H Bach, Lintang Sutawika, Zaid Alyafeai, Antoine Chaffin, Arnaud Stiegler, Teven Le Scao, Arun Raja, et al. Multitask prompted training enables zero-shot task generalization. arXiv preprint arXiv:2110.08207,
work page internal anchor Pith review Pith/arXiv arXiv
-
[14]
Fine-tuned language models are continual learners
Thomas Scialom, Tuhin Chakrabarty, and Smaranda Muresan. Fine-tuned language models are continual learners. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp. 6107–6122,
work page 2022
-
[15]
LLaMA: Open and Efficient Foundation Language Models
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth´ee Lacroix, Baptiste Rozi `ere, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971,
work page internal anchor Pith review Pith/arXiv arXiv
-
[16]
Finetuned Language Models Are Zero-Shot Learners
Jason Wei, Maarten Bosma, Vincent Y Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M Dai, and Quoc V Le. Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652,
work page internal anchor Pith review Pith/arXiv arXiv
-
[17]
Pretrained language model in continual learning: A comparative study
13 Under review Tongtong Wu, Massimo Caccia, Zhuang Li, Yuan-Fang Li, Guilin Qi, and Gholamreza Haffari. Pretrained language model in continual learning: A comparative study. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25- 29,
work page 2022
-
[18]
HellaSwag: Can a Machine Really Finish Your Sentence?
URL https://openreview.net/forum?id=figzpGMrdD. Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence? arXiv preprint arXiv:1905.07830,
work page internal anchor Pith review Pith/arXiv arXiv 1905
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.