arxiv: 2308.08747 · v5 · submitted 2023-08-17 · 💻 cs.CL

An Empirical Study of Catastrophic Forgetting in Large Language Models During Continual Fine-tuning

Yun Luo , Zhen Yang , Fandong Meng , Yafu Li , Jie Zhou , Yue Zhang This is my paper

Pith reviewed 2026-05-18 08:22 UTC · model grok-4.3

classification 💻 cs.CL

keywords catastrophic forgettinglarge language modelscontinual instruction tuningmodel scaleknowledge retentionlanguage biasesfine-tuning

0 comments

The pith

Catastrophic forgetting occurs in LLMs during continual instruction tuning and grows more severe as models scale from 1B to 7B parameters.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper tests whether large language models lose earlier knowledge when they are fine-tuned sequentially on new instruction sets. Experiments on models from 1 billion to 7 billion parameters show that performance drops on prior tasks involving domain knowledge, reasoning, and reading comprehension. The drop becomes steeper in bigger models, which the authors link to those models starting from higher baseline scores. Decoder-only architectures retain more than encoder-decoder ones, and running general instruction tuning first reduces the amount of forgetting seen in later stages.

Core claim

The authors show through direct measurement that catastrophic forgetting is generally observed in LLMs ranging from 1b to 7b parameters during continual instruction tuning. Severity increases with model scale in this range, possibly because larger models begin with stronger initial performance. BLOOMZ exhibits less forgetting than mT0. LLMs can also reduce language biases such as gender bias during the process, and general instruction tuning beforehand helps limit forgetting in subsequent fine-tuning.

What carries the argument

Continual instruction tuning followed by repeated evaluation of retained accuracy on domain-knowledge, reasoning, and reading-comprehension benchmarks.

If this is right

Forgetting appears across the full 1B–7B size range tested.
Within this range, larger models lose more of their prior capabilities.
Decoder-only models keep more knowledge than encoder-decoder models.
The same tuning process can reduce certain language biases.
Running broad instruction tuning before specialized steps limits later forgetting.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the scale trend holds, methods that protect early knowledge may become more important as models grow beyond 7B.
The bias-reduction observation raises the possibility that continual tuning can be used deliberately to correct unwanted behaviors learned in pre-training.
Interleaving general and domain-specific tuning stages might be tested as a simple way to preserve broad competence.

Load-bearing premise

The chosen tasks for domain knowledge, reasoning, and reading comprehension give an unbiased picture of retained knowledge that is not heavily shaped by the exact order or content of the new tuning data.

What would settle it

An experiment in which performance on the original tasks stays flat or rises after the model is fine-tuned on new tasks, or in which forgetting lessens rather than increases as model size grows from 1B to 7B.

read the original abstract

Catastrophic forgetting (CF) is a phenomenon that occurs in machine learning when a model forgets previously learned information while acquiring new knowledge for achieving a satisfactory performance in downstream tasks. As large language models (LLMs) have demonstrated remarkable performance, it is intriguing to investigate whether CF exists during the continual instruction tuning of LLMs. This study empirically evaluates the forgetting phenomenon in LLMs' knowledge during continual instruction tuning from the perspectives of domain knowledge, reasoning, and reading comprehension. The experiments reveal that catastrophic forgetting is generally observed in LLMs ranging from 1b to 7b parameters. Surprisingly, as the model scale increases, the severity of forgetting intensifies in such a model sale range which may result from the much significant initial performance in the larger LLM. Comparing the decoder-only model BLOOMZ with the encoder-decoder model mT0, BLOOMZ exhibits less forgetting and retains more knowledge. Interestingly, we also observe that LLMs can mitigate language biases, such as gender bias, during continual fine-tuning. Furthermore, our findings indicate that general instruction tuning can help alleviate the forgetting phenomenon in LLMs during subsequent fine-tuning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Forgetting gets worse with scale from 1B to 7B but decoder-only models retain more than encoder-decoder ones during continual instruction tuning.

read the letter

The key thing to know is that this study finds catastrophic forgetting in LLMs during continual instruction tuning, with severity increasing as models scale from 1B to 7B parameters, possibly because larger ones start with higher performance. Decoder-only models forget less than encoder-decoder models. What the paper does well is run a set of experiments comparing different sizes and two architectures on tasks covering domain knowledge, reasoning, and reading comprehension. The results show consistent forgetting, plus some positive notes on bias reduction and the benefit of general instruction tuning beforehand. These are practical findings that add to the literature on how LLMs behave under sequential updates. The soft spots are around the measurement of forgetting. The stress on absolute drops could explain the scale trend without needing a deeper architectural reason, and the abstract does not make clear if they used relative retention or controlled for baselines. Details on datasets, variance, and task difficulty balancing are also thin, which makes the support for the main claims only moderate. The work is empirical and observational, with no circularity in derivations. It engages with prior forgetting research by focusing on instruction tuning specifics. This paper is for people in the continual learning community who want empirical benchmarks for LLMs. A reader looking for data on scale and architecture effects would find it relevant. It deserves a serious referee because the observations are new and could influence how people set up fine-tuning pipelines, even if the current evidence needs strengthening on the metrics side.

Referee Report

1 major / 2 minor

Summary. The paper empirically examines catastrophic forgetting (CF) during continual instruction tuning of LLMs. It evaluates models ranging from 1B to 7B parameters across tasks in domain knowledge, reasoning, and reading comprehension. The central findings are that CF occurs generally in this scale range, that forgetting severity increases with model scale (attributed to higher initial performance in larger models), that the decoder-only BLOOMZ exhibits less forgetting than the encoder-decoder mT0, that continual fine-tuning can mitigate certain language biases, and that prior general instruction tuning reduces subsequent forgetting.

Significance. If the scale-dependent forgetting claim survives controls for baseline performance and task difficulty, the work would provide useful empirical grounding for continual-learning strategies in LLMs. The multi-perspective evaluation (knowledge, reasoning, comprehension) and the architectural comparison are strengths; the observation that general instruction tuning alleviates later forgetting is practically relevant. The study is purely empirical with no parameter-free derivations or machine-checked proofs.

major comments (1)

[Abstract / Results section] Abstract and experimental results: the claim that 'as the model scale increases, the severity of forgetting intensifies' is load-bearing for the paper's main contribution. The manuscript reports absolute performance deltas on fixed downstream tasks without indicating use of normalized retention metrics (e.g., retained fraction (post-pre)/pre) or regression controls for initial performance. Larger models start with higher baselines, so larger absolute drops can occur even under identical relative forgetting rates; this confound must be ruled out before the scale effect can be asserted.

minor comments (2)

[Experimental setup] The manuscript should report exact dataset names, sizes, and task-difficulty controls for the domain-knowledge, reasoning, and reading-comprehension evaluations, along with error bars or statistical significance tests on all forgetting deltas.
[Method] Clarify the precise continual fine-tuning sequence and data composition used for the 1B–7B models so that readers can assess whether task ordering or content overlap could drive the observed patterns.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback on our manuscript. We address the major comment regarding the interpretation of scale-dependent forgetting below.

read point-by-point responses

Referee: [Abstract / Results section] Abstract and experimental results: the claim that 'as the model scale increases, the severity of forgetting intensifies' is load-bearing for the paper's main contribution. The manuscript reports absolute performance deltas on fixed downstream tasks without indicating use of normalized retention metrics (e.g., retained fraction (post-pre)/pre) or regression controls for initial performance. Larger models start with higher baselines, so larger absolute drops can occur even under identical relative forgetting rates; this confound must be ruled out before the scale effect can be asserted.

Authors: We acknowledge this valid concern about potential confounding between model scale and baseline performance. While our manuscript already notes that the observed increase in forgetting severity 'may result from the much significant initial performance in the larger LLM', we agree that absolute deltas alone are insufficient to fully substantiate the claim. In the revised version, we will add analyses using normalized retention metrics (e.g., retained fraction = (post - pre)/pre) across the tasks. We will also include regression controls for initial performance to isolate the effect of scale. These additions will clarify whether the severity indeed intensifies with scale beyond what is expected from higher baselines. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical measurements with no derivations or self-referential reductions

full rationale

This paper conducts an empirical study measuring catastrophic forgetting via performance deltas on downstream tasks across model scales (1B-7B). There are no equations, first-principles derivations, fitted parameters renamed as predictions, or load-bearing self-citations that reduce claims to inputs by construction. All reported observations (e.g., scale-dependent forgetting severity, model architecture comparisons) are direct experimental results against external benchmarks, with the analysis self-contained and independent of any internal redefinitions or ansatzes.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This empirical study introduces no free parameters, mathematical axioms, or invented entities; all claims rest on direct experimental measurements of model performance before and after fine-tuning steps.

pith-pipeline@v0.9.0 · 5743 in / 1133 out tokens · 34208 ms · 2026-05-18T08:22:33.106021+00:00 · methodology

discussion (0)

Forward citations

Cited by 19 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

MeMo: Memory as a Model
cs.CL 2026-05 unverdicted novelty 7.0

MeMo encodes new knowledge into a separate memory model for frozen LLMs, achieving strong performance on BrowseComp-Plus, NarrativeQA, and MuSiQue while capturing cross-document relationships and remaining robust to r...
Learning, Fast and Slow: Towards LLMs That Adapt Continually
cs.LG 2026-05 unverdicted novelty 7.0

Fast-Slow Training uses context optimization as fast weights alongside parameter updates as slow weights to achieve up to 3x better sample efficiency, higher performance, and less catastrophic forgetting than standard...
HEBATRON: A Hebrew-Specialized Open-Weight Mixture-of-Experts Language Model
cs.CL 2026-05 unverdicted novelty 7.0

Hebatron is the first open-weight Hebrew MoE LLM adapted from Nemotron-3, reaching 73.8% on Hebrew reasoning benchmarks while activating only 3B parameters per pass and supporting 65k-token context.
Beyond Steering Vector: Flow-based Activation Steering for Inference-Time Intervention
cs.CL 2026-05 unverdicted novelty 7.0

FLAS learns a multi-step velocity field v_t(h,t,c) to steer activations, outperforming prompting with harmonic means of 1.015 and 1.113 on two Gemma models without per-concept tuning.
SafeAnchor: Preventing Cumulative Safety Erosion in Continual Domain Adaptation of Large Language Models
cs.LG 2026-04 unverdicted novelty 7.0

SafeAnchor preserves 93.2% of original safety alignment across sequential domain adaptations by anchoring low-rank safety subspaces and constraining orthogonal updates, while matching unconstrained fine-tuning perform...
Understanding and Enforcing Weight Disentanglement in Task Arithmetic
cs.AI 2026-04 unverdicted novelty 7.0

Task-Feature Specialization explains weight disentanglement in task arithmetic and leads to orthogonality, which OrthoReg enforces to enhance performance of model composition methods.
MNAFT: modality neuron-aware fine-tuning of multimodal large language models for image translation
cs.CL 2026-04 unverdicted novelty 7.0

MNAFT identifies language-agnostic and language-specific neurons via activation analysis and selectively fine-tunes only relevant ones in MLLMs to close the modality gap and outperform full fine-tuning and other metho...
SLE-FNO: Single-Layer Extensions for Task-Agnostic Continual Learning in Fourier Neural Operators
cs.LG 2026-03 unverdicted novelty 7.0

SLE-FNO achieves zero forgetting and strong plasticity-stability balance in continual learning for FNO surrogate models of pulsatile blood flow by adding minimal single-layer extensions across four out-of-distribution tasks.
You've Got a Golden Ticket: Improving Generative Robot Policies With A Single Noise Vector
cs.RO 2026-03 conditional novelty 7.0

Optimizing a single constant initial noise vector for frozen generative robot policies improves success rates on 38 of 43 tasks by up to 58% relative improvement.
Learning, Fast and Slow: Towards LLMs That Adapt Continually
cs.LG 2026-05 unverdicted novelty 6.0

Fast-Slow Training combines slow parameter updates with fast context optimization to achieve up to 3x better sample efficiency, higher performance, less forgetting, and preserved plasticity in continual LLM learning.
Do Self-Evolving Agents Forget? Capability Degradation and Preservation in Lifelong LLM Agent Adaptation
cs.AI 2026-05 unverdicted novelty 6.0

Self-evolving LLM agents exhibit capability erosion under continual adaptation, which Capability-Preserving Evolution mitigates by raising retained simple-task performance from 41.8% to 52.8% in workflow evolution und...
PERSA: Reinforcement Learning for Professor-Style Personalized Feedback with LLMs
cs.AI 2026-05 unverdicted novelty 6.0

PERSA combines RLHF with selective parameter-efficient updates to top transformer layers, raising style alignment scores from 35% to 96% on code feedback benchmarks while holding correctness near 100%.
$M^2$-VLA: Boosting Vision-Language Models for Generalizable Manipulation via Layer Mixture and Meta-Skills
cs.RO 2026-04 unverdicted novelty 6.0

M²-VLA shows that generalized VLMs can serve as direct backbones for robotic manipulation by selectively extracting task-critical features via Mixture of Layers and adding Meta Skill Modules for efficient trajectory learning.
Routing-Based Continual Learning for Multimodal Large Language Models
cs.LG 2025-11 unverdicted novelty 6.0

Routing architecture for MLLMs enables continual learning with constant compute, matching multi-task learning performance and supporting cross-modal transfer.
Diffusion-Inspired Masked Fine-Tuning for Knowledge Injection in Autoregressive LLMs
cs.CL 2025-10 unverdicted novelty 6.0

Masked fine-tuning enables autoregressive LLMs to inject new factual knowledge without paraphrases and with reversal-curse resistance, matching diffusion LLM advantages on QA tasks.
Improving Sparse Memory Finetuning
cs.LG 2026-04 unverdicted novelty 4.0

Sparse memory modules with KL-based surprising-token selection let retrofitted LLMs acquire new factual knowledge while largely preserving held-out capabilities.
Jailbreak Attacks and Defenses Against Large Language Models: A Survey
cs.CR 2024-07 accept novelty 4.0

A survey that creates taxonomies for jailbreak attacks and defenses on LLMs, subdivides them into sub-classes, and compares evaluation approaches.
Towards EnergyGPT: A Large Language Model Specialized for the Energy Sector
cs.CL 2025-09 unverdicted novelty 3.0

Fine-tuned LLaMA 3.1-8B variants for the energy sector outperform the base model on domain QA benchmarks, with LoRA delivering similar gains at lower training cost.
LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods
cs.CL 2024-12 accept novelty 3.0

A survey that organizes LLMs-as-judges research into functionality, methodology, applications, meta-evaluation, and limitations.

Reference graph

Works this paper leans on

18 extracted references · 18 canonical work pages · cited by 18 Pith papers · 7 internal anchors

[1]

Mathqa: Towards interpretable math word problem solving with operation-based formalisms

Aida Amini, Saadia Gabriel, Shanchuan Lin, Rik Koncel-Kedziorski, Yejin Choi, and Han- naneh Hajishirzi. Mathqa: Towards interpretable math word problem solving with operation-based formalisms. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long a...

work page 2019
[2]

Sparks of Artificial General Intelligence: Early experiments with GPT-4

S´ebastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg, et al. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Adaprompt: Adaptive model training for prompt-based nlp

11 Under review Yulong Chen, Yang Liu, Li Dong, Shuohang Wang, Chenguang Zhu, Michael Zeng, and Yue Zhang. Adaprompt: Adaptive model training for prompt-based nlp. arXiv preprint arXiv:2202.04824,

work page arXiv
[4]

Boolq: Exploring the surprising difficulty of natural yes/no questions

Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. Boolq: Exploring the surprising difficulty of natural yes/no questions. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers),...

work page 2019
[5]

Measuring Massive Multitask Language Understanding

URL https: //doi.org/10.5281/zenodo.5371628. Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300,

work page internal anchor Pith review Pith/arXiv arXiv doi:10.5281/zenodo.5371628 2009
[6]

Continual learning of natural language processing tasks: A survey

Zixuan Ke and Bing Liu. Continual learning of natural language processing tasks: A survey. arXiv preprint arXiv:2211.12701,

work page arXiv
[7]

Continual training of language models for few-shot learning

Zixuan Ke, Haowei Lin, Yijia Shao, Hu Xu, Lei Shu, and Bing Liu. Continual training of language models for few-shot learning. arXiv preprint arXiv:2210.05549,

work page arXiv
[8]

Adam: A Method for Stochastic Optimization

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Race: Large-scale reading comprehension dataset from examinations

Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang, and Eduard Hovy. Race: Large-scale reading comprehension dataset from examinations. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 785–794,

work page 2017
[10]

Mitigating catastrophic forgetting in task-incremental continual learning with adaptive classification criterion

Yun Luo, Xiaotian Lin, Zhen Yang, Fandong Meng, Jie Zhou, and Yue Zhang. Mitigating catastrophic forgetting in task-incremental continual learning with adaptive classification criterion. arXiv preprint arXiv:2305.12270, 2023a. Yun Luo, Zhen Yang, Xuefeng Bai, Fandong Meng, Jie Zhou, and Yue Zhang. Investigat- ing forgetting in pre-trained representations ...

work page arXiv
[11]

Crosslingual generalization through multitask finetuning

Niklas Muennighoff, Thomas Wang, Lintang Sutawika, Adam Roberts, Stella Biderman, Teven Le Scao, M Saiful Bari, Sheng Shen, Zheng-Xin Yong, Hailey Schoelkopf, et al. Crosslingual generalization through multitask finetuning. arXiv preprint arXiv:2211.01786,

work page arXiv
[12]

Crows-pairs: A chal- lenge dataset for measuring social biases in masked language models

Nikita Nangia, Clara Vania, Rasika Bhalerao, and Samuel Bowman. Crows-pairs: A chal- lenge dataset for measuring social biases in masked language models. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1953–1967,

work page 2020
[13]

Multitask Prompted Training Enables Zero-Shot Task Generalization

Victor Sanh, Albert Webson, Colin Raffel, Stephen H Bach, Lintang Sutawika, Zaid Alyafeai, Antoine Chaffin, Arnaud Stiegler, Teven Le Scao, Arun Raja, et al. Multitask prompted training enables zero-shot task generalization. arXiv preprint arXiv:2110.08207,

work page internal anchor Pith review Pith/arXiv arXiv
[14]

Fine-tuned language models are continual learners

Thomas Scialom, Tuhin Chakrabarty, and Smaranda Muresan. Fine-tuned language models are continual learners. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp. 6107–6122,

work page 2022
[15]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth´ee Lacroix, Baptiste Rozi `ere, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971,

work page internal anchor Pith review Pith/arXiv arXiv
[16]

Finetuned Language Models Are Zero-Shot Learners

Jason Wei, Maarten Bosma, Vincent Y Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M Dai, and Quoc V Le. Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652,

work page internal anchor Pith review Pith/arXiv arXiv
[17]

Pretrained language model in continual learning: A comparative study

13 Under review Tongtong Wu, Massimo Caccia, Zhuang Li, Yuan-Fang Li, Guilin Qi, and Gholamreza Haffari. Pretrained language model in continual learning: A comparative study. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25- 29,

work page 2022
[18]

HellaSwag: Can a Machine Really Finish Your Sentence?

URL https://openreview.net/forum?id=figzpGMrdD. Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence? arXiv preprint arXiv:1905.07830,

work page internal anchor Pith review Pith/arXiv arXiv 1905