Catastrophic Forgetting is Low-Rank: A Function-Space Theory for Continual Adaptation
Pith reviewed 2026-06-27 01:34 UTC · model grok-4.3
The pith
New-task training induces old-task prediction drift through the cross-task kernel, yielding an exact closed-form forgetting vector in the linear-head NTK case.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
In the NTK regime, new-task training induces old-task prediction drift through the cross-task kernel, yielding a closed-form predictor for the forgetting vector before any new-task gradient step. In frozen-backbone linear-head PEFT-CL the predictor is exact up to numerical precision; for nonlinear adapters it is a local NTK approximation. The same expression reveals that forgetting concentrates in a small number of old-task NTK eigenmodes and under frozen linear heads gives a Kronecker scaling rule for the vulnerable rank.
What carries the argument
The cross-task kernel, which produces the closed-form predictor for the forgetting vector and exposes its low-rank concentration in old-task NTK eigenmodes.
If this is right
- The exact forgetting vector can be computed from the cross-task kernel before any new-task gradient step occurs.
- Forgetting is confined to a low-dimensional subspace spanned by a few eigenmodes of the old-task NTK.
- Parameter-space regularizers can miss the output-space directions where interference actually occurs.
- A spectral regularizer that targets only the vulnerable eigenmodes becomes a natural design choice.
- Under frozen linear heads the rank of the forgetting matrix follows an explicit Kronecker product rule.
Where Pith is reading between the lines
- Adapters could be chosen or trained to minimize cross-task kernel entries and thereby reduce forgetting without replay buffers.
- Continual-learning diagnostics need only track a small number of principal directions rather than the full output space.
- Quantifying the approximation error of the local NTK predictor for nonlinear adapters would turn the theory into a practical bound.
- The same kernel-driven view may illuminate negative transfer or interference in multi-task rather than sequential settings.
Load-bearing premise
The analysis assumes the NTK regime and requires the model to be exactly linear in its trainable parameters for the predictor to be exact rather than approximate.
What would settle it
Train a frozen-backbone linear-head model sequentially on two tasks, compute the cross-task kernel once, form the predicted forgetting vector, and verify whether the observed change in old-task outputs matches that vector to numerical precision.
Figures
read the original abstract
Catastrophic forgetting in continual adaptation is usually studied through parameter drift, replay, or distillation, but these views do not identify which output-space directions are vulnerable. We give a function-space account in the NTK regime: new-task training induces old-task prediction drift through the cross-task kernel, yielding a closed-form predictor for the forgetting vector before any new-task gradient step. In frozen-backbone linear-head PEFT-CL, where the model is linear in the trainable parameters, the predictor is exact up to numerical precision; for nonlinear adapters/full fine-tuning, it is a local NTK approximation. The same expression reveals that forgetting concentrates in a small number of old-task NTK eigenmodes and under frozen linear heads gives a Kronecker scaling rule for the vulnerable rank. These results clarify the relation to prior NTK-overlap theory, explain why parameter-space regularizers can miss output-space interference, and motivate a targeted spectral regularizer.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper develops a function-space theory of catastrophic forgetting in the NTK regime. New-task training induces old-task prediction drift via the cross-task kernel, yielding a closed-form predictor for the forgetting vector. This predictor is exact (up to numerics) for frozen-backbone linear-head PEFT-CL and a local approximation otherwise. The same expression shows forgetting concentrates in a small number of old-task NTK eigenmodes and yields a Kronecker scaling rule for vulnerable rank under linear heads. The work relates this to prior NTK-overlap theory and motivates spectral regularization.
Significance. If the derivations hold, the closed-form predictor and low-rank characterization provide a precise output-space account of forgetting that explains limitations of parameter-space regularizers and enables targeted interventions. The exactness result in the linear-head case, together with the eigenmode concentration, is a clear strength offering falsifiable predictions and reproducible analysis in the PEFT-CL setting.
major comments (2)
- [§3.2, Eq. (9)] §3.2, Eq. (9): the local NTK approximation for nonlinear adapters and full fine-tuning is stated without an error bound, remainder term, or empirical quantification of linearization error. This is load-bearing for extending the exact linear-head result to the broader PEFT-CL claims in the title and abstract.
- [§4.1, Theorem 2] §4.1, Theorem 2: the claim that forgetting concentrates in a small number of eigenmodes relies on the cross-task kernel structure, but the paper does not report the numerical rank or eigenvalue decay rates across the evaluated tasks to confirm the 'small number' is consistent and not task-dependent.
minor comments (2)
- [§2] Notation for the cross-task kernel K_{12} is introduced without an explicit comparison table to prior NTK-overlap definitions, which would clarify the claimed relation to earlier work.
- [Figure 3] Figure 3 caption does not state the number of random seeds or the precise metric used for the 'forgetting vector' norm, reducing reproducibility of the eigenmode plots.
Simulated Author's Rebuttal
We thank the referee for the constructive comments, which help clarify the scope and presentation of our results. We address each major comment below and indicate the revisions we will incorporate.
read point-by-point responses
-
Referee: [§3.2, Eq. (9)] the local NTK approximation for nonlinear adapters and full fine-tuning is stated without an error bound, remainder term, or empirical quantification of linearization error. This is load-bearing for extending the exact linear-head result to the broader PEFT-CL claims in the title and abstract.
Authors: We agree that the local NTK approximation lacks a formal error bound or remainder term; deriving a non-vacuous bound for finite-width nonlinear adapters remains an open technical challenge beyond the scope of the present work. The exact closed-form result is restricted to the frozen linear-head case, as stated in the abstract and §3. The approximation for nonlinear cases is presented as a local predictor whose validity is supported by the standard NTK linearization. To strengthen the manuscript we will add empirical quantification of the linearization error (comparing the closed-form predictor against observed prediction drift on nonlinear adapters) in the experiments section and will explicitly qualify the title/abstract claims to emphasize the exact linear-head setting while noting the approximation for other regimes. revision: yes
-
Referee: [§4.1, Theorem 2] the claim that forgetting concentrates in a small number of eigenmodes relies on the cross-task kernel structure, but the paper does not report the numerical rank or eigenvalue decay rates across the evaluated tasks to confirm the 'small number' is consistent and not task-dependent.
Authors: We concur that explicit reporting of numerical ranks and eigenvalue decay is needed to substantiate the consistency of the low-rank concentration. The current manuscript illustrates concentration via the closed-form expression and selected visualizations, but does not tabulate effective ranks. We will revise §4.1 and the associated experiments to include tables (or supplementary plots) of eigenvalue spectra and numerical ranks (e.g., count of eigenvalues exceeding 1% of the largest) for every task pair, thereby confirming that the effective rank remains small and stable across the evaluated settings. revision: yes
Circularity Check
No circularity: closed-form predictor derived from standard NTK cross-task kernel without reduction to fitted inputs or self-citation chains
full rationale
The paper's central claim is a closed-form expression for old-task prediction drift induced by new-task training via the cross-task kernel in the NTK regime. This is presented as following directly from the kernel definition under the stated linearity assumption (frozen backbone + linear head), with the nonlinear case explicitly labeled a local approximation. No equations or text in the provided abstract reduce the predictor to a fitted parameter renamed as prediction, a self-definitional loop, or a load-bearing self-citation. The derivation is framed as a direct consequence of the NTK framework applied to the continual adaptation setting, remaining self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption NTK regime approximation holds for the model behavior during new-task training
Reference graph
Works this paper leans on
-
[1]
S., Rolnick, D., and Kording, K
Benjamin, A. S., Rolnick, D., and Kording, K. Measuring and regularizing networks in function space. ICLR, 2019
2019
-
[2]
Bennani, M. A., Doan, T., and Sugiyama, M. Generalisation guarantees for continual learning with orthogonal gradient descent. arXiv preprint arXiv:2006.11942, 2020
-
[3]
Dark experience for general continual learning: a strong, simple baseline
Buzzega, P., Boschini, M., Porrello, A., Abati, D., and Calderara, S. Dark experience for general continual learning: a strong, simple baseline. NeurIPS, 2020
2020
-
[4]
A theoretical analysis of catastrophic forgetting through the NTK overlap matrix
Doan, T., Abbana Bennani, M., Mazoure, B., Rabusseau, G., and Alquier, P. A theoretical analysis of catastrophic forgetting through the NTK overlap matrix. AISTATS, 2021
2021
- [5]
-
[6]
and Li, W.-J
Liang, Y.-S. and Li, W.-J. InfLoRA: Interference-free low-rank adaptation for continual learning. CVPR, pp.\ 23638-23647, 2024
2024
-
[7]
An Empirical Study of Catastrophic Forgetting in Large Language Models During Continual Fine-tuning
Luo, Y., Yang, Z., Meng, F., Li, Y., Zhou, J., and Zhang, Y. An empirical study of catastrophic forgetting in large language models during continual fine-tuning. arXiv preprint arXiv:2308.08747, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[8]
S., Karlinsky, L., Gutta, V., Cascante-Bonilla, P., Kim, D., Arbelle, A., Panda, R., Feris, R., and Kira, Z
Smith, J. S., Karlinsky, L., Gutta, V., Cascante-Bonilla, P., Kim, D., Arbelle, A., Panda, R., Feris, R., and Kira, Z. CODA-Prompt: COntinual decomposed attention-based prompting for rehearsal-free continual learning. CVPR, 2023
2023
-
[9]
DualPrompt: Complementary prompting for rehearsal-free continual learning
Wang, Z., Zhang, Z., Ebrahimi, S., Sun, R., Zhang, H., Lee, C.-Y., Ren, X., Su, G., Perot, V., Dy, J., and Pfister, T. DualPrompt: Complementary prompting for rehearsal-free continual learning. ECCV, 2022
2022
-
[10]
Learning to prompt for continual learning
Wang, Z., Zhang, Z., Lee, C.-Y., Zhang, H., Sun, R., Ren, X., Su, G., Perot, V., Dy, J., and Pfister, T. Learning to prompt for continual learning. CVPR, 2022
2022
-
[11]
arXiv preprint arXiv:2404.16789 , year=
Wang, H., Lu, H., Yao, L., and Gong, D. Continual learning of large language models: A comprehensive survey. arXiv preprint arXiv:2404.16789, 2024
-
[12]
Orthogonal gradient descent for continual learning
Farajtabar, M., Azizan, N., Mott, A., and Li, A. Orthogonal gradient descent for continual learning. AISTATS, 2020
2020
-
[13]
Neural tangent kernel: Convergence and generalization in neural networks
Jacot, A., Gabriel, F., and Hongler, C. Neural tangent kernel: Convergence and generalization in neural networks. NeurIPS, 2018
2018
-
[14]
Overcoming catastrophic forgetting in neural networks
Kirkpatrick, J., Pascanu, R., Rabinowitz, N., et al. Overcoming catastrophic forgetting in neural networks. PNAS, 114(13):3521-3526, 2017
2017
-
[15]
and Hoiem, D
Li, Z. and Hoiem, D. Learning without forgetting. TPAMI, 40(12):2935-2947, 2017
2017
-
[16]
On the spectral bias of neural networks
Rahaman, N., Baratin, A., Arpit, D., et al. On the spectral bias of neural networks. ICML, 2019
2019
-
[17]
Experience replay for continual learning
Rolnick, D., Ahuja, A., Schwarz, J., Lillicrap, T., and Wayne, G. Experience replay for continual learning. NeurIPS, 2019
2019
-
[18]
Gradient projection memory for continual learning
Saha, G., Garg, I., and Roy, K. Gradient projection memory for continual learning. ICLR, 2021
2021
-
[19]
K., Schwarz, J., Matthews, A
Titsias, M. K., Schwarz, J., Matthews, A. G., Pascanu, R., and Teh, Y. W. Functional regularisation for continual learning using Gaussian processes. ICLR, 2020
2020
-
[20]
Continual learning through synaptic intelligence
Zenke, F., Poole, B., and Ganguli, S. Continual learning through synaptic intelligence. ICML, 2017
2017
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.