Catastrophic Forgetting is Low-Rank: A Function-Space Theory for Continual Adaptation

Dan Raviv; Ido Nitzan Hidekel

arxiv: 2606.18024 · v1 · pith:PC5AQJLRnew · submitted 2026-06-16 · 💻 cs.LG · cs.AI

Catastrophic Forgetting is Low-Rank: A Function-Space Theory for Continual Adaptation

Ido Nitzan Hidekel , Dan Raviv This is my paper

Pith reviewed 2026-06-27 01:34 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords catastrophic forgettingcontinual learningNTK regimefunction spacelow-rank structurecross-task kernelPEFTspectral regularization

0 comments

The pith

New-task training induces old-task prediction drift through the cross-task kernel, yielding an exact closed-form forgetting vector in the linear-head NTK case.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper supplies a function-space account of catastrophic forgetting inside the NTK regime. Training on a new task produces a predictable drift in old-task outputs that is fully determined by the cross-task kernel, and this drift can be written down before any gradient step is taken. When the backbone is frozen and the head is linear, the expression is exact; otherwise it remains a local approximation. The same formula shows that the drift lives in only a few eigenmodes of the old-task NTK and obeys a Kronecker scaling rule for its rank under linear heads.

Core claim

In the NTK regime, new-task training induces old-task prediction drift through the cross-task kernel, yielding a closed-form predictor for the forgetting vector before any new-task gradient step. In frozen-backbone linear-head PEFT-CL the predictor is exact up to numerical precision; for nonlinear adapters it is a local NTK approximation. The same expression reveals that forgetting concentrates in a small number of old-task NTK eigenmodes and under frozen linear heads gives a Kronecker scaling rule for the vulnerable rank.

What carries the argument

The cross-task kernel, which produces the closed-form predictor for the forgetting vector and exposes its low-rank concentration in old-task NTK eigenmodes.

If this is right

The exact forgetting vector can be computed from the cross-task kernel before any new-task gradient step occurs.
Forgetting is confined to a low-dimensional subspace spanned by a few eigenmodes of the old-task NTK.
Parameter-space regularizers can miss the output-space directions where interference actually occurs.
A spectral regularizer that targets only the vulnerable eigenmodes becomes a natural design choice.
Under frozen linear heads the rank of the forgetting matrix follows an explicit Kronecker product rule.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Adapters could be chosen or trained to minimize cross-task kernel entries and thereby reduce forgetting without replay buffers.
Continual-learning diagnostics need only track a small number of principal directions rather than the full output space.
Quantifying the approximation error of the local NTK predictor for nonlinear adapters would turn the theory into a practical bound.
The same kernel-driven view may illuminate negative transfer or interference in multi-task rather than sequential settings.

Load-bearing premise

The analysis assumes the NTK regime and requires the model to be exactly linear in its trainable parameters for the predictor to be exact rather than approximate.

What would settle it

Train a frozen-backbone linear-head model sequentially on two tasks, compute the cross-task kernel once, form the predicted forgetting vector, and verify whether the observed change in old-task outputs matches that vector to numerical precision.

Figures

Figures reproduced from arXiv: 2606.18024 by Dan Raviv, Ido Nitzan Hidekel.

**Figure 1.** Figure 1: Left: Predicted vs. realized ∆fA (cos sim > 0.99) on Split-MNIST/CIFAR-10. Center: Cumulative forgetting energy: 50–90% in 1–6 eigenmodes. Right: Drift decomposition — spectral reg targets the vulnerable subspace at 75:1 on Split-MNIST (1.7:1 on the CNN-based Split-CIFAR-10, App. L) vs. <1:1 for baselines. structurally exact in the frozen-backbone linear-head PEFTCL regime, with 1 − cos sim down to 10−6 o… view at source ↗

**Figure 2.** Figure 2: Method overview. Left: Forgetting ∆fA lies in the column space of KAA (Prop. 1), and its energy concentrates on a low-rank slice – the vulnerable subspace span(u1, . . . , uk) spanned by the top eigenvectors of KAA. The complementary u⊥ directions are unprotected by construction. Right: the NTK spectrum decays rapidly (grey), and the forgetting-energy coefficients c 2 i = (u ⊤ i ∆fA) 2 inherit this decay … view at source ↗

read the original abstract

Catastrophic forgetting in continual adaptation is usually studied through parameter drift, replay, or distillation, but these views do not identify which output-space directions are vulnerable. We give a function-space account in the NTK regime: new-task training induces old-task prediction drift through the cross-task kernel, yielding a closed-form predictor for the forgetting vector before any new-task gradient step. In frozen-backbone linear-head PEFT-CL, where the model is linear in the trainable parameters, the predictor is exact up to numerical precision; for nonlinear adapters/full fine-tuning, it is a local NTK approximation. The same expression reveals that forgetting concentrates in a small number of old-task NTK eigenmodes and under frozen linear heads gives a Kronecker scaling rule for the vulnerable rank. These results clarify the relation to prior NTK-overlap theory, explain why parameter-space regularizers can miss output-space interference, and motivate a targeted spectral regularizer.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives an exact closed-form predictor for forgetting under frozen linear heads via the cross-task NTK, with a low-rank concentration result; the nonlinear case is only a local approximation whose error is not bounded in the abstract.

read the letter

The core contribution is a function-space derivation showing that new-task training shifts old-task predictions through the cross-task kernel, producing a closed-form forgetting vector before any gradient step. Under frozen backbone and linear head the expression is exact up to numerics, and the same kernel structure shows forgetting concentrates in a few old-task NTK eigenmodes with a Kronecker scaling rule for the vulnerable rank. This is new relative to the parameter-drift and replay views cited, and it directly explains why some regularizers miss output-space interference.

The linear-head case looks solid on its own terms and supplies a concrete target for spectral regularizers. The low-rank claim follows from the eigenmode decomposition without extra assumptions.

The soft spot is the extension to nonlinear adapters or full fine-tuning. The abstract labels this a local NTK approximation but supplies no remainder term, error bound, or empirical quantification of the linearization error. If the full paper contains such checks or experiments showing the predictor remains useful, that would tighten the result; otherwise the main claim stays restricted to the linear-head PEFT setting.

This is worth a serious referee for the continual-learning theory crowd. The math is direct, the framing is honest about its regime, and the function-space angle organizes prior NTK-overlap work without circularity. I would bring it to a reading group to walk through the derivation and see how the approximation behaves in the experiments.

Referee Report

2 major / 2 minor

Summary. The paper develops a function-space theory of catastrophic forgetting in the NTK regime. New-task training induces old-task prediction drift via the cross-task kernel, yielding a closed-form predictor for the forgetting vector. This predictor is exact (up to numerics) for frozen-backbone linear-head PEFT-CL and a local approximation otherwise. The same expression shows forgetting concentrates in a small number of old-task NTK eigenmodes and yields a Kronecker scaling rule for vulnerable rank under linear heads. The work relates this to prior NTK-overlap theory and motivates spectral regularization.

Significance. If the derivations hold, the closed-form predictor and low-rank characterization provide a precise output-space account of forgetting that explains limitations of parameter-space regularizers and enables targeted interventions. The exactness result in the linear-head case, together with the eigenmode concentration, is a clear strength offering falsifiable predictions and reproducible analysis in the PEFT-CL setting.

major comments (2)

[§3.2, Eq. (9)] §3.2, Eq. (9): the local NTK approximation for nonlinear adapters and full fine-tuning is stated without an error bound, remainder term, or empirical quantification of linearization error. This is load-bearing for extending the exact linear-head result to the broader PEFT-CL claims in the title and abstract.
[§4.1, Theorem 2] §4.1, Theorem 2: the claim that forgetting concentrates in a small number of eigenmodes relies on the cross-task kernel structure, but the paper does not report the numerical rank or eigenvalue decay rates across the evaluated tasks to confirm the 'small number' is consistent and not task-dependent.

minor comments (2)

[§2] Notation for the cross-task kernel K_{12} is introduced without an explicit comparison table to prior NTK-overlap definitions, which would clarify the claimed relation to earlier work.
[Figure 3] Figure 3 caption does not state the number of random seeds or the precise metric used for the 'forgetting vector' norm, reducing reproducibility of the eigenmode plots.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the scope and presentation of our results. We address each major comment below and indicate the revisions we will incorporate.

read point-by-point responses

Referee: [§3.2, Eq. (9)] the local NTK approximation for nonlinear adapters and full fine-tuning is stated without an error bound, remainder term, or empirical quantification of linearization error. This is load-bearing for extending the exact linear-head result to the broader PEFT-CL claims in the title and abstract.

Authors: We agree that the local NTK approximation lacks a formal error bound or remainder term; deriving a non-vacuous bound for finite-width nonlinear adapters remains an open technical challenge beyond the scope of the present work. The exact closed-form result is restricted to the frozen linear-head case, as stated in the abstract and §3. The approximation for nonlinear cases is presented as a local predictor whose validity is supported by the standard NTK linearization. To strengthen the manuscript we will add empirical quantification of the linearization error (comparing the closed-form predictor against observed prediction drift on nonlinear adapters) in the experiments section and will explicitly qualify the title/abstract claims to emphasize the exact linear-head setting while noting the approximation for other regimes. revision: yes
Referee: [§4.1, Theorem 2] the claim that forgetting concentrates in a small number of eigenmodes relies on the cross-task kernel structure, but the paper does not report the numerical rank or eigenvalue decay rates across the evaluated tasks to confirm the 'small number' is consistent and not task-dependent.

Authors: We concur that explicit reporting of numerical ranks and eigenvalue decay is needed to substantiate the consistency of the low-rank concentration. The current manuscript illustrates concentration via the closed-form expression and selected visualizations, but does not tabulate effective ranks. We will revise §4.1 and the associated experiments to include tables (or supplementary plots) of eigenvalue spectra and numerical ranks (e.g., count of eigenvalues exceeding 1% of the largest) for every task pair, thereby confirming that the effective rank remains small and stable across the evaluated settings. revision: yes

Circularity Check

0 steps flagged

No circularity: closed-form predictor derived from standard NTK cross-task kernel without reduction to fitted inputs or self-citation chains

full rationale

The paper's central claim is a closed-form expression for old-task prediction drift induced by new-task training via the cross-task kernel in the NTK regime. This is presented as following directly from the kernel definition under the stated linearity assumption (frozen backbone + linear head), with the nonlinear case explicitly labeled a local approximation. No equations or text in the provided abstract reduce the predictor to a fitted parameter renamed as prediction, a self-definitional loop, or a load-bearing self-citation. The derivation is framed as a direct consequence of the NTK framework applied to the continual adaptation setting, remaining self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the validity of the NTK regime for the models considered and on the linearity assumption that makes the predictor exact in the PEFT-CL setting.

axioms (1)

domain assumption NTK regime approximation holds for the model behavior during new-task training
Entire analysis is conducted in the NTK regime as stated in the abstract.

pith-pipeline@v0.9.1-grok · 5695 in / 1272 out tokens · 38794 ms · 2026-06-27T01:34:27.029525+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

20 extracted references · 4 canonical work pages · 1 internal anchor

[1]

S., Rolnick, D., and Kording, K

Benjamin, A. S., Rolnick, D., and Kording, K. Measuring and regularizing networks in function space. ICLR, 2019

2019
[2]

A., Doan, T., and Sugiyama, M

Bennani, M. A., Doan, T., and Sugiyama, M. Generalisation guarantees for continual learning with orthogonal gradient descent. arXiv preprint arXiv:2006.11942, 2020

work page arXiv 2006
[3]

Dark experience for general continual learning: a strong, simple baseline

Buzzega, P., Boschini, M., Porrello, A., Abati, D., and Calderara, S. Dark experience for general continual learning: a strong, simple baseline. NeurIPS, 2020

2020
[4]

A theoretical analysis of catastrophic forgetting through the NTK overlap matrix

Doan, T., Abbana Bennani, M., Mazoure, B., Rabusseau, G., and Alquier, P. A theoretical analysis of catastrophic forgetting through the NTK overlap matrix. AISTATS, 2021

2021
[5]

Imanov, O. Y. L. Mechanistic analysis of catastrophic forgetting in large language models during continual fine-tuning. arXiv preprint arXiv:2601.18699, 2026

work page arXiv 2026
[6]

and Li, W.-J

Liang, Y.-S. and Li, W.-J. InfLoRA: Interference-free low-rank adaptation for continual learning. CVPR, pp.\ 23638-23647, 2024

2024
[7]

An Empirical Study of Catastrophic Forgetting in Large Language Models During Continual Fine-tuning

Luo, Y., Yang, Z., Meng, F., Li, Y., Zhou, J., and Zhang, Y. An empirical study of catastrophic forgetting in large language models during continual fine-tuning. arXiv preprint arXiv:2308.08747, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[8]

S., Karlinsky, L., Gutta, V., Cascante-Bonilla, P., Kim, D., Arbelle, A., Panda, R., Feris, R., and Kira, Z

Smith, J. S., Karlinsky, L., Gutta, V., Cascante-Bonilla, P., Kim, D., Arbelle, A., Panda, R., Feris, R., and Kira, Z. CODA-Prompt: COntinual decomposed attention-based prompting for rehearsal-free continual learning. CVPR, 2023

2023
[9]

DualPrompt: Complementary prompting for rehearsal-free continual learning

Wang, Z., Zhang, Z., Ebrahimi, S., Sun, R., Zhang, H., Lee, C.-Y., Ren, X., Su, G., Perot, V., Dy, J., and Pfister, T. DualPrompt: Complementary prompting for rehearsal-free continual learning. ECCV, 2022

2022
[10]

Learning to prompt for continual learning

Wang, Z., Zhang, Z., Lee, C.-Y., Zhang, H., Sun, R., Ren, X., Su, G., Perot, V., Dy, J., and Pfister, T. Learning to prompt for continual learning. CVPR, 2022

2022
[11]

arXiv preprint arXiv:2404.16789 , year=

Wang, H., Lu, H., Yao, L., and Gong, D. Continual learning of large language models: A comprehensive survey. arXiv preprint arXiv:2404.16789, 2024

work page arXiv 2024
[12]

Orthogonal gradient descent for continual learning

Farajtabar, M., Azizan, N., Mott, A., and Li, A. Orthogonal gradient descent for continual learning. AISTATS, 2020

2020
[13]

Neural tangent kernel: Convergence and generalization in neural networks

Jacot, A., Gabriel, F., and Hongler, C. Neural tangent kernel: Convergence and generalization in neural networks. NeurIPS, 2018

2018
[14]

Overcoming catastrophic forgetting in neural networks

Kirkpatrick, J., Pascanu, R., Rabinowitz, N., et al. Overcoming catastrophic forgetting in neural networks. PNAS, 114(13):3521-3526, 2017

2017
[15]

and Hoiem, D

Li, Z. and Hoiem, D. Learning without forgetting. TPAMI, 40(12):2935-2947, 2017

2017
[16]

On the spectral bias of neural networks

Rahaman, N., Baratin, A., Arpit, D., et al. On the spectral bias of neural networks. ICML, 2019

2019
[17]

Experience replay for continual learning

Rolnick, D., Ahuja, A., Schwarz, J., Lillicrap, T., and Wayne, G. Experience replay for continual learning. NeurIPS, 2019

2019
[18]

Gradient projection memory for continual learning

Saha, G., Garg, I., and Roy, K. Gradient projection memory for continual learning. ICLR, 2021

2021
[19]

K., Schwarz, J., Matthews, A

Titsias, M. K., Schwarz, J., Matthews, A. G., Pascanu, R., and Teh, Y. W. Functional regularisation for continual learning using Gaussian processes. ICLR, 2020

2020
[20]

Continual learning through synaptic intelligence

Zenke, F., Poole, B., and Ganguli, S. Continual learning through synaptic intelligence. ICML, 2017

2017

[1] [1]

S., Rolnick, D., and Kording, K

Benjamin, A. S., Rolnick, D., and Kording, K. Measuring and regularizing networks in function space. ICLR, 2019

2019

[2] [2]

A., Doan, T., and Sugiyama, M

Bennani, M. A., Doan, T., and Sugiyama, M. Generalisation guarantees for continual learning with orthogonal gradient descent. arXiv preprint arXiv:2006.11942, 2020

work page arXiv 2006

[3] [3]

Dark experience for general continual learning: a strong, simple baseline

Buzzega, P., Boschini, M., Porrello, A., Abati, D., and Calderara, S. Dark experience for general continual learning: a strong, simple baseline. NeurIPS, 2020

2020

[4] [4]

A theoretical analysis of catastrophic forgetting through the NTK overlap matrix

Doan, T., Abbana Bennani, M., Mazoure, B., Rabusseau, G., and Alquier, P. A theoretical analysis of catastrophic forgetting through the NTK overlap matrix. AISTATS, 2021

2021

[5] [5]

Imanov, O. Y. L. Mechanistic analysis of catastrophic forgetting in large language models during continual fine-tuning. arXiv preprint arXiv:2601.18699, 2026

work page arXiv 2026

[6] [6]

and Li, W.-J

Liang, Y.-S. and Li, W.-J. InfLoRA: Interference-free low-rank adaptation for continual learning. CVPR, pp.\ 23638-23647, 2024

2024

[7] [7]

An Empirical Study of Catastrophic Forgetting in Large Language Models During Continual Fine-tuning

Luo, Y., Yang, Z., Meng, F., Li, Y., Zhou, J., and Zhang, Y. An empirical study of catastrophic forgetting in large language models during continual fine-tuning. arXiv preprint arXiv:2308.08747, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[8] [8]

S., Karlinsky, L., Gutta, V., Cascante-Bonilla, P., Kim, D., Arbelle, A., Panda, R., Feris, R., and Kira, Z

Smith, J. S., Karlinsky, L., Gutta, V., Cascante-Bonilla, P., Kim, D., Arbelle, A., Panda, R., Feris, R., and Kira, Z. CODA-Prompt: COntinual decomposed attention-based prompting for rehearsal-free continual learning. CVPR, 2023

2023

[9] [9]

DualPrompt: Complementary prompting for rehearsal-free continual learning

Wang, Z., Zhang, Z., Ebrahimi, S., Sun, R., Zhang, H., Lee, C.-Y., Ren, X., Su, G., Perot, V., Dy, J., and Pfister, T. DualPrompt: Complementary prompting for rehearsal-free continual learning. ECCV, 2022

2022

[10] [10]

Learning to prompt for continual learning

Wang, Z., Zhang, Z., Lee, C.-Y., Zhang, H., Sun, R., Ren, X., Su, G., Perot, V., Dy, J., and Pfister, T. Learning to prompt for continual learning. CVPR, 2022

2022

[11] [11]

arXiv preprint arXiv:2404.16789 , year=

Wang, H., Lu, H., Yao, L., and Gong, D. Continual learning of large language models: A comprehensive survey. arXiv preprint arXiv:2404.16789, 2024

work page arXiv 2024

[12] [12]

Orthogonal gradient descent for continual learning

Farajtabar, M., Azizan, N., Mott, A., and Li, A. Orthogonal gradient descent for continual learning. AISTATS, 2020

2020

[13] [13]

Neural tangent kernel: Convergence and generalization in neural networks

Jacot, A., Gabriel, F., and Hongler, C. Neural tangent kernel: Convergence and generalization in neural networks. NeurIPS, 2018

2018

[14] [14]

Overcoming catastrophic forgetting in neural networks

Kirkpatrick, J., Pascanu, R., Rabinowitz, N., et al. Overcoming catastrophic forgetting in neural networks. PNAS, 114(13):3521-3526, 2017

2017

[15] [15]

and Hoiem, D

Li, Z. and Hoiem, D. Learning without forgetting. TPAMI, 40(12):2935-2947, 2017

2017

[16] [16]

On the spectral bias of neural networks

Rahaman, N., Baratin, A., Arpit, D., et al. On the spectral bias of neural networks. ICML, 2019

2019

[17] [17]

Experience replay for continual learning

Rolnick, D., Ahuja, A., Schwarz, J., Lillicrap, T., and Wayne, G. Experience replay for continual learning. NeurIPS, 2019

2019

[18] [18]

Gradient projection memory for continual learning

Saha, G., Garg, I., and Roy, K. Gradient projection memory for continual learning. ICLR, 2021

2021

[19] [19]

K., Schwarz, J., Matthews, A

Titsias, M. K., Schwarz, J., Matthews, A. G., Pascanu, R., and Teh, Y. W. Functional regularisation for continual learning using Gaussian processes. ICLR, 2020

2020

[20] [20]

Continual learning through synaptic intelligence

Zenke, F., Poole, B., and Ganguli, S. Continual learning through synaptic intelligence. ICML, 2017

2017