pith. sign in

arxiv: 2606.18024 · v1 · pith:PC5AQJLRnew · submitted 2026-06-16 · 💻 cs.LG · cs.AI

Catastrophic Forgetting is Low-Rank: A Function-Space Theory for Continual Adaptation

Pith reviewed 2026-06-27 01:34 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords catastrophic forgettingcontinual learningNTK regimefunction spacelow-rank structurecross-task kernelPEFTspectral regularization
0
0 comments X

The pith

New-task training induces old-task prediction drift through the cross-task kernel, yielding an exact closed-form forgetting vector in the linear-head NTK case.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper supplies a function-space account of catastrophic forgetting inside the NTK regime. Training on a new task produces a predictable drift in old-task outputs that is fully determined by the cross-task kernel, and this drift can be written down before any gradient step is taken. When the backbone is frozen and the head is linear, the expression is exact; otherwise it remains a local approximation. The same formula shows that the drift lives in only a few eigenmodes of the old-task NTK and obeys a Kronecker scaling rule for its rank under linear heads.

Core claim

In the NTK regime, new-task training induces old-task prediction drift through the cross-task kernel, yielding a closed-form predictor for the forgetting vector before any new-task gradient step. In frozen-backbone linear-head PEFT-CL the predictor is exact up to numerical precision; for nonlinear adapters it is a local NTK approximation. The same expression reveals that forgetting concentrates in a small number of old-task NTK eigenmodes and under frozen linear heads gives a Kronecker scaling rule for the vulnerable rank.

What carries the argument

The cross-task kernel, which produces the closed-form predictor for the forgetting vector and exposes its low-rank concentration in old-task NTK eigenmodes.

If this is right

  • The exact forgetting vector can be computed from the cross-task kernel before any new-task gradient step occurs.
  • Forgetting is confined to a low-dimensional subspace spanned by a few eigenmodes of the old-task NTK.
  • Parameter-space regularizers can miss the output-space directions where interference actually occurs.
  • A spectral regularizer that targets only the vulnerable eigenmodes becomes a natural design choice.
  • Under frozen linear heads the rank of the forgetting matrix follows an explicit Kronecker product rule.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Adapters could be chosen or trained to minimize cross-task kernel entries and thereby reduce forgetting without replay buffers.
  • Continual-learning diagnostics need only track a small number of principal directions rather than the full output space.
  • Quantifying the approximation error of the local NTK predictor for nonlinear adapters would turn the theory into a practical bound.
  • The same kernel-driven view may illuminate negative transfer or interference in multi-task rather than sequential settings.

Load-bearing premise

The analysis assumes the NTK regime and requires the model to be exactly linear in its trainable parameters for the predictor to be exact rather than approximate.

What would settle it

Train a frozen-backbone linear-head model sequentially on two tasks, compute the cross-task kernel once, form the predicted forgetting vector, and verify whether the observed change in old-task outputs matches that vector to numerical precision.

Figures

Figures reproduced from arXiv: 2606.18024 by Dan Raviv, Ido Nitzan Hidekel.

Figure 1
Figure 1. Figure 1: Left: Predicted vs. realized ∆fA (cos sim > 0.99) on Split-MNIST/CIFAR-10. Center: Cumulative forgetting energy: 50–90% in 1–6 eigenmodes. Right: Drift decomposition — spectral reg targets the vulnerable subspace at 75:1 on Split-MNIST (1.7:1 on the CNN-based Split-CIFAR-10, App. L) vs. <1:1 for baselines. structurally exact in the frozen-backbone linear-head PEFT￾CL regime, with 1 − cos sim down to 10−6 o… view at source ↗
Figure 2
Figure 2. Figure 2: Method overview. Left: Forgetting ∆fA lies in the column space of KAA (Prop. 1), and its energy concentrates on a low-rank slice – the vulnerable subspace span(u1, . . . , uk) spanned by the top eigenvectors of KAA. The complementary u⊥ directions are unprotected by construction. Right: the NTK spec￾trum decays rapidly (grey), and the forgetting-energy coefficients c 2 i = (u ⊤ i ∆fA) 2 inherit this decay … view at source ↗
read the original abstract

Catastrophic forgetting in continual adaptation is usually studied through parameter drift, replay, or distillation, but these views do not identify which output-space directions are vulnerable. We give a function-space account in the NTK regime: new-task training induces old-task prediction drift through the cross-task kernel, yielding a closed-form predictor for the forgetting vector before any new-task gradient step. In frozen-backbone linear-head PEFT-CL, where the model is linear in the trainable parameters, the predictor is exact up to numerical precision; for nonlinear adapters/full fine-tuning, it is a local NTK approximation. The same expression reveals that forgetting concentrates in a small number of old-task NTK eigenmodes and under frozen linear heads gives a Kronecker scaling rule for the vulnerable rank. These results clarify the relation to prior NTK-overlap theory, explain why parameter-space regularizers can miss output-space interference, and motivate a targeted spectral regularizer.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper develops a function-space theory of catastrophic forgetting in the NTK regime. New-task training induces old-task prediction drift via the cross-task kernel, yielding a closed-form predictor for the forgetting vector. This predictor is exact (up to numerics) for frozen-backbone linear-head PEFT-CL and a local approximation otherwise. The same expression shows forgetting concentrates in a small number of old-task NTK eigenmodes and yields a Kronecker scaling rule for vulnerable rank under linear heads. The work relates this to prior NTK-overlap theory and motivates spectral regularization.

Significance. If the derivations hold, the closed-form predictor and low-rank characterization provide a precise output-space account of forgetting that explains limitations of parameter-space regularizers and enables targeted interventions. The exactness result in the linear-head case, together with the eigenmode concentration, is a clear strength offering falsifiable predictions and reproducible analysis in the PEFT-CL setting.

major comments (2)
  1. [§3.2, Eq. (9)] §3.2, Eq. (9): the local NTK approximation for nonlinear adapters and full fine-tuning is stated without an error bound, remainder term, or empirical quantification of linearization error. This is load-bearing for extending the exact linear-head result to the broader PEFT-CL claims in the title and abstract.
  2. [§4.1, Theorem 2] §4.1, Theorem 2: the claim that forgetting concentrates in a small number of eigenmodes relies on the cross-task kernel structure, but the paper does not report the numerical rank or eigenvalue decay rates across the evaluated tasks to confirm the 'small number' is consistent and not task-dependent.
minor comments (2)
  1. [§2] Notation for the cross-task kernel K_{12} is introduced without an explicit comparison table to prior NTK-overlap definitions, which would clarify the claimed relation to earlier work.
  2. [Figure 3] Figure 3 caption does not state the number of random seeds or the precise metric used for the 'forgetting vector' norm, reducing reproducibility of the eigenmode plots.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the scope and presentation of our results. We address each major comment below and indicate the revisions we will incorporate.

read point-by-point responses
  1. Referee: [§3.2, Eq. (9)] the local NTK approximation for nonlinear adapters and full fine-tuning is stated without an error bound, remainder term, or empirical quantification of linearization error. This is load-bearing for extending the exact linear-head result to the broader PEFT-CL claims in the title and abstract.

    Authors: We agree that the local NTK approximation lacks a formal error bound or remainder term; deriving a non-vacuous bound for finite-width nonlinear adapters remains an open technical challenge beyond the scope of the present work. The exact closed-form result is restricted to the frozen linear-head case, as stated in the abstract and §3. The approximation for nonlinear cases is presented as a local predictor whose validity is supported by the standard NTK linearization. To strengthen the manuscript we will add empirical quantification of the linearization error (comparing the closed-form predictor against observed prediction drift on nonlinear adapters) in the experiments section and will explicitly qualify the title/abstract claims to emphasize the exact linear-head setting while noting the approximation for other regimes. revision: yes

  2. Referee: [§4.1, Theorem 2] the claim that forgetting concentrates in a small number of eigenmodes relies on the cross-task kernel structure, but the paper does not report the numerical rank or eigenvalue decay rates across the evaluated tasks to confirm the 'small number' is consistent and not task-dependent.

    Authors: We concur that explicit reporting of numerical ranks and eigenvalue decay is needed to substantiate the consistency of the low-rank concentration. The current manuscript illustrates concentration via the closed-form expression and selected visualizations, but does not tabulate effective ranks. We will revise §4.1 and the associated experiments to include tables (or supplementary plots) of eigenvalue spectra and numerical ranks (e.g., count of eigenvalues exceeding 1% of the largest) for every task pair, thereby confirming that the effective rank remains small and stable across the evaluated settings. revision: yes

Circularity Check

0 steps flagged

No circularity: closed-form predictor derived from standard NTK cross-task kernel without reduction to fitted inputs or self-citation chains

full rationale

The paper's central claim is a closed-form expression for old-task prediction drift induced by new-task training via the cross-task kernel in the NTK regime. This is presented as following directly from the kernel definition under the stated linearity assumption (frozen backbone + linear head), with the nonlinear case explicitly labeled a local approximation. No equations or text in the provided abstract reduce the predictor to a fitted parameter renamed as prediction, a self-definitional loop, or a load-bearing self-citation. The derivation is framed as a direct consequence of the NTK framework applied to the continual adaptation setting, remaining self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the validity of the NTK regime for the models considered and on the linearity assumption that makes the predictor exact in the PEFT-CL setting.

axioms (1)
  • domain assumption NTK regime approximation holds for the model behavior during new-task training
    Entire analysis is conducted in the NTK regime as stated in the abstract.

pith-pipeline@v0.9.1-grok · 5695 in / 1272 out tokens · 38794 ms · 2026-06-27T01:34:27.029525+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

20 extracted references · 4 canonical work pages · 1 internal anchor

  1. [1]

    S., Rolnick, D., and Kording, K

    Benjamin, A. S., Rolnick, D., and Kording, K. Measuring and regularizing networks in function space. ICLR, 2019

  2. [2]

    A., Doan, T., and Sugiyama, M

    Bennani, M. A., Doan, T., and Sugiyama, M. Generalisation guarantees for continual learning with orthogonal gradient descent. arXiv preprint arXiv:2006.11942, 2020

  3. [3]

    Dark experience for general continual learning: a strong, simple baseline

    Buzzega, P., Boschini, M., Porrello, A., Abati, D., and Calderara, S. Dark experience for general continual learning: a strong, simple baseline. NeurIPS, 2020

  4. [4]

    A theoretical analysis of catastrophic forgetting through the NTK overlap matrix

    Doan, T., Abbana Bennani, M., Mazoure, B., Rabusseau, G., and Alquier, P. A theoretical analysis of catastrophic forgetting through the NTK overlap matrix. AISTATS, 2021

  5. [5]

    Imanov, O. Y. L. Mechanistic analysis of catastrophic forgetting in large language models during continual fine-tuning. arXiv preprint arXiv:2601.18699, 2026

  6. [6]

    and Li, W.-J

    Liang, Y.-S. and Li, W.-J. InfLoRA: Interference-free low-rank adaptation for continual learning. CVPR, pp.\ 23638-23647, 2024

  7. [7]

    An Empirical Study of Catastrophic Forgetting in Large Language Models During Continual Fine-tuning

    Luo, Y., Yang, Z., Meng, F., Li, Y., Zhou, J., and Zhang, Y. An empirical study of catastrophic forgetting in large language models during continual fine-tuning. arXiv preprint arXiv:2308.08747, 2024

  8. [8]

    S., Karlinsky, L., Gutta, V., Cascante-Bonilla, P., Kim, D., Arbelle, A., Panda, R., Feris, R., and Kira, Z

    Smith, J. S., Karlinsky, L., Gutta, V., Cascante-Bonilla, P., Kim, D., Arbelle, A., Panda, R., Feris, R., and Kira, Z. CODA-Prompt: COntinual decomposed attention-based prompting for rehearsal-free continual learning. CVPR, 2023

  9. [9]

    DualPrompt: Complementary prompting for rehearsal-free continual learning

    Wang, Z., Zhang, Z., Ebrahimi, S., Sun, R., Zhang, H., Lee, C.-Y., Ren, X., Su, G., Perot, V., Dy, J., and Pfister, T. DualPrompt: Complementary prompting for rehearsal-free continual learning. ECCV, 2022

  10. [10]

    Learning to prompt for continual learning

    Wang, Z., Zhang, Z., Lee, C.-Y., Zhang, H., Sun, R., Ren, X., Su, G., Perot, V., Dy, J., and Pfister, T. Learning to prompt for continual learning. CVPR, 2022

  11. [11]

    arXiv preprint arXiv:2404.16789 , year=

    Wang, H., Lu, H., Yao, L., and Gong, D. Continual learning of large language models: A comprehensive survey. arXiv preprint arXiv:2404.16789, 2024

  12. [12]

    Orthogonal gradient descent for continual learning

    Farajtabar, M., Azizan, N., Mott, A., and Li, A. Orthogonal gradient descent for continual learning. AISTATS, 2020

  13. [13]

    Neural tangent kernel: Convergence and generalization in neural networks

    Jacot, A., Gabriel, F., and Hongler, C. Neural tangent kernel: Convergence and generalization in neural networks. NeurIPS, 2018

  14. [14]

    Overcoming catastrophic forgetting in neural networks

    Kirkpatrick, J., Pascanu, R., Rabinowitz, N., et al. Overcoming catastrophic forgetting in neural networks. PNAS, 114(13):3521-3526, 2017

  15. [15]

    and Hoiem, D

    Li, Z. and Hoiem, D. Learning without forgetting. TPAMI, 40(12):2935-2947, 2017

  16. [16]

    On the spectral bias of neural networks

    Rahaman, N., Baratin, A., Arpit, D., et al. On the spectral bias of neural networks. ICML, 2019

  17. [17]

    Experience replay for continual learning

    Rolnick, D., Ahuja, A., Schwarz, J., Lillicrap, T., and Wayne, G. Experience replay for continual learning. NeurIPS, 2019

  18. [18]

    Gradient projection memory for continual learning

    Saha, G., Garg, I., and Roy, K. Gradient projection memory for continual learning. ICLR, 2021

  19. [19]

    K., Schwarz, J., Matthews, A

    Titsias, M. K., Schwarz, J., Matthews, A. G., Pascanu, R., and Teh, Y. W. Functional regularisation for continual learning using Gaussian processes. ICLR, 2020

  20. [20]

    Continual learning through synaptic intelligence

    Zenke, F., Poole, B., and Ganguli, S. Continual learning through synaptic intelligence. ICML, 2017