pith. sign in

arxiv: 2601.22012 · v2 · submitted 2026-01-29 · 💻 cs.LG

Putting a Face to Forgetting: Continual Learning meets Mechanistic Interpretability

Pith reviewed 2026-05-16 09:55 UTC · model grok-4.3

classification 💻 cs.LG
keywords catastrophic forgettingcontinual learningmechanistic interpretabilityfeature encodinggeometric interpretationcrosscodersvision transformers
0
0 comments X

The pith

Catastrophic forgetting results from geometric transformations in how neural networks encode individual features.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a mechanistic framework that interprets catastrophic forgetting as transformations to the geometric encoding of specific features inside neural networks. These changes cause forgetting either by shrinking the capacity allocated to a feature or by breaking how later layers read it out. A tractable toy model formalizes the view, identifies best- and worst-case scenarios, and shows that depth worsens the transformations. The same lens is then applied to a Vision Transformer trained sequentially on CIFAR-10 by using Crosscoders to surface the responsible encoding shifts. A reader would care because the approach replaces black-box performance measurements with a precise, feature-level vocabulary that could guide more targeted continual-learning methods.

Core claim

We introduce a mechanistic framework that offers a geometric interpretation of catastrophic forgetting as the result of transformations to the encoding of individual features. These transformations can lead to forgetting by reducing the allocated capacity of features or by disrupting their readout by downstream computations. Analysis of a tractable toy model formalizes this view, allowing us to identify best- and worst-case scenarios. Through experiments on this model, we empirically test our formal analysis and highlight the detrimental effect of depth. Finally, we demonstrate how our framework can be used in the analysis of practical models through the use of Crosscoders in a Vision-Transf

What carries the argument

Mechanistic framework that supplies a geometric interpretation of forgetting via transformations to individual feature encodings, made concrete by a toy model and by Crosscoders applied to Vision Transformers.

If this is right

  • Depth increases the severity of encoding transformations that produce forgetting.
  • Forgetting occurs when feature capacity shrinks or when downstream readout is disrupted.
  • Crosscoders can locate the exact encoding transformations responsible for forgetting in a trained Vision Transformer.
  • Best-case continual-learning trajectories preserve both feature capacity and readout fidelity across tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Regularization that explicitly protects feature encodings could become a new family of continual-learning algorithms.
  • The same geometric analysis could be applied to sequential fine-tuning of language models to diagnose their forgetting patterns.
  • Architectures engineered for more stable feature geometries might reduce reliance on continual-learning interventions altogether.

Load-bearing premise

The tractable toy model and the Crosscoders accurately capture the feature-encoding transformations that produce forgetting in practical deep networks.

What would settle it

Demonstrating that forgetting occurs in a Vision Transformer on sequential CIFAR-10 without any measurable transformation in the feature encodings identified by Crosscoders would falsify the central claim.

read the original abstract

Catastrophic forgetting in continual learning is often measured at the performance or last-layer representation level, overlooking the underlying mechanisms. We introduce a mechanistic framework that offers a geometric interpretation of catastrophic forgetting as the result of transformations to the encoding of individual features. These transformations can lead to forgetting by reducing the allocated capacity of features or by disrupting their readout by downstream computations. Analysis of a tractable toy model formalizes this view, allowing us to identify best- and worst-case scenarios. Through experiments on this model, we empirically test our formal analysis and highlight the detrimental effect of depth. Finally, we demonstrate how our framework can be used in the analysis of practical models through the use of Crosscoders. We do so through a case study example of a Vision Transformer trained on sequential CIFAR-10. Our work provides a new, feature-centric vocabulary for continual learning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper claims that catastrophic forgetting arises from geometric transformations in the encoding of individual features, which reduce allocated capacity or disrupt downstream readout. It formalizes this view in a tractable toy model (identifying best- and worst-case scenarios and the detrimental role of depth), empirically validates the analysis on the toy model, and demonstrates applicability to practical networks via Crosscoders in a Vision Transformer trained sequentially on CIFAR-10, providing a feature-centric vocabulary for continual learning.

Significance. If the central geometric mechanisms hold and transfer, the work supplies a new mechanistic lens and vocabulary that could unify performance-level observations in continual learning with interpretable feature dynamics. The toy-model formalization and depth analysis are concrete strengths; the Crosscoders case study suggests a path toward analyzing real models, though stronger causal evidence would be needed to elevate impact.

major comments (3)
  1. [§4] §4 (Toy Model Experiments): the claim that depth is detrimental is supported only within the specific toy-model geometry; no ablation or scaling argument shows why this transfers to the ViT architecture used later, leaving the 'worst-case scenario' identification load-bearing but unlinked to the practical case study.
  2. [§5] §5 (Crosscoders Case Study on ViT): the transformations identified by Crosscoders are shown to correlate with forgetting, but the manuscript contains no intervention (e.g., targeted editing of the discovered feature directions or capacity metrics) that would establish causality rather than correlation; this directly weakens the central claim that the toy-model mechanisms operate in practical networks.
  3. [§3] §3 (Geometric Formalization): the definitions of 'capacity reduction' and 'readout disruption' are introduced geometrically but lack explicit, parameter-free metrics that can be computed identically on both the toy model and the ViT activations; without such a transferrable quantity, the framework risks being descriptive rather than predictive.
minor comments (2)
  1. [Figure 3] Figure 3 (toy-model depth plot): axis labels and legend entries are too small for readability; enlarging and adding a brief caption explaining the geometric quantities plotted would improve clarity.
  2. [§5] The term 'Crosscoders' is used without an inline equation or pseudocode definition in the main text; a short formal description (even if details are in the appendix) would help readers follow the ViT analysis.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, clarifying the intended scope of the toy model and case study while committing to revisions that improve precision and transferability of the framework.

read point-by-point responses
  1. Referee: [§4] §4 (Toy Model Experiments): the claim that depth is detrimental is supported only within the specific toy-model geometry; no ablation or scaling argument shows why this transfers to the ViT architecture used later, leaving the 'worst-case scenario' identification load-bearing but unlinked to the practical case study.

    Authors: The toy model is presented as a controlled, analytically tractable setting in which geometric mechanisms can be formalized and worst-case scenarios derived exactly; the ViT experiment is positioned as an application of the resulting vocabulary via Crosscoders rather than a direct empirical transfer test. We agree that the manuscript would benefit from an explicit discussion of the modeling assumptions required for generalization. In the revision we will add a dedicated paragraph in §4 and the discussion section that states the toy-model depth result as illustrative of possible mechanisms, notes the absence of scaling ablations, and outlines conditions under which similar depth effects might appear in transformers. This clarification prevents over-interpretation while preserving the toy model’s role as a formalization tool. revision: partial

  2. Referee: [§5] §5 (Crosscoders Case Study on ViT): the transformations identified by Crosscoders are shown to correlate with forgetting, but the manuscript contains no intervention (e.g., targeted editing of the discovered feature directions or capacity metrics) that would establish causality rather than correlation; this directly weakens the central claim that the toy-model mechanisms operate in practical networks.

    Authors: The Crosscoders analysis demonstrates that the geometric transformations predicted by the toy model co-occur with measured forgetting; no causal interventions (feature editing, capacity manipulation) are performed. We will revise §5 and the abstract to describe the result explicitly as correlational evidence that the framework can be applied to real networks, and we will add a short subsection outlining how targeted interventions could be conducted in follow-up work. This adjustment aligns the stated claims with the evidence actually presented. revision: yes

  3. Referee: [§3] §3 (Geometric Formalization): the definitions of 'capacity reduction' and 'readout disruption' are introduced geometrically but lack explicit, parameter-free metrics that can be computed identically on both the toy model and the ViT activations; without such a transferrable quantity, the framework risks being descriptive rather than predictive.

    Authors: We accept that the geometric definitions in §3 would be strengthened by accompanying, parameter-free metrics usable on both the toy model and Crosscoder-extracted features. In the revision we will define (i) capacity reduction as the fractional decrease in the volume of the convex hull of a feature’s activation vectors across tasks and (ii) readout disruption as the increase in the angle between the feature direction and the linear readout weights. Both quantities are computable from the same activation matrices in the toy model and from the Crosscoder dictionary vectors in the ViT, thereby providing a uniform, predictive bridge between the two settings. revision: yes

Circularity Check

0 steps flagged

No circularity: new geometric framework formalized in toy model then applied via crosscoders

full rationale

The derivation introduces an original mechanistic framework with geometric interpretation of feature encoding transformations, formalizes it via analysis and experiments on a tractable toy model (identifying capacity reduction and readout disruption), and demonstrates application to a ViT via Crosscoders on sequential CIFAR-10. No load-bearing step reduces by construction to fitted parameters, self-citations, or renamed known results; the central claims rest on independent toy-model formalization and case-study analysis rather than tautological redefinition of inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the assumption that the toy model formalizes real mechanisms and that Crosscoders provide faithful feature analysis, both domain assumptions without independent evidence supplied in the abstract.

axioms (1)
  • domain assumption The tractable toy model captures the mechanisms of catastrophic forgetting in deeper practical networks
    Invoked to formalize the geometric view and identify best/worst-case scenarios.
invented entities (1)
  • Crosscoders for feature transformation analysis no independent evidence
    purpose: To inspect encoding changes in practical models like Vision Transformers
    Newly applied tool in the case study to demonstrate the framework.

pith-pipeline@v0.9.0 · 5457 in / 1408 out tokens · 43059 ms · 2026-05-16T09:55:56.317717+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Lost or Hidden? A Concept-Level Forgetting in Supervised Continual Learning

    cs.LG 2026-05 unverdicted novelty 7.0

    A framework using sparse autoencoders decomposes concept-level forgetting in supervised continual learning into apparent deletion, recoverability, and decodability, showing substantial recoverability under linearity a...