Putting a Face to Forgetting: Continual Learning meets Mechanistic Interpretability
Pith reviewed 2026-05-16 09:55 UTC · model grok-4.3
The pith
Catastrophic forgetting results from geometric transformations in how neural networks encode individual features.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We introduce a mechanistic framework that offers a geometric interpretation of catastrophic forgetting as the result of transformations to the encoding of individual features. These transformations can lead to forgetting by reducing the allocated capacity of features or by disrupting their readout by downstream computations. Analysis of a tractable toy model formalizes this view, allowing us to identify best- and worst-case scenarios. Through experiments on this model, we empirically test our formal analysis and highlight the detrimental effect of depth. Finally, we demonstrate how our framework can be used in the analysis of practical models through the use of Crosscoders in a Vision-Transf
What carries the argument
Mechanistic framework that supplies a geometric interpretation of forgetting via transformations to individual feature encodings, made concrete by a toy model and by Crosscoders applied to Vision Transformers.
If this is right
- Depth increases the severity of encoding transformations that produce forgetting.
- Forgetting occurs when feature capacity shrinks or when downstream readout is disrupted.
- Crosscoders can locate the exact encoding transformations responsible for forgetting in a trained Vision Transformer.
- Best-case continual-learning trajectories preserve both feature capacity and readout fidelity across tasks.
Where Pith is reading between the lines
- Regularization that explicitly protects feature encodings could become a new family of continual-learning algorithms.
- The same geometric analysis could be applied to sequential fine-tuning of language models to diagnose their forgetting patterns.
- Architectures engineered for more stable feature geometries might reduce reliance on continual-learning interventions altogether.
Load-bearing premise
The tractable toy model and the Crosscoders accurately capture the feature-encoding transformations that produce forgetting in practical deep networks.
What would settle it
Demonstrating that forgetting occurs in a Vision Transformer on sequential CIFAR-10 without any measurable transformation in the feature encodings identified by Crosscoders would falsify the central claim.
read the original abstract
Catastrophic forgetting in continual learning is often measured at the performance or last-layer representation level, overlooking the underlying mechanisms. We introduce a mechanistic framework that offers a geometric interpretation of catastrophic forgetting as the result of transformations to the encoding of individual features. These transformations can lead to forgetting by reducing the allocated capacity of features or by disrupting their readout by downstream computations. Analysis of a tractable toy model formalizes this view, allowing us to identify best- and worst-case scenarios. Through experiments on this model, we empirically test our formal analysis and highlight the detrimental effect of depth. Finally, we demonstrate how our framework can be used in the analysis of practical models through the use of Crosscoders. We do so through a case study example of a Vision Transformer trained on sequential CIFAR-10. Our work provides a new, feature-centric vocabulary for continual learning.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that catastrophic forgetting arises from geometric transformations in the encoding of individual features, which reduce allocated capacity or disrupt downstream readout. It formalizes this view in a tractable toy model (identifying best- and worst-case scenarios and the detrimental role of depth), empirically validates the analysis on the toy model, and demonstrates applicability to practical networks via Crosscoders in a Vision Transformer trained sequentially on CIFAR-10, providing a feature-centric vocabulary for continual learning.
Significance. If the central geometric mechanisms hold and transfer, the work supplies a new mechanistic lens and vocabulary that could unify performance-level observations in continual learning with interpretable feature dynamics. The toy-model formalization and depth analysis are concrete strengths; the Crosscoders case study suggests a path toward analyzing real models, though stronger causal evidence would be needed to elevate impact.
major comments (3)
- [§4] §4 (Toy Model Experiments): the claim that depth is detrimental is supported only within the specific toy-model geometry; no ablation or scaling argument shows why this transfers to the ViT architecture used later, leaving the 'worst-case scenario' identification load-bearing but unlinked to the practical case study.
- [§5] §5 (Crosscoders Case Study on ViT): the transformations identified by Crosscoders are shown to correlate with forgetting, but the manuscript contains no intervention (e.g., targeted editing of the discovered feature directions or capacity metrics) that would establish causality rather than correlation; this directly weakens the central claim that the toy-model mechanisms operate in practical networks.
- [§3] §3 (Geometric Formalization): the definitions of 'capacity reduction' and 'readout disruption' are introduced geometrically but lack explicit, parameter-free metrics that can be computed identically on both the toy model and the ViT activations; without such a transferrable quantity, the framework risks being descriptive rather than predictive.
minor comments (2)
- [Figure 3] Figure 3 (toy-model depth plot): axis labels and legend entries are too small for readability; enlarging and adding a brief caption explaining the geometric quantities plotted would improve clarity.
- [§5] The term 'Crosscoders' is used without an inline equation or pseudocode definition in the main text; a short formal description (even if details are in the appendix) would help readers follow the ViT analysis.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, clarifying the intended scope of the toy model and case study while committing to revisions that improve precision and transferability of the framework.
read point-by-point responses
-
Referee: [§4] §4 (Toy Model Experiments): the claim that depth is detrimental is supported only within the specific toy-model geometry; no ablation or scaling argument shows why this transfers to the ViT architecture used later, leaving the 'worst-case scenario' identification load-bearing but unlinked to the practical case study.
Authors: The toy model is presented as a controlled, analytically tractable setting in which geometric mechanisms can be formalized and worst-case scenarios derived exactly; the ViT experiment is positioned as an application of the resulting vocabulary via Crosscoders rather than a direct empirical transfer test. We agree that the manuscript would benefit from an explicit discussion of the modeling assumptions required for generalization. In the revision we will add a dedicated paragraph in §4 and the discussion section that states the toy-model depth result as illustrative of possible mechanisms, notes the absence of scaling ablations, and outlines conditions under which similar depth effects might appear in transformers. This clarification prevents over-interpretation while preserving the toy model’s role as a formalization tool. revision: partial
-
Referee: [§5] §5 (Crosscoders Case Study on ViT): the transformations identified by Crosscoders are shown to correlate with forgetting, but the manuscript contains no intervention (e.g., targeted editing of the discovered feature directions or capacity metrics) that would establish causality rather than correlation; this directly weakens the central claim that the toy-model mechanisms operate in practical networks.
Authors: The Crosscoders analysis demonstrates that the geometric transformations predicted by the toy model co-occur with measured forgetting; no causal interventions (feature editing, capacity manipulation) are performed. We will revise §5 and the abstract to describe the result explicitly as correlational evidence that the framework can be applied to real networks, and we will add a short subsection outlining how targeted interventions could be conducted in follow-up work. This adjustment aligns the stated claims with the evidence actually presented. revision: yes
-
Referee: [§3] §3 (Geometric Formalization): the definitions of 'capacity reduction' and 'readout disruption' are introduced geometrically but lack explicit, parameter-free metrics that can be computed identically on both the toy model and the ViT activations; without such a transferrable quantity, the framework risks being descriptive rather than predictive.
Authors: We accept that the geometric definitions in §3 would be strengthened by accompanying, parameter-free metrics usable on both the toy model and Crosscoder-extracted features. In the revision we will define (i) capacity reduction as the fractional decrease in the volume of the convex hull of a feature’s activation vectors across tasks and (ii) readout disruption as the increase in the angle between the feature direction and the linear readout weights. Both quantities are computable from the same activation matrices in the toy model and from the Crosscoder dictionary vectors in the ViT, thereby providing a uniform, predictive bridge between the two settings. revision: yes
Circularity Check
No circularity: new geometric framework formalized in toy model then applied via crosscoders
full rationale
The derivation introduces an original mechanistic framework with geometric interpretation of feature encoding transformations, formalizes it via analysis and experiments on a tractable toy model (identifying capacity reduction and readout disruption), and demonstrates application to a ViT via Crosscoders on sequential CIFAR-10. No load-bearing step reduces by construction to fitted parameters, self-citations, or renamed known results; the central claims rest on independent toy-model formalization and case-study analysis rather than tautological redefinition of inputs.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The tractable toy model captures the mechanisms of catastrophic forgetting in deeper practical networks
invented entities (1)
-
Crosscoders for feature transformation analysis
no independent evidence
Forward citations
Cited by 1 Pith paper
-
Lost or Hidden? A Concept-Level Forgetting in Supervised Continual Learning
A framework using sparse autoencoders decomposes concept-level forgetting in supervised continual learning into apparent deletion, recoverability, and decodability, showing substantial recoverability under linearity a...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.