Forgetting is Not Erasure: Recovering Latent Knowledge via Transport Keys

Archie Chaudhury

arxiv: 2606.02860 · v1 · pith:IYRT3JLRnew · submitted 2026-06-01 · 💻 cs.LG · cs.AI

Forgetting is Not Erasure: Recovering Latent Knowledge via Transport Keys

Archie Chaudhury This is my paper

Pith reviewed 2026-06-28 15:27 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords catastrophic forgettingcontinual learningmodel stitchingtransport keysinterface driftlatent knowledge

0 comments

The pith

Apparent catastrophic forgetting often reflects interface drift between network stages rather than erasure of earlier computations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that catastrophic forgetting in sequential training is not primarily the loss of task-relevant features inside a network. Instead, a large share of the performance drop arises from misalignment at the boundaries where early computation hands off to later stages. By estimating compact transport keys from a few paired anchor activations and using them to stitch early layers from the updated model with late layers from the predecessor, the authors recover most of the original task performance. Experiments on split CIFAR-100 with ResNet-style nets and on a compact vision transformer show the same pattern. A reader would care because this reframes continual learning as a problem of indexing existing computations rather than solely preventing weight change.

Core claim

Across controlled continual-learning settings, a significant portion of apparent forgetting can be attributed to interface drift between internal stages rather than permanent erasure of task-relevant computation. Transport keys, described as compact interface-alignment operators estimated from a small set of paired anchor activations, enable recovery of most Task A performance after training on Task B when early computation from the post-update network is combined with late computation from its predecessor via model stitching.

What carries the argument

Transport keys: compact interface-alignment operators estimated from paired anchor activations and applied through model stitching to realign drifted stages.

If this is right

On split CIFAR-100 with a ResNet-style network, transport keys recover most of the original Task A performance after sequential training on Task B.
A compact vision transformer exhibits a similar recovery pattern when early and late stages are realigned with transport keys.
Continual learning may benefit more from mechanisms that index and re-access latent computations than from methods focused only on preventing weight change.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If interface drift dominates, continual-learning systems could maintain performance by learning or estimating transport keys on the fly without full retraining.
The same stitching approach might be tested on language models to check whether apparent forgetting there is likewise mostly an interface issue.
Varying the selection or number of anchor activations used to estimate each transport key would test how sensitive the recovery is to that choice.

Load-bearing premise

Recovery of performance by stitching with a transport key estimated from anchor activations shows that task-relevant computation remains intact rather than the key or stitching protocol itself creating the recovered capability.

What would settle it

An experiment in which early layers are replaced by random weights yet transport keys still produce the same recovery would falsify the claim that the keys are merely restoring access to preserved latent computation.

Figures

Figures reproduced from arXiv: 2606.02860 by Archie Chaudhury.

**Figure 2.** Figure 2: Transport-key evaluation workflow. Paired anchor activa [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Split CIFAR-100 recovery via stitching at stage 1. A com [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 6.** Figure 6: Control analysis at stage 2 for CIFAR-10 [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 5.** Figure 5: Control analysis at stage 1 for CIFAR-10 [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

read the original abstract

Catastrophic forgetting is often framed as a representational problem: after sequential training, a model appears to lose the features that supported performance on earlier tasks. We challenge the stronger form of this view. Across controlled continual-learning settings, we find that a significant portion of apparent forgetting can be attributed to interface drift between internal stages rather than permanent erasure of task-relevant computation. We study this phenomenon through a stitched evaluation protocol that combines early computation from a post-update network with late computation from its predecessor, optionally mediated by a compact, task-specific transport key. We describe transport keys at a systems level as compact interface-alignment operators estimated from a small set of paired anchor activations and evaluated through model stitching. On split CIFAR-100 with a ResNet-style network, transport keys recover most of the original Task A performance after sequential training on Task B. On a compact vision transformer, we observe a similar recovery pattern. These results suggest that continual learning may require better mechanisms for indexing and re-accessing latent computations, not only methods that prevent weight change.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper claims much apparent forgetting is interface drift recoverable by stitching with task-specific transport keys, but the abstract leaves open whether those keys are just injecting the missing task info.

read the letter

The main observation is that in sequential training on split CIFAR-100, early layers from the updated network plus late layers from the predecessor, connected by a small transport key fit on anchor pairs, restore most of the original task performance. The same pattern appears on a compact vision transformer. This is presented as evidence that the early computation for the old task remains intact and the issue is mainly one of interface alignment.

The stitching protocol itself is a reasonable way to separate stages, and framing transport keys as compact, estimable alignment operators gives a concrete systems-level handle on the idea. If the recovery survives proper controls it would shift attention in continual learning toward indexing mechanisms.

The soft spot is the information content of the keys. They are estimated from paired activations on the same Task A inputs, so nothing in the abstract rules out the possibility that the key learns task-conditioned corrections or partial features rather than a pure drift fix. Without ablations using keys from unrelated tasks, random mappings, or zeroed keys, the recovery could be the key supplying what the late layers lost. The lack of any numbers, anchor counts, error bars, or trivial baselines makes it impossible to gauge how much of the effect is real versus artifact.

This is for people already thinking about mechanistic accounts of forgetting rather than standard regularization or replay. It deserves referee time because the experimental setup is straightforward to check and the reframing is testable, but the authors will need to close the gap on whether the keys are neutral interfaces or active contributors.

Referee Report

3 major / 2 minor

Summary. The paper claims that catastrophic forgetting in continual learning is largely attributable to interface drift between internal stages of a network rather than permanent erasure of task-relevant computations. It introduces 'transport keys' as compact, task-specific interface-alignment operators estimated from small sets of paired anchor activations, and uses a model-stitching protocol (early layers from post-update network + late layers from predecessor, optionally with the key) to recover most original Task A performance after sequential training on Task B. Results are reported on split CIFAR-100 using a ResNet-style network and a compact vision transformer.

Significance. If the central claim holds after rigorous controls, the work would meaningfully reframe continual-learning research away from purely representational accounts of forgetting toward interface management and re-access mechanisms. The stitching protocol itself could become a useful diagnostic tool. The manuscript does not yet supply the quantitative controls or baselines needed to establish this reframing.

major comments (3)

[§3] §3 (transport-key estimation): The key is fitted on paired anchor activations drawn from the same Task A inputs used to evaluate recovery. This setup risks the learned mapping encoding task-conditioned corrections or partial task features rather than performing pure interface alignment; without an ablation that replaces the key with a mapping estimated from shuffled, random, or cross-task activations, it is impossible to separate drift correction from information injection.
[§4] §4 (experimental results): The abstract states that transport keys 'recover most of the original Task A performance,' yet the provided text supplies no numerical values, error bars, number of anchor examples, or comparisons against trivial baselines (e.g., direct use of the predecessor late stages, identity mapping, or random linear probes). These omissions make it impossible to judge whether the reported recovery exceeds what would be expected from the stitching protocol itself.
[§4.1] §4.1 (stitching protocol): The claim that recovered performance demonstrates 'intact latent task-relevant computation' in the post-update early stages assumes the transport key supplies no task-specific computation. This assumption is load-bearing for the central thesis but is not tested by any control that isolates the information content of the key (e.g., zero-shot stitching without a learned key or keys trained on unrelated data).

minor comments (2)

[§3] Notation for the transport key (size, parameterization, optimization objective) should be formalized with an equation in §3 to allow replication.
[§4] The manuscript should report the exact number of anchor examples used for key estimation and any sensitivity analysis with respect to that hyperparameter.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. The comments correctly identify gaps in controls and reporting that limit the strength of the current evidence. We will revise the manuscript to address each point.

read point-by-point responses

Referee: [§3] §3 (transport-key estimation): The key is fitted on paired anchor activations drawn from the same Task A inputs used to evaluate recovery. This setup risks the learned mapping encoding task-conditioned corrections or partial task features rather than performing pure interface alignment; without an ablation that replaces the key with a mapping estimated from shuffled, random, or cross-task activations, it is impossible to separate drift correction from information injection.

Authors: We agree that estimation on Task A activations leaves open the possibility that the key captures task-specific information. In revision we will add the requested ablations: keys estimated from shuffled activations, random pairings, and cross-task activations, and report the resulting stitching performance to isolate the contribution of interface alignment. revision: yes
Referee: [§4] §4 (experimental results): The abstract states that transport keys 'recover most of the original Task A performance,' yet the provided text supplies no numerical values, error bars, number of anchor examples, or comparisons against trivial baselines (e.g., direct use of the predecessor late stages, identity mapping, or random linear probes). These omissions make it impossible to judge whether the reported recovery exceeds what would be expected from the stitching protocol itself.

Authors: We will expand §4 with the missing quantitative details: exact recovery percentages and standard deviations across multiple runs, the precise number of anchor examples, and direct comparisons against the listed baselines (predecessor late stages alone, identity mapping, and random linear probes). revision: yes
Referee: [§4.1] §4.1 (stitching protocol): The claim that recovered performance demonstrates 'intact latent task-relevant computation' in the post-update early stages assumes the transport key supplies no task-specific computation. This assumption is load-bearing for the central thesis but is not tested by any control that isolates the information content of the key (e.g., zero-shot stitching without a learned key or keys trained on unrelated data).

Authors: We will add the suggested controls: zero-shot stitching (no learned key) and transport keys estimated from unrelated tasks or random data. These experiments will quantify how much performance gain is attributable to the key versus the stitching protocol itself. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical recovery protocol does not reduce to input by construction

full rationale

The paper presents an empirical protocol using stitched networks and transport keys estimated from anchor activations to demonstrate performance recovery on split CIFAR-100 and vision transformers. No equations, self-citations, or uniqueness theorems are provided in the text that would make the recovery result equivalent to the fitting procedure by definition. The transport key is described as an estimated alignment operator whose effect is measured on task performance; this is a standard fitted-component evaluation rather than a self-definitional or load-bearing self-citation reduction. The central claim remains an interpretation of observed recovery rather than a mathematical identity forced by the inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the interpretation that stitching performance equals evidence of preserved latent computation; this is a domain assumption whose validity cannot be checked from the abstract alone. No free parameters or invented entities beyond the transport key itself are described.

axioms (1)

domain assumption Stitched evaluation with a transport key estimated from anchor activations reveals the presence of task-relevant computation that would otherwise be inaccessible due to interface drift.
This premise is required for the claim that forgetting is not erasure; it is invoked when the authors interpret recovery as evidence against permanent erasure.

invented entities (1)

transport key no independent evidence
purpose: compact interface-alignment operator estimated from paired anchor activations
New construct introduced to mediate early and late computation stages; no independent evidence outside the stitching experiments is provided in the abstract.

pith-pipeline@v0.9.1-grok · 5704 in / 1295 out tokens · 12782 ms · 2026-06-28T15:27:05.667947+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

17 extracted references · 1 canonical work pages · 1 internal anchor

[1]

Git re-basin: Merging models modulo permutation symmetries

[Ainsworthet al., 2023 ] Samuel K Ainsworth, Jonathan Hayase, and Siddhartha Srinivasa. Git re-basin: Merging models modulo permutation symmetries. InInternational Conference on Learning Representations (ICLR),

2023
[2]

Memory aware synapses: Learning what (not) to forget

[Aljundiet al., 2018 ] Rahaf Aljundi, Francesca Babiloni, Mohamed Elhoseiny, Marcus Rohrbach, and Tinne Tuyte- laars. Memory aware synapses: Learning what (not) to forget. InProceedings of the European Conference on Computer Vision (ECCV), pages 139–154,

2018
[3]

Revisiting model stitching to compare neural representations

[Bansalet al., 2021 ] Yamini Bansal, Gal Kaplun, Jing Yang, Preetum Raghavan, Shreya Medipally, Doina Precup, and Irina Rish. Revisiting model stitching to compare neural representations. InAdvances in Neural Information Pro- cessing Systems (NeurIPS), volume 34, pages 225–236,

2021
[4]

Efficient lifelong learning with a-gem

[Chaudhryet al., 2019 ] Arslan Chaudhry, Marc’Aurelio Ranzato, Marcus Rohrbach, and Mohamed Elhoseiny. Efficient lifelong learning with a-gem. InInternational Conference on Learning Representations (ICLR),

2019
[5]

The role of permutation invariance in linear mode connectivity of neural networks

[Entezariet al., 2022 ] Rahim Entezari, Hanie Sedghi, Olga Saukh, and Neyshabur Behnam. The role of permutation invariance in linear mode connectivity of neural networks. InInternational Conference on Learning Representations (ICLR),

2022
[6]

[French, 1999] Robert M. French. Catastrophic forgetting in connectionist networks.Trends in Cognitive Sciences, 3(4):128–135,

1999
[7]

Parameter-efficient transfer learning for nlp

[Houlsbyet al., 2019 ] Neil Houlsby, Andrei Giurgiu, Stanis- law Jastrzebski, Bruna Brunslo, Anna Degtyareva, Gal Kaplyn, et al. Parameter-efficient transfer learning for nlp. Proceedings of the 36th International Conference on Ma- chine Learning (ICML),

2019
[8]

Overcom- ing catastrophic forgetting in neural networks

[Kirkpatricket al., 2017 ] James Kirkpatrick, Razvan Pas- canu, Neil Rabinowitz, Joel Veness, Guillaume Des- jardins, Andrei A Rusu, Pashapi Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al. Overcom- ing catastrophic forgetting in neural networks. InProceed- ings of the National Academy of Sciences (PNAS), volume 114, pages 3521–3526,

2017
[9]

Understanding image representations by measuring their equivariance and equivalence

[Lenc and Vedaldi, 2015] Karel Lenc and Andrea Vedaldi. Understanding image representations by measuring their equivariance and equivalence. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 991–999,

2015
[10]

Gradient episodic memory for continual learning

[Lopez-Paz and Ranzato, 2017] David Lopez-Paz and Marc’Aurelio Ranzato. Gradient episodic memory for continual learning. InAdvances in Neural Information Processing Systems (NeurIPS), volume 30,

2017
[11]

Packnet: Adding multiple tasks to a single net- work by iterative pruning

[Mallya and Lazebnik, 2018] Arun Mallya and Svetlana Lazebnik. Packnet: Adding multiple tasks to a single net- work by iterative pruning. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 7765–7773,

2018
[12]

[McCloskey and Cohen, 1989] Michael McCloskey and Neal J. Cohen. Catastrophic interference in connectionist networks: The sequential learning problem. volume 24 of Psychology of Learning and Motivation, pages 109–165. Academic Press,

1989
[13]

icarl: Incremental classifier and representation learning

[Rebuffiet al., 2017 ] Sylvestre-Alvise Rebuffi, Alexander Kolesnikov, Georg Sperl, and Christoph H Lampert. icarl: Incremental classifier and representation learning. InPro- ceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2001–2010,

2017
[14]

Experience replay for continual learning

[Rolnicket al., 2019 ] David Rolnick, Arun Ahuja, Jonathan Schwarz, Timothy Wayne, David Saxton, Timothy Lilli- crap, and Greg Wayne. Experience replay for continual learning. InAdvances in Neural Information Processing Systems (NeurIPS), volume 32,

2019
[15]

Progressive Neural Networks

[Rusuet al., 2016 ] Andrei A Rusu, Neil C Rabinowitz, Guillaume Desjardins, Soyer Hubert, Kirkpatrick James, Kavukcuoglu Koray, Pascanu Razvan, and Hadsell Raia. Progressive neural networks. InarXiv preprint arXiv:1606.04671,

work page internal anchor Pith review Pith/arXiv arXiv 2016
[16]

A comprehensive survey of continual learning: Theory, method and application,

[Wanget al., 2024 ] Liyuan Wang, Xingxing Zhang, Hang Su, and Jun Zhu. A comprehensive survey of continual learning: Theory, method and application,

2024
[17]

Continual learning through synaptic in- telligence

[Zenkeet al., 2017 ] Friedemann Zenke, Ben Poole, and Ganguli Surya. Continual learning through synaptic in- telligence. InProceedings of the 34th International Con- ference on Machine Learning (ICML), pages 3987–3995,

2017

[1] [1]

Git re-basin: Merging models modulo permutation symmetries

[Ainsworthet al., 2023 ] Samuel K Ainsworth, Jonathan Hayase, and Siddhartha Srinivasa. Git re-basin: Merging models modulo permutation symmetries. InInternational Conference on Learning Representations (ICLR),

2023

[2] [2]

Memory aware synapses: Learning what (not) to forget

[Aljundiet al., 2018 ] Rahaf Aljundi, Francesca Babiloni, Mohamed Elhoseiny, Marcus Rohrbach, and Tinne Tuyte- laars. Memory aware synapses: Learning what (not) to forget. InProceedings of the European Conference on Computer Vision (ECCV), pages 139–154,

2018

[3] [3]

Revisiting model stitching to compare neural representations

[Bansalet al., 2021 ] Yamini Bansal, Gal Kaplun, Jing Yang, Preetum Raghavan, Shreya Medipally, Doina Precup, and Irina Rish. Revisiting model stitching to compare neural representations. InAdvances in Neural Information Pro- cessing Systems (NeurIPS), volume 34, pages 225–236,

2021

[4] [4]

Efficient lifelong learning with a-gem

[Chaudhryet al., 2019 ] Arslan Chaudhry, Marc’Aurelio Ranzato, Marcus Rohrbach, and Mohamed Elhoseiny. Efficient lifelong learning with a-gem. InInternational Conference on Learning Representations (ICLR),

2019

[5] [5]

The role of permutation invariance in linear mode connectivity of neural networks

[Entezariet al., 2022 ] Rahim Entezari, Hanie Sedghi, Olga Saukh, and Neyshabur Behnam. The role of permutation invariance in linear mode connectivity of neural networks. InInternational Conference on Learning Representations (ICLR),

2022

[6] [6]

[French, 1999] Robert M. French. Catastrophic forgetting in connectionist networks.Trends in Cognitive Sciences, 3(4):128–135,

1999

[7] [7]

Parameter-efficient transfer learning for nlp

[Houlsbyet al., 2019 ] Neil Houlsby, Andrei Giurgiu, Stanis- law Jastrzebski, Bruna Brunslo, Anna Degtyareva, Gal Kaplyn, et al. Parameter-efficient transfer learning for nlp. Proceedings of the 36th International Conference on Ma- chine Learning (ICML),

2019

[8] [8]

Overcom- ing catastrophic forgetting in neural networks

[Kirkpatricket al., 2017 ] James Kirkpatrick, Razvan Pas- canu, Neil Rabinowitz, Joel Veness, Guillaume Des- jardins, Andrei A Rusu, Pashapi Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al. Overcom- ing catastrophic forgetting in neural networks. InProceed- ings of the National Academy of Sciences (PNAS), volume 114, pages 3521–3526,

2017

[9] [9]

Understanding image representations by measuring their equivariance and equivalence

[Lenc and Vedaldi, 2015] Karel Lenc and Andrea Vedaldi. Understanding image representations by measuring their equivariance and equivalence. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 991–999,

2015

[10] [10]

Gradient episodic memory for continual learning

[Lopez-Paz and Ranzato, 2017] David Lopez-Paz and Marc’Aurelio Ranzato. Gradient episodic memory for continual learning. InAdvances in Neural Information Processing Systems (NeurIPS), volume 30,

2017

[11] [11]

Packnet: Adding multiple tasks to a single net- work by iterative pruning

[Mallya and Lazebnik, 2018] Arun Mallya and Svetlana Lazebnik. Packnet: Adding multiple tasks to a single net- work by iterative pruning. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 7765–7773,

2018

[12] [12]

[McCloskey and Cohen, 1989] Michael McCloskey and Neal J. Cohen. Catastrophic interference in connectionist networks: The sequential learning problem. volume 24 of Psychology of Learning and Motivation, pages 109–165. Academic Press,

1989

[13] [13]

icarl: Incremental classifier and representation learning

[Rebuffiet al., 2017 ] Sylvestre-Alvise Rebuffi, Alexander Kolesnikov, Georg Sperl, and Christoph H Lampert. icarl: Incremental classifier and representation learning. InPro- ceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2001–2010,

2017

[14] [14]

Experience replay for continual learning

[Rolnicket al., 2019 ] David Rolnick, Arun Ahuja, Jonathan Schwarz, Timothy Wayne, David Saxton, Timothy Lilli- crap, and Greg Wayne. Experience replay for continual learning. InAdvances in Neural Information Processing Systems (NeurIPS), volume 32,

2019

[15] [15]

Progressive Neural Networks

[Rusuet al., 2016 ] Andrei A Rusu, Neil C Rabinowitz, Guillaume Desjardins, Soyer Hubert, Kirkpatrick James, Kavukcuoglu Koray, Pascanu Razvan, and Hadsell Raia. Progressive neural networks. InarXiv preprint arXiv:1606.04671,

work page internal anchor Pith review Pith/arXiv arXiv 2016

[16] [16]

A comprehensive survey of continual learning: Theory, method and application,

[Wanget al., 2024 ] Liyuan Wang, Xingxing Zhang, Hang Su, and Jun Zhu. A comprehensive survey of continual learning: Theory, method and application,

2024

[17] [17]

Continual learning through synaptic in- telligence

[Zenkeet al., 2017 ] Friedemann Zenke, Ben Poole, and Ganguli Surya. Continual learning through synaptic in- telligence. InProceedings of the 34th International Con- ference on Machine Learning (ICML), pages 3987–3995,

2017