DRDN: Decoupled Representation Dynamic Network for From-Scratch ViT Class-Incremental Learning

Bingchen Huang; Yifu Chen; Yuanchao Du; Zhiling Wang

arxiv: 2607.01630 · v1 · pith:TAYJL4ATnew · submitted 2026-07-02 · 💻 cs.CV

DRDN: Decoupled Representation Dynamic Network for From-Scratch ViT Class-Incremental Learning

Bingchen Huang , Yifu Chen , Zhiling Wang , Yuanchao Du This is my paper

Pith reviewed 2026-07-03 16:56 UTC · model grok-4.3

classification 💻 cs.CV

keywords class-incremental learningvision transformerdynamic networkmasked image modelingfrom-scratch learningtask token expansionrepresentation decoupling

0 comments

The pith

Continuous masked image modeling on the backbone plus hierarchical task token expansion lets Vision Transformers learn new classes from scratch while retaining old ones better.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to show that classification supervision alone fails to keep shared backbone features intact across many incremental tasks when starting from random weights, leading to biased boundaries and capped long-term accuracy. It proposes separating the preservation of general visual structure from task-specific discrimination through two orthogonal mechanisms. If correct, this would mean ViTs can handle longer sequences of new classes without external pretraining by maintaining task-agnostic representations that support discrimination as tasks accumulate.

Core claim

DRDN continuously applies masked image modeling at every incremental step with reconstruction gradients routed exclusively through the backbone to retain general visual structure, while employing hierarchical task token expansion across all transformer layers with a modified per-task attention rule to reduce inter-task interference; this yields higher average accuracy than token-expansion baselines on from-scratch ViT CIL benchmarks, with the gap widening on longer sequences.

What carries the argument

The decoupled representation mechanism that routes masked image modeling reconstruction gradients exclusively through the backbone while expanding task-specific tokens hierarchically across layers.

If this is right

On CIFAR100-B0 with 10 steps, average accuracy reaches 77.19 percent, exceeding DKT by 1.36 points and DyTox by 3.53 points.
The performance edge increases as the number of incremental steps grows.
Multi-seed runs show stability within plus or minus 0.31 percent.
The MIM decoder adds parameters and computation only during training, leaving inference unchanged.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The separation of reconstruction and classification objectives could apply to other incremental settings where shared features degrade over time.
Hierarchical token expansion might reduce interference in non-ViT architectures if adapted similarly.
Longer task sequences would likely amplify the benefit, suggesting the method scales with problem size.

Load-bearing premise

Continuously applying masked image modeling with gradients routed only through the backbone will preserve task-agnostic shared representations sufficiently to improve discriminability as tasks accumulate.

What would settle it

A controlled ablation removing the masked image modeling component on CIFAR100-B0 with 20 steps, where the accuracy advantage over DKT and DyTox vanishes or reverses, would falsify the central claim.

Figures

Figures reproduced from arXiv: 2607.01630 by Bingchen Huang, Yifu Chen, Yuanchao Du, Zhiling Wang.

**Figure 1.** Figure 1: t-SNE visualization of DyTox task-token features on CIFAR100-B0 (10 steps) after all 10 tasks are trained. Large markers show per-class centroids; [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: Grad-CAM visualization of shallow-layer activations. Models trained [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: CIL performance on CIFAR100 (10-step, B0). For each task, 10 new [PITH_FULL_IMAGE:figures/full_fig_p002_3.png] view at source ↗

**Figure 4.** Figure 4: DRDN framework. The backbone consists of multiple Modified Self-Attention Blocks (MSABs) and branches into two paths. The upper (classification) [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

**Figure 5.** Figure 5: Modified Self-Attention Block (MSAB). Left (task-specific path): each [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗

**Figure 6.** Figure 6: Average accuracy performance evolution on CIFAR100-B50 (2, 5, 10 steps), the harder large-base setting. Starting from a 50-class base, DRDN (ours, [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: Average accuracy performance evolution on CIFAR100-B0 (5, 10, 20 steps). DRDN (blue) exhibits consistently slower accuracy degradation than [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗

read the original abstract

Dynamic expansion methods for class-incremental learning (CIL) protect task-specific knowledge by growing dedicated tokens or subnetworks, yet our analyses suggest that classification supervision alone does not sufficiently preserve task-agnostic shared backbone representations over long incremental sequences. We identify two intertwined challenges: cross-task confusion from sequential training on predominantly current-task data, which biases decision boundaries toward recent tasks; and under-optimized shared representations in the backbone that cap long-term discriminability as tasks accumulate. We propose the Decoupled Representation Dynamic Network (DRDN), which addresses these challenges via two orthogonal mechanisms. For shared backbone representations, DRDN continuously applies masked image modeling (MIM) at every incremental step, with reconstruction gradients routed exclusively through the backbone, encouraging it to retain general visual structure beyond class-discriminative cues. For task-specific discrimination, DRDN employs hierarchical task token expansion across all transformer layers, with a modified per-task attention rule that reduces inter-task interference. We support this design with accuracy degradation analysis and cross-task confusion rate measurements. In the from-scratch ViT CIL setting (no external pretraining), DRDN consistently improves over strong token-expansion baselines with comparable backbone scale. On CIFAR100-B0 (10 steps), DRDN achieves 77.19% average accuracy, outperforming DKT by 1.36 points and DyTox by 3.53 points, with an advantage that grows at longer incremental sequences. Multi-seed validation confirms stability (+/-0.31%). The MIM decoder is active only during training, adding no inference-time parameters or computation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DRDN's accuracy gains over token-expansion baselines look real on the reported numbers but cannot be cleanly attributed to the MIM decoupling because the modified attention rule is not ablated separately.

read the letter

The paper's core move is to run masked image modeling at every incremental step with gradients routed only through the backbone, while also growing hierarchical task tokens across layers and changing the per-task attention rule to cut interference. On CIFAR100-B0 with 10 steps it reports 77.19% average accuracy, beating DKT by 1.36 points and DyTox by 3.53 points, and the gap widens on longer sequences. Multi-seed runs give a small standard deviation. They also supply accuracy-degradation curves and cross-task confusion rates, which is more analysis than many incremental-learning papers include.

The from-scratch ViT setting is useful because most prior work leans on ImageNet pretraining. The design choice to keep the MIM decoder training-only is clean and adds no inference cost.

The soft spot is the missing isolation the stress-test flags. The abstract and reported results do not describe an ablation that keeps the hierarchical tokens and modified attention but turns off the MIM loss. Without that control it is hard to know whether the 1-3 point lift comes from better shared representations or from the attention tweak. The gains are modest enough that this matters. The paper does not appear to overclaim statistical significance or hide implementation details in the abstract, but the attribution step is incomplete.

This is for people already working on class-incremental ViTs. It engages the token-expansion literature directly and adds a concrete protocol. It is coherent on its own terms and shows clear thinking about the shared-representation problem, so it deserves a serious referee. I would send it out and ask specifically for the MIM-only ablation in revision.

Referee Report

1 major / 2 minor

Summary. The paper claims that in from-scratch ViT class-incremental learning, standard classification supervision fails to preserve task-agnostic shared backbone representations, causing cross-task confusion and degraded long-term discriminability. It proposes DRDN, which applies continuous masked image modeling (MIM) at each incremental step with reconstruction gradients routed only through the backbone, combined with hierarchical task token expansion across transformer layers and a modified per-task attention rule to reduce interference. Supported by accuracy degradation and cross-task confusion analyses, DRDN reports 77.19% average accuracy on CIFAR100-B0 (10 steps), outperforming DKT by 1.36 points and DyTox by 3.53 points, with larger gains on longer sequences, multi-seed stability, and no added inference cost.

Significance. If the central attribution holds, the work would be significant for from-scratch ViT CIL by demonstrating that explicit MIM-based preservation of shared representations can complement token-expansion methods and improve scaling to longer task sequences. The zero-inference-overhead design and multi-seed validation (+/-0.31%) are concrete strengths. However, the significance is limited by the absence of component-isolation experiments, which leaves open whether the reported gains stem from the MIM mechanism, the attention modification, or their interaction.

major comments (1)

[Abstract / method description] Abstract and method description: The central claim attributes the 1.36-point and 3.53-point gains (and the growing advantage on longer sequences) to continuous MIM preserving task-agnostic backbone representations. However, DRDN also introduces a modified per-task attention rule within the hierarchical token expansion. No ablation is described that disables only the MIM loss while retaining the modified attention and token expansion, so the improvements cannot be securely attributed to the decoupled-representation mechanism rather than the attention change.

minor comments (2)

[Abstract] Abstract: The supporting analyses (accuracy degradation and cross-task confusion rates) are mentioned but not quantified or compared to baselines; explicit numbers, tables, or statistical significance tests for these diagnostics would strengthen the evidence.
[Abstract] Abstract: Experimental details are absent (baseline re-implementation, fixed vs. tuned hyperparameters, data splits, number of runs beyond the multi-seed note). These are needed to assess whether the 1.36/3.53-point margins are robust.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for highlighting the need to more rigorously isolate the contribution of the continuous MIM mechanism from the modified per-task attention rule. We agree that the current experiments do not fully separate these components and will add the requested ablation in the revision to strengthen attribution of the reported gains.

read point-by-point responses

Referee: [Abstract / method description] Abstract and method description: The central claim attributes the 1.36-point and 3.53-point gains (and the growing advantage on longer sequences) to continuous MIM preserving task-agnostic backbone representations. However, DRDN also introduces a modified per-task attention rule within the hierarchical token expansion. No ablation is described that disables only the MIM loss while retaining the modified attention and token expansion, so the improvements cannot be securely attributed to the decoupled-representation mechanism rather than the attention change.

Authors: We acknowledge that the manuscript does not include an ablation that removes only the MIM loss while retaining the hierarchical token expansion and modified per-task attention rule. The modified attention rule is introduced specifically to reduce inter-task interference within the expanded token structure, but the central hypothesis concerns the MIM-driven preservation of shared backbone representations. In the revised version we will add an ablation study that trains the full DRDN architecture (including token expansion and modified attention) but disables the MIM reconstruction loss. This will directly compare performance with and without MIM under otherwise identical conditions, allowing clearer attribution of the gains on CIFAR100-B0 and longer sequences. We will also report the corresponding cross-task confusion rates for this variant. revision: yes

Circularity Check

0 steps flagged

No circularity in empirical method or reported results

full rationale

The paper presents an empirical architecture (DRDN) for class-incremental learning and reports benchmark accuracies (e.g., 77.19% on CIFAR100-B0) without any equations, derivations, or parameter-fitting steps that reduce a claimed prediction to its own inputs by construction. No self-citations, uniqueness theorems, or ansatzes are invoked in the provided text to justify core claims; the design choices are motivated by analysis of cross-task confusion and then validated experimentally. The absence of any load-bearing mathematical reduction or fitted-input-as-prediction pattern makes the derivation chain self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities can be identified from the abstract alone; the method description is high-level and does not expose explicit fitted constants or new postulated objects.

pith-pipeline@v0.9.1-grok · 5830 in / 1254 out tokens · 25728 ms · 2026-07-03T16:56:37.996520+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

47 extracted references · 3 canonical work pages · 3 internal anchors

[1]

Catastrophic interference in connec- tionist networks: The sequential learning problem,

M. McCloskey and N. J. Cohen, “Catastrophic interference in connec- tionist networks: The sequential learning problem,” inPsychology of Learning and Motivation. Elsevier, 1989, vol. 24, pp. 109–165

1989
[2]

An empirical investigation of catastrophic forgetting in gradient-based neural networks,

I. J. Goodfellow, M. Mirza, A. Courville, and Y . Bengio, “An empirical investigation of catastrophic forgetting in gradient-based neural networks,” inICLR Workshop, 2014

2014
[3]

Der: Dynamically expandable representation for class incremental learning,

S. Yan, J. Xie, and X. He, “Der: Dynamically expandable representation for class incremental learning,” inCVPR, 2021, pp. 3014–3023

2021
[4]

Dytox: Transformers for continual learning with dynamic token expansion,

A. Douillard, A. Ram ´e, G. Couairon, and M. Cord, “Dytox: Transformers for continual learning with dynamic token expansion,” inCVPR, 2022, pp. 9285–9295

2022
[5]

Dkt: Diverse knowledge transfer transformer for class incremental learning,

X. Gao, Y . He, S. Dong, J. Cheng, X. Wei, and Y . Gong, “Dkt: Diverse knowledge transfer transformer for class incremental learning,” inCVPR, 2023, pp. 24 236–24 245

2023
[6]

Resolving task confu- sion in dynamic expansion architectures for class incremental learning,

B. Huang, Z. Chen, P. Zhou, J. Chen, and Z. Wu, “Resolving task confu- sion in dynamic expansion architectures for class incremental learning,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 37, no. 1, 2023, pp. 908–916

2023
[7]

BEEF: Bi-compatible class-incremental learning via energy- based expansion and fusion,

F.-Y . Wang, D.-W. Zhou, L. Liu, H.-J. Ye, Y . Bian, D.-C. Zhan, and P. Zhao, “BEEF: Bi-compatible class-incremental learning via energy- based expansion and fusion,” inICLR, 2023

2023
[8]

Semantic relatedness emerges in deep convolutional neural networks designed for object recognition,

T. Huang, Z. Zhen, and J. Liu, “Semantic relatedness emerges in deep convolutional neural networks designed for object recognition,”Frontiers in Computational Neuroscience, vol. 15, p. 625804, 2021

2021
[9]

When prompt-based incremental learning does not meet strong pretraining,

Y .-M. Tang, Y .-X. Peng, and W.-S. Zheng, “When prompt-based incremental learning does not meet strong pretraining,” inICCV, 2023, pp. 1706–1716

2023
[10]

Progressive Neural Networks

A. A. Rusu, N. C. Rabinowitz, G. Desjardins, H. Soyer, J. Kirkpatrick, K. Kavukcuoglu, R. Pascanu, and R. Hadsell, “Progressive neural networks,”arXiv preprint arXiv:1606.04671, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[11]

Lifelong Learning with Dynamically Expandable Networks

J. Yoon, E. Yang, J. Lee, and S. J. Hwang, “Lifelong learning with dynamically expandable networks,”arXiv preprint arXiv:1708.01547, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[12]

Reinforced continual learning,

J. Xu and Z. Zhu, “Reinforced continual learning,”NeurIPS, vol. 31, 2018

2018
[13]

Dense network expansion for class incremental learning,

Z. Hu, Y . Li, J. Lyu, D. Gao, and N. Vasconcelos, “Dense network expansion for class incremental learning,” inCVPR, 2023, pp. 11 858– 11 867

2023
[14]

Loss decoupling for task-agnostic continual learning,

Y .-S. Liang and W.-J. Li, “Loss decoupling for task-agnostic continual learning,” inNeurIPS, vol. 36, 2023, pp. 11 151–11 167

2023
[15]

On the stability-plasticity dilemma of class- incremental learning,

D. Kim and B. Han, “On the stability-plasticity dilemma of class- incremental learning,” inCVPR, 2023, pp. 20 196–20 205

2023
[16]

Masked autoencoders are efficient class incremental learners,

J.-T. Zhai, X. Liu, J. van de Weijer, and M.-M. Cheng, “Masked autoencoders are efficient class incremental learners,” inICCV, 2023

2023
[17]

Masked autoencoders are scalable vision learners,

K. He, X. Chen, S. Xie, Y . Li, P. Doll ´ar, and R. Girshick, “Masked autoencoders are scalable vision learners,” inCVPR, 2022, pp. 16 000– 16 009

2022
[18]

BEiT: BERT Pre-Training of Image Transformers

H. Bao, L. Dong, S. Piao, and F. Wei, “Beit: Bert pre-training of image transformers,”arXiv preprint arXiv:2106.08254, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[19]

Image bert pre-training with online tokenizer,

J. Zhou, C. Wei, H. Wang, W. Shen, C. Xie, A. Yuille, and T. Kong, “Image bert pre-training with online tokenizer,” inICLR, 2022

2022
[20]

Convnext v2: Co-designing and scaling convnets with masked autoencoders,

S. Woo, S. Debnath, R. Hu, X. Chen, Z. Liu, I. S. Kweon, and S. Xie, “Convnext v2: Co-designing and scaling convnets with masked autoencoders,” inCVPR, 2023, pp. 16 133–16 142

2023
[21]

icarl: Incremental classifier and representation learning,

S.-A. Rebuffi, A. Kolesnikov, G. Sperl, and C. H. Lampert, “icarl: Incremental classifier and representation learning,” inCVPR, 2017, pp. 2001–2010. PREPRINT — UNDER REVIEW 10

2017
[22]

Rainbow memory: Continual learning with a memory of diverse samples,

J. Bang, H. Kim, Y . Yoo, J.-W. Ha, and J. Choi, “Rainbow memory: Continual learning with a memory of diverse samples,” inCVPR, 2021, pp. 8218–8227

2021
[23]

Gradient based sample selection for online continual learning,

R. Aljundi, M. Lin, B. Goujaud, and Y . Bengio, “Gradient based sample selection for online continual learning,”NeurIPS, vol. 32, 2019

2019
[24]

Large scale incremental learning,

Y . Wu, Y . Chen, L. Wang, Y . Ye, Z. Liu, Y . Guo, and Y . Fu, “Large scale incremental learning,” inCVPR, 2019, pp. 374–382

2019
[25]

Maintaining discrimination and fairness in class incremental learning,

B. Zhao, X. Xiao, G. Gan, B. Zhang, and S.-T. Xia, “Maintaining discrimination and fairness in class incremental learning,” inCVPR, 2020, pp. 13 208–13 217

2020
[26]

Continual learning with deep generative replay,

H. Shin, J. K. Lee, J. Kim, and J. Kim, “Continual learning with deep generative replay,” vol. 30, 2017

2017
[27]

Class-incremental learning using diffusion model for distillation and replay,

Q. Jodelet, X. Liu, Y . J. Phua, and T. Murata, “Class-incremental learning using diffusion model for distillation and replay,” inICCV, 2023, pp. 3425–3433

2023
[28]

Ddgr: Continual learning with deep diffusion-based generative replay,

R. Gao and W. Liu, “Ddgr: Continual learning with deep diffusion-based generative replay,” inICML, 2023, pp. 10 744–10 763

2023
[29]

Overcoming catastrophic forgetting in neural networks,

J. Kirkpatricket al., “Overcoming catastrophic forgetting in neural networks,”Proceedings of the National Academy of Sciences, vol. 114, no. 13, pp. 3521–3526, 2017

2017
[30]

Continual learning through synaptic intelligence,

F. Zenke, B. Poole, and S. Ganguli, “Continual learning through synaptic intelligence,” inICML, 2017, pp. 3987–3995

2017
[31]

Memory aware synapses: Learning what (not) to forget,

R. Aljundi, F. Babiloni, M. Elhoseiny, M. Rohrbach, and T. Tuytelaars, “Memory aware synapses: Learning what (not) to forget,” inECCV, 2018, pp. 139–154

2018
[32]

Riemannian walk for incremental learning,

A. Chaudhry, P. K. Dokania, T. Ajanthan, and P. H. Torr, “Riemannian walk for incremental learning,” inECCV, 2018, pp. 532–547

2018
[33]

Learning to prompt for continual learning,

Z. Wang, Z. Zhang, C.-Y . Lee, H. Zhang, R. Sun, X. Ren, G. Su, V . Perot, J. Dy, and T. Pfister, “Learning to prompt for continual learning,” in CVPR, 2022, pp. 139–149

2022
[34]

Dualprompt: Complementary prompting for rehearsal-free continual learning,

Z. Wang, Z. Zhang, S. Ebrahimi, R. Sun, H. Zhang, C.-Y . Lee, X. Ren, G. Su, V . Perot, J. Dyet al., “Dualprompt: Complementary prompting for rehearsal-free continual learning,” inECCV, 2022, pp. 631–648

2022
[35]

Self-supervised models are continual learners,

E. Fini, V . G. T. da Costa, X. Alameda-Pineda, E. Ricci, K. Alahari, and J. Mairal, “Self-supervised models are continual learners,” inCVPR, 2022, pp. 9621–9630

2022
[36]

Lump: A framework for continual learning with large pretrained models,

W. Sun, Q. Li, H. Zhang, Y . Li, and S. Liu, “Lump: A framework for continual learning with large pretrained models,” inICLR, 2024

2024
[37]

Scale: Online self- supervised lifelong learning without prior knowledge,

J.-Q. Yu, Z.-Q. Chen, Y .-X. Mu, and J.-H. Li, “Scale: Online self- supervised lifelong learning without prior knowledge,” inCVPR, 2023, pp. 19 090–19 099

2023
[38]

Learning multiple layers of features from tiny images,

A. Krizhevsky, G. Hintonet al., “Learning multiple layers of features from tiny images,” 2009

2009
[39]

Imagenet large scale visual recognition challenge,

O. Russakovskyet al., “Imagenet large scale visual recognition challenge,” IJCV, vol. 115, no. 3, pp. 211–252, 2015

2015
[40]

Convit: Improving vision transformers with soft convolutional inductive biases,

S. d’Ascoli, H. Touvron, M. L. Leavitt, A. S. Morcos, G. Biroli, and L. Sagun, “Convit: Improving vision transformers with soft convolutional inductive biases,” inICML, 2021, pp. 2286–2296

2021
[41]

Foster: Feature boosting and compression for class-incremental learning,

F.-Y . Wang, D.-W. Zhou, H.-J. Ye, and D.-C. Zhan, “Foster: Feature boosting and compression for class-incremental learning,” inECCV, 2022, pp. 398–414

2022
[42]

Adaptive aggregation networks for class-incremental learning,

Y . Liu, B. Schiele, and Q. Sun, “Adaptive aggregation networks for class-incremental learning,” inCVPR, 2021, pp. 2544–2553

2021
[43]

Memo: A unified framework for exemplar-free class-incremental learning,

D.-W. Zhou, Q.-W. Wang, Z.-H. Qi, H.-J. Ye, D.-C. Zhan, and Z. Liu, “Memo: A unified framework for exemplar-free class-incremental learning,” inICLR, 2023

2023
[44]

Learning a unified classifier incrementally via rebalancing,

S. Hou, X. Pan, C. C. Loy, Z. Wang, and D. Lin, “Learning a unified classifier incrementally via rebalancing,” inCVPR, 2019, pp. 831–839

2019
[45]

Podnet: Pooled outputs distillation for small-tasks incremental learning,

A. Douillard, M. Cord, C. Ollion, T. Robert, and E. Valle, “Podnet: Pooled outputs distillation for small-tasks incremental learning,” inECCV, 2020, pp. 86–102

2020
[46]

Overcoming catastrophic forgetting in incremental object detection via elastic response distillation,

T. Feng, M. Wang, and H. Yuan, “Overcoming catastrophic forgetting in incremental object detection via elastic response distillation,” inCVPR, 2022, pp. 9427–9436

2022
[47]

Endpoints weight fusion for class incremental semantic segmentation,

J.-W. Xiao, C.-B. Zhang, J. Feng, X. Liu, J. van de Weijer, and M.- M. Cheng, “Endpoints weight fusion for class incremental semantic segmentation,” inCVPR, 2023, pp. 7204–7213

2023

[1] [1]

Catastrophic interference in connec- tionist networks: The sequential learning problem,

M. McCloskey and N. J. Cohen, “Catastrophic interference in connec- tionist networks: The sequential learning problem,” inPsychology of Learning and Motivation. Elsevier, 1989, vol. 24, pp. 109–165

1989

[2] [2]

An empirical investigation of catastrophic forgetting in gradient-based neural networks,

I. J. Goodfellow, M. Mirza, A. Courville, and Y . Bengio, “An empirical investigation of catastrophic forgetting in gradient-based neural networks,” inICLR Workshop, 2014

2014

[3] [3]

Der: Dynamically expandable representation for class incremental learning,

S. Yan, J. Xie, and X. He, “Der: Dynamically expandable representation for class incremental learning,” inCVPR, 2021, pp. 3014–3023

2021

[4] [4]

Dytox: Transformers for continual learning with dynamic token expansion,

A. Douillard, A. Ram ´e, G. Couairon, and M. Cord, “Dytox: Transformers for continual learning with dynamic token expansion,” inCVPR, 2022, pp. 9285–9295

2022

[5] [5]

Dkt: Diverse knowledge transfer transformer for class incremental learning,

X. Gao, Y . He, S. Dong, J. Cheng, X. Wei, and Y . Gong, “Dkt: Diverse knowledge transfer transformer for class incremental learning,” inCVPR, 2023, pp. 24 236–24 245

2023

[6] [6]

Resolving task confu- sion in dynamic expansion architectures for class incremental learning,

B. Huang, Z. Chen, P. Zhou, J. Chen, and Z. Wu, “Resolving task confu- sion in dynamic expansion architectures for class incremental learning,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 37, no. 1, 2023, pp. 908–916

2023

[7] [7]

BEEF: Bi-compatible class-incremental learning via energy- based expansion and fusion,

F.-Y . Wang, D.-W. Zhou, L. Liu, H.-J. Ye, Y . Bian, D.-C. Zhan, and P. Zhao, “BEEF: Bi-compatible class-incremental learning via energy- based expansion and fusion,” inICLR, 2023

2023

[8] [8]

Semantic relatedness emerges in deep convolutional neural networks designed for object recognition,

T. Huang, Z. Zhen, and J. Liu, “Semantic relatedness emerges in deep convolutional neural networks designed for object recognition,”Frontiers in Computational Neuroscience, vol. 15, p. 625804, 2021

2021

[9] [9]

When prompt-based incremental learning does not meet strong pretraining,

Y .-M. Tang, Y .-X. Peng, and W.-S. Zheng, “When prompt-based incremental learning does not meet strong pretraining,” inICCV, 2023, pp. 1706–1716

2023

[10] [10]

Progressive Neural Networks

A. A. Rusu, N. C. Rabinowitz, G. Desjardins, H. Soyer, J. Kirkpatrick, K. Kavukcuoglu, R. Pascanu, and R. Hadsell, “Progressive neural networks,”arXiv preprint arXiv:1606.04671, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[11] [11]

Lifelong Learning with Dynamically Expandable Networks

J. Yoon, E. Yang, J. Lee, and S. J. Hwang, “Lifelong learning with dynamically expandable networks,”arXiv preprint arXiv:1708.01547, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[12] [12]

Reinforced continual learning,

J. Xu and Z. Zhu, “Reinforced continual learning,”NeurIPS, vol. 31, 2018

2018

[13] [13]

Dense network expansion for class incremental learning,

Z. Hu, Y . Li, J. Lyu, D. Gao, and N. Vasconcelos, “Dense network expansion for class incremental learning,” inCVPR, 2023, pp. 11 858– 11 867

2023

[14] [14]

Loss decoupling for task-agnostic continual learning,

Y .-S. Liang and W.-J. Li, “Loss decoupling for task-agnostic continual learning,” inNeurIPS, vol. 36, 2023, pp. 11 151–11 167

2023

[15] [15]

On the stability-plasticity dilemma of class- incremental learning,

D. Kim and B. Han, “On the stability-plasticity dilemma of class- incremental learning,” inCVPR, 2023, pp. 20 196–20 205

2023

[16] [16]

Masked autoencoders are efficient class incremental learners,

J.-T. Zhai, X. Liu, J. van de Weijer, and M.-M. Cheng, “Masked autoencoders are efficient class incremental learners,” inICCV, 2023

2023

[17] [17]

Masked autoencoders are scalable vision learners,

K. He, X. Chen, S. Xie, Y . Li, P. Doll ´ar, and R. Girshick, “Masked autoencoders are scalable vision learners,” inCVPR, 2022, pp. 16 000– 16 009

2022

[18] [18]

BEiT: BERT Pre-Training of Image Transformers

H. Bao, L. Dong, S. Piao, and F. Wei, “Beit: Bert pre-training of image transformers,”arXiv preprint arXiv:2106.08254, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[19] [19]

Image bert pre-training with online tokenizer,

J. Zhou, C. Wei, H. Wang, W. Shen, C. Xie, A. Yuille, and T. Kong, “Image bert pre-training with online tokenizer,” inICLR, 2022

2022

[20] [20]

Convnext v2: Co-designing and scaling convnets with masked autoencoders,

S. Woo, S. Debnath, R. Hu, X. Chen, Z. Liu, I. S. Kweon, and S. Xie, “Convnext v2: Co-designing and scaling convnets with masked autoencoders,” inCVPR, 2023, pp. 16 133–16 142

2023

[21] [21]

icarl: Incremental classifier and representation learning,

S.-A. Rebuffi, A. Kolesnikov, G. Sperl, and C. H. Lampert, “icarl: Incremental classifier and representation learning,” inCVPR, 2017, pp. 2001–2010. PREPRINT — UNDER REVIEW 10

2017

[22] [22]

Rainbow memory: Continual learning with a memory of diverse samples,

J. Bang, H. Kim, Y . Yoo, J.-W. Ha, and J. Choi, “Rainbow memory: Continual learning with a memory of diverse samples,” inCVPR, 2021, pp. 8218–8227

2021

[23] [23]

Gradient based sample selection for online continual learning,

R. Aljundi, M. Lin, B. Goujaud, and Y . Bengio, “Gradient based sample selection for online continual learning,”NeurIPS, vol. 32, 2019

2019

[24] [24]

Large scale incremental learning,

Y . Wu, Y . Chen, L. Wang, Y . Ye, Z. Liu, Y . Guo, and Y . Fu, “Large scale incremental learning,” inCVPR, 2019, pp. 374–382

2019

[25] [25]

Maintaining discrimination and fairness in class incremental learning,

B. Zhao, X. Xiao, G. Gan, B. Zhang, and S.-T. Xia, “Maintaining discrimination and fairness in class incremental learning,” inCVPR, 2020, pp. 13 208–13 217

2020

[26] [26]

Continual learning with deep generative replay,

H. Shin, J. K. Lee, J. Kim, and J. Kim, “Continual learning with deep generative replay,” vol. 30, 2017

2017

[27] [27]

Class-incremental learning using diffusion model for distillation and replay,

Q. Jodelet, X. Liu, Y . J. Phua, and T. Murata, “Class-incremental learning using diffusion model for distillation and replay,” inICCV, 2023, pp. 3425–3433

2023

[28] [28]

Ddgr: Continual learning with deep diffusion-based generative replay,

R. Gao and W. Liu, “Ddgr: Continual learning with deep diffusion-based generative replay,” inICML, 2023, pp. 10 744–10 763

2023

[29] [29]

Overcoming catastrophic forgetting in neural networks,

J. Kirkpatricket al., “Overcoming catastrophic forgetting in neural networks,”Proceedings of the National Academy of Sciences, vol. 114, no. 13, pp. 3521–3526, 2017

2017

[30] [30]

Continual learning through synaptic intelligence,

F. Zenke, B. Poole, and S. Ganguli, “Continual learning through synaptic intelligence,” inICML, 2017, pp. 3987–3995

2017

[31] [31]

Memory aware synapses: Learning what (not) to forget,

R. Aljundi, F. Babiloni, M. Elhoseiny, M. Rohrbach, and T. Tuytelaars, “Memory aware synapses: Learning what (not) to forget,” inECCV, 2018, pp. 139–154

2018

[32] [32]

Riemannian walk for incremental learning,

A. Chaudhry, P. K. Dokania, T. Ajanthan, and P. H. Torr, “Riemannian walk for incremental learning,” inECCV, 2018, pp. 532–547

2018

[33] [33]

Learning to prompt for continual learning,

Z. Wang, Z. Zhang, C.-Y . Lee, H. Zhang, R. Sun, X. Ren, G. Su, V . Perot, J. Dy, and T. Pfister, “Learning to prompt for continual learning,” in CVPR, 2022, pp. 139–149

2022

[34] [34]

Dualprompt: Complementary prompting for rehearsal-free continual learning,

Z. Wang, Z. Zhang, S. Ebrahimi, R. Sun, H. Zhang, C.-Y . Lee, X. Ren, G. Su, V . Perot, J. Dyet al., “Dualprompt: Complementary prompting for rehearsal-free continual learning,” inECCV, 2022, pp. 631–648

2022

[35] [35]

Self-supervised models are continual learners,

E. Fini, V . G. T. da Costa, X. Alameda-Pineda, E. Ricci, K. Alahari, and J. Mairal, “Self-supervised models are continual learners,” inCVPR, 2022, pp. 9621–9630

2022

[36] [36]

Lump: A framework for continual learning with large pretrained models,

W. Sun, Q. Li, H. Zhang, Y . Li, and S. Liu, “Lump: A framework for continual learning with large pretrained models,” inICLR, 2024

2024

[37] [37]

Scale: Online self- supervised lifelong learning without prior knowledge,

J.-Q. Yu, Z.-Q. Chen, Y .-X. Mu, and J.-H. Li, “Scale: Online self- supervised lifelong learning without prior knowledge,” inCVPR, 2023, pp. 19 090–19 099

2023

[38] [38]

Learning multiple layers of features from tiny images,

A. Krizhevsky, G. Hintonet al., “Learning multiple layers of features from tiny images,” 2009

2009

[39] [39]

Imagenet large scale visual recognition challenge,

O. Russakovskyet al., “Imagenet large scale visual recognition challenge,” IJCV, vol. 115, no. 3, pp. 211–252, 2015

2015

[40] [40]

Convit: Improving vision transformers with soft convolutional inductive biases,

S. d’Ascoli, H. Touvron, M. L. Leavitt, A. S. Morcos, G. Biroli, and L. Sagun, “Convit: Improving vision transformers with soft convolutional inductive biases,” inICML, 2021, pp. 2286–2296

2021

[41] [41]

Foster: Feature boosting and compression for class-incremental learning,

F.-Y . Wang, D.-W. Zhou, H.-J. Ye, and D.-C. Zhan, “Foster: Feature boosting and compression for class-incremental learning,” inECCV, 2022, pp. 398–414

2022

[42] [42]

Adaptive aggregation networks for class-incremental learning,

Y . Liu, B. Schiele, and Q. Sun, “Adaptive aggregation networks for class-incremental learning,” inCVPR, 2021, pp. 2544–2553

2021

[43] [43]

Memo: A unified framework for exemplar-free class-incremental learning,

D.-W. Zhou, Q.-W. Wang, Z.-H. Qi, H.-J. Ye, D.-C. Zhan, and Z. Liu, “Memo: A unified framework for exemplar-free class-incremental learning,” inICLR, 2023

2023

[44] [44]

Learning a unified classifier incrementally via rebalancing,

S. Hou, X. Pan, C. C. Loy, Z. Wang, and D. Lin, “Learning a unified classifier incrementally via rebalancing,” inCVPR, 2019, pp. 831–839

2019

[45] [45]

Podnet: Pooled outputs distillation for small-tasks incremental learning,

A. Douillard, M. Cord, C. Ollion, T. Robert, and E. Valle, “Podnet: Pooled outputs distillation for small-tasks incremental learning,” inECCV, 2020, pp. 86–102

2020

[46] [46]

Overcoming catastrophic forgetting in incremental object detection via elastic response distillation,

T. Feng, M. Wang, and H. Yuan, “Overcoming catastrophic forgetting in incremental object detection via elastic response distillation,” inCVPR, 2022, pp. 9427–9436

2022

[47] [47]

Endpoints weight fusion for class incremental semantic segmentation,

J.-W. Xiao, C.-B. Zhang, J. Feng, X. Liu, J. van de Weijer, and M.- M. Cheng, “Endpoints weight fusion for class incremental semantic segmentation,” inCVPR, 2023, pp. 7204–7213

2023