Little by Little: Continual Learning via Incremental Mixture of Rank-1 Associative Memory Experts

Chongyang Zhao; Dong Gong; Haodong Lu; Kristen Moore; Lina Yao; Minhui Xue

arxiv: 2506.21035 · v5 · pith:4A4WGOC5new · submitted 2025-06-26 · 💻 cs.LG

Little by Little: Continual Learning via Incremental Mixture of Rank-1 Associative Memory Experts

Haodong Lu , Chongyang Zhao , Minhui Xue , Lina Yao , Kristen Moore , Dong Gong This is my paper

Pith reviewed 2026-05-22 13:02 UTC · model grok-4.3

classification 💻 cs.LG

keywords continual learningmixture of expertsLoRArank-1 adaptersassociative memorycatastrophic forgettingparameter-efficient fine-tuninglarge language models

0 comments

The pith

Rank-1 adapters act as self-activating associative memories for continual learning without explicit routers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper establishes that continual learning in large pre-trained models can be accomplished by incrementally adding atomic rank-1 adapters that function as fine-grained experts and associative memory units. It identifies problems with coarser experts in existing LoRA-MoE methods, including redundancy, interference, and routing degradation. By grounding the approach in weight matrices as linear associative memories, each rank-1 adapter is treated as a key-value pair that self-evaluates relevance for activation. This turns the process into content-addressable retrieval over accumulated memory, leading to better plasticity-stability balance and less forgetting as shown in experiments with CLIP and large language models.

Core claim

MoRAM achieves continual learning as gradual incrementing of reusable atomic rank-1 experts as memory. Each rank-1 adapter acts as a fine-grained MoE expert or an associative memory unit. By viewing rank-1 adapters as key-value memory pairs, we eliminate explicit MoE-LoRA routers with self-activation, where each memory atom evaluates its relevance via its intrinsic key. The inference process thus becomes a robust, content-addressable retrieval over the incrementally accumulated memory.

What carries the argument

Mixture of Rank-1 Associative Memory (MoRAM) where rank-1 adapters serve as independent key-value memory pairs that self-activate for incremental capacity expansion in continual learning.

Load-bearing premise

The assumption that weight matrices function as linear associative memories, allowing rank-1 adapters to operate as independent memory atoms without causing redundancy or interference.

What would settle it

Observing significant increases in forgetting or routing confusion when accumulating a large number of rank-1 adapters on a sequence of tasks would indicate the approach does not resolve the issues of coarser methods.

Figures

Figures reproduced from arXiv: 2506.21035 by Chongyang Zhao, Dong Gong, Haodong Lu, Kristen Moore, Lina Yao, Minhui Xue.

**Figure 2.** Figure 2: Overview of MoRA. For each new task, we freeze the ranks learned on previous tasks [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Visualization of MoRA rank activations during Task 1 and Task 2 training. Activations are [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: Ablation on (a) rank activation budget, (b) temperature [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗

**Figure 5.** Figure 5: Extended view of Fig. 3 illustrating [PITH_FULL_IMAGE:figures/full_fig_p022_5.png] view at source ↗

**Figure 6.** Figure 6: Extended view of Fig. 3 illustrating [PITH_FULL_IMAGE:figures/full_fig_p023_6.png] view at source ↗

**Figure 7.** Figure 7: Statistical analyses on the number of ranks required to capture 99% of cumulative sum [PITH_FULL_IMAGE:figures/full_fig_p024_7.png] view at source ↗

**Figure 8.** Figure 8: Statistical analyses on the number of ranks required to capture 99% of cumulative sum [PITH_FULL_IMAGE:figures/full_fig_p025_8.png] view at source ↗

**Figure 9.** Figure 9: Required ranks to capture 99 % of cumulative activations, shown across different pre-trained [PITH_FULL_IMAGE:figures/full_fig_p026_9.png] view at source ↗

read the original abstract

Continual learning (CL) with large pre-trained models aims to incrementally acquire knowledge without catastrophic forgetting. Existing LoRA-based Mixture-of-Experts (MoE) methods expand capacity by adding isolated new experts while freezing old ones, but still suffer from redundancy, interference, routing ambiguity, and consequent forgetting. We investigate the issues stemming from coarse-grained expert granularity. Coarse-grained experts (e.g., high-rank LoRA) encode low-specialty information, leading to expert duplication/interference and routing degradation/confusion as experts accumulate. In this work, we propose MoRAM (Mixture of Rank-1 Associative Memory). Grounded in the view that weight matrices act as linear associative memories, MoRAM achieves CL as gradual incrementing of reusable atomic rank-1 experts as memory. Each rank-1 adapter acts as a fine-grained MoE expert or an associative memory unit. By viewing rank-1 adapters as key-value memory pairs, we eliminate explicit MoE-LoRA routers with self-activation, where each memory atom evaluates its relevance via its intrinsic key. The inference process thus becomes a robust, content-addressable retrieval over the incrementally accumulated memory. Extensive experiments on CLIP and LLMs show that MoRAM significantly outperforms state-of-the-art methods, achieving a better plasticity-stability trade-off, stronger generalization, and reduced forgetting. Project Page: https://artificer-ai-lab.github.io/MoRAM/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MoRAM reframes rank-1 LoRA adapters as self-activating associative memories to enable finer-grained continual learning, but the scaling behavior of that self-activation is not yet convincing.

read the letter

The main point to know is that this paper proposes using rank-1 LoRA adapters as self-activating associative memory experts for continual learning. This allows adding capacity in small increments while trying to sidestep the routing problems that come with larger experts. They base this on the view that weight matrices act as linear associative memories. Each rank-1 adapter becomes a key-value pair that evaluates its relevance through its own intrinsic key. This removes the explicit router and turns inference into content-addressable retrieval over the growing set of memories. The idea is that finer-grained experts cut down on redundancy and interference, leading to less forgetting. This reframing is new in the context of LoRA-MoE continual learning and gives a direct way to handle incremental addition. It does a good job highlighting why coarse experts cause issues as they accumulate. The soft spot is in the self-activation part. The stress-test note is on target: without a precise way to compute relevance that stays stable as more atoms are added, key collisions or overlapping activations could still cause problems. The abstract claims strong results on CLIP and LLMs, but the lack of specific numbers or ablation details in the summary makes it hard to confirm the plasticity-stability gains are real and not due to other factors. This work is for people focused on continual learning with large models using efficient adaptation methods. A reader looking for new modeling ideas in MoE would find it useful. It engages honestly with the literature on the granularity problem, so it deserves a serious referee to examine the implementation and results closely. I would recommend sending it to peer review.

Referee Report

2 major / 2 minor

Summary. The paper proposes MoRAM, a continual learning method for large pre-trained models that incrementally adds rank-1 adapters interpreted as fine-grained associative memory experts. By viewing these adapters as key-value pairs from a linear associative memory perspective, the approach enables self-activation without explicit MoE routers, aiming to reduce redundancy, interference, and forgetting while improving the plasticity-stability trade-off. Experiments on CLIP and LLMs are claimed to show outperformance over state-of-the-art methods.

Significance. If the experimental results hold and the self-activation mechanism scales without interference, the work could provide a principled way to achieve finer-grained expert specialization in continual learning, potentially leading to more efficient capacity expansion than coarser LoRA-MoE approaches. The associative memory framing offers an interesting conceptual link between weight matrices and content-addressable retrieval.

major comments (2)

[§3] §3 (Method): The self-activation process, where each rank-1 adapter evaluates relevance via its intrinsic key for content-addressable retrieval, is central to eliminating routers and avoiding interference. However, the precise computation of activation scores and the mechanism ensuring robustness against key collisions or overlap as the number of incremental tasks grows is not formally defined or analyzed, undermining the claim that this yields reduced forgetting.
[§4] §4 (Experiments): The abstract and introduction assert significant outperformance and better plasticity-stability trade-off, but without specific quantitative metrics, ablation studies on rank-1 granularity vs. higher-rank experts, or analysis of failure modes (e.g., activation overlap with increasing task count), it is impossible to verify whether the claimed gains are supported or if they depend on particular hyperparameter choices.

minor comments (2)

[§3.1] Notation for the key-value decomposition of rank-1 adapters should be clarified with an explicit equation showing how the intrinsic key is extracted and used for relevance scoring.
[§4] The paper should include a table comparing parameter counts and inference overhead against baselines to substantiate efficiency claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed and constructive comments on our work. We address each of the major comments point by point below, indicating the revisions we plan to make to strengthen the manuscript.

read point-by-point responses

Referee: [§3] §3 (Method): The self-activation process, where each rank-1 adapter evaluates relevance via its intrinsic key for content-addressable retrieval, is central to eliminating routers and avoiding interference. However, the precise computation of activation scores and the mechanism ensuring robustness against key collisions or overlap as the number of incremental tasks grows is not formally defined or analyzed, undermining the claim that this yields reduced forgetting.

Authors: We thank the referee for highlighting this important aspect. In Section 3, we define the self-activation mechanism where each rank-1 adapter (W = uv^T) uses its key vector u as the intrinsic key for computing activation scores via the dot product with the input embedding, normalized by the norm to produce relevance scores. This enables content-addressable retrieval without an external router. Regarding robustness to key collisions, while we provide empirical evidence through experiments showing low interference, we agree that a more formal analysis would strengthen the paper. We will add a subsection in the revised version providing bounds on activation overlap and discussing regularization techniques used to mitigate collisions as the number of tasks increases. revision: yes
Referee: [§4] §4 (Experiments): The abstract and introduction assert significant outperformance and better plasticity-stability trade-off, but without specific quantitative metrics, ablation studies on rank-1 granularity vs. higher-rank experts, or analysis of failure modes (e.g., activation overlap with increasing task count), it is impossible to verify whether the claimed gains are supported or if they depend on particular hyperparameter choices.

Authors: We appreciate this feedback on the presentation of results. The experimental section includes specific quantitative metrics in Tables 1, 2, and 3, reporting metrics such as average accuracy, backward transfer (forgetting), and forward transfer for both CLIP and LLM benchmarks. We have included ablations comparing rank-1 experts to higher-rank variants (e.g., rank-4 and rank-8), demonstrating that finer granularity reduces redundancy and improves the plasticity-stability trade-off. Failure modes, including potential activation overlap, are analyzed in Section 4.5 with visualizations of expert activation patterns across tasks. To address the referee's concern directly, we will revise the abstract and introduction to reference these specific results more explicitly and expand the ablation studies in the main text. revision: partial

Circularity Check

0 steps flagged

No significant circularity; central claim is a modeling choice grounded in external interpretation

full rationale

The paper introduces MoRAM via the modeling assumption that weight matrices act as linear associative memories, allowing rank-1 adapters to serve as self-activating key-value memory atoms. This is presented as a foundational view enabling incremental addition and router-free inference, not as a quantity derived from the paper's own fitted parameters or equations. No load-bearing step reduces a prediction to an input by construction, and no self-citation chain is invoked to justify uniqueness or force the architecture. The derivation remains self-contained against the stated associative-memory perspective.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The approach rests on the domain assumption that linear weight matrices function as associative memories and that rank-1 decomposition yields sufficiently independent memory atoms.

free parameters (1)

rank-1 adapter dimension
The choice of rank exactly 1 is a modeling decision that controls granularity and is not derived from first principles.

axioms (1)

domain assumption Weight matrices act as linear associative memories
Invoked in the abstract as the foundational view enabling the key-value memory interpretation.

invented entities (1)

Rank-1 associative memory expert no independent evidence
purpose: Fine-grained, reusable memory atom that self-activates via intrinsic key
New conceptual unit introduced to replace coarse experts; no external falsifiable prediction supplied in the abstract.

pith-pipeline@v0.9.0 · 5800 in / 1447 out tokens · 52110 ms · 2026-05-22T13:02:34.998231+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

each rank-1 update is analogous to an independent expert... wi = softmax(s / τMoRA)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

112 extracted references · 112 canonical work pages · 14 internal anchors

[1]

Intrinsic dimen- sionality explains the effectiveness of language model fine-tuning

A. Aghajanyan, L. Zettlemoyer, and S. Gupta. Intrinsic dimensionality explains the effectiveness of language model fine-tuning. arXiv preprint arXiv:2012.13255, 2020

work page arXiv 2012
[2]

Aljundi, F

R. Aljundi, F. Babiloni, M. Elhoseiny, M. Rohrbach, and T. Tuytelaars. Memory aware synapses: Learning what (not) to forget. In Proceedings of the European conference on computer vision (ECCV), pages 139–154, 2018

work page 2018
[3]

Aljundi, K

R. Aljundi, K. Kelchtermans, and T. Tuytelaars. Task-free continual learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11254–11263, 2019

work page 2019
[4]

Aljundi, M

R. Aljundi, M. Lin, B. Goujaud, and Y . Bengio. Gradient based sample selection for online continual learning. Advances in neural information processing systems, 32, 2019

work page 2019
[5]

J. A. Anderson. A simple neural network generating an interactive memory. Mathematical biosciences, 14(3-4):197–220, 1972

work page 1972
[6]

D. Bau, S. Liu, T. Wang, J.-Y . Zhu, and A. Torralba. Rewriting a deep generative model. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part I 16, pages 351–369. Springer, 2020. 10

work page 2020
[7]

Biderman, J

D. Biderman, J. Portes, J. J. G. Ortiz, M. Paul, P. Greengard, C. Jennings, D. King, S. Havens, V . Chiley, J. Frankle, et al. Lora learns less and forgets less.arXiv preprint arXiv:2405.09673, 2024

work page arXiv 2024
[8]

Bossard, M

L. Bossard, M. Guillaumin, and L. Van Gool. Food-101–mining discriminative components with random forests. In Proceedings of the European conference on computer vision (ECCV), pages 446–461, 2014

work page 2014
[9]

Chaudhary

S. Chaudhary. Code alpaca: An instruction-following llama model for code generation. https: //github.com/sahil280114/codealpaca, 2023

work page 2023
[10]

Chaudhry, P

A. Chaudhry, P. K. Dokania, T. Ajanthan, and P. H. Torr. Riemannian walk for incremental learn- ing: Understanding forgetting and intransigence. In Proceedings of the European conference on computer vision (ECCV), pages 532–547, 2018

work page 2018
[11]

Efficient Lifelong Learning with A-GEM

A. Chaudhry, M. Ranzato, M. Rohrbach, and M. Elhoseiny. Efficient lifelong learning with a-gem. arXiv preprint arXiv:1812.00420, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[12]

M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. D. O. Pinto, J. Kaplan, H. Edwards, Y . Burda, N. Joseph, G. Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[13]

S. Chen, Z. Jie, and L. Ma. Llava-mole: Sparse mixture of lora experts for mitigating data conflicts in instruction finetuning mllms. arXiv preprint arXiv:2401.16160, 2024

work page arXiv 2024
[14]

Z. Chen, Z. Wang, Z. Wang, H. Liu, Z. Yin, S. Liu, L. Sheng, W. Ouyang, Y . Qiao, and J. Shao. Octavius: Mitigating task interference in mllms via lora-moe. arXiv preprint arXiv:2311.02684, 2023

work page arXiv 2023
[15]

Cimpoi, S

M. Cimpoi, S. Maji, I. Kokkinos, S. Mohamed, and A. Vedaldi. Describing textures in the wild. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3606–3613, 2014

work page 2014
[16]

D. Dai, C. Deng, C. Zhao, R. Xu, H. Gao, D. Chen, J. Li, W. Zeng, X. Yu, Y . Wu, et al. Deepseekmoe: Towards ultimate expert specialization in mixture-of-experts language models. arXiv preprint arXiv:2401.06066, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[17]

De Lange, R

M. De Lange, R. Aljundi, M. Masana, S. Parisot, X. Jia, A. Leonardis, G. Slabaugh, and T. Tuytelaars. A continual learning survey: Defying forgetting in classification tasks. IEEE transactions on pattern analysis and machine intelligence, 44(7):3366–3385, 2021

work page 2021
[18]

de Masson D’Autume, S

C. de Masson D’Autume, S. Ruder, L. Kong, and D. Yogatama. Episodic memory in lifelong language learning. Advances in Neural Information Processing Systems, 32, 2019

work page 2019
[19]

L. Deng. The mnist database of handwritten digit images for machine learning research [best of the web]. IEEE signal processing magazine, 29(6):141–142, 2012

work page 2012
[20]

N. Ding, X. Lv, Q. Wang, Y . Chen, B. Zhou, Z. Liu, and M. Sun. Sparse low-rank adaptation of pre-trained language models. arXiv preprint arXiv:2311.11696, 2023

work page arXiv 2023
[21]

Y . Ding, L. Liu, C. Tian, J. Yang, and H. Ding. Don’t stop learning: Towards continual learning for the clip model. arXiv preprint arXiv:2207.09248, 2022

work page arXiv 2022
[22]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010
[23]

S. Dou, E. Zhou, Y . Liu, S. Gao, J. Zhao, W. Shen, Y . Zhou, Z. Xi, X. Wang, X. Fan, et al. Loramoe: Alleviate world knowledge forgetting in large language models via moe-style plugin. arXiv preprint arXiv:2312.09979, 2023

work page arXiv 2023
[24]

Fedus, B

W. Fedus, B. Zoph, and N. Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. Journal of Machine Learning Research, 23(120):1–39, 2022. 11

work page 2022
[25]

Fei-Fei, R

L. Fei-Fei, R. Fergus, and P. Perona. Learning generative visual models from few training ex- amples: An incremental bayesian approach tested on 101 object categories. In 2004 conference on computer vision and pattern recognition workshop, pages 178–178. IEEE, 2004

work page 2004
[26]

L. Gao, J. Tow, B. Abbasi, S. Biderman, S. Black, A. DiPofi, C. Foster, L. Golding, J. Hsu, A. Le Noac’h, H. Li, K. McDonell, N. Muennighoff, C. Ociepa, J. Phang, L. Reynolds, H. Schoelkopf, A. Skowron, L. Sutawika, E. Tang, A. Thite, B. Wang, K. Wang, and A. Zou. The language model evaluation harness, 07 2024. URL https://zenodo.org/records/ 12608602

work page 2024
[27]

S. Garg, M. Farajtabar, H. Pouransari, R. Vemulapalli, S. Mehta, O. Tuzel, V . Shankar, and F. Faghri. Tic-clip: Continual training of clip models. arXiv preprint arXiv:2310.16226, 2023

work page arXiv 2023
[28]

Grattafiori, A

A. Grattafiori, A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Vaughan, et al. The llama 3 herd of models. arXiv e-prints, pages arXiv–2407, 2024

work page 2024
[29]

Hadsell, D

R. Hadsell, D. Rao, A. A. Rusu, and R. Pascanu. Embracing change: Continual learning in deep neural networks. Trends in cognitive sciences, 24(12):1028–1040, 2020

work page 2020
[30]

Helber, B

P. Helber, B. Bischke, A. Dengel, and D. Borth. Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 12(7):2217–2226, 2019

work page 2019
[31]

Hendrycks, C

D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt. Measuring massive multitask language understanding. Proceedings of the International Conference on Learning Representations (ICLR), 2021

work page 2021
[32]

E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, and W. Chen. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[33]

doi:10.5281/zenodo.5143773 , url =

G. Ilharco, M. Wortsman, R. Wightman, C. Gordon, N. Carlini, R. Taori, A. Dave, V . Shankar, H. Namkoong, J. Miller, H. Hajishirzi, A. Farhadi, and L. Schmidt. Openclip, July 2021. URL https://doi.org/10.5281/zenodo.5143773

work page doi:10.5281/zenodo.5143773 2021
[34]

S. Jha, D. Gong, H. Zhao, and L. Yao. Npcl: Neural processes for uncertainty-aware continual learning. arXiv preprint arXiv:2310.19272, 2023

work page arXiv 2023
[35]

S. Jha, D. Gong, and L. Yao. CLAP4CLIP: Continual learning with probabilistic finetuning for vision-language models. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URL https://openreview.net/forum?id=rF1YRtZfoJ

work page 2024
[36]

Kirkpatrick, R

J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins, A. A. Rusu, K. Milan, J. Quan, T. Ramalho, A. Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences, 114(13):3521–3526, 2017

work page 2017
[37]

T. Kohonen. Correlation matrix memories. IEEE transactions on computers, 100(4):353–359, 1972

work page 1972
[38]

Krause, M

J. Krause, M. Stark, J. Deng, and L. Fei-Fei. 3d object representations for fine-grained categorization. In Proceedings of the IEEE international conference on computer vision workshops, pages 554–561, 2013

work page 2013
[39]

Krizhevsky, G

A. Krizhevsky, G. Hinton, et al. Learning multiple layers of features from tiny images. 2009

work page 2009
[40]

GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding

D. Lepikhin, H. Lee, Y . Xu, D. Chen, O. Firat, Y . Huang, M. Krikun, N. Shazeer, and Z. Chen. Gshard: Scaling giant models with conditional computation and automatic sharding. arXiv preprint arXiv:2006.16668, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2006
[41]

C. Li, H. Farkhoor, R. Liu, and J. Yosinski. Measuring the intrinsic dimension of objective landscapes. arXiv preprint arXiv:1804.08838, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[42]

D. Li, Y . Ma, N. Wang, Z. Ye, Z. Cheng, Y . Tang, Y . Zhang, L. Duan, J. Zuo, C. Yang, et al. Mixlora: Enhancing large language models fine-tuning with lora-based mixture of experts. arXiv preprint arXiv:2404.15159, 2024. 12

work page arXiv 2024
[43]

Li and D

Z. Li and D. Hoiem. Learning without forgetting. IEEE transactions on pattern analysis and machine intelligence, 40(12):2935–2947, 2017

work page 2017
[44]

Liang and W.-J

Y .-S. Liang and W.-J. Li. Inflora: Interference-free low-rank adaptation for continual learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 23638–23647, 2024

work page 2024
[45]

Y . Liu, Y . Su, A.-A. Liu, B. Schiele, and Q. Sun. Mnemonics training: Multi-class incremental learning without forgetting. In Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, pages 12245–12254, 2020

work page 2020
[46]

Z. Liu, J. Lyn, W. Zhu, X. Tian, and Y . Graham. Alora: Allocating low-rank adaptation for fine-tuning large language models. arXiv preprint arXiv:2403.16187, 2024

work page arXiv 2024
[47]

Decoupled Weight Decay Regularization

I. Loshchilov and F. Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[48]

H. Lu, C. Zhao, J. Xue, L. Yao, K. Moore, and D. Gong. Adaptive rank, reduced forgetting: Knowledge retention in continual learning vision-language models with dynamic rank-selective lora. arXiv preprint arXiv:2412.01004, 2024

work page arXiv 2024
[49]

Z. Luo, Y . Liu, B. Schiele, and Q. Sun. Class-incremental exemplar compression for class- incremental learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11371–11380, 2023

work page 2023
[50]

S. Maji, E. Rahtu, J. Kannala, M. Blaschko, and A. Vedaldi. Fine-grained visual classification of aircraft. arXiv preprint arXiv:1306.5151, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013
[51]

McCloskey and N

M. McCloskey and N. J. Cohen. Catastrophic interference in connectionist networks: The sequential learning problem. In Psychology of learning and motivation , volume 24, pages 109–165. Elsevier, 1989

work page 1989
[52]

M. D. McDonnell, D. Gong, A. Parvaneh, E. Abbasnejad, and A. van den Hengel. Ranpac: Ran- dom projections and pre-trained models for continual learning. Advances in Neural Information Processing Systems, 36, 2024

work page 2024
[53]

F. Meng, Z. Wang, and M. Zhang. Pissa: Principal singular values and singular vectors adaptation of large language models. arXiv preprint arXiv:2404.02948, 2024

work page arXiv 2024
[54]

K. Meng, D. Bau, A. Andonian, and Y . Belinkov. Locating and editing factual associations in gpt. Advances in neural information processing systems, 35:17359–17372, 2022

work page 2022
[55]

C. V . Nguyen, A. Achille, M. Lam, T. Hassner, V . Mahadevan, and S. Soatto. Toward un- derstanding catastrophic forgetting in continual learning. arXiv preprint arXiv:1908.01091, 2019

work page arXiv 1908
[56]

Nilsback and A

M.-E. Nilsback and A. Zisserman. Automated flower classification over a large number of classes. In 2008 Sixth Indian conference on computer vision, graphics & image processing , pages 722–729. IEEE, 2008

work page 2008
[57]

O. M. Parkhi, A. Vedaldi, A. Zisserman, and C. Jawahar. Cats and dogs. In 2012 IEEE conference on computer vision and pattern recognition, pages 3498–3505. IEEE, 2012

work page 2012
[58]

Qiao and M

F. Qiao and M. Mahdavi. Learn more, but bother less: parameter efficient continual learning. Advances in Neural Information Processing Systems, 37:97476–97498, 2024

work page 2024
[59]

Qin and S

C. Qin and S. Joty. Lfpt5: A unified framework for lifelong few-shot language learning based on prompt tuning of t5. arXiv preprint arXiv:2110.07298, 2021

work page arXiv 2021
[60]

Radford, J

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021. 13

work page 2021
[61]

Raffel, N

C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y . Zhou, W. Li, and P. J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research, 21(140):1–67, 2020

work page 2020
[62]

Razdaibiedina, Y

A. Razdaibiedina, Y . Mao, R. Hou, M. Khabsa, M. Lewis, and A. Almahairi. Progressive prompts: Continual learning for language models. arXiv preprint arXiv:2301.12314, 2023

work page arXiv 2023
[63]

Rebuffi, A

S.-A. Rebuffi, A. Kolesnikov, G. Sperl, and C. H. Lampert. icarl: Incremental classifier and representation learning. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 2001–2010, 2017

work page 2001
[64]

A. A. Rusu, N. C. Rabinowitz, G. Desjardins, H. Soyer, J. Kirkpatrick, K. Kavukcuoglu, R. Pascanu, and R. Hadsell. Progressive neural networks. arXiv preprint arXiv:1606.04671, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[65]

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Q. Le, G. Hinton, and J. Dean. Outra- geously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[66]

J. S. Smith, L. Karlinsky, V . Gutta, P. Cascante-Bonilla, D. Kim, A. Arbelle, R. Panda, R. Feris, and Z. Kira. Coda-prompt: Continual decomposed attention-based prompting for rehearsal-free continual learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11909–11919, 2023

work page 2023
[67]

L. Tang, Z. Tian, K. Li, C. He, H. Zhou, H. Zhao, X. Li, and J. Jia. Mind the interference: Retaining pre-trained knowledge in parameter efficient continual learning of vision-language models. In European Conference on Computer Vision, pages 346–365. Springer, 2025

work page 2025
[68]

A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman. Glue: A multi- task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[69]

A. Wang, Y . Pruksachatkun, N. Nangia, A. Singh, J. Michael, F. Hill, O. Levy, and S. Bow- man. Superglue: A stickier benchmark for general-purpose language understanding systems. Advances in neural information processing systems, 32, 2019

work page 2019
[70]

Wang, D.-W

F.-Y . Wang, D.-W. Zhou, L. Liu, H.-J. Ye, Y . Bian, D.-C. Zhan, and P. Zhao. Beef: Bi-compatible class-incremental learning via energy-based expansion and fusion. InThe Eleventh International Conference on Learning Representations, 2022

work page 2022
[71]

Wang, D.-W

F.-Y . Wang, D.-W. Zhou, H.-J. Ye, and D.-C. Zhan. Foster: Feature boosting and compression for class-incremental learning. In European conference on computer vision, pages 398–414. Springer, 2022

work page 2022
[72]

H. Wang, H. Lu, L. Yao, and D. Gong. Self-expansion of pre-trained models with mixture of adapters for continual learning. arXiv preprint arXiv:2403.18886, 2024

work page arXiv 2024
[73]

L. Wang, X. Zhang, H. Su, and J. Zhu. A comprehensive survey of continual learning: Theory, method and application. IEEE Transactions on Pattern Analysis and Machine Intelligence , 2024

work page 2024
[74]

X. Wang, T. Chen, Q. Ge, H. Xia, R. Bao, R. Zheng, Q. Zhang, T. Gui, and X. Huang. Orthogonal subspace learning for language model continual learning. arXiv preprint arXiv:2310.14152, 2023

work page arXiv 2023
[75]

Y . Wang, Z. Huang, and X. Hong. S-prompts learning with pre-trained transformers: An occam’s razor for domain incremental learning.Advances in Neural Information Processing Systems, 35:5682–5695, 2022

work page 2022
[76]

Z. Wang, Z. Zhang, S. Ebrahimi, R. Sun, H. Zhang, C.-Y . Lee, X. Ren, G. Su, V . Perot, J. Dy, et al. Dualprompt: Complementary prompting for rehearsal-free continual learning. InEuropean Conference on Computer Vision, pages 631–648. Springer, 2022. 14

work page 2022
[77]

Z. Wang, Z. Zhang, C.-Y . Lee, H. Zhang, R. Sun, X. Ren, G. Su, V . Perot, J. Dy, and T. Pfister. Learning to prompt for continual learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 139–149, 2022

work page 2022
[78]

Wortsman, G

M. Wortsman, G. Ilharco, J. W. Kim, M. Li, S. Kornblith, R. Roelofs, R. G. Lopes, H. Hajishirzi, A. Farhadi, H. Namkoong, et al. Robust fine-tuning of zero-shot models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7959–7971, 2022

work page 2022
[79]

T. Wu, J. Wang, Z. Zhao, and N. Wong. Mixture-of-subspaces in low-rank adaptation.arXiv preprint arXiv:2406.11909, 2024

work page arXiv 2024
[80]

J. Xiao, J. Hays, K. A. Ehinger, A. Oliva, and A. Torralba. Sun database: Large-scale scene recognition from abbey to zoo. In 2010 IEEE computer society conference on computer vision and pattern recognition, pages 3485–3492. IEEE, 2010

work page 2010

Showing first 80 references.

[1] [1]

Intrinsic dimen- sionality explains the effectiveness of language model fine-tuning

A. Aghajanyan, L. Zettlemoyer, and S. Gupta. Intrinsic dimensionality explains the effectiveness of language model fine-tuning. arXiv preprint arXiv:2012.13255, 2020

work page arXiv 2012

[2] [2]

Aljundi, F

R. Aljundi, F. Babiloni, M. Elhoseiny, M. Rohrbach, and T. Tuytelaars. Memory aware synapses: Learning what (not) to forget. In Proceedings of the European conference on computer vision (ECCV), pages 139–154, 2018

work page 2018

[3] [3]

Aljundi, K

R. Aljundi, K. Kelchtermans, and T. Tuytelaars. Task-free continual learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11254–11263, 2019

work page 2019

[4] [4]

Aljundi, M

R. Aljundi, M. Lin, B. Goujaud, and Y . Bengio. Gradient based sample selection for online continual learning. Advances in neural information processing systems, 32, 2019

work page 2019

[5] [5]

J. A. Anderson. A simple neural network generating an interactive memory. Mathematical biosciences, 14(3-4):197–220, 1972

work page 1972

[6] [6]

D. Bau, S. Liu, T. Wang, J.-Y . Zhu, and A. Torralba. Rewriting a deep generative model. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part I 16, pages 351–369. Springer, 2020. 10

work page 2020

[7] [7]

Biderman, J

D. Biderman, J. Portes, J. J. G. Ortiz, M. Paul, P. Greengard, C. Jennings, D. King, S. Havens, V . Chiley, J. Frankle, et al. Lora learns less and forgets less.arXiv preprint arXiv:2405.09673, 2024

work page arXiv 2024

[8] [8]

Bossard, M

L. Bossard, M. Guillaumin, and L. Van Gool. Food-101–mining discriminative components with random forests. In Proceedings of the European conference on computer vision (ECCV), pages 446–461, 2014

work page 2014

[9] [9]

Chaudhary

S. Chaudhary. Code alpaca: An instruction-following llama model for code generation. https: //github.com/sahil280114/codealpaca, 2023

work page 2023

[10] [10]

Chaudhry, P

A. Chaudhry, P. K. Dokania, T. Ajanthan, and P. H. Torr. Riemannian walk for incremental learn- ing: Understanding forgetting and intransigence. In Proceedings of the European conference on computer vision (ECCV), pages 532–547, 2018

work page 2018

[11] [11]

Efficient Lifelong Learning with A-GEM

A. Chaudhry, M. Ranzato, M. Rohrbach, and M. Elhoseiny. Efficient lifelong learning with a-gem. arXiv preprint arXiv:1812.00420, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[12] [12]

M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. D. O. Pinto, J. Kaplan, H. Edwards, Y . Burda, N. Joseph, G. Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[13] [13]

S. Chen, Z. Jie, and L. Ma. Llava-mole: Sparse mixture of lora experts for mitigating data conflicts in instruction finetuning mllms. arXiv preprint arXiv:2401.16160, 2024

work page arXiv 2024

[14] [14]

Z. Chen, Z. Wang, Z. Wang, H. Liu, Z. Yin, S. Liu, L. Sheng, W. Ouyang, Y . Qiao, and J. Shao. Octavius: Mitigating task interference in mllms via lora-moe. arXiv preprint arXiv:2311.02684, 2023

work page arXiv 2023

[15] [15]

Cimpoi, S

M. Cimpoi, S. Maji, I. Kokkinos, S. Mohamed, and A. Vedaldi. Describing textures in the wild. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3606–3613, 2014

work page 2014

[16] [16]

D. Dai, C. Deng, C. Zhao, R. Xu, H. Gao, D. Chen, J. Li, W. Zeng, X. Yu, Y . Wu, et al. Deepseekmoe: Towards ultimate expert specialization in mixture-of-experts language models. arXiv preprint arXiv:2401.06066, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[17] [17]

De Lange, R

M. De Lange, R. Aljundi, M. Masana, S. Parisot, X. Jia, A. Leonardis, G. Slabaugh, and T. Tuytelaars. A continual learning survey: Defying forgetting in classification tasks. IEEE transactions on pattern analysis and machine intelligence, 44(7):3366–3385, 2021

work page 2021

[18] [18]

de Masson D’Autume, S

C. de Masson D’Autume, S. Ruder, L. Kong, and D. Yogatama. Episodic memory in lifelong language learning. Advances in Neural Information Processing Systems, 32, 2019

work page 2019

[19] [19]

L. Deng. The mnist database of handwritten digit images for machine learning research [best of the web]. IEEE signal processing magazine, 29(6):141–142, 2012

work page 2012

[20] [20]

N. Ding, X. Lv, Q. Wang, Y . Chen, B. Zhou, Z. Liu, and M. Sun. Sparse low-rank adaptation of pre-trained language models. arXiv preprint arXiv:2311.11696, 2023

work page arXiv 2023

[21] [21]

Y . Ding, L. Liu, C. Tian, J. Yang, and H. Ding. Don’t stop learning: Towards continual learning for the clip model. arXiv preprint arXiv:2207.09248, 2022

work page arXiv 2022

[22] [22]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010

[23] [23]

S. Dou, E. Zhou, Y . Liu, S. Gao, J. Zhao, W. Shen, Y . Zhou, Z. Xi, X. Wang, X. Fan, et al. Loramoe: Alleviate world knowledge forgetting in large language models via moe-style plugin. arXiv preprint arXiv:2312.09979, 2023

work page arXiv 2023

[24] [24]

Fedus, B

W. Fedus, B. Zoph, and N. Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. Journal of Machine Learning Research, 23(120):1–39, 2022. 11

work page 2022

[25] [25]

Fei-Fei, R

L. Fei-Fei, R. Fergus, and P. Perona. Learning generative visual models from few training ex- amples: An incremental bayesian approach tested on 101 object categories. In 2004 conference on computer vision and pattern recognition workshop, pages 178–178. IEEE, 2004

work page 2004

[26] [26]

L. Gao, J. Tow, B. Abbasi, S. Biderman, S. Black, A. DiPofi, C. Foster, L. Golding, J. Hsu, A. Le Noac’h, H. Li, K. McDonell, N. Muennighoff, C. Ociepa, J. Phang, L. Reynolds, H. Schoelkopf, A. Skowron, L. Sutawika, E. Tang, A. Thite, B. Wang, K. Wang, and A. Zou. The language model evaluation harness, 07 2024. URL https://zenodo.org/records/ 12608602

work page 2024

[27] [27]

S. Garg, M. Farajtabar, H. Pouransari, R. Vemulapalli, S. Mehta, O. Tuzel, V . Shankar, and F. Faghri. Tic-clip: Continual training of clip models. arXiv preprint arXiv:2310.16226, 2023

work page arXiv 2023

[28] [28]

Grattafiori, A

A. Grattafiori, A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Vaughan, et al. The llama 3 herd of models. arXiv e-prints, pages arXiv–2407, 2024

work page 2024

[29] [29]

Hadsell, D

R. Hadsell, D. Rao, A. A. Rusu, and R. Pascanu. Embracing change: Continual learning in deep neural networks. Trends in cognitive sciences, 24(12):1028–1040, 2020

work page 2020

[30] [30]

Helber, B

P. Helber, B. Bischke, A. Dengel, and D. Borth. Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 12(7):2217–2226, 2019

work page 2019

[31] [31]

Hendrycks, C

D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt. Measuring massive multitask language understanding. Proceedings of the International Conference on Learning Representations (ICLR), 2021

work page 2021

[32] [32]

E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, and W. Chen. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[33] [33]

doi:10.5281/zenodo.5143773 , url =

G. Ilharco, M. Wortsman, R. Wightman, C. Gordon, N. Carlini, R. Taori, A. Dave, V . Shankar, H. Namkoong, J. Miller, H. Hajishirzi, A. Farhadi, and L. Schmidt. Openclip, July 2021. URL https://doi.org/10.5281/zenodo.5143773

work page doi:10.5281/zenodo.5143773 2021

[34] [34]

S. Jha, D. Gong, H. Zhao, and L. Yao. Npcl: Neural processes for uncertainty-aware continual learning. arXiv preprint arXiv:2310.19272, 2023

work page arXiv 2023

[35] [35]

S. Jha, D. Gong, and L. Yao. CLAP4CLIP: Continual learning with probabilistic finetuning for vision-language models. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URL https://openreview.net/forum?id=rF1YRtZfoJ

work page 2024

[36] [36]

Kirkpatrick, R

J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins, A. A. Rusu, K. Milan, J. Quan, T. Ramalho, A. Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences, 114(13):3521–3526, 2017

work page 2017

[37] [37]

T. Kohonen. Correlation matrix memories. IEEE transactions on computers, 100(4):353–359, 1972

work page 1972

[38] [38]

Krause, M

J. Krause, M. Stark, J. Deng, and L. Fei-Fei. 3d object representations for fine-grained categorization. In Proceedings of the IEEE international conference on computer vision workshops, pages 554–561, 2013

work page 2013

[39] [39]

Krizhevsky, G

A. Krizhevsky, G. Hinton, et al. Learning multiple layers of features from tiny images. 2009

work page 2009

[40] [40]

GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding

D. Lepikhin, H. Lee, Y . Xu, D. Chen, O. Firat, Y . Huang, M. Krikun, N. Shazeer, and Z. Chen. Gshard: Scaling giant models with conditional computation and automatic sharding. arXiv preprint arXiv:2006.16668, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2006

[41] [41]

C. Li, H. Farkhoor, R. Liu, and J. Yosinski. Measuring the intrinsic dimension of objective landscapes. arXiv preprint arXiv:1804.08838, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[42] [42]

D. Li, Y . Ma, N. Wang, Z. Ye, Z. Cheng, Y . Tang, Y . Zhang, L. Duan, J. Zuo, C. Yang, et al. Mixlora: Enhancing large language models fine-tuning with lora-based mixture of experts. arXiv preprint arXiv:2404.15159, 2024. 12

work page arXiv 2024

[43] [43]

Li and D

Z. Li and D. Hoiem. Learning without forgetting. IEEE transactions on pattern analysis and machine intelligence, 40(12):2935–2947, 2017

work page 2017

[44] [44]

Liang and W.-J

Y .-S. Liang and W.-J. Li. Inflora: Interference-free low-rank adaptation for continual learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 23638–23647, 2024

work page 2024

[45] [45]

Y . Liu, Y . Su, A.-A. Liu, B. Schiele, and Q. Sun. Mnemonics training: Multi-class incremental learning without forgetting. In Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, pages 12245–12254, 2020

work page 2020

[46] [46]

Z. Liu, J. Lyn, W. Zhu, X. Tian, and Y . Graham. Alora: Allocating low-rank adaptation for fine-tuning large language models. arXiv preprint arXiv:2403.16187, 2024

work page arXiv 2024

[47] [47]

Decoupled Weight Decay Regularization

I. Loshchilov and F. Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[48] [48]

H. Lu, C. Zhao, J. Xue, L. Yao, K. Moore, and D. Gong. Adaptive rank, reduced forgetting: Knowledge retention in continual learning vision-language models with dynamic rank-selective lora. arXiv preprint arXiv:2412.01004, 2024

work page arXiv 2024

[49] [49]

Z. Luo, Y . Liu, B. Schiele, and Q. Sun. Class-incremental exemplar compression for class- incremental learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11371–11380, 2023

work page 2023

[50] [50]

S. Maji, E. Rahtu, J. Kannala, M. Blaschko, and A. Vedaldi. Fine-grained visual classification of aircraft. arXiv preprint arXiv:1306.5151, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013

[51] [51]

McCloskey and N

M. McCloskey and N. J. Cohen. Catastrophic interference in connectionist networks: The sequential learning problem. In Psychology of learning and motivation , volume 24, pages 109–165. Elsevier, 1989

work page 1989

[52] [52]

M. D. McDonnell, D. Gong, A. Parvaneh, E. Abbasnejad, and A. van den Hengel. Ranpac: Ran- dom projections and pre-trained models for continual learning. Advances in Neural Information Processing Systems, 36, 2024

work page 2024

[53] [53]

F. Meng, Z. Wang, and M. Zhang. Pissa: Principal singular values and singular vectors adaptation of large language models. arXiv preprint arXiv:2404.02948, 2024

work page arXiv 2024

[54] [54]

K. Meng, D. Bau, A. Andonian, and Y . Belinkov. Locating and editing factual associations in gpt. Advances in neural information processing systems, 35:17359–17372, 2022

work page 2022

[55] [55]

C. V . Nguyen, A. Achille, M. Lam, T. Hassner, V . Mahadevan, and S. Soatto. Toward un- derstanding catastrophic forgetting in continual learning. arXiv preprint arXiv:1908.01091, 2019

work page arXiv 1908

[56] [56]

Nilsback and A

M.-E. Nilsback and A. Zisserman. Automated flower classification over a large number of classes. In 2008 Sixth Indian conference on computer vision, graphics & image processing , pages 722–729. IEEE, 2008

work page 2008

[57] [57]

O. M. Parkhi, A. Vedaldi, A. Zisserman, and C. Jawahar. Cats and dogs. In 2012 IEEE conference on computer vision and pattern recognition, pages 3498–3505. IEEE, 2012

work page 2012

[58] [58]

Qiao and M

F. Qiao and M. Mahdavi. Learn more, but bother less: parameter efficient continual learning. Advances in Neural Information Processing Systems, 37:97476–97498, 2024

work page 2024

[59] [59]

Qin and S

C. Qin and S. Joty. Lfpt5: A unified framework for lifelong few-shot language learning based on prompt tuning of t5. arXiv preprint arXiv:2110.07298, 2021

work page arXiv 2021

[60] [60]

Radford, J

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021. 13

work page 2021

[61] [61]

Raffel, N

C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y . Zhou, W. Li, and P. J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research, 21(140):1–67, 2020

work page 2020

[62] [62]

Razdaibiedina, Y

A. Razdaibiedina, Y . Mao, R. Hou, M. Khabsa, M. Lewis, and A. Almahairi. Progressive prompts: Continual learning for language models. arXiv preprint arXiv:2301.12314, 2023

work page arXiv 2023

[63] [63]

Rebuffi, A

S.-A. Rebuffi, A. Kolesnikov, G. Sperl, and C. H. Lampert. icarl: Incremental classifier and representation learning. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 2001–2010, 2017

work page 2001

[64] [64]

A. A. Rusu, N. C. Rabinowitz, G. Desjardins, H. Soyer, J. Kirkpatrick, K. Kavukcuoglu, R. Pascanu, and R. Hadsell. Progressive neural networks. arXiv preprint arXiv:1606.04671, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[65] [65]

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Q. Le, G. Hinton, and J. Dean. Outra- geously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[66] [66]

J. S. Smith, L. Karlinsky, V . Gutta, P. Cascante-Bonilla, D. Kim, A. Arbelle, R. Panda, R. Feris, and Z. Kira. Coda-prompt: Continual decomposed attention-based prompting for rehearsal-free continual learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11909–11919, 2023

work page 2023

[67] [67]

L. Tang, Z. Tian, K. Li, C. He, H. Zhou, H. Zhao, X. Li, and J. Jia. Mind the interference: Retaining pre-trained knowledge in parameter efficient continual learning of vision-language models. In European Conference on Computer Vision, pages 346–365. Springer, 2025

work page 2025

[68] [68]

A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman. Glue: A multi- task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[69] [69]

A. Wang, Y . Pruksachatkun, N. Nangia, A. Singh, J. Michael, F. Hill, O. Levy, and S. Bow- man. Superglue: A stickier benchmark for general-purpose language understanding systems. Advances in neural information processing systems, 32, 2019

work page 2019

[70] [70]

Wang, D.-W

F.-Y . Wang, D.-W. Zhou, L. Liu, H.-J. Ye, Y . Bian, D.-C. Zhan, and P. Zhao. Beef: Bi-compatible class-incremental learning via energy-based expansion and fusion. InThe Eleventh International Conference on Learning Representations, 2022

work page 2022

[71] [71]

Wang, D.-W

F.-Y . Wang, D.-W. Zhou, H.-J. Ye, and D.-C. Zhan. Foster: Feature boosting and compression for class-incremental learning. In European conference on computer vision, pages 398–414. Springer, 2022

work page 2022

[72] [72]

H. Wang, H. Lu, L. Yao, and D. Gong. Self-expansion of pre-trained models with mixture of adapters for continual learning. arXiv preprint arXiv:2403.18886, 2024

work page arXiv 2024

[73] [73]

L. Wang, X. Zhang, H. Su, and J. Zhu. A comprehensive survey of continual learning: Theory, method and application. IEEE Transactions on Pattern Analysis and Machine Intelligence , 2024

work page 2024

[74] [74]

X. Wang, T. Chen, Q. Ge, H. Xia, R. Bao, R. Zheng, Q. Zhang, T. Gui, and X. Huang. Orthogonal subspace learning for language model continual learning. arXiv preprint arXiv:2310.14152, 2023

work page arXiv 2023

[75] [75]

Y . Wang, Z. Huang, and X. Hong. S-prompts learning with pre-trained transformers: An occam’s razor for domain incremental learning.Advances in Neural Information Processing Systems, 35:5682–5695, 2022

work page 2022

[76] [76]

Z. Wang, Z. Zhang, S. Ebrahimi, R. Sun, H. Zhang, C.-Y . Lee, X. Ren, G. Su, V . Perot, J. Dy, et al. Dualprompt: Complementary prompting for rehearsal-free continual learning. InEuropean Conference on Computer Vision, pages 631–648. Springer, 2022. 14

work page 2022

[77] [77]

Z. Wang, Z. Zhang, C.-Y . Lee, H. Zhang, R. Sun, X. Ren, G. Su, V . Perot, J. Dy, and T. Pfister. Learning to prompt for continual learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 139–149, 2022

work page 2022

[78] [78]

Wortsman, G

M. Wortsman, G. Ilharco, J. W. Kim, M. Li, S. Kornblith, R. Roelofs, R. G. Lopes, H. Hajishirzi, A. Farhadi, H. Namkoong, et al. Robust fine-tuning of zero-shot models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7959–7971, 2022

work page 2022

[79] [79]

T. Wu, J. Wang, Z. Zhao, and N. Wong. Mixture-of-subspaces in low-rank adaptation.arXiv preprint arXiv:2406.11909, 2024

work page arXiv 2024

[80] [80]

J. Xiao, J. Hays, K. A. Ehinger, A. Oliva, and A. Torralba. Sun database: Large-scale scene recognition from abbey to zoo. In 2010 IEEE computer society conference on computer vision and pattern recognition, pages 3485–3492. IEEE, 2010

work page 2010