pith. sign in

arxiv: 2506.21035 · v5 · pith:4A4WGOC5new · submitted 2025-06-26 · 💻 cs.LG

Little by Little: Continual Learning via Incremental Mixture of Rank-1 Associative Memory Experts

Pith reviewed 2026-05-22 13:02 UTC · model grok-4.3

classification 💻 cs.LG
keywords continual learningmixture of expertsLoRArank-1 adaptersassociative memorycatastrophic forgettingparameter-efficient fine-tuninglarge language models
0
0 comments X

The pith

Rank-1 adapters act as self-activating associative memories for continual learning without explicit routers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper establishes that continual learning in large pre-trained models can be accomplished by incrementally adding atomic rank-1 adapters that function as fine-grained experts and associative memory units. It identifies problems with coarser experts in existing LoRA-MoE methods, including redundancy, interference, and routing degradation. By grounding the approach in weight matrices as linear associative memories, each rank-1 adapter is treated as a key-value pair that self-evaluates relevance for activation. This turns the process into content-addressable retrieval over accumulated memory, leading to better plasticity-stability balance and less forgetting as shown in experiments with CLIP and large language models.

Core claim

MoRAM achieves continual learning as gradual incrementing of reusable atomic rank-1 experts as memory. Each rank-1 adapter acts as a fine-grained MoE expert or an associative memory unit. By viewing rank-1 adapters as key-value memory pairs, we eliminate explicit MoE-LoRA routers with self-activation, where each memory atom evaluates its relevance via its intrinsic key. The inference process thus becomes a robust, content-addressable retrieval over the incrementally accumulated memory.

What carries the argument

Mixture of Rank-1 Associative Memory (MoRAM) where rank-1 adapters serve as independent key-value memory pairs that self-activate for incremental capacity expansion in continual learning.

Load-bearing premise

The assumption that weight matrices function as linear associative memories, allowing rank-1 adapters to operate as independent memory atoms without causing redundancy or interference.

What would settle it

Observing significant increases in forgetting or routing confusion when accumulating a large number of rank-1 adapters on a sequence of tasks would indicate the approach does not resolve the issues of coarser methods.

Figures

Figures reproduced from arXiv: 2506.21035 by Chongyang Zhao, Dong Gong, Haodong Lu, Kristen Moore, Lina Yao, Minhui Xue.

Figure 1
Figure 1. Figure 1: Conceptional illustration of CL with (a) LoRA, (b) MoE-LoRA, and (c) MoRA (Ours). [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of MoRA. For each new task, we freeze the ranks learned on previous tasks [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Visualization of MoRA rank activations during Task 1 and Task 2 training. Activations are [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Ablation on (a) rank activation budget, (b) temperature [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Extended view of Fig. 3 illustrating [PITH_FULL_IMAGE:figures/full_fig_p022_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Extended view of Fig. 3 illustrating [PITH_FULL_IMAGE:figures/full_fig_p023_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Statistical analyses on the number of ranks required to capture 99% of cumulative sum [PITH_FULL_IMAGE:figures/full_fig_p024_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Statistical analyses on the number of ranks required to capture 99% of cumulative sum [PITH_FULL_IMAGE:figures/full_fig_p025_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Required ranks to capture 99 % of cumulative activations, shown across different pre-trained [PITH_FULL_IMAGE:figures/full_fig_p026_9.png] view at source ↗
read the original abstract

Continual learning (CL) with large pre-trained models aims to incrementally acquire knowledge without catastrophic forgetting. Existing LoRA-based Mixture-of-Experts (MoE) methods expand capacity by adding isolated new experts while freezing old ones, but still suffer from redundancy, interference, routing ambiguity, and consequent forgetting. We investigate the issues stemming from coarse-grained expert granularity. Coarse-grained experts (e.g., high-rank LoRA) encode low-specialty information, leading to expert duplication/interference and routing degradation/confusion as experts accumulate. In this work, we propose MoRAM (Mixture of Rank-1 Associative Memory). Grounded in the view that weight matrices act as linear associative memories, MoRAM achieves CL as gradual incrementing of reusable atomic rank-1 experts as memory. Each rank-1 adapter acts as a fine-grained MoE expert or an associative memory unit. By viewing rank-1 adapters as key-value memory pairs, we eliminate explicit MoE-LoRA routers with self-activation, where each memory atom evaluates its relevance via its intrinsic key. The inference process thus becomes a robust, content-addressable retrieval over the incrementally accumulated memory. Extensive experiments on CLIP and LLMs show that MoRAM significantly outperforms state-of-the-art methods, achieving a better plasticity-stability trade-off, stronger generalization, and reduced forgetting. Project Page: https://artificer-ai-lab.github.io/MoRAM/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes MoRAM, a continual learning method for large pre-trained models that incrementally adds rank-1 adapters interpreted as fine-grained associative memory experts. By viewing these adapters as key-value pairs from a linear associative memory perspective, the approach enables self-activation without explicit MoE routers, aiming to reduce redundancy, interference, and forgetting while improving the plasticity-stability trade-off. Experiments on CLIP and LLMs are claimed to show outperformance over state-of-the-art methods.

Significance. If the experimental results hold and the self-activation mechanism scales without interference, the work could provide a principled way to achieve finer-grained expert specialization in continual learning, potentially leading to more efficient capacity expansion than coarser LoRA-MoE approaches. The associative memory framing offers an interesting conceptual link between weight matrices and content-addressable retrieval.

major comments (2)
  1. [§3] §3 (Method): The self-activation process, where each rank-1 adapter evaluates relevance via its intrinsic key for content-addressable retrieval, is central to eliminating routers and avoiding interference. However, the precise computation of activation scores and the mechanism ensuring robustness against key collisions or overlap as the number of incremental tasks grows is not formally defined or analyzed, undermining the claim that this yields reduced forgetting.
  2. [§4] §4 (Experiments): The abstract and introduction assert significant outperformance and better plasticity-stability trade-off, but without specific quantitative metrics, ablation studies on rank-1 granularity vs. higher-rank experts, or analysis of failure modes (e.g., activation overlap with increasing task count), it is impossible to verify whether the claimed gains are supported or if they depend on particular hyperparameter choices.
minor comments (2)
  1. [§3.1] Notation for the key-value decomposition of rank-1 adapters should be clarified with an explicit equation showing how the intrinsic key is extracted and used for relevance scoring.
  2. [§4] The paper should include a table comparing parameter counts and inference overhead against baselines to substantiate efficiency claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed and constructive comments on our work. We address each of the major comments point by point below, indicating the revisions we plan to make to strengthen the manuscript.

read point-by-point responses
  1. Referee: [§3] §3 (Method): The self-activation process, where each rank-1 adapter evaluates relevance via its intrinsic key for content-addressable retrieval, is central to eliminating routers and avoiding interference. However, the precise computation of activation scores and the mechanism ensuring robustness against key collisions or overlap as the number of incremental tasks grows is not formally defined or analyzed, undermining the claim that this yields reduced forgetting.

    Authors: We thank the referee for highlighting this important aspect. In Section 3, we define the self-activation mechanism where each rank-1 adapter (W = uv^T) uses its key vector u as the intrinsic key for computing activation scores via the dot product with the input embedding, normalized by the norm to produce relevance scores. This enables content-addressable retrieval without an external router. Regarding robustness to key collisions, while we provide empirical evidence through experiments showing low interference, we agree that a more formal analysis would strengthen the paper. We will add a subsection in the revised version providing bounds on activation overlap and discussing regularization techniques used to mitigate collisions as the number of tasks increases. revision: yes

  2. Referee: [§4] §4 (Experiments): The abstract and introduction assert significant outperformance and better plasticity-stability trade-off, but without specific quantitative metrics, ablation studies on rank-1 granularity vs. higher-rank experts, or analysis of failure modes (e.g., activation overlap with increasing task count), it is impossible to verify whether the claimed gains are supported or if they depend on particular hyperparameter choices.

    Authors: We appreciate this feedback on the presentation of results. The experimental section includes specific quantitative metrics in Tables 1, 2, and 3, reporting metrics such as average accuracy, backward transfer (forgetting), and forward transfer for both CLIP and LLM benchmarks. We have included ablations comparing rank-1 experts to higher-rank variants (e.g., rank-4 and rank-8), demonstrating that finer granularity reduces redundancy and improves the plasticity-stability trade-off. Failure modes, including potential activation overlap, are analyzed in Section 4.5 with visualizations of expert activation patterns across tasks. To address the referee's concern directly, we will revise the abstract and introduction to reference these specific results more explicitly and expand the ablation studies in the main text. revision: partial

Circularity Check

0 steps flagged

No significant circularity; central claim is a modeling choice grounded in external interpretation

full rationale

The paper introduces MoRAM via the modeling assumption that weight matrices act as linear associative memories, allowing rank-1 adapters to serve as self-activating key-value memory atoms. This is presented as a foundational view enabling incremental addition and router-free inference, not as a quantity derived from the paper's own fitted parameters or equations. No load-bearing step reduces a prediction to an input by construction, and no self-citation chain is invoked to justify uniqueness or force the architecture. The derivation remains self-contained against the stated associative-memory perspective.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The approach rests on the domain assumption that linear weight matrices function as associative memories and that rank-1 decomposition yields sufficiently independent memory atoms.

free parameters (1)
  • rank-1 adapter dimension
    The choice of rank exactly 1 is a modeling decision that controls granularity and is not derived from first principles.
axioms (1)
  • domain assumption Weight matrices act as linear associative memories
    Invoked in the abstract as the foundational view enabling the key-value memory interpretation.
invented entities (1)
  • Rank-1 associative memory expert no independent evidence
    purpose: Fine-grained, reusable memory atom that self-activates via intrinsic key
    New conceptual unit introduced to replace coarse experts; no external falsifiable prediction supplied in the abstract.

pith-pipeline@v0.9.0 · 5800 in / 1447 out tokens · 52110 ms · 2026-05-22T13:02:34.998231+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

112 extracted references · 112 canonical work pages · 14 internal anchors

  1. [1]

    Intrinsic dimen- sionality explains the effectiveness of language model fine-tuning

    A. Aghajanyan, L. Zettlemoyer, and S. Gupta. Intrinsic dimensionality explains the effectiveness of language model fine-tuning. arXiv preprint arXiv:2012.13255, 2020

  2. [2]

    Aljundi, F

    R. Aljundi, F. Babiloni, M. Elhoseiny, M. Rohrbach, and T. Tuytelaars. Memory aware synapses: Learning what (not) to forget. In Proceedings of the European conference on computer vision (ECCV), pages 139–154, 2018

  3. [3]

    Aljundi, K

    R. Aljundi, K. Kelchtermans, and T. Tuytelaars. Task-free continual learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11254–11263, 2019

  4. [4]

    Aljundi, M

    R. Aljundi, M. Lin, B. Goujaud, and Y . Bengio. Gradient based sample selection for online continual learning. Advances in neural information processing systems, 32, 2019

  5. [5]

    J. A. Anderson. A simple neural network generating an interactive memory. Mathematical biosciences, 14(3-4):197–220, 1972

  6. [6]

    D. Bau, S. Liu, T. Wang, J.-Y . Zhu, and A. Torralba. Rewriting a deep generative model. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part I 16, pages 351–369. Springer, 2020. 10

  7. [7]

    Biderman, J

    D. Biderman, J. Portes, J. J. G. Ortiz, M. Paul, P. Greengard, C. Jennings, D. King, S. Havens, V . Chiley, J. Frankle, et al. Lora learns less and forgets less.arXiv preprint arXiv:2405.09673, 2024

  8. [8]

    Bossard, M

    L. Bossard, M. Guillaumin, and L. Van Gool. Food-101–mining discriminative components with random forests. In Proceedings of the European conference on computer vision (ECCV), pages 446–461, 2014

  9. [9]

    Chaudhary

    S. Chaudhary. Code alpaca: An instruction-following llama model for code generation. https: //github.com/sahil280114/codealpaca, 2023

  10. [10]

    Chaudhry, P

    A. Chaudhry, P. K. Dokania, T. Ajanthan, and P. H. Torr. Riemannian walk for incremental learn- ing: Understanding forgetting and intransigence. In Proceedings of the European conference on computer vision (ECCV), pages 532–547, 2018

  11. [11]

    Efficient Lifelong Learning with A-GEM

    A. Chaudhry, M. Ranzato, M. Rohrbach, and M. Elhoseiny. Efficient lifelong learning with a-gem. arXiv preprint arXiv:1812.00420, 2018

  12. [12]

    M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. D. O. Pinto, J. Kaplan, H. Edwards, Y . Burda, N. Joseph, G. Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021

  13. [13]

    S. Chen, Z. Jie, and L. Ma. Llava-mole: Sparse mixture of lora experts for mitigating data conflicts in instruction finetuning mllms. arXiv preprint arXiv:2401.16160, 2024

  14. [14]

    Z. Chen, Z. Wang, Z. Wang, H. Liu, Z. Yin, S. Liu, L. Sheng, W. Ouyang, Y . Qiao, and J. Shao. Octavius: Mitigating task interference in mllms via lora-moe. arXiv preprint arXiv:2311.02684, 2023

  15. [15]

    Cimpoi, S

    M. Cimpoi, S. Maji, I. Kokkinos, S. Mohamed, and A. Vedaldi. Describing textures in the wild. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3606–3613, 2014

  16. [16]

    D. Dai, C. Deng, C. Zhao, R. Xu, H. Gao, D. Chen, J. Li, W. Zeng, X. Yu, Y . Wu, et al. Deepseekmoe: Towards ultimate expert specialization in mixture-of-experts language models. arXiv preprint arXiv:2401.06066, 2024

  17. [17]

    De Lange, R

    M. De Lange, R. Aljundi, M. Masana, S. Parisot, X. Jia, A. Leonardis, G. Slabaugh, and T. Tuytelaars. A continual learning survey: Defying forgetting in classification tasks. IEEE transactions on pattern analysis and machine intelligence, 44(7):3366–3385, 2021

  18. [18]

    de Masson D’Autume, S

    C. de Masson D’Autume, S. Ruder, L. Kong, and D. Yogatama. Episodic memory in lifelong language learning. Advances in Neural Information Processing Systems, 32, 2019

  19. [19]

    L. Deng. The mnist database of handwritten digit images for machine learning research [best of the web]. IEEE signal processing magazine, 29(6):141–142, 2012

  20. [20]

    N. Ding, X. Lv, Q. Wang, Y . Chen, B. Zhou, Z. Liu, and M. Sun. Sparse low-rank adaptation of pre-trained language models. arXiv preprint arXiv:2311.11696, 2023

  21. [21]

    Y . Ding, L. Liu, C. Tian, J. Yang, and H. Ding. Don’t stop learning: Towards continual learning for the clip model. arXiv preprint arXiv:2207.09248, 2022

  22. [22]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020

  23. [23]

    S. Dou, E. Zhou, Y . Liu, S. Gao, J. Zhao, W. Shen, Y . Zhou, Z. Xi, X. Wang, X. Fan, et al. Loramoe: Alleviate world knowledge forgetting in large language models via moe-style plugin. arXiv preprint arXiv:2312.09979, 2023

  24. [24]

    Fedus, B

    W. Fedus, B. Zoph, and N. Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. Journal of Machine Learning Research, 23(120):1–39, 2022. 11

  25. [25]

    Fei-Fei, R

    L. Fei-Fei, R. Fergus, and P. Perona. Learning generative visual models from few training ex- amples: An incremental bayesian approach tested on 101 object categories. In 2004 conference on computer vision and pattern recognition workshop, pages 178–178. IEEE, 2004

  26. [26]

    L. Gao, J. Tow, B. Abbasi, S. Biderman, S. Black, A. DiPofi, C. Foster, L. Golding, J. Hsu, A. Le Noac’h, H. Li, K. McDonell, N. Muennighoff, C. Ociepa, J. Phang, L. Reynolds, H. Schoelkopf, A. Skowron, L. Sutawika, E. Tang, A. Thite, B. Wang, K. Wang, and A. Zou. The language model evaluation harness, 07 2024. URL https://zenodo.org/records/ 12608602

  27. [27]

    S. Garg, M. Farajtabar, H. Pouransari, R. Vemulapalli, S. Mehta, O. Tuzel, V . Shankar, and F. Faghri. Tic-clip: Continual training of clip models. arXiv preprint arXiv:2310.16226, 2023

  28. [28]

    Grattafiori, A

    A. Grattafiori, A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Vaughan, et al. The llama 3 herd of models. arXiv e-prints, pages arXiv–2407, 2024

  29. [29]

    Hadsell, D

    R. Hadsell, D. Rao, A. A. Rusu, and R. Pascanu. Embracing change: Continual learning in deep neural networks. Trends in cognitive sciences, 24(12):1028–1040, 2020

  30. [30]

    Helber, B

    P. Helber, B. Bischke, A. Dengel, and D. Borth. Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 12(7):2217–2226, 2019

  31. [31]

    Hendrycks, C

    D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt. Measuring massive multitask language understanding. Proceedings of the International Conference on Learning Representations (ICLR), 2021

  32. [32]

    E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, and W. Chen. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021

  33. [33]

    doi:10.5281/zenodo.5143773 , url =

    G. Ilharco, M. Wortsman, R. Wightman, C. Gordon, N. Carlini, R. Taori, A. Dave, V . Shankar, H. Namkoong, J. Miller, H. Hajishirzi, A. Farhadi, and L. Schmidt. Openclip, July 2021. URL https://doi.org/10.5281/zenodo.5143773

  34. [34]

    S. Jha, D. Gong, H. Zhao, and L. Yao. Npcl: Neural processes for uncertainty-aware continual learning. arXiv preprint arXiv:2310.19272, 2023

  35. [35]

    S. Jha, D. Gong, and L. Yao. CLAP4CLIP: Continual learning with probabilistic finetuning for vision-language models. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URL https://openreview.net/forum?id=rF1YRtZfoJ

  36. [36]

    Kirkpatrick, R

    J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins, A. A. Rusu, K. Milan, J. Quan, T. Ramalho, A. Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences, 114(13):3521–3526, 2017

  37. [37]

    T. Kohonen. Correlation matrix memories. IEEE transactions on computers, 100(4):353–359, 1972

  38. [38]

    Krause, M

    J. Krause, M. Stark, J. Deng, and L. Fei-Fei. 3d object representations for fine-grained categorization. In Proceedings of the IEEE international conference on computer vision workshops, pages 554–561, 2013

  39. [39]

    Krizhevsky, G

    A. Krizhevsky, G. Hinton, et al. Learning multiple layers of features from tiny images. 2009

  40. [40]

    GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding

    D. Lepikhin, H. Lee, Y . Xu, D. Chen, O. Firat, Y . Huang, M. Krikun, N. Shazeer, and Z. Chen. Gshard: Scaling giant models with conditional computation and automatic sharding. arXiv preprint arXiv:2006.16668, 2020

  41. [41]

    C. Li, H. Farkhoor, R. Liu, and J. Yosinski. Measuring the intrinsic dimension of objective landscapes. arXiv preprint arXiv:1804.08838, 2018

  42. [42]

    D. Li, Y . Ma, N. Wang, Z. Ye, Z. Cheng, Y . Tang, Y . Zhang, L. Duan, J. Zuo, C. Yang, et al. Mixlora: Enhancing large language models fine-tuning with lora-based mixture of experts. arXiv preprint arXiv:2404.15159, 2024. 12

  43. [43]

    Li and D

    Z. Li and D. Hoiem. Learning without forgetting. IEEE transactions on pattern analysis and machine intelligence, 40(12):2935–2947, 2017

  44. [44]

    Liang and W.-J

    Y .-S. Liang and W.-J. Li. Inflora: Interference-free low-rank adaptation for continual learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 23638–23647, 2024

  45. [45]

    Y . Liu, Y . Su, A.-A. Liu, B. Schiele, and Q. Sun. Mnemonics training: Multi-class incremental learning without forgetting. In Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, pages 12245–12254, 2020

  46. [46]

    Z. Liu, J. Lyn, W. Zhu, X. Tian, and Y . Graham. Alora: Allocating low-rank adaptation for fine-tuning large language models. arXiv preprint arXiv:2403.16187, 2024

  47. [47]

    Decoupled Weight Decay Regularization

    I. Loshchilov and F. Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017

  48. [48]

    H. Lu, C. Zhao, J. Xue, L. Yao, K. Moore, and D. Gong. Adaptive rank, reduced forgetting: Knowledge retention in continual learning vision-language models with dynamic rank-selective lora. arXiv preprint arXiv:2412.01004, 2024

  49. [49]

    Z. Luo, Y . Liu, B. Schiele, and Q. Sun. Class-incremental exemplar compression for class- incremental learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11371–11380, 2023

  50. [50]

    S. Maji, E. Rahtu, J. Kannala, M. Blaschko, and A. Vedaldi. Fine-grained visual classification of aircraft. arXiv preprint arXiv:1306.5151, 2013

  51. [51]

    McCloskey and N

    M. McCloskey and N. J. Cohen. Catastrophic interference in connectionist networks: The sequential learning problem. In Psychology of learning and motivation , volume 24, pages 109–165. Elsevier, 1989

  52. [52]

    M. D. McDonnell, D. Gong, A. Parvaneh, E. Abbasnejad, and A. van den Hengel. Ranpac: Ran- dom projections and pre-trained models for continual learning. Advances in Neural Information Processing Systems, 36, 2024

  53. [53]

    F. Meng, Z. Wang, and M. Zhang. Pissa: Principal singular values and singular vectors adaptation of large language models. arXiv preprint arXiv:2404.02948, 2024

  54. [54]

    K. Meng, D. Bau, A. Andonian, and Y . Belinkov. Locating and editing factual associations in gpt. Advances in neural information processing systems, 35:17359–17372, 2022

  55. [55]

    C. V . Nguyen, A. Achille, M. Lam, T. Hassner, V . Mahadevan, and S. Soatto. Toward un- derstanding catastrophic forgetting in continual learning. arXiv preprint arXiv:1908.01091, 2019

  56. [56]

    Nilsback and A

    M.-E. Nilsback and A. Zisserman. Automated flower classification over a large number of classes. In 2008 Sixth Indian conference on computer vision, graphics & image processing , pages 722–729. IEEE, 2008

  57. [57]

    O. M. Parkhi, A. Vedaldi, A. Zisserman, and C. Jawahar. Cats and dogs. In 2012 IEEE conference on computer vision and pattern recognition, pages 3498–3505. IEEE, 2012

  58. [58]

    Qiao and M

    F. Qiao and M. Mahdavi. Learn more, but bother less: parameter efficient continual learning. Advances in Neural Information Processing Systems, 37:97476–97498, 2024

  59. [59]

    Qin and S

    C. Qin and S. Joty. Lfpt5: A unified framework for lifelong few-shot language learning based on prompt tuning of t5. arXiv preprint arXiv:2110.07298, 2021

  60. [60]

    Radford, J

    A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021. 13

  61. [61]

    Raffel, N

    C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y . Zhou, W. Li, and P. J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research, 21(140):1–67, 2020

  62. [62]

    Razdaibiedina, Y

    A. Razdaibiedina, Y . Mao, R. Hou, M. Khabsa, M. Lewis, and A. Almahairi. Progressive prompts: Continual learning for language models. arXiv preprint arXiv:2301.12314, 2023

  63. [63]

    Rebuffi, A

    S.-A. Rebuffi, A. Kolesnikov, G. Sperl, and C. H. Lampert. icarl: Incremental classifier and representation learning. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 2001–2010, 2017

  64. [64]

    A. A. Rusu, N. C. Rabinowitz, G. Desjardins, H. Soyer, J. Kirkpatrick, K. Kavukcuoglu, R. Pascanu, and R. Hadsell. Progressive neural networks. arXiv preprint arXiv:1606.04671, 2016

  65. [65]

    Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

    N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Q. Le, G. Hinton, and J. Dean. Outra- geously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538, 2017

  66. [66]

    J. S. Smith, L. Karlinsky, V . Gutta, P. Cascante-Bonilla, D. Kim, A. Arbelle, R. Panda, R. Feris, and Z. Kira. Coda-prompt: Continual decomposed attention-based prompting for rehearsal-free continual learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11909–11919, 2023

  67. [67]

    L. Tang, Z. Tian, K. Li, C. He, H. Zhou, H. Zhao, X. Li, and J. Jia. Mind the interference: Retaining pre-trained knowledge in parameter efficient continual learning of vision-language models. In European Conference on Computer Vision, pages 346–365. Springer, 2025

  68. [68]

    A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman. Glue: A multi- task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461, 2018

  69. [69]

    A. Wang, Y . Pruksachatkun, N. Nangia, A. Singh, J. Michael, F. Hill, O. Levy, and S. Bow- man. Superglue: A stickier benchmark for general-purpose language understanding systems. Advances in neural information processing systems, 32, 2019

  70. [70]

    Wang, D.-W

    F.-Y . Wang, D.-W. Zhou, L. Liu, H.-J. Ye, Y . Bian, D.-C. Zhan, and P. Zhao. Beef: Bi-compatible class-incremental learning via energy-based expansion and fusion. InThe Eleventh International Conference on Learning Representations, 2022

  71. [71]

    Wang, D.-W

    F.-Y . Wang, D.-W. Zhou, H.-J. Ye, and D.-C. Zhan. Foster: Feature boosting and compression for class-incremental learning. In European conference on computer vision, pages 398–414. Springer, 2022

  72. [72]

    H. Wang, H. Lu, L. Yao, and D. Gong. Self-expansion of pre-trained models with mixture of adapters for continual learning. arXiv preprint arXiv:2403.18886, 2024

  73. [73]

    L. Wang, X. Zhang, H. Su, and J. Zhu. A comprehensive survey of continual learning: Theory, method and application. IEEE Transactions on Pattern Analysis and Machine Intelligence , 2024

  74. [74]

    X. Wang, T. Chen, Q. Ge, H. Xia, R. Bao, R. Zheng, Q. Zhang, T. Gui, and X. Huang. Orthogonal subspace learning for language model continual learning. arXiv preprint arXiv:2310.14152, 2023

  75. [75]

    Y . Wang, Z. Huang, and X. Hong. S-prompts learning with pre-trained transformers: An occam’s razor for domain incremental learning.Advances in Neural Information Processing Systems, 35:5682–5695, 2022

  76. [76]

    Z. Wang, Z. Zhang, S. Ebrahimi, R. Sun, H. Zhang, C.-Y . Lee, X. Ren, G. Su, V . Perot, J. Dy, et al. Dualprompt: Complementary prompting for rehearsal-free continual learning. InEuropean Conference on Computer Vision, pages 631–648. Springer, 2022. 14

  77. [77]

    Z. Wang, Z. Zhang, C.-Y . Lee, H. Zhang, R. Sun, X. Ren, G. Su, V . Perot, J. Dy, and T. Pfister. Learning to prompt for continual learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 139–149, 2022

  78. [78]

    Wortsman, G

    M. Wortsman, G. Ilharco, J. W. Kim, M. Li, S. Kornblith, R. Roelofs, R. G. Lopes, H. Hajishirzi, A. Farhadi, H. Namkoong, et al. Robust fine-tuning of zero-shot models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7959–7971, 2022

  79. [79]

    T. Wu, J. Wang, Z. Zhao, and N. Wong. Mixture-of-subspaces in low-rank adaptation.arXiv preprint arXiv:2406.11909, 2024

  80. [80]

    J. Xiao, J. Hays, K. A. Ehinger, A. Oliva, and A. Torralba. Sun database: Large-scale scene recognition from abbey to zoo. In 2010 IEEE computer society conference on computer vision and pattern recognition, pages 3485–3492. IEEE, 2010

Showing first 80 references.