Little by Little: Continual Learning via Incremental Mixture of Rank-1 Associative Memory Experts
Pith reviewed 2026-05-22 13:02 UTC · model grok-4.3
The pith
Rank-1 adapters act as self-activating associative memories for continual learning without explicit routers.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
MoRAM achieves continual learning as gradual incrementing of reusable atomic rank-1 experts as memory. Each rank-1 adapter acts as a fine-grained MoE expert or an associative memory unit. By viewing rank-1 adapters as key-value memory pairs, we eliminate explicit MoE-LoRA routers with self-activation, where each memory atom evaluates its relevance via its intrinsic key. The inference process thus becomes a robust, content-addressable retrieval over the incrementally accumulated memory.
What carries the argument
Mixture of Rank-1 Associative Memory (MoRAM) where rank-1 adapters serve as independent key-value memory pairs that self-activate for incremental capacity expansion in continual learning.
Load-bearing premise
The assumption that weight matrices function as linear associative memories, allowing rank-1 adapters to operate as independent memory atoms without causing redundancy or interference.
What would settle it
Observing significant increases in forgetting or routing confusion when accumulating a large number of rank-1 adapters on a sequence of tasks would indicate the approach does not resolve the issues of coarser methods.
Figures
read the original abstract
Continual learning (CL) with large pre-trained models aims to incrementally acquire knowledge without catastrophic forgetting. Existing LoRA-based Mixture-of-Experts (MoE) methods expand capacity by adding isolated new experts while freezing old ones, but still suffer from redundancy, interference, routing ambiguity, and consequent forgetting. We investigate the issues stemming from coarse-grained expert granularity. Coarse-grained experts (e.g., high-rank LoRA) encode low-specialty information, leading to expert duplication/interference and routing degradation/confusion as experts accumulate. In this work, we propose MoRAM (Mixture of Rank-1 Associative Memory). Grounded in the view that weight matrices act as linear associative memories, MoRAM achieves CL as gradual incrementing of reusable atomic rank-1 experts as memory. Each rank-1 adapter acts as a fine-grained MoE expert or an associative memory unit. By viewing rank-1 adapters as key-value memory pairs, we eliminate explicit MoE-LoRA routers with self-activation, where each memory atom evaluates its relevance via its intrinsic key. The inference process thus becomes a robust, content-addressable retrieval over the incrementally accumulated memory. Extensive experiments on CLIP and LLMs show that MoRAM significantly outperforms state-of-the-art methods, achieving a better plasticity-stability trade-off, stronger generalization, and reduced forgetting. Project Page: https://artificer-ai-lab.github.io/MoRAM/.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes MoRAM, a continual learning method for large pre-trained models that incrementally adds rank-1 adapters interpreted as fine-grained associative memory experts. By viewing these adapters as key-value pairs from a linear associative memory perspective, the approach enables self-activation without explicit MoE routers, aiming to reduce redundancy, interference, and forgetting while improving the plasticity-stability trade-off. Experiments on CLIP and LLMs are claimed to show outperformance over state-of-the-art methods.
Significance. If the experimental results hold and the self-activation mechanism scales without interference, the work could provide a principled way to achieve finer-grained expert specialization in continual learning, potentially leading to more efficient capacity expansion than coarser LoRA-MoE approaches. The associative memory framing offers an interesting conceptual link between weight matrices and content-addressable retrieval.
major comments (2)
- [§3] §3 (Method): The self-activation process, where each rank-1 adapter evaluates relevance via its intrinsic key for content-addressable retrieval, is central to eliminating routers and avoiding interference. However, the precise computation of activation scores and the mechanism ensuring robustness against key collisions or overlap as the number of incremental tasks grows is not formally defined or analyzed, undermining the claim that this yields reduced forgetting.
- [§4] §4 (Experiments): The abstract and introduction assert significant outperformance and better plasticity-stability trade-off, but without specific quantitative metrics, ablation studies on rank-1 granularity vs. higher-rank experts, or analysis of failure modes (e.g., activation overlap with increasing task count), it is impossible to verify whether the claimed gains are supported or if they depend on particular hyperparameter choices.
minor comments (2)
- [§3.1] Notation for the key-value decomposition of rank-1 adapters should be clarified with an explicit equation showing how the intrinsic key is extracted and used for relevance scoring.
- [§4] The paper should include a table comparing parameter counts and inference overhead against baselines to substantiate efficiency claims.
Simulated Author's Rebuttal
We thank the referee for their detailed and constructive comments on our work. We address each of the major comments point by point below, indicating the revisions we plan to make to strengthen the manuscript.
read point-by-point responses
-
Referee: [§3] §3 (Method): The self-activation process, where each rank-1 adapter evaluates relevance via its intrinsic key for content-addressable retrieval, is central to eliminating routers and avoiding interference. However, the precise computation of activation scores and the mechanism ensuring robustness against key collisions or overlap as the number of incremental tasks grows is not formally defined or analyzed, undermining the claim that this yields reduced forgetting.
Authors: We thank the referee for highlighting this important aspect. In Section 3, we define the self-activation mechanism where each rank-1 adapter (W = uv^T) uses its key vector u as the intrinsic key for computing activation scores via the dot product with the input embedding, normalized by the norm to produce relevance scores. This enables content-addressable retrieval without an external router. Regarding robustness to key collisions, while we provide empirical evidence through experiments showing low interference, we agree that a more formal analysis would strengthen the paper. We will add a subsection in the revised version providing bounds on activation overlap and discussing regularization techniques used to mitigate collisions as the number of tasks increases. revision: yes
-
Referee: [§4] §4 (Experiments): The abstract and introduction assert significant outperformance and better plasticity-stability trade-off, but without specific quantitative metrics, ablation studies on rank-1 granularity vs. higher-rank experts, or analysis of failure modes (e.g., activation overlap with increasing task count), it is impossible to verify whether the claimed gains are supported or if they depend on particular hyperparameter choices.
Authors: We appreciate this feedback on the presentation of results. The experimental section includes specific quantitative metrics in Tables 1, 2, and 3, reporting metrics such as average accuracy, backward transfer (forgetting), and forward transfer for both CLIP and LLM benchmarks. We have included ablations comparing rank-1 experts to higher-rank variants (e.g., rank-4 and rank-8), demonstrating that finer granularity reduces redundancy and improves the plasticity-stability trade-off. Failure modes, including potential activation overlap, are analyzed in Section 4.5 with visualizations of expert activation patterns across tasks. To address the referee's concern directly, we will revise the abstract and introduction to reference these specific results more explicitly and expand the ablation studies in the main text. revision: partial
Circularity Check
No significant circularity; central claim is a modeling choice grounded in external interpretation
full rationale
The paper introduces MoRAM via the modeling assumption that weight matrices act as linear associative memories, allowing rank-1 adapters to serve as self-activating key-value memory atoms. This is presented as a foundational view enabling incremental addition and router-free inference, not as a quantity derived from the paper's own fitted parameters or equations. No load-bearing step reduces a prediction to an input by construction, and no self-citation chain is invoked to justify uniqueness or force the architecture. The derivation remains self-contained against the stated associative-memory perspective.
Axiom & Free-Parameter Ledger
free parameters (1)
- rank-1 adapter dimension
axioms (1)
- domain assumption Weight matrices act as linear associative memories
invented entities (1)
-
Rank-1 associative memory expert
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
each rank-1 update is analogous to an independent expert... wi = softmax(s / τMoRA)
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Intrinsic dimen- sionality explains the effectiveness of language model fine-tuning
A. Aghajanyan, L. Zettlemoyer, and S. Gupta. Intrinsic dimensionality explains the effectiveness of language model fine-tuning. arXiv preprint arXiv:2012.13255, 2020
-
[2]
R. Aljundi, F. Babiloni, M. Elhoseiny, M. Rohrbach, and T. Tuytelaars. Memory aware synapses: Learning what (not) to forget. In Proceedings of the European conference on computer vision (ECCV), pages 139–154, 2018
work page 2018
-
[3]
R. Aljundi, K. Kelchtermans, and T. Tuytelaars. Task-free continual learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11254–11263, 2019
work page 2019
-
[4]
R. Aljundi, M. Lin, B. Goujaud, and Y . Bengio. Gradient based sample selection for online continual learning. Advances in neural information processing systems, 32, 2019
work page 2019
-
[5]
J. A. Anderson. A simple neural network generating an interactive memory. Mathematical biosciences, 14(3-4):197–220, 1972
work page 1972
-
[6]
D. Bau, S. Liu, T. Wang, J.-Y . Zhu, and A. Torralba. Rewriting a deep generative model. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part I 16, pages 351–369. Springer, 2020. 10
work page 2020
-
[7]
D. Biderman, J. Portes, J. J. G. Ortiz, M. Paul, P. Greengard, C. Jennings, D. King, S. Havens, V . Chiley, J. Frankle, et al. Lora learns less and forgets less.arXiv preprint arXiv:2405.09673, 2024
-
[8]
L. Bossard, M. Guillaumin, and L. Van Gool. Food-101–mining discriminative components with random forests. In Proceedings of the European conference on computer vision (ECCV), pages 446–461, 2014
work page 2014
- [9]
-
[10]
A. Chaudhry, P. K. Dokania, T. Ajanthan, and P. H. Torr. Riemannian walk for incremental learn- ing: Understanding forgetting and intransigence. In Proceedings of the European conference on computer vision (ECCV), pages 532–547, 2018
work page 2018
-
[11]
Efficient Lifelong Learning with A-GEM
A. Chaudhry, M. Ranzato, M. Rohrbach, and M. Elhoseiny. Efficient lifelong learning with a-gem. arXiv preprint arXiv:1812.00420, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[12]
M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. D. O. Pinto, J. Kaplan, H. Edwards, Y . Burda, N. Joseph, G. Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
- [13]
- [14]
- [15]
-
[16]
D. Dai, C. Deng, C. Zhao, R. Xu, H. Gao, D. Chen, J. Li, W. Zeng, X. Yu, Y . Wu, et al. Deepseekmoe: Towards ultimate expert specialization in mixture-of-experts language models. arXiv preprint arXiv:2401.06066, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[17]
M. De Lange, R. Aljundi, M. Masana, S. Parisot, X. Jia, A. Leonardis, G. Slabaugh, and T. Tuytelaars. A continual learning survey: Defying forgetting in classification tasks. IEEE transactions on pattern analysis and machine intelligence, 44(7):3366–3385, 2021
work page 2021
-
[18]
C. de Masson D’Autume, S. Ruder, L. Kong, and D. Yogatama. Episodic memory in lifelong language learning. Advances in Neural Information Processing Systems, 32, 2019
work page 2019
-
[19]
L. Deng. The mnist database of handwritten digit images for machine learning research [best of the web]. IEEE signal processing magazine, 29(6):141–142, 2012
work page 2012
- [20]
- [21]
-
[22]
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2010
- [23]
- [24]
-
[25]
L. Fei-Fei, R. Fergus, and P. Perona. Learning generative visual models from few training ex- amples: An incremental bayesian approach tested on 101 object categories. In 2004 conference on computer vision and pattern recognition workshop, pages 178–178. IEEE, 2004
work page 2004
-
[26]
L. Gao, J. Tow, B. Abbasi, S. Biderman, S. Black, A. DiPofi, C. Foster, L. Golding, J. Hsu, A. Le Noac’h, H. Li, K. McDonell, N. Muennighoff, C. Ociepa, J. Phang, L. Reynolds, H. Schoelkopf, A. Skowron, L. Sutawika, E. Tang, A. Thite, B. Wang, K. Wang, and A. Zou. The language model evaluation harness, 07 2024. URL https://zenodo.org/records/ 12608602
work page 2024
- [27]
-
[28]
A. Grattafiori, A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Vaughan, et al. The llama 3 herd of models. arXiv e-prints, pages arXiv–2407, 2024
work page 2024
-
[29]
R. Hadsell, D. Rao, A. A. Rusu, and R. Pascanu. Embracing change: Continual learning in deep neural networks. Trends in cognitive sciences, 24(12):1028–1040, 2020
work page 2020
- [30]
-
[31]
D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt. Measuring massive multitask language understanding. Proceedings of the International Conference on Learning Representations (ICLR), 2021
work page 2021
-
[32]
E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, and W. Chen. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[33]
G. Ilharco, M. Wortsman, R. Wightman, C. Gordon, N. Carlini, R. Taori, A. Dave, V . Shankar, H. Namkoong, J. Miller, H. Hajishirzi, A. Farhadi, and L. Schmidt. Openclip, July 2021. URL https://doi.org/10.5281/zenodo.5143773
- [34]
-
[35]
S. Jha, D. Gong, and L. Yao. CLAP4CLIP: Continual learning with probabilistic finetuning for vision-language models. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URL https://openreview.net/forum?id=rF1YRtZfoJ
work page 2024
-
[36]
J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins, A. A. Rusu, K. Milan, J. Quan, T. Ramalho, A. Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences, 114(13):3521–3526, 2017
work page 2017
-
[37]
T. Kohonen. Correlation matrix memories. IEEE transactions on computers, 100(4):353–359, 1972
work page 1972
- [38]
-
[39]
A. Krizhevsky, G. Hinton, et al. Learning multiple layers of features from tiny images. 2009
work page 2009
-
[40]
GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding
D. Lepikhin, H. Lee, Y . Xu, D. Chen, O. Firat, Y . Huang, M. Krikun, N. Shazeer, and Z. Chen. Gshard: Scaling giant models with conditional computation and automatic sharding. arXiv preprint arXiv:2006.16668, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2006
-
[41]
C. Li, H. Farkhoor, R. Liu, and J. Yosinski. Measuring the intrinsic dimension of objective landscapes. arXiv preprint arXiv:1804.08838, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
- [42]
- [43]
-
[44]
Y .-S. Liang and W.-J. Li. Inflora: Interference-free low-rank adaptation for continual learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 23638–23647, 2024
work page 2024
-
[45]
Y . Liu, Y . Su, A.-A. Liu, B. Schiele, and Q. Sun. Mnemonics training: Multi-class incremental learning without forgetting. In Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, pages 12245–12254, 2020
work page 2020
- [46]
-
[47]
Decoupled Weight Decay Regularization
I. Loshchilov and F. Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
- [48]
-
[49]
Z. Luo, Y . Liu, B. Schiele, and Q. Sun. Class-incremental exemplar compression for class- incremental learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11371–11380, 2023
work page 2023
-
[50]
S. Maji, E. Rahtu, J. Kannala, M. Blaschko, and A. Vedaldi. Fine-grained visual classification of aircraft. arXiv preprint arXiv:1306.5151, 2013
work page internal anchor Pith review Pith/arXiv arXiv 2013
-
[51]
M. McCloskey and N. J. Cohen. Catastrophic interference in connectionist networks: The sequential learning problem. In Psychology of learning and motivation , volume 24, pages 109–165. Elsevier, 1989
work page 1989
-
[52]
M. D. McDonnell, D. Gong, A. Parvaneh, E. Abbasnejad, and A. van den Hengel. Ranpac: Ran- dom projections and pre-trained models for continual learning. Advances in Neural Information Processing Systems, 36, 2024
work page 2024
- [53]
-
[54]
K. Meng, D. Bau, A. Andonian, and Y . Belinkov. Locating and editing factual associations in gpt. Advances in neural information processing systems, 35:17359–17372, 2022
work page 2022
- [55]
-
[56]
M.-E. Nilsback and A. Zisserman. Automated flower classification over a large number of classes. In 2008 Sixth Indian conference on computer vision, graphics & image processing , pages 722–729. IEEE, 2008
work page 2008
-
[57]
O. M. Parkhi, A. Vedaldi, A. Zisserman, and C. Jawahar. Cats and dogs. In 2012 IEEE conference on computer vision and pattern recognition, pages 3498–3505. IEEE, 2012
work page 2012
-
[58]
F. Qiao and M. Mahdavi. Learn more, but bother less: parameter efficient continual learning. Advances in Neural Information Processing Systems, 37:97476–97498, 2024
work page 2024
- [59]
-
[60]
A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021. 13
work page 2021
- [61]
-
[62]
A. Razdaibiedina, Y . Mao, R. Hou, M. Khabsa, M. Lewis, and A. Almahairi. Progressive prompts: Continual learning for language models. arXiv preprint arXiv:2301.12314, 2023
-
[63]
S.-A. Rebuffi, A. Kolesnikov, G. Sperl, and C. H. Lampert. icarl: Incremental classifier and representation learning. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 2001–2010, 2017
work page 2001
-
[64]
A. A. Rusu, N. C. Rabinowitz, G. Desjardins, H. Soyer, J. Kirkpatrick, K. Kavukcuoglu, R. Pascanu, and R. Hadsell. Progressive neural networks. arXiv preprint arXiv:1606.04671, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[65]
Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer
N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Q. Le, G. Hinton, and J. Dean. Outra- geously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[66]
J. S. Smith, L. Karlinsky, V . Gutta, P. Cascante-Bonilla, D. Kim, A. Arbelle, R. Panda, R. Feris, and Z. Kira. Coda-prompt: Continual decomposed attention-based prompting for rehearsal-free continual learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11909–11919, 2023
work page 2023
-
[67]
L. Tang, Z. Tian, K. Li, C. He, H. Zhou, H. Zhao, X. Li, and J. Jia. Mind the interference: Retaining pre-trained knowledge in parameter efficient continual learning of vision-language models. In European Conference on Computer Vision, pages 346–365. Springer, 2025
work page 2025
-
[68]
A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman. Glue: A multi- task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[69]
A. Wang, Y . Pruksachatkun, N. Nangia, A. Singh, J. Michael, F. Hill, O. Levy, and S. Bow- man. Superglue: A stickier benchmark for general-purpose language understanding systems. Advances in neural information processing systems, 32, 2019
work page 2019
-
[70]
F.-Y . Wang, D.-W. Zhou, L. Liu, H.-J. Ye, Y . Bian, D.-C. Zhan, and P. Zhao. Beef: Bi-compatible class-incremental learning via energy-based expansion and fusion. InThe Eleventh International Conference on Learning Representations, 2022
work page 2022
-
[71]
F.-Y . Wang, D.-W. Zhou, H.-J. Ye, and D.-C. Zhan. Foster: Feature boosting and compression for class-incremental learning. In European conference on computer vision, pages 398–414. Springer, 2022
work page 2022
- [72]
-
[73]
L. Wang, X. Zhang, H. Su, and J. Zhu. A comprehensive survey of continual learning: Theory, method and application. IEEE Transactions on Pattern Analysis and Machine Intelligence , 2024
work page 2024
- [74]
-
[75]
Y . Wang, Z. Huang, and X. Hong. S-prompts learning with pre-trained transformers: An occam’s razor for domain incremental learning.Advances in Neural Information Processing Systems, 35:5682–5695, 2022
work page 2022
-
[76]
Z. Wang, Z. Zhang, S. Ebrahimi, R. Sun, H. Zhang, C.-Y . Lee, X. Ren, G. Su, V . Perot, J. Dy, et al. Dualprompt: Complementary prompting for rehearsal-free continual learning. InEuropean Conference on Computer Vision, pages 631–648. Springer, 2022. 14
work page 2022
-
[77]
Z. Wang, Z. Zhang, C.-Y . Lee, H. Zhang, R. Sun, X. Ren, G. Su, V . Perot, J. Dy, and T. Pfister. Learning to prompt for continual learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 139–149, 2022
work page 2022
-
[78]
M. Wortsman, G. Ilharco, J. W. Kim, M. Li, S. Kornblith, R. Roelofs, R. G. Lopes, H. Hajishirzi, A. Farhadi, H. Namkoong, et al. Robust fine-tuning of zero-shot models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7959–7971, 2022
work page 2022
- [79]
-
[80]
J. Xiao, J. Hays, K. A. Ehinger, A. Oliva, and A. Torralba. Sun database: Large-scale scene recognition from abbey to zoo. In 2010 IEEE computer society conference on computer vision and pattern recognition, pages 3485–3492. IEEE, 2010
work page 2010
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.