pith. machine review for the scientific record. sign in

arxiv: 2511.01831 · v3 · submitted 2025-11-03 · 💻 cs.LG · cs.AI

Routing-Based Continual Learning for Multimodal Large Language Models

Pith reviewed 2026-05-18 00:49 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords continual learningmultimodal large language modelsrouting architecturecatastrophic forgettingexpert routingcross-modal transfertask relatedness
0
0 comments X

The pith

A routing-based architecture lets multimodal LLMs add new tasks sequentially without forgetting, matching multi-task performance at fixed cost.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a routing method for continual learning in multimodal large language models that avoids catastrophic forgetting during sequential task adaptation. Tokens are routed at the input level to specialized experts, preserving core knowledge while incorporating new skills with constant data and compute demands. This contrasts with multi-task learning, whose overhead grows linearly with task count. Tests on models from 2B to 8B parameters show results comparable to joint training, plus cross-modal transfer where one modality's knowledge aids another.

Core claim

Token-level routing assigns inputs to a pool of experts so that multimodal models can integrate new capabilities sequentially while retaining foundational performance, achieving parity with multi-task learning at the efficiency of single-task fine-tuning and enabling cross-modal knowledge sharing.

What carries the argument

Token-level routing mechanism that dynamically assigns each token to the most relevant expert from a growing pool, based on task relatedness, to support specialization without interference.

If this is right

  • Routing stays effective with large expert pools and capitalizes on task similarities.
  • Cross-modal transfer occurs, letting knowledge from one input type improve results in another.
  • Larger models show smaller drops relative to fully specialized fine-tuning.
  • Overall training cost and data use stay fixed no matter how many tasks arrive in sequence.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same routing pattern might reduce forgetting in non-multimodal continual learning by exploiting any detectable task overlap.
  • It opens the possibility of maintaining an ever-expanding set of specialized capabilities inside one model without periodic full retraining.
  • Future work could test whether routing decisions themselves can be learned more efficiently on very long task streams.

Load-bearing premise

The router can accurately match tokens to experts according to task relatedness without creating hidden scaling costs or cross-task interference as more tasks and experts are added.

What would settle it

Measure whether performance or compute cost degrades when the task sequence length and expert pool are both doubled while keeping task relatedness low.

Figures

Figures reproduced from arXiv: 2511.01831 by Dimitrios Dimitriadis, Gwang Lee, Jay Mohta, Kenan Emir Ak, Mingwei Shen, Yan Xu.

Figure 1
Figure 1. Figure 1: Comparison of Multi-Task Learning, Routing [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Performance drop in comparison to specialized [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Routing patterns for SNLI (left), MMBENCH (middle), and COCO (right). The figure demonstrates that the rout [PITH_FULL_IMAGE:figures/full_fig_p012_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Routing patterns for the MGSM dataset in multilingual transfer. Notably, the model leverages the Chinese expert in [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗
read the original abstract

Multimodal Large Language Models (MLLMs) struggle with continual learning, often suffering from catastrophic forgetting when adapting to sequential tasks. We introduce a routing-based architecture that integrates new capabilities while robustly preserving foundational knowledge. While Multi-Task Learning (MTL) offers a theoretical performance upper bound, it incurs a linearly scaling computational overhead as the number of tasks increases. In contrast, our method maintains fixed data and compute requirements regardless of the task sequence length. Across models ranging from 2B to 8B parameters, we demonstrate that our routing approach performs on par with MTL while retaining the training efficiency of sequential fine-tuning. Beyond merely mitigating forgetting, we observe that token-level routing facilitates cross-modal transfer, leveraging knowledge from one modality to bolster performance in another. Ablation studies confirm the approach's scalability: routing remains robust even with large expert pools and effectively capitalizes on task relatedness. Finally, we show that our method scales favorably, with larger models exhibiting minimal degradation compared to fully specialized fine-tuning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims to introduce a routing-based continual learning architecture for multimodal LLMs that uses token-level routing to experts to add new capabilities sequentially while avoiding catastrophic forgetting. It asserts that this achieves performance comparable to multi-task learning (MTL) but with the training efficiency of sequential fine-tuning, keeping data and compute fixed regardless of task number. Evaluations on 2B-8B models show parity with MTL, cross-modal transfer benefits, and robustness in ablations with large expert pools, with favorable scaling for larger models.

Significance. If substantiated, the result would be significant for the field of continual learning in large multimodal models, as it potentially resolves the trade-off between performance (MTL) and efficiency (sequential FT) by using routing to leverage task relatedness and cross-modal transfer without linear overhead. The empirical demonstration across model sizes and ablations on expert pools adds to its practical value, though verification of no hidden costs is key.

major comments (2)
  1. [§4 Experiments] §4 Experiments: The central claim of performing 'on par with MTL' across model sizes lacks specific metrics, error bars, exact baseline numbers, and dataset details, as noted in the abstract's reporting. This is load-bearing for verifying the performance equivalence.
  2. [§5 Ablations] §5 Ablations: The ablation studies confirm robustness with large expert pools but do not report measurements of the routing mechanism's compute overhead, results on long task sequences, or modality-specific interference metrics. This leaves the assumption of no hidden scaling costs or interference untested, which is critical for the fixed-compute advantage.
minor comments (2)
  1. [Abstract] The abstract could benefit from including at least one key quantitative result to support the parity claim.
  2. [Notation] Ensure consistent definition of the routing function and expert pool size throughout the method section.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We are grateful to the referee for the thoughtful comments, which help improve the clarity and rigor of our work. Below we respond point-by-point to the major comments.

read point-by-point responses
  1. Referee: §4 Experiments: The central claim of performing 'on par with MTL' across model sizes lacks specific metrics, error bars, exact baseline numbers, and dataset details, as noted in the abstract's reporting. This is load-bearing for verifying the performance equivalence.

    Authors: The manuscript's Section 4 presents comparative results across model sizes in dedicated tables, with performance metrics for our routing method, MTL, and sequential baselines on multimodal tasks. While the abstract condenses these findings, the main text includes the relevant numbers. To fully address this concern, we will incorporate error bars from repeated trials and more explicit dataset specifications directly in the experimental section of the revised manuscript. revision: yes

  2. Referee: §5 Ablations: The ablation studies confirm robustness with large expert pools but do not report measurements of the routing mechanism's compute overhead, results on long task sequences, or modality-specific interference metrics. This leaves the assumption of no hidden scaling costs or interference untested, which is critical for the fixed-compute advantage.

    Authors: Our architecture maintains fixed compute by routing each token to a single expert, independent of the number of tasks, as described in the method section. The ablations in Section 5 demonstrate robustness to large expert pools. We agree that explicit compute overhead measurements, evaluations on longer task sequences, and modality-specific interference metrics would further substantiate the claims. We will include these additional analyses in the revised version where feasible, noting that extending to very long sequences may be constrained by available benchmarks. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical comparisons rest on external baselines

full rationale

The paper presents a routing architecture for continual learning in MLLMs and evaluates it via direct experiments against MTL and sequential fine-tuning baselines across 2B-8B models. No equations, predictions, or first-principles results are claimed; performance parity and fixed-compute claims are measured outcomes, not quantities defined by fitted parameters inside the paper. Ablations on expert pool size and task relatedness are reported as empirical checks rather than derivations. No self-citation load-bearing steps, uniqueness theorems, or ansatz smuggling appear in the abstract or described content. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim rests on the empirical effectiveness of a learned routing mechanism whose behavior is validated through experiments rather than derived from first principles or external benchmarks.

free parameters (1)
  • expert pool size and routing hyperparameters
    Number of experts and routing decision thresholds are chosen or tuned to achieve the reported scaling behavior.
axioms (1)
  • domain assumption Task relatedness can be leveraged by token-level routing to improve transfer without interference
    Invoked to explain cross-modal benefits and robustness in ablation studies.
invented entities (1)
  • token-level router for continual learning no independent evidence
    purpose: Dynamically directs computation to task-specific experts to prevent forgetting while keeping compute fixed
    Core new component introduced by the paper; no independent falsifiable prediction outside the reported experiments is provided.

pith-pipeline@v0.9.0 · 5718 in / 1186 out tokens · 73102 ms · 2026-05-18T00:49:06.686508+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

73 extracted references · 73 canonical work pages · 16 internal anchors

  1. [1]

    , " * write output.state after.block = add.period write newline

    ENTRY address archivePrefix author booktitle chapter edition editor eid eprint howpublished institution isbn journal key month note number organization pages publisher school series title type volume year label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block FUNCTION init.state.consts #0 'before.a...

  2. [2]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

  3. [3]

    Gated Multimodal Units for Information Fusion

    Arevalo, J.; Solorio, T.; y Gómez, M. M.; and González, F. A. 2017. Gated Multimodal Units for Information Fusion. arXiv:1702.01992

  4. [4]

    R.; Angeli, G.; Potts, C.; and Manning, C

    Bowman, S. R.; Angeli, G.; Potts, C.; and Manning, C. D. 2015. A large annotated corpus for learning natural language inference. In M \`a rquez, L.; Callison-Burch, C.; and Su, J., eds., Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, 632--642. Lisbon, Portugal: Association for Computational Linguistics

  5. [5]

    Language Models are Few-Shot Learners

    Brown, T. B.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.; Dhariwal, P.; Neel, S.; Shinn, E.; Steinhardt, J.; Christian, G.; et al. 2020. Language Models are Few-Shot Learners. arXiv preprint arXiv:2005.14165

  6. [6]

    Caruana, R. 1997. Multitask Learning. Machine Learning, 28(1): 41--75

  7. [7]

    Cha, S.; Lee, H.; Shin, J.; and Shin, J. 2020. CPR: Classifier-projection regularization for continual learning. arXiv preprint arXiv:2006.07326

  8. [8]

    Chaudhry, A.; et al. 2018. Efficient lifelong learning with a-gem. arXiv preprint arXiv:1812.00420

  9. [9]

    Chen, B.; Wang, H.; Du, T.; Yu, S.; An, R.; Gao, Q.; Lin, D.; and Wang, J. 2024 a . MoE-LLaVA: Mixture of Experts for Large Vision-Language Models

  10. [10]

    Chen, T.; Kornblith, S.; Norouzi, M.; and Hinton, G. 2020. A simple framework for contrastive learning of visual representations. In International conference on machine learning, 1597--1607. PMLR

  11. [11]

    Chen, Z.; Wu, J.; Wang, W.; Su, W.; Chen, G.; Xing, S.; Zhong, M.; Zhang, Q.; Zhu, X.; Lu, L.; et al. 2024 b . Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 24185--24198

  12. [12]

    Chowdhery, A.; Narang, S.; Devlin, J.; Bosma, M.; Mishra, G.; Roberts, A.; Barham, P.; Chung, H. W.; Sutton, C.; Gehrmann, S.; Schuh, P.; Shi, K.; Tsvyashchenko, S.; Maynez, J.; Rao, A.; Barnes, P.; Tay, Y.; Shazeer, N.; Prabhakaran, V.; Reif, E.; Du, N.; Hutchinson, B.; Pope, R.; Bradbury, J.; Austin, J.; Isard, M.; Gur-Ari, G.; Yin, P.; Duke, T.; Levska...

  13. [13]

    R.; Schwenk, H.; and Stoyanov, V

    Conneau, A.; Rinott, R.; Lample, G.; Williams, A.; Bowman, S. R.; Schwenk, H.; and Stoyanov, V. 2018. XNLI: Evaluating Cross-lingual Sentence Representations. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics

  14. [14]

    Douillard, A.; et al. 2021. End-to-End Task-Specific Model Merging for Multi-Task Learning. Proceedings of the International Conference on Machine Learning, 139: 1380--1389

  15. [15]

    Fedus, W.; Zoph, B.; and Shazeer, N. 2021. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. Journal of Machine Learning Research, 23: 1--39

  16. [16]

    An Empirical Investigation of Catastrophic Forgetting in Gradient-Based Neural Networks

    Goodfellow, I. J.; Mirza, M.; Xiao, D.; Courville, A.; and Bengio, Y. 2013. An empirical investigation of catastrophic forgetting in gradient-based neural networks. arXiv preprint arXiv:1312.6211

  17. [17]

    Hinton, G.; et al. 2015. Distilling the Knowledge in a Neural Network. arXiv preprint arXiv:1503.02531

  18. [18]

    J.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; Chen, W.; et al

    Hu, E. J.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; Chen, W.; et al. 2022. Lora: Low-rank adaptation of large language models. ICLR, 1(2): 3

  19. [19]

    T.; Wortsman, M.; Schmidt, L.; Hajishirzi, H.; and Farhadi, A

    Ilharco, G.; Ribeiro, M. T.; Wortsman, M.; Schmidt, L.; Hajishirzi, H.; and Farhadi, A. 2023. Editing models with task arithmetic. In The Eleventh International Conference on Learning Representations

  20. [20]

    B.; Avent, B.; Bellet, A.; Bennis, M.; Bhagoji, A

    Kairouz, P.; McMahan, H. B.; Avent, B.; Bellet, A.; Bennis, M.; Bhagoji, A. N.; Bonawitz, K.; Charles, Z.; Cormode, G.; Cummings, R.; et al. 2021. Advances and open problems in federated learning. Foundations and Trends in Machine Learning, 14(1--2): 1--210

  21. [21]

    Kalajdzievski, D. 2024. Scaling laws for forgetting when fine-tuning large language models. arXiv preprint arXiv:2401.05605

  22. [22]

    Kemker, R.; McClure, M.; Abitino, A.; Hayes, T.; and Kanan, C. 2018. Measuring catastrophic forgetting in neural networks. In Proceedings of the AAAI conference on artificial intelligence, volume 32

  23. [23]

    Kiela, D.; Firooz, H.; Mohan, A.; Goswami, V.; Singh, A.; Ringshia, P.; and Testuggine, D. 2021. The Hateful Memes Challenge: Detecting Hate Speech in Multimodal Memes. arXiv:2005.04790

  24. [24]

    A.; Milan, C.; Quan, J.; Ramalho, T.; Grabska-Barwi \'n ska, A.; et al

    Kirkpatrick, J.; Pascanu, R.; Rabinowitz, N.; Veness, J.; Desjardins, G.; Rusu, A. A.; Milan, C.; Quan, J.; Ramalho, T.; Grabska-Barwi \'n ska, A.; et al. 2017. Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences, 114(13): 3521--3526

  25. [25]

    Li, T.; Sahu, A.; Talwalkar, A.; and Smith, V. 2020. Federated optimization in heterogeneous networks. Proceedings of Machine Learning and Systems (MLSys)

  26. [26]

    Li, Y.; et al. 2018. Learning to route with neural modular networks. arXiv preprint arXiv:1809.10778

  27. [27]

    Li, Z.; and Hoiem, D. 2017. Learning without forgetting. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(12): 2935--2947

  28. [28]

    Lin, B.; Tang, Z.; Ye, Y.; Huang, J.; Zhang, J.; Pang, Y.; Jin, P.; Ning, M.; Luo, J.; and Yuan, L. 2024. MoE-LLaVA: Mixture of Experts for Large Vision-Language Models. arXiv:2401.15947

  29. [29]

    Microsoft COCO: Common Objects in Context

    Lin, T.-Y.; Maire, M.; Belongie, S.; Bourdev, L.; Girshick, R.; Hays, J.; Perona, P.; Ramanan, D.; Zitnick, C. L.; and Dollár, P. 2015. Microsoft COCO: Common Objects in Context. arXiv:1405.0312

  30. [30]

    Liu, H.; Li, C.; Wu, Q.; and Lee, Y. J. 2023. Visual Instruction Tuning. arXiv preprint arXiv:2304.08485

  31. [31]

    Liu, Y.; Duan, H.; Zhang, Y.; Li, B.; Zhang, S.; Zhao, W.; Yuan, Y.; Wang, J.; He, C.; Liu, Z.; Chen, K.; and Lin, D. 2024. MMBench: Is Your Multi-modal Model an All-around Player? arXiv:2307.06281

  32. [32]

    Liu, Y.; Duan, H.; Zhang, Y.; Li, B.; Zhang, S.; Zhao, W.; Yuan, Y.; Wang, J.; He, C.; Liu, Z.; et al. 2025. Mmbench: Is your multi-modal model an all-around player? In European conference on computer vision, 216--233. Springer

  33. [33]

    Lopez-Paz, D.; and Ranzato, M. 2017. Gradient episodic memory for continual learning. In Advances in Neural Information Processing Systems, volume 30

  34. [34]

    Lu, J.; Batra, D.; Parikh, D.; and Lee, S. 2019. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In Advances in neural information processing systems, volume 32

  35. [35]

    Luo, Y.; et al. 2023. An empirical study of catastrophic forgetting in large language models during continual fine-tuning. arXiv preprint arXiv:2308.08747

  36. [36]

    Mallya, A.; Davis, D.; and Lazebnik, S. 2018. Piggyback: Adding new tasks to a single network with weight transformations using binary masks. In Proceedings of the European Conference on Computer Vision, 72--87

  37. [37]

    Mallya, A.; and Lazebnik, S. 2018. PackNet: Adding Multiple Tasks to a Single Network by Iterative Pruning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 7765--7773

  38. [38]

    Mallya, A.; and Lazebnik, S. 2022. Forget-free Continual Learning with Winning Subnetworks. In International Conference on Machine Learning, 15014--15024

  39. [39]

    Q.; Joty, S.; and Hoque, E

    Masry, A.; Long, D.; Tan, J. Q.; Joty, S.; and Hoque, E. 2022 a . C hart QA : A Benchmark for Question Answering about Charts with Visual and Logical Reasoning. In Findings of the Association for Computational Linguistics: ACL 2022, 2263--2279. Dublin, Ireland: Association for Computational Linguistics

  40. [40]

    ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning

    Masry, A.; Long, D. X.; Tan, J. Q.; Joty, S.; and Hoque, E. 2022 b . Chartqa: A benchmark for question answering about charts with visual and logical reasoning. arXiv preprint arXiv:2203.10244

  41. [41]

    Masry, A.; Thakkar, M.; Bajaj, A.; Kartha, A.; Hoque, E.; and Joty, S. 2024. ChartGemma: Visual Instruction-tuning for Chart Reasoning in the Wild. arXiv:2407.04172

  42. [42]

    Matena, M.; and Raffel, C. 2022. Merging Models with Fisher-Weighted Averaging. arXiv:2111.09832

  43. [43]

    Mathew, M.; Karatzas, D.; and Jawahar, C. 2021 a . Docvqa: A dataset for vqa on document images. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, 2200--2209

  44. [44]

    Mathew, M.; Karatzas, D.; and Jawahar, C. V. 2021 b . DocVQA: A Dataset for VQA on Document Images. arXiv:2007.00398

  45. [45]

    B.; Moore, E.; Ramage, D.; Hampson, S.; and Ag \"u era y Arcas, B

    McMahan, H. B.; Moore, E.; Ramage, D.; Hampson, S.; and Ag \"u era y Arcas, B. 2017. Communication-Efficient Learning of Deep Networks from Decentralized Data. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, 1273--1282. PMLR

  46. [46]

    Merlingot, S.; Gagnon-Audet, J.-C.; Kadoury, S.; and Pal, C. 2024. MagMax: Tackling continual learning with automated model merging. arXiv preprint arXiv:2403.07505

  47. [47]

    Mitra, A.; Khanpour, H.; Rosset, C.; and Awadallah, A. 2024. Orca-Math: Unlocking the potential of SLMs in Grade School Math. arXiv:2402.14830

  48. [48]

    L.; Bari, M

    Muennighoff, N.; Wang, T.; Sutawika, L.; Roberts, A.; Biderman, S.; Scao, T. L.; Bari, M. S.; Shen, S.; Yong, Z.-X.; Schoelkopf, H.; et al. 2022. Crosslingual generalization through multitask finetuning. arXiv preprint arXiv:2211.01786

  49. [49]

    Muqeeth, M.; Liu, H.; Liu, Y.; and Raffel, C. 2024. Learning to route among specialized experts for zero-shot generalization. arXiv preprint arXiv:2402.05859

  50. [50]

    M.; Charlin, L.; Roux, N

    Ostapenko, O.; Su, Z.; Ponti, E. M.; Charlin, L.; Roux, N. L.; Pereira, M.; Caccia, L.; and Sordoni, A. 2024. Towards Modular LLMs by Building and Reusing a Library of LoRAs. arXiv:2405.11157

  51. [51]

    Peng, S.; Fu, D.; Gao, L.; Zhong, X.; Fu, H.; and Tang, Z. 2024. MultiMath: Bridging Visual and Mathematical Reasoning for Large Language Models. arXiv:2409.00147

  52. [52]

    W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al

    Radford, A.; Kim, J. W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. 2021. Learning transferable visual models from natural language supervision. In International conference on machine learning, 8748--8763. PMLR

  53. [53]

    Rebuffi, S.-A.; Kolesnikov, A.; Sperl, G.; and Lampert, C. H. 2017. iCaRL: Incremental classifier and representation learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2001--2010

  54. [54]

    S.; Keysers, D.; and Houlsby, N

    Riquelme, C.; Puigcerver, J.; Mustafa, B.; Neumann, M.; Jenatton, R.; Pinto, A. S.; Keysers, D.; and Houlsby, N. 2021. Scaling vision with sparse mixture of experts. In Proceedings of the 35th International Conference on Neural Information Processing Systems, NIPS '21. Red Hook, NY, USA: Curran Associates Inc. ISBN 9781713845393

  55. [55]

    Rolnick, D.; et al. 2019. Experience replay for continual learning. In Advances in Neural Information Processing Systems, volume 32

  56. [56]

    Ruder, S. 2017. An overview of multi-task learning in deep neural networks. arXiv preprint arXiv:1706.05098

  57. [57]

    Progressive Neural Networks

    Rusu, A. A.; et al. 2016. Progressive neural networks. arXiv preprint arXiv:1606.04671

  58. [58]

    Shazeer, N.; Mirhoseini, A.; Maziarz, K.; Davis, A.; Le, Q.; Hinton, G.; and Dean, J. 2017. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538

  59. [59]

    Language Models are Multilingual Chain-of-Thought Reasoners

    Shi, F.; Suzgun, M.; Freitag, M.; Wang, X.; Srivats, S.; Vosoughi, S.; Chung, H. W.; Tay, Y.; Ruder, S.; Zhou, D.; Das, D.; and Wei, J. 2022. Language Models are Multilingual Chain-of-Thought Reasoners. arXiv:2210.03057

  60. [60]

    K.; Kim, J.; and Kim, J

    Shin, H.; Lee, J. K.; Kim, J.; and Kim, J. 2017. Continual learning with deep generative replay. Advances in neural information processing systems, 30

  61. [61]

    Shoham, C.; Rotem, O.; and Ben-Ari, R. 2022. Federated continual learning via experience replay. Proceedings of the European Conference on Artificial Intelligence (ECAI)

  62. [62]

    T Dinh, C.; Tran, N.; and Nguyen, T. 2020. Personalized federated learning with adaptive clustering. In Proceedings of the 39th IEEE International Conference on Distributed Computing Systems (ICDCS)

  63. [63]

    Vedantam, R.; Lawrence Zitnick, C.; and Parikh, D. 2015. Cider: Consensus-based image description evaluation. In Proceedings of the IEEE conference on computer vision and pattern recognition, 4566--4575

  64. [64]

    Wang, H.; Lu, H.; Yao, L.; and Gong, D. 2024 a . Self-Expansion of Pre-trained Models with Mixture of Adapters for Continual Learning. arXiv preprint arXiv:2403.18886

  65. [66]

    Wang, Y.; et al. 2024 c . Inscl: A data-efficient continual learning paradigm for fine-tuning large language models with instructions. arXiv preprint arXiv:2403.11435

  66. [67]

    C.; and Tsvetkov, Y

    Wang, Z.; Lipton, Z. C.; and Tsvetkov, Y. 2020. On Negative Interference in Multilingual Models: Findings and A Meta-Learning Treatment. In Webber, B.; Cohn, T.; He, Y.; and Liu, Y., eds., Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 4438--4450. Online: Association for Computational Linguistics

  67. [68]

    Wortsman, M.; Ramanujan, V.; Liu, R.; Kembhavi, A.; Rastegari, M.; Yosinski, J.; and Farhadi, A. 2020. Supermasks in Superposition. In Advances in Neural Information Processing Systems, volume 33, 15173--15184

  68. [69]

    Xie, N.; Lai, F.; Doran, D.; and Kadav, A. 2018. Visual Entailment Task for Visually-Grounded Language Learning. arXiv preprint arXiv:1811.10582

  69. [70]

    Yadav, P.; Vu, T.; Lai, J.; Chronopoulou, A.; Faruqui, M.; Bansal, M.; and Munkhdalai, T. 2024. What Matters for Model Merging at Scale? arXiv:2410.03617

  70. [71]

    Yoon, J.; Yang, E.; and Hwang, S. J. 2021. Federated continual learning with a mixture of experts. Advances in Neural Information Processing Systems (NeurIPS)

  71. [72]

    Zenke, F.; Poole, B.; and Ganguli, S. 2017. Continual Learning Through Synaptic Intelligence. In International Conference on Machine Learning, 3987--3995. PMLR

  72. [73]

    Zhai, Y.; et al. 2023. Investigating the catastrophic forgetting in multimodal large language models. arXiv preprint arXiv:2309.10313

  73. [74]

    Zhao, H.; Wang, X.; Sahu, A.; and Talwalkar, A. 2022. Federated continual learning with knowledge distillation. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)