arxiv: 2511.01831 · v3 · submitted 2025-11-03 · 💻 cs.LG · cs.AI

Routing-Based Continual Learning for Multimodal Large Language Models

Jay Mohta , Kenan Emir Ak , Gwang Lee , Dimitrios Dimitriadis , Yan Xu , Mingwei Shen This is my paper

Pith reviewed 2026-05-18 00:49 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords continual learningmultimodal large language modelsrouting architecturecatastrophic forgettingexpert routingcross-modal transfertask relatedness

0 comments

The pith

A routing-based architecture lets multimodal LLMs add new tasks sequentially without forgetting, matching multi-task performance at fixed cost.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a routing method for continual learning in multimodal large language models that avoids catastrophic forgetting during sequential task adaptation. Tokens are routed at the input level to specialized experts, preserving core knowledge while incorporating new skills with constant data and compute demands. This contrasts with multi-task learning, whose overhead grows linearly with task count. Tests on models from 2B to 8B parameters show results comparable to joint training, plus cross-modal transfer where one modality's knowledge aids another.

Core claim

Token-level routing assigns inputs to a pool of experts so that multimodal models can integrate new capabilities sequentially while retaining foundational performance, achieving parity with multi-task learning at the efficiency of single-task fine-tuning and enabling cross-modal knowledge sharing.

What carries the argument

Token-level routing mechanism that dynamically assigns each token to the most relevant expert from a growing pool, based on task relatedness, to support specialization without interference.

If this is right

Routing stays effective with large expert pools and capitalizes on task similarities.
Cross-modal transfer occurs, letting knowledge from one input type improve results in another.
Larger models show smaller drops relative to fully specialized fine-tuning.
Overall training cost and data use stay fixed no matter how many tasks arrive in sequence.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same routing pattern might reduce forgetting in non-multimodal continual learning by exploiting any detectable task overlap.
It opens the possibility of maintaining an ever-expanding set of specialized capabilities inside one model without periodic full retraining.
Future work could test whether routing decisions themselves can be learned more efficiently on very long task streams.

Load-bearing premise

The router can accurately match tokens to experts according to task relatedness without creating hidden scaling costs or cross-task interference as more tasks and experts are added.

What would settle it

Measure whether performance or compute cost degrades when the task sequence length and expert pool are both doubled while keeping task relatedness low.

Figures

Figures reproduced from arXiv: 2511.01831 by Dimitrios Dimitriadis, Gwang Lee, Jay Mohta, Kenan Emir Ak, Mingwei Shen, Yan Xu.

**Figure 2.** Figure 2: Performance drop in comparison to specialized [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Routing patterns for SNLI (left), MMBENCH (middle), and COCO (right). The figure demonstrates that the rout [PITH_FULL_IMAGE:figures/full_fig_p012_3.png] view at source ↗

**Figure 4.** Figure 4: Routing patterns for the MGSM dataset in multilingual transfer. Notably, the model leverages the Chinese expert in [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗

read the original abstract

Multimodal Large Language Models (MLLMs) struggle with continual learning, often suffering from catastrophic forgetting when adapting to sequential tasks. We introduce a routing-based architecture that integrates new capabilities while robustly preserving foundational knowledge. While Multi-Task Learning (MTL) offers a theoretical performance upper bound, it incurs a linearly scaling computational overhead as the number of tasks increases. In contrast, our method maintains fixed data and compute requirements regardless of the task sequence length. Across models ranging from 2B to 8B parameters, we demonstrate that our routing approach performs on par with MTL while retaining the training efficiency of sequential fine-tuning. Beyond merely mitigating forgetting, we observe that token-level routing facilitates cross-modal transfer, leveraging knowledge from one modality to bolster performance in another. Ablation studies confirm the approach's scalability: routing remains robust even with large expert pools and effectively capitalizes on task relatedness. Finally, we show that our method scales favorably, with larger models exhibiting minimal degradation compared to fully specialized fine-tuning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The routing method claims MTL-level performance at fixed compute for sequential MLLM tasks, with some cross-modal transfer, but the abstract leaves the central equivalence and scaling claims under-specified.

read the letter

The main point is a token-level routing setup that lets MLLMs add new tasks one after another while holding total data and compute steady, unlike MTL which scales linearly with task count. They test this on models from 2B to 8B parameters and report parity with MTL plus better retention than plain sequential fine-tuning. The routing also appears to produce cross-modal transfer, where gains in one modality help another. Ablations are said to show the router stays stable with large expert pools and that bigger models degrade less than fully specialized fine-tuning. That fixed-cost guarantee plus the transfer observation is the concrete extension of prior routing and continual-learning work into the multimodal case. The experiments cover a reasonable range of model sizes and the baselines are the obvious ones, so the setup is easy to understand. The soft spots sit in the reporting. The abstract gives no numbers, error bars, exact datasets, or per-task metrics, which makes it difficult to judge how close the parity really is or whether router overhead or interference grows with more tasks and experts. The stress-test concern about hidden scaling costs and degraded routing quality over longer sequences is fair until the full results show direct measurements of compute per task and modality-specific forgetting curves. This work is aimed at groups building deployable multimodal systems that need to keep learning without repeated full retraining. A reader already working on routing or efficient adaptation would pick up usable ideas on how to keep costs flat while preserving performance. It deserves peer review because the idea is testable, the model-size sweep is there, and the claims can be checked with standard metrics once the details are filled in.

Referee Report

2 major / 2 minor

Summary. The paper claims to introduce a routing-based continual learning architecture for multimodal LLMs that uses token-level routing to experts to add new capabilities sequentially while avoiding catastrophic forgetting. It asserts that this achieves performance comparable to multi-task learning (MTL) but with the training efficiency of sequential fine-tuning, keeping data and compute fixed regardless of task number. Evaluations on 2B-8B models show parity with MTL, cross-modal transfer benefits, and robustness in ablations with large expert pools, with favorable scaling for larger models.

Significance. If substantiated, the result would be significant for the field of continual learning in large multimodal models, as it potentially resolves the trade-off between performance (MTL) and efficiency (sequential FT) by using routing to leverage task relatedness and cross-modal transfer without linear overhead. The empirical demonstration across model sizes and ablations on expert pools adds to its practical value, though verification of no hidden costs is key.

major comments (2)

[§4 Experiments] §4 Experiments: The central claim of performing 'on par with MTL' across model sizes lacks specific metrics, error bars, exact baseline numbers, and dataset details, as noted in the abstract's reporting. This is load-bearing for verifying the performance equivalence.
[§5 Ablations] §5 Ablations: The ablation studies confirm robustness with large expert pools but do not report measurements of the routing mechanism's compute overhead, results on long task sequences, or modality-specific interference metrics. This leaves the assumption of no hidden scaling costs or interference untested, which is critical for the fixed-compute advantage.

minor comments (2)

[Abstract] The abstract could benefit from including at least one key quantitative result to support the parity claim.
[Notation] Ensure consistent definition of the routing function and expert pool size throughout the method section.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We are grateful to the referee for the thoughtful comments, which help improve the clarity and rigor of our work. Below we respond point-by-point to the major comments.

read point-by-point responses

Referee: §4 Experiments: The central claim of performing 'on par with MTL' across model sizes lacks specific metrics, error bars, exact baseline numbers, and dataset details, as noted in the abstract's reporting. This is load-bearing for verifying the performance equivalence.

Authors: The manuscript's Section 4 presents comparative results across model sizes in dedicated tables, with performance metrics for our routing method, MTL, and sequential baselines on multimodal tasks. While the abstract condenses these findings, the main text includes the relevant numbers. To fully address this concern, we will incorporate error bars from repeated trials and more explicit dataset specifications directly in the experimental section of the revised manuscript. revision: yes
Referee: §5 Ablations: The ablation studies confirm robustness with large expert pools but do not report measurements of the routing mechanism's compute overhead, results on long task sequences, or modality-specific interference metrics. This leaves the assumption of no hidden scaling costs or interference untested, which is critical for the fixed-compute advantage.

Authors: Our architecture maintains fixed compute by routing each token to a single expert, independent of the number of tasks, as described in the method section. The ablations in Section 5 demonstrate robustness to large expert pools. We agree that explicit compute overhead measurements, evaluations on longer task sequences, and modality-specific interference metrics would further substantiate the claims. We will include these additional analyses in the revised version where feasible, noting that extending to very long sequences may be constrained by available benchmarks. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical comparisons rest on external baselines

full rationale

The paper presents a routing architecture for continual learning in MLLMs and evaluates it via direct experiments against MTL and sequential fine-tuning baselines across 2B-8B models. No equations, predictions, or first-principles results are claimed; performance parity and fixed-compute claims are measured outcomes, not quantities defined by fitted parameters inside the paper. Ablations on expert pool size and task relatedness are reported as empirical checks rather than derivations. No self-citation load-bearing steps, uniqueness theorems, or ansatz smuggling appear in the abstract or described content. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim rests on the empirical effectiveness of a learned routing mechanism whose behavior is validated through experiments rather than derived from first principles or external benchmarks.

free parameters (1)

expert pool size and routing hyperparameters
Number of experts and routing decision thresholds are chosen or tuned to achieve the reported scaling behavior.

axioms (1)

domain assumption Task relatedness can be leveraged by token-level routing to improve transfer without interference
Invoked to explain cross-modal benefits and robustness in ablation studies.

invented entities (1)

token-level router for continual learning no independent evidence
purpose: Dynamically directs computation to task-specific experts to prevent forgetting while keeping compute fixed
Core new component introduced by the paper; no independent falsifiable prediction outside the reported experiments is provided.

pith-pipeline@v0.9.0 · 5718 in / 1186 out tokens · 73102 ms · 2026-05-18T00:49:06.686508+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We implement task-specific LoRA adapters ... routing vectors ... αt,i = vTi ut ... wt = softmax{αt,i/√n : i∈Et}
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean embed_injective unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

routing remains robust even with large expert pools and effectively capitalizes on task relatedness

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

73 extracted references · 73 canonical work pages · 16 internal anchors

[1]

, " * write output.state after.block = add.period write newline

ENTRY address archivePrefix author booktitle chapter edition editor eid eprint howpublished institution isbn journal key month note number organization pages publisher school series title type volume year label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block FUNCTION init.state.consts #0 'before.a...

work page
[2]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

work page
[3]

Gated Multimodal Units for Information Fusion

Arevalo, J.; Solorio, T.; y Gómez, M. M.; and González, F. A. 2017. Gated Multimodal Units for Information Fusion. arXiv:1702.01992

work page internal anchor Pith review Pith/arXiv arXiv 2017
[4]

R.; Angeli, G.; Potts, C.; and Manning, C

Bowman, S. R.; Angeli, G.; Potts, C.; and Manning, C. D. 2015. A large annotated corpus for learning natural language inference. In M \`a rquez, L.; Callison-Burch, C.; and Su, J., eds., Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, 632--642. Lisbon, Portugal: Association for Computational Linguistics

work page 2015
[5]

Language Models are Few-Shot Learners

Brown, T. B.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.; Dhariwal, P.; Neel, S.; Shinn, E.; Steinhardt, J.; Christian, G.; et al. 2020. Language Models are Few-Shot Learners. arXiv preprint arXiv:2005.14165

work page internal anchor Pith review Pith/arXiv arXiv 2020
[6]

Caruana, R. 1997. Multitask Learning. Machine Learning, 28(1): 41--75

work page 1997
[7]

Cha, S.; Lee, H.; Shin, J.; and Shin, J. 2020. CPR: Classifier-projection regularization for continual learning. arXiv preprint arXiv:2006.07326

work page arXiv 2020
[8]

Chaudhry, A.; et al. 2018. Efficient lifelong learning with a-gem. arXiv preprint arXiv:1812.00420

work page internal anchor Pith review Pith/arXiv arXiv 2018
[9]

Chen, B.; Wang, H.; Du, T.; Yu, S.; An, R.; Gao, Q.; Lin, D.; and Wang, J. 2024 a . MoE-LLaVA: Mixture of Experts for Large Vision-Language Models

work page 2024
[10]

Chen, T.; Kornblith, S.; Norouzi, M.; and Hinton, G. 2020. A simple framework for contrastive learning of visual representations. In International conference on machine learning, 1597--1607. PMLR

work page 2020
[11]

Chen, Z.; Wu, J.; Wang, W.; Su, W.; Chen, G.; Xing, S.; Zhong, M.; Zhang, Q.; Zhu, X.; Lu, L.; et al. 2024 b . Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 24185--24198

work page 2024
[12]

Chowdhery, A.; Narang, S.; Devlin, J.; Bosma, M.; Mishra, G.; Roberts, A.; Barham, P.; Chung, H. W.; Sutton, C.; Gehrmann, S.; Schuh, P.; Shi, K.; Tsvyashchenko, S.; Maynez, J.; Rao, A.; Barnes, P.; Tay, Y.; Shazeer, N.; Prabhakaran, V.; Reif, E.; Du, N.; Hutchinson, B.; Pope, R.; Bradbury, J.; Austin, J.; Isard, M.; Gur-Ari, G.; Yin, P.; Duke, T.; Levska...

work page 2023
[13]

R.; Schwenk, H.; and Stoyanov, V

Conneau, A.; Rinott, R.; Lample, G.; Williams, A.; Bowman, S. R.; Schwenk, H.; and Stoyanov, V. 2018. XNLI: Evaluating Cross-lingual Sentence Representations. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics

work page 2018
[14]

Douillard, A.; et al. 2021. End-to-End Task-Specific Model Merging for Multi-Task Learning. Proceedings of the International Conference on Machine Learning, 139: 1380--1389

work page 2021
[15]

Fedus, W.; Zoph, B.; and Shazeer, N. 2021. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. Journal of Machine Learning Research, 23: 1--39

work page 2021
[16]

An Empirical Investigation of Catastrophic Forgetting in Gradient-Based Neural Networks

Goodfellow, I. J.; Mirza, M.; Xiao, D.; Courville, A.; and Bengio, Y. 2013. An empirical investigation of catastrophic forgetting in gradient-based neural networks. arXiv preprint arXiv:1312.6211

work page internal anchor Pith review Pith/arXiv arXiv 2013
[17]

Hinton, G.; et al. 2015. Distilling the Knowledge in a Neural Network. arXiv preprint arXiv:1503.02531

work page internal anchor Pith review Pith/arXiv arXiv 2015
[18]

J.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; Chen, W.; et al

Hu, E. J.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; Chen, W.; et al. 2022. Lora: Low-rank adaptation of large language models. ICLR, 1(2): 3

work page 2022
[19]

T.; Wortsman, M.; Schmidt, L.; Hajishirzi, H.; and Farhadi, A

Ilharco, G.; Ribeiro, M. T.; Wortsman, M.; Schmidt, L.; Hajishirzi, H.; and Farhadi, A. 2023. Editing models with task arithmetic. In The Eleventh International Conference on Learning Representations

work page 2023
[20]

B.; Avent, B.; Bellet, A.; Bennis, M.; Bhagoji, A

Kairouz, P.; McMahan, H. B.; Avent, B.; Bellet, A.; Bennis, M.; Bhagoji, A. N.; Bonawitz, K.; Charles, Z.; Cormode, G.; Cummings, R.; et al. 2021. Advances and open problems in federated learning. Foundations and Trends in Machine Learning, 14(1--2): 1--210

work page 2021
[21]

Kalajdzievski, D. 2024. Scaling laws for forgetting when fine-tuning large language models. arXiv preprint arXiv:2401.05605

work page arXiv 2024
[22]

Kemker, R.; McClure, M.; Abitino, A.; Hayes, T.; and Kanan, C. 2018. Measuring catastrophic forgetting in neural networks. In Proceedings of the AAAI conference on artificial intelligence, volume 32

work page 2018
[23]

Kiela, D.; Firooz, H.; Mohan, A.; Goswami, V.; Singh, A.; Ringshia, P.; and Testuggine, D. 2021. The Hateful Memes Challenge: Detecting Hate Speech in Multimodal Memes. arXiv:2005.04790

work page arXiv 2021
[24]

A.; Milan, C.; Quan, J.; Ramalho, T.; Grabska-Barwi \'n ska, A.; et al

Kirkpatrick, J.; Pascanu, R.; Rabinowitz, N.; Veness, J.; Desjardins, G.; Rusu, A. A.; Milan, C.; Quan, J.; Ramalho, T.; Grabska-Barwi \'n ska, A.; et al. 2017. Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences, 114(13): 3521--3526

work page 2017
[25]

Li, T.; Sahu, A.; Talwalkar, A.; and Smith, V. 2020. Federated optimization in heterogeneous networks. Proceedings of Machine Learning and Systems (MLSys)

work page 2020
[26]

Li, Y.; et al. 2018. Learning to route with neural modular networks. arXiv preprint arXiv:1809.10778

work page internal anchor Pith review Pith/arXiv arXiv 2018
[27]

Li, Z.; and Hoiem, D. 2017. Learning without forgetting. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(12): 2935--2947

work page 2017
[28]

Lin, B.; Tang, Z.; Ye, Y.; Huang, J.; Zhang, J.; Pang, Y.; Jin, P.; Ning, M.; Luo, J.; and Yuan, L. 2024. MoE-LLaVA: Mixture of Experts for Large Vision-Language Models. arXiv:2401.15947

work page internal anchor Pith review Pith/arXiv arXiv 2024
[29]

Microsoft COCO: Common Objects in Context

Lin, T.-Y.; Maire, M.; Belongie, S.; Bourdev, L.; Girshick, R.; Hays, J.; Perona, P.; Ramanan, D.; Zitnick, C. L.; and Dollár, P. 2015. Microsoft COCO: Common Objects in Context. arXiv:1405.0312

work page internal anchor Pith review Pith/arXiv arXiv 2015
[30]

Liu, H.; Li, C.; Wu, Q.; and Lee, Y. J. 2023. Visual Instruction Tuning. arXiv preprint arXiv:2304.08485

work page internal anchor Pith review Pith/arXiv arXiv 2023
[31]

Liu, Y.; Duan, H.; Zhang, Y.; Li, B.; Zhang, S.; Zhao, W.; Yuan, Y.; Wang, J.; He, C.; Liu, Z.; Chen, K.; and Lin, D. 2024. MMBench: Is Your Multi-modal Model an All-around Player? arXiv:2307.06281

work page internal anchor Pith review Pith/arXiv arXiv 2024
[32]

Liu, Y.; Duan, H.; Zhang, Y.; Li, B.; Zhang, S.; Zhao, W.; Yuan, Y.; Wang, J.; He, C.; Liu, Z.; et al. 2025. Mmbench: Is your multi-modal model an all-around player? In European conference on computer vision, 216--233. Springer

work page 2025
[33]

Lopez-Paz, D.; and Ranzato, M. 2017. Gradient episodic memory for continual learning. In Advances in Neural Information Processing Systems, volume 30

work page 2017
[34]

Lu, J.; Batra, D.; Parikh, D.; and Lee, S. 2019. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In Advances in neural information processing systems, volume 32

work page 2019
[35]

Luo, Y.; et al. 2023. An empirical study of catastrophic forgetting in large language models during continual fine-tuning. arXiv preprint arXiv:2308.08747

work page arXiv 2023
[36]

Mallya, A.; Davis, D.; and Lazebnik, S. 2018. Piggyback: Adding new tasks to a single network with weight transformations using binary masks. In Proceedings of the European Conference on Computer Vision, 72--87

work page 2018
[37]

Mallya, A.; and Lazebnik, S. 2018. PackNet: Adding Multiple Tasks to a Single Network by Iterative Pruning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 7765--7773

work page 2018
[38]

Mallya, A.; and Lazebnik, S. 2022. Forget-free Continual Learning with Winning Subnetworks. In International Conference on Machine Learning, 15014--15024

work page 2022
[39]

Q.; Joty, S.; and Hoque, E

Masry, A.; Long, D.; Tan, J. Q.; Joty, S.; and Hoque, E. 2022 a . C hart QA : A Benchmark for Question Answering about Charts with Visual and Logical Reasoning. In Findings of the Association for Computational Linguistics: ACL 2022, 2263--2279. Dublin, Ireland: Association for Computational Linguistics

work page 2022
[40]

ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning

Masry, A.; Long, D. X.; Tan, J. Q.; Joty, S.; and Hoque, E. 2022 b . Chartqa: A benchmark for question answering about charts with visual and logical reasoning. arXiv preprint arXiv:2203.10244

work page internal anchor Pith review Pith/arXiv arXiv 2022
[41]

Masry, A.; Thakkar, M.; Bajaj, A.; Kartha, A.; Hoque, E.; and Joty, S. 2024. ChartGemma: Visual Instruction-tuning for Chart Reasoning in the Wild. arXiv:2407.04172

work page arXiv 2024
[42]

Matena, M.; and Raffel, C. 2022. Merging Models with Fisher-Weighted Averaging. arXiv:2111.09832

work page arXiv 2022
[43]

Mathew, M.; Karatzas, D.; and Jawahar, C. 2021 a . Docvqa: A dataset for vqa on document images. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, 2200--2209

work page 2021
[44]

Mathew, M.; Karatzas, D.; and Jawahar, C. V. 2021 b . DocVQA: A Dataset for VQA on Document Images. arXiv:2007.00398

work page arXiv 2021
[45]

B.; Moore, E.; Ramage, D.; Hampson, S.; and Ag \"u era y Arcas, B

McMahan, H. B.; Moore, E.; Ramage, D.; Hampson, S.; and Ag \"u era y Arcas, B. 2017. Communication-Efficient Learning of Deep Networks from Decentralized Data. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, 1273--1282. PMLR

work page 2017
[46]

Merlingot, S.; Gagnon-Audet, J.-C.; Kadoury, S.; and Pal, C. 2024. MagMax: Tackling continual learning with automated model merging. arXiv preprint arXiv:2403.07505

work page arXiv 2024
[47]

Mitra, A.; Khanpour, H.; Rosset, C.; and Awadallah, A. 2024. Orca-Math: Unlocking the potential of SLMs in Grade School Math. arXiv:2402.14830

work page arXiv 2024
[48]

L.; Bari, M

Muennighoff, N.; Wang, T.; Sutawika, L.; Roberts, A.; Biderman, S.; Scao, T. L.; Bari, M. S.; Shen, S.; Yong, Z.-X.; Schoelkopf, H.; et al. 2022. Crosslingual generalization through multitask finetuning. arXiv preprint arXiv:2211.01786

work page arXiv 2022
[49]

Muqeeth, M.; Liu, H.; Liu, Y.; and Raffel, C. 2024. Learning to route among specialized experts for zero-shot generalization. arXiv preprint arXiv:2402.05859

work page arXiv 2024
[50]

M.; Charlin, L.; Roux, N

Ostapenko, O.; Su, Z.; Ponti, E. M.; Charlin, L.; Roux, N. L.; Pereira, M.; Caccia, L.; and Sordoni, A. 2024. Towards Modular LLMs by Building and Reusing a Library of LoRAs. arXiv:2405.11157

work page arXiv 2024
[51]

Peng, S.; Fu, D.; Gao, L.; Zhong, X.; Fu, H.; and Tang, Z. 2024. MultiMath: Bridging Visual and Mathematical Reasoning for Large Language Models. arXiv:2409.00147

work page arXiv 2024
[52]

W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al

Radford, A.; Kim, J. W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. 2021. Learning transferable visual models from natural language supervision. In International conference on machine learning, 8748--8763. PMLR

work page 2021
[53]

Rebuffi, S.-A.; Kolesnikov, A.; Sperl, G.; and Lampert, C. H. 2017. iCaRL: Incremental classifier and representation learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2001--2010

work page 2017
[54]

S.; Keysers, D.; and Houlsby, N

Riquelme, C.; Puigcerver, J.; Mustafa, B.; Neumann, M.; Jenatton, R.; Pinto, A. S.; Keysers, D.; and Houlsby, N. 2021. Scaling vision with sparse mixture of experts. In Proceedings of the 35th International Conference on Neural Information Processing Systems, NIPS '21. Red Hook, NY, USA: Curran Associates Inc. ISBN 9781713845393

work page 2021
[55]

Rolnick, D.; et al. 2019. Experience replay for continual learning. In Advances in Neural Information Processing Systems, volume 32

work page 2019
[56]

Ruder, S. 2017. An overview of multi-task learning in deep neural networks. arXiv preprint arXiv:1706.05098

work page internal anchor Pith review Pith/arXiv arXiv 2017
[57]

Progressive Neural Networks

Rusu, A. A.; et al. 2016. Progressive neural networks. arXiv preprint arXiv:1606.04671

work page internal anchor Pith review Pith/arXiv arXiv 2016
[58]

Shazeer, N.; Mirhoseini, A.; Maziarz, K.; Davis, A.; Le, Q.; Hinton, G.; and Dean, J. 2017. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538

work page internal anchor Pith review Pith/arXiv arXiv 2017
[59]

Language Models are Multilingual Chain-of-Thought Reasoners

Shi, F.; Suzgun, M.; Freitag, M.; Wang, X.; Srivats, S.; Vosoughi, S.; Chung, H. W.; Tay, Y.; Ruder, S.; Zhou, D.; Das, D.; and Wei, J. 2022. Language Models are Multilingual Chain-of-Thought Reasoners. arXiv:2210.03057

work page internal anchor Pith review Pith/arXiv arXiv 2022
[60]

K.; Kim, J.; and Kim, J

Shin, H.; Lee, J. K.; Kim, J.; and Kim, J. 2017. Continual learning with deep generative replay. Advances in neural information processing systems, 30

work page 2017
[61]

Shoham, C.; Rotem, O.; and Ben-Ari, R. 2022. Federated continual learning via experience replay. Proceedings of the European Conference on Artificial Intelligence (ECAI)

work page 2022
[62]

T Dinh, C.; Tran, N.; and Nguyen, T. 2020. Personalized federated learning with adaptive clustering. In Proceedings of the 39th IEEE International Conference on Distributed Computing Systems (ICDCS)

work page 2020
[63]

Vedantam, R.; Lawrence Zitnick, C.; and Parikh, D. 2015. Cider: Consensus-based image description evaluation. In Proceedings of the IEEE conference on computer vision and pattern recognition, 4566--4575

work page 2015
[64]

Wang, H.; Lu, H.; Yao, L.; and Gong, D. 2024 a . Self-Expansion of Pre-trained Models with Mixture of Adapters for Continual Learning. arXiv preprint arXiv:2403.18886

work page arXiv 2024
[66]

Wang, Y.; et al. 2024 c . Inscl: A data-efficient continual learning paradigm for fine-tuning large language models with instructions. arXiv preprint arXiv:2403.11435

work page arXiv 2024
[67]

C.; and Tsvetkov, Y

Wang, Z.; Lipton, Z. C.; and Tsvetkov, Y. 2020. On Negative Interference in Multilingual Models: Findings and A Meta-Learning Treatment. In Webber, B.; Cohn, T.; He, Y.; and Liu, Y., eds., Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 4438--4450. Online: Association for Computational Linguistics

work page 2020
[68]

Wortsman, M.; Ramanujan, V.; Liu, R.; Kembhavi, A.; Rastegari, M.; Yosinski, J.; and Farhadi, A. 2020. Supermasks in Superposition. In Advances in Neural Information Processing Systems, volume 33, 15173--15184

work page 2020
[69]

Xie, N.; Lai, F.; Doran, D.; and Kadav, A. 2018. Visual Entailment Task for Visually-Grounded Language Learning. arXiv preprint arXiv:1811.10582

work page internal anchor Pith review Pith/arXiv arXiv 2018
[70]

Yadav, P.; Vu, T.; Lai, J.; Chronopoulou, A.; Faruqui, M.; Bansal, M.; and Munkhdalai, T. 2024. What Matters for Model Merging at Scale? arXiv:2410.03617

work page arXiv 2024
[71]

Yoon, J.; Yang, E.; and Hwang, S. J. 2021. Federated continual learning with a mixture of experts. Advances in Neural Information Processing Systems (NeurIPS)

work page 2021
[72]

Zenke, F.; Poole, B.; and Ganguli, S. 2017. Continual Learning Through Synaptic Intelligence. In International Conference on Machine Learning, 3987--3995. PMLR

work page 2017
[73]

Zhai, Y.; et al. 2023. Investigating the catastrophic forgetting in multimodal large language models. arXiv preprint arXiv:2309.10313

work page arXiv 2023
[74]

Zhao, H.; Wang, X.; Sahu, A.; and Talwalkar, A. 2022. Federated continual learning with knowledge distillation. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

work page 2022