Routing-Based Continual Learning for Multimodal Large Language Models
Pith reviewed 2026-05-18 00:49 UTC · model grok-4.3
The pith
A routing-based architecture lets multimodal LLMs add new tasks sequentially without forgetting, matching multi-task performance at fixed cost.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Token-level routing assigns inputs to a pool of experts so that multimodal models can integrate new capabilities sequentially while retaining foundational performance, achieving parity with multi-task learning at the efficiency of single-task fine-tuning and enabling cross-modal knowledge sharing.
What carries the argument
Token-level routing mechanism that dynamically assigns each token to the most relevant expert from a growing pool, based on task relatedness, to support specialization without interference.
If this is right
- Routing stays effective with large expert pools and capitalizes on task similarities.
- Cross-modal transfer occurs, letting knowledge from one input type improve results in another.
- Larger models show smaller drops relative to fully specialized fine-tuning.
- Overall training cost and data use stay fixed no matter how many tasks arrive in sequence.
Where Pith is reading between the lines
- The same routing pattern might reduce forgetting in non-multimodal continual learning by exploiting any detectable task overlap.
- It opens the possibility of maintaining an ever-expanding set of specialized capabilities inside one model without periodic full retraining.
- Future work could test whether routing decisions themselves can be learned more efficiently on very long task streams.
Load-bearing premise
The router can accurately match tokens to experts according to task relatedness without creating hidden scaling costs or cross-task interference as more tasks and experts are added.
What would settle it
Measure whether performance or compute cost degrades when the task sequence length and expert pool are both doubled while keeping task relatedness low.
Figures
read the original abstract
Multimodal Large Language Models (MLLMs) struggle with continual learning, often suffering from catastrophic forgetting when adapting to sequential tasks. We introduce a routing-based architecture that integrates new capabilities while robustly preserving foundational knowledge. While Multi-Task Learning (MTL) offers a theoretical performance upper bound, it incurs a linearly scaling computational overhead as the number of tasks increases. In contrast, our method maintains fixed data and compute requirements regardless of the task sequence length. Across models ranging from 2B to 8B parameters, we demonstrate that our routing approach performs on par with MTL while retaining the training efficiency of sequential fine-tuning. Beyond merely mitigating forgetting, we observe that token-level routing facilitates cross-modal transfer, leveraging knowledge from one modality to bolster performance in another. Ablation studies confirm the approach's scalability: routing remains robust even with large expert pools and effectively capitalizes on task relatedness. Finally, we show that our method scales favorably, with larger models exhibiting minimal degradation compared to fully specialized fine-tuning.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims to introduce a routing-based continual learning architecture for multimodal LLMs that uses token-level routing to experts to add new capabilities sequentially while avoiding catastrophic forgetting. It asserts that this achieves performance comparable to multi-task learning (MTL) but with the training efficiency of sequential fine-tuning, keeping data and compute fixed regardless of task number. Evaluations on 2B-8B models show parity with MTL, cross-modal transfer benefits, and robustness in ablations with large expert pools, with favorable scaling for larger models.
Significance. If substantiated, the result would be significant for the field of continual learning in large multimodal models, as it potentially resolves the trade-off between performance (MTL) and efficiency (sequential FT) by using routing to leverage task relatedness and cross-modal transfer without linear overhead. The empirical demonstration across model sizes and ablations on expert pools adds to its practical value, though verification of no hidden costs is key.
major comments (2)
- [§4 Experiments] §4 Experiments: The central claim of performing 'on par with MTL' across model sizes lacks specific metrics, error bars, exact baseline numbers, and dataset details, as noted in the abstract's reporting. This is load-bearing for verifying the performance equivalence.
- [§5 Ablations] §5 Ablations: The ablation studies confirm robustness with large expert pools but do not report measurements of the routing mechanism's compute overhead, results on long task sequences, or modality-specific interference metrics. This leaves the assumption of no hidden scaling costs or interference untested, which is critical for the fixed-compute advantage.
minor comments (2)
- [Abstract] The abstract could benefit from including at least one key quantitative result to support the parity claim.
- [Notation] Ensure consistent definition of the routing function and expert pool size throughout the method section.
Simulated Author's Rebuttal
We are grateful to the referee for the thoughtful comments, which help improve the clarity and rigor of our work. Below we respond point-by-point to the major comments.
read point-by-point responses
-
Referee: §4 Experiments: The central claim of performing 'on par with MTL' across model sizes lacks specific metrics, error bars, exact baseline numbers, and dataset details, as noted in the abstract's reporting. This is load-bearing for verifying the performance equivalence.
Authors: The manuscript's Section 4 presents comparative results across model sizes in dedicated tables, with performance metrics for our routing method, MTL, and sequential baselines on multimodal tasks. While the abstract condenses these findings, the main text includes the relevant numbers. To fully address this concern, we will incorporate error bars from repeated trials and more explicit dataset specifications directly in the experimental section of the revised manuscript. revision: yes
-
Referee: §5 Ablations: The ablation studies confirm robustness with large expert pools but do not report measurements of the routing mechanism's compute overhead, results on long task sequences, or modality-specific interference metrics. This leaves the assumption of no hidden scaling costs or interference untested, which is critical for the fixed-compute advantage.
Authors: Our architecture maintains fixed compute by routing each token to a single expert, independent of the number of tasks, as described in the method section. The ablations in Section 5 demonstrate robustness to large expert pools. We agree that explicit compute overhead measurements, evaluations on longer task sequences, and modality-specific interference metrics would further substantiate the claims. We will include these additional analyses in the revised version where feasible, noting that extending to very long sequences may be constrained by available benchmarks. revision: partial
Circularity Check
No circularity: empirical comparisons rest on external baselines
full rationale
The paper presents a routing architecture for continual learning in MLLMs and evaluates it via direct experiments against MTL and sequential fine-tuning baselines across 2B-8B models. No equations, predictions, or first-principles results are claimed; performance parity and fixed-compute claims are measured outcomes, not quantities defined by fitted parameters inside the paper. Ablations on expert pool size and task relatedness are reported as empirical checks rather than derivations. No self-citation load-bearing steps, uniqueness theorems, or ansatz smuggling appear in the abstract or described content. The derivation chain is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
free parameters (1)
- expert pool size and routing hyperparameters
axioms (1)
- domain assumption Task relatedness can be leveraged by token-level routing to improve transfer without interference
invented entities (1)
-
token-level router for continual learning
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We implement task-specific LoRA adapters ... routing vectors ... αt,i = vTi ut ... wt = softmax{αt,i/√n : i∈Et}
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.leanembed_injective unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
routing remains robust even with large expert pools and effectively capitalizes on task relatedness
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
, " * write output.state after.block = add.period write newline
ENTRY address archivePrefix author booktitle chapter edition editor eid eprint howpublished institution isbn journal key month note number organization pages publisher school series title type volume year label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block FUNCTION init.state.consts #0 'before.a...
-
[2]
" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...
-
[3]
Gated Multimodal Units for Information Fusion
Arevalo, J.; Solorio, T.; y Gómez, M. M.; and González, F. A. 2017. Gated Multimodal Units for Information Fusion. arXiv:1702.01992
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[4]
R.; Angeli, G.; Potts, C.; and Manning, C
Bowman, S. R.; Angeli, G.; Potts, C.; and Manning, C. D. 2015. A large annotated corpus for learning natural language inference. In M \`a rquez, L.; Callison-Burch, C.; and Su, J., eds., Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, 632--642. Lisbon, Portugal: Association for Computational Linguistics
work page 2015
-
[5]
Language Models are Few-Shot Learners
Brown, T. B.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.; Dhariwal, P.; Neel, S.; Shinn, E.; Steinhardt, J.; Christian, G.; et al. 2020. Language Models are Few-Shot Learners. arXiv preprint arXiv:2005.14165
work page internal anchor Pith review Pith/arXiv arXiv 2020
-
[6]
Caruana, R. 1997. Multitask Learning. Machine Learning, 28(1): 41--75
work page 1997
- [7]
-
[8]
Chaudhry, A.; et al. 2018. Efficient lifelong learning with a-gem. arXiv preprint arXiv:1812.00420
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[9]
Chen, B.; Wang, H.; Du, T.; Yu, S.; An, R.; Gao, Q.; Lin, D.; and Wang, J. 2024 a . MoE-LLaVA: Mixture of Experts for Large Vision-Language Models
work page 2024
-
[10]
Chen, T.; Kornblith, S.; Norouzi, M.; and Hinton, G. 2020. A simple framework for contrastive learning of visual representations. In International conference on machine learning, 1597--1607. PMLR
work page 2020
-
[11]
Chen, Z.; Wu, J.; Wang, W.; Su, W.; Chen, G.; Xing, S.; Zhong, M.; Zhang, Q.; Zhu, X.; Lu, L.; et al. 2024 b . Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 24185--24198
work page 2024
-
[12]
Chowdhery, A.; Narang, S.; Devlin, J.; Bosma, M.; Mishra, G.; Roberts, A.; Barham, P.; Chung, H. W.; Sutton, C.; Gehrmann, S.; Schuh, P.; Shi, K.; Tsvyashchenko, S.; Maynez, J.; Rao, A.; Barnes, P.; Tay, Y.; Shazeer, N.; Prabhakaran, V.; Reif, E.; Du, N.; Hutchinson, B.; Pope, R.; Bradbury, J.; Austin, J.; Isard, M.; Gur-Ari, G.; Yin, P.; Duke, T.; Levska...
work page 2023
-
[13]
R.; Schwenk, H.; and Stoyanov, V
Conneau, A.; Rinott, R.; Lample, G.; Williams, A.; Bowman, S. R.; Schwenk, H.; and Stoyanov, V. 2018. XNLI: Evaluating Cross-lingual Sentence Representations. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics
work page 2018
-
[14]
Douillard, A.; et al. 2021. End-to-End Task-Specific Model Merging for Multi-Task Learning. Proceedings of the International Conference on Machine Learning, 139: 1380--1389
work page 2021
-
[15]
Fedus, W.; Zoph, B.; and Shazeer, N. 2021. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. Journal of Machine Learning Research, 23: 1--39
work page 2021
-
[16]
An Empirical Investigation of Catastrophic Forgetting in Gradient-Based Neural Networks
Goodfellow, I. J.; Mirza, M.; Xiao, D.; Courville, A.; and Bengio, Y. 2013. An empirical investigation of catastrophic forgetting in gradient-based neural networks. arXiv preprint arXiv:1312.6211
work page internal anchor Pith review Pith/arXiv arXiv 2013
-
[17]
Hinton, G.; et al. 2015. Distilling the Knowledge in a Neural Network. arXiv preprint arXiv:1503.02531
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[18]
J.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; Chen, W.; et al
Hu, E. J.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; Chen, W.; et al. 2022. Lora: Low-rank adaptation of large language models. ICLR, 1(2): 3
work page 2022
-
[19]
T.; Wortsman, M.; Schmidt, L.; Hajishirzi, H.; and Farhadi, A
Ilharco, G.; Ribeiro, M. T.; Wortsman, M.; Schmidt, L.; Hajishirzi, H.; and Farhadi, A. 2023. Editing models with task arithmetic. In The Eleventh International Conference on Learning Representations
work page 2023
-
[20]
B.; Avent, B.; Bellet, A.; Bennis, M.; Bhagoji, A
Kairouz, P.; McMahan, H. B.; Avent, B.; Bellet, A.; Bennis, M.; Bhagoji, A. N.; Bonawitz, K.; Charles, Z.; Cormode, G.; Cummings, R.; et al. 2021. Advances and open problems in federated learning. Foundations and Trends in Machine Learning, 14(1--2): 1--210
work page 2021
- [21]
-
[22]
Kemker, R.; McClure, M.; Abitino, A.; Hayes, T.; and Kanan, C. 2018. Measuring catastrophic forgetting in neural networks. In Proceedings of the AAAI conference on artificial intelligence, volume 32
work page 2018
- [23]
-
[24]
A.; Milan, C.; Quan, J.; Ramalho, T.; Grabska-Barwi \'n ska, A.; et al
Kirkpatrick, J.; Pascanu, R.; Rabinowitz, N.; Veness, J.; Desjardins, G.; Rusu, A. A.; Milan, C.; Quan, J.; Ramalho, T.; Grabska-Barwi \'n ska, A.; et al. 2017. Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences, 114(13): 3521--3526
work page 2017
-
[25]
Li, T.; Sahu, A.; Talwalkar, A.; and Smith, V. 2020. Federated optimization in heterogeneous networks. Proceedings of Machine Learning and Systems (MLSys)
work page 2020
-
[26]
Li, Y.; et al. 2018. Learning to route with neural modular networks. arXiv preprint arXiv:1809.10778
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[27]
Li, Z.; and Hoiem, D. 2017. Learning without forgetting. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(12): 2935--2947
work page 2017
-
[28]
Lin, B.; Tang, Z.; Ye, Y.; Huang, J.; Zhang, J.; Pang, Y.; Jin, P.; Ning, M.; Luo, J.; and Yuan, L. 2024. MoE-LLaVA: Mixture of Experts for Large Vision-Language Models. arXiv:2401.15947
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[29]
Microsoft COCO: Common Objects in Context
Lin, T.-Y.; Maire, M.; Belongie, S.; Bourdev, L.; Girshick, R.; Hays, J.; Perona, P.; Ramanan, D.; Zitnick, C. L.; and Dollár, P. 2015. Microsoft COCO: Common Objects in Context. arXiv:1405.0312
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[30]
Liu, H.; Li, C.; Wu, Q.; and Lee, Y. J. 2023. Visual Instruction Tuning. arXiv preprint arXiv:2304.08485
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[31]
Liu, Y.; Duan, H.; Zhang, Y.; Li, B.; Zhang, S.; Zhao, W.; Yuan, Y.; Wang, J.; He, C.; Liu, Z.; Chen, K.; and Lin, D. 2024. MMBench: Is Your Multi-modal Model an All-around Player? arXiv:2307.06281
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[32]
Liu, Y.; Duan, H.; Zhang, Y.; Li, B.; Zhang, S.; Zhao, W.; Yuan, Y.; Wang, J.; He, C.; Liu, Z.; et al. 2025. Mmbench: Is your multi-modal model an all-around player? In European conference on computer vision, 216--233. Springer
work page 2025
-
[33]
Lopez-Paz, D.; and Ranzato, M. 2017. Gradient episodic memory for continual learning. In Advances in Neural Information Processing Systems, volume 30
work page 2017
-
[34]
Lu, J.; Batra, D.; Parikh, D.; and Lee, S. 2019. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In Advances in neural information processing systems, volume 32
work page 2019
- [35]
-
[36]
Mallya, A.; Davis, D.; and Lazebnik, S. 2018. Piggyback: Adding new tasks to a single network with weight transformations using binary masks. In Proceedings of the European Conference on Computer Vision, 72--87
work page 2018
-
[37]
Mallya, A.; and Lazebnik, S. 2018. PackNet: Adding Multiple Tasks to a Single Network by Iterative Pruning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 7765--7773
work page 2018
-
[38]
Mallya, A.; and Lazebnik, S. 2022. Forget-free Continual Learning with Winning Subnetworks. In International Conference on Machine Learning, 15014--15024
work page 2022
-
[39]
Masry, A.; Long, D.; Tan, J. Q.; Joty, S.; and Hoque, E. 2022 a . C hart QA : A Benchmark for Question Answering about Charts with Visual and Logical Reasoning. In Findings of the Association for Computational Linguistics: ACL 2022, 2263--2279. Dublin, Ireland: Association for Computational Linguistics
work page 2022
-
[40]
ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning
Masry, A.; Long, D. X.; Tan, J. Q.; Joty, S.; and Hoque, E. 2022 b . Chartqa: A benchmark for question answering about charts with visual and logical reasoning. arXiv preprint arXiv:2203.10244
work page internal anchor Pith review Pith/arXiv arXiv 2022
- [41]
- [42]
-
[43]
Mathew, M.; Karatzas, D.; and Jawahar, C. 2021 a . Docvqa: A dataset for vqa on document images. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, 2200--2209
work page 2021
- [44]
-
[45]
B.; Moore, E.; Ramage, D.; Hampson, S.; and Ag \"u era y Arcas, B
McMahan, H. B.; Moore, E.; Ramage, D.; Hampson, S.; and Ag \"u era y Arcas, B. 2017. Communication-Efficient Learning of Deep Networks from Decentralized Data. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, 1273--1282. PMLR
work page 2017
- [46]
- [47]
-
[48]
Muennighoff, N.; Wang, T.; Sutawika, L.; Roberts, A.; Biderman, S.; Scao, T. L.; Bari, M. S.; Shen, S.; Yong, Z.-X.; Schoelkopf, H.; et al. 2022. Crosslingual generalization through multitask finetuning. arXiv preprint arXiv:2211.01786
- [49]
-
[50]
Ostapenko, O.; Su, Z.; Ponti, E. M.; Charlin, L.; Roux, N. L.; Pereira, M.; Caccia, L.; and Sordoni, A. 2024. Towards Modular LLMs by Building and Reusing a Library of LoRAs. arXiv:2405.11157
- [51]
-
[52]
Radford, A.; Kim, J. W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. 2021. Learning transferable visual models from natural language supervision. In International conference on machine learning, 8748--8763. PMLR
work page 2021
-
[53]
Rebuffi, S.-A.; Kolesnikov, A.; Sperl, G.; and Lampert, C. H. 2017. iCaRL: Incremental classifier and representation learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2001--2010
work page 2017
-
[54]
S.; Keysers, D.; and Houlsby, N
Riquelme, C.; Puigcerver, J.; Mustafa, B.; Neumann, M.; Jenatton, R.; Pinto, A. S.; Keysers, D.; and Houlsby, N. 2021. Scaling vision with sparse mixture of experts. In Proceedings of the 35th International Conference on Neural Information Processing Systems, NIPS '21. Red Hook, NY, USA: Curran Associates Inc. ISBN 9781713845393
work page 2021
-
[55]
Rolnick, D.; et al. 2019. Experience replay for continual learning. In Advances in Neural Information Processing Systems, volume 32
work page 2019
-
[56]
Ruder, S. 2017. An overview of multi-task learning in deep neural networks. arXiv preprint arXiv:1706.05098
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[57]
Rusu, A. A.; et al. 2016. Progressive neural networks. arXiv preprint arXiv:1606.04671
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[58]
Shazeer, N.; Mirhoseini, A.; Maziarz, K.; Davis, A.; Le, Q.; Hinton, G.; and Dean, J. 2017. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[59]
Language Models are Multilingual Chain-of-Thought Reasoners
Shi, F.; Suzgun, M.; Freitag, M.; Wang, X.; Srivats, S.; Vosoughi, S.; Chung, H. W.; Tay, Y.; Ruder, S.; Zhou, D.; Das, D.; and Wei, J. 2022. Language Models are Multilingual Chain-of-Thought Reasoners. arXiv:2210.03057
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[60]
Shin, H.; Lee, J. K.; Kim, J.; and Kim, J. 2017. Continual learning with deep generative replay. Advances in neural information processing systems, 30
work page 2017
-
[61]
Shoham, C.; Rotem, O.; and Ben-Ari, R. 2022. Federated continual learning via experience replay. Proceedings of the European Conference on Artificial Intelligence (ECAI)
work page 2022
-
[62]
T Dinh, C.; Tran, N.; and Nguyen, T. 2020. Personalized federated learning with adaptive clustering. In Proceedings of the 39th IEEE International Conference on Distributed Computing Systems (ICDCS)
work page 2020
-
[63]
Vedantam, R.; Lawrence Zitnick, C.; and Parikh, D. 2015. Cider: Consensus-based image description evaluation. In Proceedings of the IEEE conference on computer vision and pattern recognition, 4566--4575
work page 2015
- [64]
- [66]
-
[67]
Wang, Z.; Lipton, Z. C.; and Tsvetkov, Y. 2020. On Negative Interference in Multilingual Models: Findings and A Meta-Learning Treatment. In Webber, B.; Cohn, T.; He, Y.; and Liu, Y., eds., Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 4438--4450. Online: Association for Computational Linguistics
work page 2020
-
[68]
Wortsman, M.; Ramanujan, V.; Liu, R.; Kembhavi, A.; Rastegari, M.; Yosinski, J.; and Farhadi, A. 2020. Supermasks in Superposition. In Advances in Neural Information Processing Systems, volume 33, 15173--15184
work page 2020
-
[69]
Xie, N.; Lai, F.; Doran, D.; and Kadav, A. 2018. Visual Entailment Task for Visually-Grounded Language Learning. arXiv preprint arXiv:1811.10582
work page internal anchor Pith review Pith/arXiv arXiv 2018
- [70]
-
[71]
Yoon, J.; Yang, E.; and Hwang, S. J. 2021. Federated continual learning with a mixture of experts. Advances in Neural Information Processing Systems (NeurIPS)
work page 2021
-
[72]
Zenke, F.; Poole, B.; and Ganguli, S. 2017. Continual Learning Through Synaptic Intelligence. In International Conference on Machine Learning, 3987--3995. PMLR
work page 2017
- [73]
-
[74]
Zhao, H.; Wang, X.; Sahu, A.; and Talwalkar, A. 2022. Federated continual learning with knowledge distillation. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
work page 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.