Continual Learning for VLMs: A Survey and Taxonomy Beyond Forgetting
Pith reviewed 2026-05-21 23:37 UTC · model grok-4.3
The pith
This survey establishes a taxonomy of four paradigms to address unique continual learning challenges in vision-language models and multimodal large language models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Continual learning for VLMs requires going beyond standard forgetting mitigation because of unique issues including cross-modal feature drift, parameter interference from shared architectures, erosion of zero-shot capabilities, and in generative MLLMs an alignment tax that disrupts chain-of-thought reasoning. The survey deconstructs these modes and introduces a challenge-driven taxonomy with four paradigms: multi-modal replay strategies for memory drift, cross-modal regularization for alignment, parameter-efficient adaptation with dynamic routing, and model fusion and decoupling. It further calls for better benchmarks that track both domain changes and ability retention along with detailed推理
What carries the argument
The challenge-driven taxonomy that organizes continual learning techniques around the distinct failure modes of cross-modal models.
If this is right
- Adopting the taxonomy will direct efforts toward methods that maintain cross-modal alignments during continual updates.
- Evaluation protocols will evolve to include separate tracking of domain adaptation and ability preservation in benchmarks.
- Research will advance toward compositional zero-shot learning and integration with embodied systems using sensor data.
- Autonomous agentic ecosystems will benefit from models that can update without collapsing their reasoning structures.
Where Pith is reading between the lines
- This framework might extend naturally to other multimodal combinations such as audio-language or video-text models.
- Testing the taxonomy's coverage could involve applying it to classify methods in related fields like continual learning for large language models alone.
- Future work could explore whether model fusion approaches offer advantages in resource-constrained environments for deploying updated VLMs.
Load-bearing premise
The identified failure modes of cross-modal feature drift, parameter interference, zero-shot erosion, and alignment tax represent the primary distinctive challenges for VLMs and MLLMs in continual learning, with the four paradigms sufficiently covering existing solutions.
What would settle it
Publication of multiple new continual learning methods for VLMs that do not align with any of the four proposed paradigms or that solve the problems without using cross-modal specific techniques would indicate the taxonomy is incomplete.
Figures
read the original abstract
Vision-language models (VLMs) and the recent surge of Multimodal Large Language Models (MLLMs) have revolutionized artificial intelligence with unprecedented cross-modal alignment and zero-shot generalization. However, enabling them to learn continually from non-stationary data remains a major challenge, as their cross-modal alignment and generalization capabilities are particularly vulnerable to catastrophic forgetting. Unlike traditional unimodal continual learning (CL), VLMs face unique challenges such as cross-modal feature drift, parameter interference due to shared architectures, and zero-shot capability erosion. Furthermore, generative MLLMs exhibit a unique ``alignment tax,'' where catastrophic forgetting manifests not merely as factual amnesia, but as a systemic collapse of deep Chain-of-Thought (CoT) reasoning. This survey presents the first comprehensive, diagnostic review bridging continual learning for both predictive VLMs and generative MLLMs. We systematically deconstruct the aforementioned failure modes and propose a challenge-driven taxonomy comprising four core paradigms: (1) Multi-Modal Replay Strategies addressing explicit and implicit memory drift; (2) Cross-Modal Regularization enforcing topological and geometric alignment; (3) Parameter-Efficient Adaptation} utilizing dynamic routing and subspace projections; and the emerging (4) Model Fusion and Decoupling paradigms. We critically analyze the evolution of evaluation protocols, highlighting the essential shift toward dual-track benchmarks (Domain vs. Ability CL) and micro-diagnostic CoT evaluations. Finally, we chart a roadmap for future research, emphasizing compositional zero-shot learning, embodied AI with sensor fusion, and autonomous agentic ecosystems. All resources are available at: https://github.com/YuyangSunshine/Awesome-Continual-learning-of-Vision-Language-Models.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript is a survey on continual learning for vision-language models (VLMs) and multimodal large language models (MLLMs). It identifies unique challenges beyond standard catastrophic forgetting, including cross-modal feature drift, parameter interference from shared architectures, zero-shot capability erosion, and an 'alignment tax' that disrupts Chain-of-Thought reasoning in generative models. The authors propose a challenge-driven taxonomy with four paradigms—(1) Multi-Modal Replay Strategies, (2) Cross-Modal Regularization, (3) Parameter-Efficient Adaptation, and (4) Model Fusion and Decoupling—to organize existing methods. The survey also reviews shifts in evaluation protocols toward dual-track (Domain vs. Ability) benchmarks and micro-diagnostic CoT evaluations, and outlines a future research roadmap.
Significance. If the taxonomy accurately and comprehensively organizes the literature on continual learning for both predictive VLMs and generative MLLMs without major omissions or forced categorizations, the survey would provide a valuable diagnostic framework for the field. Its emphasis on bridging the two model classes and advocating diagnostic evaluations could help researchers target the specific vulnerabilities of cross-modal alignment under non-stationary data.
major comments (2)
- [Taxonomy section] Taxonomy section (proposal of the four core paradigms): The central claim that the taxonomy is challenge-driven and comprehensively organizes solutions is load-bearing for the paper's contribution as the 'first comprehensive' review. However, the manuscript lacks an explicit coverage audit or mapping table showing how all cited methods (including prompt-based continual adaptation and hybrid replay-regularization approaches) fit into the four paradigms without retrofitting or omission. This weakens the assertion of comprehensive organization.
- [Failure modes deconstruction] Failure modes deconstruction (cross-modal drift, alignment tax, etc.): The paper positions these as primary unique challenges for VLMs relative to unimodal CL, but provides no quantitative synthesis or comparative analysis across cited works to substantiate that these modes dominate over standard forgetting; this is needed to support the taxonomy's challenge-driven foundation.
minor comments (2)
- [Abstract] Abstract: The phrase 'Parameter-Efficient Adaptation} utilizing' contains a stray closing brace, which is a typographical error.
- [Evaluation protocols] Evaluation protocols discussion: The shift to dual-track benchmarks is highlighted, but the section would benefit from concrete citations or examples of current benchmarks that exemplify the Domain vs. Ability distinction to improve clarity for readers.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback on our survey manuscript. The comments highlight important areas for strengthening the presentation of our taxonomy and supporting analysis. We address each major comment point by point below, with clear indications of planned revisions.
read point-by-point responses
-
Referee: [Taxonomy section] Taxonomy section (proposal of the four core paradigms): The central claim that the taxonomy is challenge-driven and comprehensively organizes solutions is load-bearing for the paper's contribution as the 'first comprehensive' review. However, the manuscript lacks an explicit coverage audit or mapping table showing how all cited methods (including prompt-based continual adaptation and hybrid replay-regularization approaches) fit into the four paradigms without retrofitting or omission. This weakens the assertion of comprehensive organization.
Authors: We agree that an explicit mapping table would enhance the transparency and verifiability of the taxonomy's coverage. In the revised manuscript, we will add a dedicated table that systematically maps all cited methods—including prompt-based continual adaptation (categorized under Parameter-Efficient Adaptation due to its focus on efficient parameter updates) and hybrid replay-regularization approaches (placed according to their primary challenge address)—to the four paradigms. The categorization will be justified by the dominant challenge each method targets, ensuring the taxonomy remains challenge-driven without omissions or retrofitting. revision: yes
-
Referee: [Failure modes deconstruction] Failure modes deconstruction (cross-modal drift, alignment tax, etc.): The paper positions these as primary unique challenges for VLMs relative to unimodal CL, but provides no quantitative synthesis or comparative analysis across cited works to substantiate that these modes dominate over standard forgetting; this is needed to support the taxonomy's challenge-driven foundation.
Authors: The deconstruction in the manuscript is grounded in a systematic qualitative review of the literature, where these VLM-specific failure modes are consistently highlighted as distinct vulnerabilities. To provide stronger substantiation, we will add a comparative synthesis subsection in the revision that aggregates and contrasts key observations and metrics from the cited works, illustrating the relative prominence of cross-modal drift, alignment tax, and related issues versus standard forgetting. We note that a formal quantitative meta-analysis is inherently limited by the heterogeneity of evaluation protocols and metrics across existing studies, but the added synthesis will better support the challenge-driven foundation of the taxonomy. revision: partial
Circularity Check
No circularity: survey taxonomy constructed from external literature analysis
full rationale
This is a survey paper whose central contribution is a literature review and a proposed four-paradigm taxonomy of continual learning methods for VLMs/MLLMs. The taxonomy is derived by grouping existing published approaches according to the failure modes they address (cross-modal drift, alignment tax, etc.). No equations, fitted parameters, or self-referential definitions appear; the classification rests on cited external works rather than reducing to any input by construction. Self-citations, if present, are not load-bearing for the taxonomy itself. The paper is therefore self-contained against external benchmarks and receives the default non-circularity finding.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption VLMs and MLLMs face unique challenges in continual learning including cross-modal feature drift, parameter interference, zero-shot capability erosion, and an alignment tax in generative models.
invented entities (4)
-
Multi-Modal Replay Strategies
no independent evidence
-
Cross-Modal Regularization
no independent evidence
-
Parameter-Efficient Adaptation
no independent evidence
-
Model Fusion and Decoupling paradigms
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We propose a challenge-driven taxonomy comprising four core paradigms: (1) Multi-Modal Replay Strategies addressing explicit and implicit memory drift; (2) Cross-Modal Regularization enforcing topological and geometric alignment; (3) Parameter-Efficient Adaptation utilizing dynamic routing and subspace projections; and the emerging (4) Model Fusion and Decoupling paradigms.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 6 Pith papers
-
Reasoning Portability: Guiding Continual Learning for MLLMs in the RLVR Era
Formalizes Reasoning Portability (RP) and proposes RDB-CL to modulate per-sample KL regularization in RLVR for MLLM continual learning, achieving +12.0% Last accuracy over vanilla RLVR baseline by preserving reusable ...
-
DSCA: Dynamic Subspace Concept Alignment for Lifelong VLM Editing
DSCA turns concept isolation into an architectural property by dynamically creating orthogonal subspaces for non-interfering lifelong edits in vision-language models, sustaining over 95% success after 1000 sequential edits.
-
ImageHD: Energy-Efficient On-Device Continual Learning of Visual Representations via Hyperdimensional Computing
ImageHD delivers up to 40.4x speedup and 383x energy efficiency for on-device continual learning of visual representations by using hyperdimensional computing and bounded exemplar management on an FPGA.
-
AIM: Asymmetric Information Masking for Visual Question Answering Continual Learning
AIM applies modality-specific masks to balance stability and plasticity in asymmetric VLMs, achieving SOTA average performance and reduced forgetting on continual VQA v2 and GQA while preserving generalization to nove...
-
iGSP:Implicit Gradient Subspace Projection for Efficient Continual Learning of Vision-Language Models
iGSP uses implicit gradient subspace projection in two phases to enable efficient continual adaptation of vision-language models, claiming SOTA accuracy with 42.7% fewer trainable parameters and 86.9% less total param...
-
MAny: Merge Anything for Multimodal Continual Instruction Tuning
MAny addresses dual-forgetting in multimodal continual instruction tuning via CPM and LPM merging strategies, delivering up to 8.57% accuracy gains on UCIT benchmarks without additional training.
Reference graph
Works this paper leans on
-
[1]
Learning transferable visual models from natural language supervision,
A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” in International Conference on Machine Learning, 2021, pp. 8748–8763
work page 2021
-
[2]
Scaling up visual and vision-language representation learning with noisy text supervision,
C. Jia, Y . Yang, Y . Xia, Y .-T. Chen, Z. Parekh, H. Pham, Q. Le, Y .-H. Sung, Z. Li, and T. Duerig, “Scaling up visual and vision-language representation learning with noisy text supervision,” in International Conference on Machine Learning, 2021, pp. 4904–4916
work page 2021
-
[3]
J. Li, D. Li, C. Xiong, and S. C. Hoi, “Blip: Bootstrap- ping language-image pre-training for unified vision- language understanding and generation,” in Interna- tional Conference on Machine Learning , 2022
work page 2022
-
[4]
Flamingo: A visual language model for few-shot learning,
J.-B. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, A. Hassani, A. Mensch, B. Millar, M. Reynolds, R. Ring et al., “Flamingo: A visual language model for few-shot learning,” in Advances in Neural Information Process- ing Systems, 2022
work page 2022
-
[5]
Retrieval-enhanced visual prompt learning for few-shot classification,
J. Rong, H. Chen, T. Chen, L. Ou, X. Yu, and Y . Liu, “Retrieval-enhanced visual prompt learning for few-shot classification,” arXiv preprint arXiv:2306.02243 , 2023
-
[6]
RA-CLIP: Retrieval augmented contrastive language-image pre-training,
C.-W. Xie, S. Sun, X. Xiong, Y . Zheng, D. Zhao, and J. Zhou, “RA-CLIP: Retrieval augmented contrastive language-image pre-training,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 19 265–19 274
work page 2023
-
[7]
VQA: Visual question answering,
S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. L. Zitnick, and D. Parikh, “VQA: Visual question answering,” in Proceedings of the IEEE/CVF Interna- tional Conference on Computer Vision, 2015, pp. 2425– 2433
work page 2015
-
[8]
CoCa: Contrastive Captioners are Image-Text Foundation Models
J. Yu, Z. Wang, V . Vasudevan, L. Yeung, M. Seyed- hosseini, and Y . Wu, “CoCa: Contrastive caption- ers are image-text foundation models,” arXiv preprint arXiv:2205.01917, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[9]
ALIP: Adaptive language-image pre-training with synthetic caption,
K. Yang, J. Deng, X. An, J. Li, Z. Feng, J. Guo, J. Yang, and T. Liu, “ALIP: Adaptive language-image pre-training with synthetic caption,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 2922–2931
work page 2023
-
[10]
Catastrophic interfer- ence in connectionist networks: The sequential learn- ing problem,
M. McCloskey and N. J. Cohen, “Catastrophic interfer- ence in connectionist networks: The sequential learn- ing problem,” Psychology of learning and motivation , vol. 24, pp. 109–165, 1989
work page 1989
-
[11]
Continual lifelong learning with neural networks: A review,
G. I. Parisi, R. Kemker, J. L. Part, C. Kanan, and S. Wermter, “Continual lifelong learning with neural networks: A review,” Neural Networks , vol. 113, pp. 54–71, 2019
work page 2019
-
[12]
A con- tinual learning survey: Defying forgetting in classifica- tion tasks,
M. Delange, R. Aljundi, M. Masana, S. Parisot, X. Jia, A. Leonardis, G. Slabaugh, and T. Tuytelaars, “A con- tinual learning survey: Defying forgetting in classifica- tion tasks,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021
work page 2021
-
[13]
Overcoming catastrophic forgetting in neural networks,
J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Ve- ness, G. Desjardins, A. A. Rusu, K. Milan, J. Quan, T. Ramalho, A. Grabska-Barwinska et al., “Overcoming catastrophic forgetting in neural networks,”Proceedings of the National Academy of Sciences , 2017
work page 2017
-
[14]
icarl: Incremental classifier and representation learning,
S.-A. Rebuffi, A. Kolesnikov, G. Sperl, and C. H. Lam- pert, “icarl: Incremental classifier and representation learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , 2017
work page 2017
-
[15]
Lifelong learning with dynamically expandable networks,
J. Yoon, E. Yang, J. Lee, and S. J. Hwang, “Lifelong learning with dynamically expandable networks,” in International Conference on Learning Representations , 2018
work page 2018
-
[16]
Multimodal continual learning: A survey,
A. Douillard, S. Choi, P. Goyal, E. Belilovsky, and M. Cord, “Multimodal continual learning: A survey,” arXiv preprint arXiv:2209.06720 , 2022
-
[17]
PromptMM: Prompt-based multi-modal continual learning,
J. Wang and Z. Liu, “PromptMM: Prompt-based multi-modal continual learning,” in Proceedings of the IEEE/CVF International Conference on Computer Vi- sion, 2023
work page 2023
-
[18]
Audio- visual event localization in unconstrained videos,
Y . Tian, X. Shi, B. Li, Z. Duan, and C. Xu, “Audio- visual event localization in unconstrained videos,” in European Conference on Computer Vision , 2018, pp. 247–263
work page 2018
-
[19]
Microsoft coco: Common objects in context,
T.-Y . Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Doll ´ar, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in Computer vision– ECCV 2014: 13th European conference, zurich, Switzer- land, September 6-12, 2014, proceedings, part v 13 . Springer, 2014, pp. 740–755
work page 2014
-
[20]
Incclip: Informed incremental learning for vision-language mod- els,
X. Wang, Z. Yu, L. Yuan, and Y . Zhang, “Incclip: Informed incremental learning for vision-language mod- els,” in European Conference on Computer Vision , 2023
work page 2023
-
[21]
Zscl: Zero- shot continual learning with vision-language models,
Y . Zhang, T. Gao, Y . Wang, and Y . Zhang, “Zscl: Zero- shot continual learning with vision-language models,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , 2023
work page 2023
-
[22]
Lora: Low-rank adaptation of large language models,
E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, W. Chen et al. , “Lora: Low-rank adaptation of large language models,” International Conference on Learning Representations , 2022
work page 2022
-
[23]
MoE- Adapters: Parameter-efficient continual learning for vision-language models,
Y . Zhou, X. Wang, X. Liu, and Y . Zhang, “MoE- Adapters: Parameter-efficient continual learning for vision-language models,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024
work page 2024
-
[24]
Synthetic data is an elegant gift for continual vision-language models,
B. Wu, W. Shi, J. Wang, and M. Ye, “Synthetic data is an elegant gift for continual vision-language models,” arXiv preprint arXiv:2503.04229 , 2025
-
[25]
Triplet: Task- aware regularization for inter-modal prompt learning,
Z. Wu, C. Shen, Y . He, and Y . Wang, “Triplet: Task- aware regularization for inter-modal prompt learning,” in Proceedings of the IEEE/CVF International Confer- ence on Computer Vision , 2023
work page 2023
-
[26]
Quad: Query- augmented distillation for vision-language continual learning,
Z. Chen, X. Wang, X. Liu, and Y . Zhang, “Quad: Query- augmented distillation for vision-language continual learning,” arXiv preprint arXiv:2307.09573 , 2023
-
[27]
Recent advances of multimodal contin- ual learning: A comprehensive survey,
D. Yu, X. Zhang, Y . Chen, A. Liu, Y . Zhang, P. S. Yu, and I. King, “Recent advances of multimodal contin- ual learning: A comprehensive survey,” arXiv preprint JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 8, AUGUST 2025 14 arXiv:2410.05352, 2024
-
[28]
Z. Chen and B. Liu, Lifelong machine learning . Mor- gan & Claypool Publishers, 2018
work page 2018
-
[29]
Continual learning through synaptic intelligence,
F. Zenke, B. Poole, and S. Ganguli, “Continual learning through synaptic intelligence,” in International Confer- ence on Machine Learning , 2017, pp. 3987–3995
work page 2017
-
[30]
Z. Li and D. Hoiem, “Learning without forgetting,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017
work page 2017
-
[31]
L3DOC: Lifelong 3d object classification,
Y . Liu, Y . Cong, G. Sun, T. Zhang, J. Dong, and H. Liu, “L3DOC: Lifelong 3d object classification,” IEEE Transactions on Image Processing , vol. 30, pp. 7486–7498, 2021
work page 2021
-
[32]
Memory aware synapses: Learning what (not) to forget,
R. Aljundi, F. Babiloni, M. Elhoseiny, M. Rohrbach, and T. Tuytelaars, “Memory aware synapses: Learning what (not) to forget,” in European Conference on Computer Vision, 2018
work page 2018
-
[33]
A. A. Rusu, N. C. Rabinowitz, G. Desjardins, H. Soyer, J. Kirkpatrick, K. Kavukcuoglu, R. Pascanu, and R. Hadsell, “Progressive neural networks,” arXiv preprint arXiv:1606.04671, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[34]
Recall: Replay-based continual learning in semantic segmentation,
A. Maracani, U. Michieli, M. Toldo, and P. Zanuttigh, “Recall: Replay-based continual learning in semantic segmentation,” in Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition , 2021, pp. 7026–7035
work page 2021
-
[35]
Mem- orizing complementation network for few-shot class- incremental learning,
Z. Ji, Z. Hou, X. Liu, Y . Pang, and X. Li, “Mem- orizing complementation network for few-shot class- incremental learning,” IEEE Transactions on Image Processing, vol. 32, pp. 937–948, 2023
work page 2023
-
[36]
Rebalancing batch normalization for exemplar-based class-incremental learning,
S. Cha, S. Cho, D. Hwang, S. Hong, M. Lee, and T. Moon, “Rebalancing batch normalization for exemplar-based class-incremental learning,” inProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 20 127–20 136
work page 2023
-
[37]
Augmented box replay: Overcoming foreground shift for incremental object detection,
Y . Liu, Y . Cong, D. Goswami, X. Liu, and J. van de Wei- jer, “Augmented box replay: Overcoming foreground shift for incremental object detection,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 11 367–11 377
work page 2023
-
[38]
Vision-language models for vision tasks: A survey,
J. Zhang, J. Huang, S. Jin, and S. Lu, “Vision-language models for vision tasks: A survey,” IEEE Transactions on Pattern Analysis and Machine Intelligence , vol. 46, no. 8, pp. 5625–5644, 2024
work page 2024
-
[39]
A survey of vision-language pre-trained models,
Y . Du, Z. Liu, J. Li, and W. X. Zhao, “A survey of vision-language pre-trained models,” arXiv preprint arXiv:2202.10936, 2022
-
[40]
Sigmoid loss for language image pre-training,
X. Zhai, B. Mustafa, A. Kolesnikov, and L. Beyer, “Sigmoid loss for language image pre-training,” in Proceedings of the IEEE/CVF International Conference on Computer Vision , 2023, pp. 11 975–11 986
work page 2023
-
[41]
Align before fuse: Vision and language representation learning with momentum distillation,
J. Li, R. Selvaraju, A. Gotmare, S. Joty, C. Xiong, and S. C. H. Hoi, “Align before fuse: Vision and language representation learning with momentum distillation,” Advances in Neural Information Processing Systems , vol. 34, pp. 9694–9705, 2021
work page 2021
-
[42]
Vilt: Vision-and-language transformer without convolution or region supervi- sion,
W. Kim, B. Son, and I. Kim, “Vilt: Vision-and-language transformer without convolution or region supervi- sion,” in International conference on machine learning . PMLR, 2021, pp. 5583–5594
work page 2021
-
[43]
Align before fuse: Vision and language representation learning with mo- mentum distillation,
J. Li, J. Baldridge, and S. C. Hoi, “Align before fuse: Vision and language representation learning with mo- mentum distillation,” inAdvances in Neural Information Processing Systems (NeurIPS) , 2021
work page 2021
-
[44]
Flava: A foundational language and vision alignment model,
A. Singh, R. Hu, V . Goswami, G. Couairon, W. Galuba, M. Rohrbach, and D. Kiela, “Flava: A foundational language and vision alignment model,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 15 638–15 650
work page 2022
-
[45]
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models
D. Zhu, J. Chen, X. Shen, X. Li, and M. Elhoseiny, “Minigpt-4: Enhancing vision-language understanding with advanced large language models,” arXiv preprint arXiv:2304.10592, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[46]
H. Liu, C. Li, Q. Wu, and Y . J. Lee, “Visual instruction tuning,” Advances in Neural Information Processing Systems, vol. 36, pp. 34 892–34 916, 2023
work page 2023
-
[47]
Generative multi-modal models are good class incre- mental learners,
X. Cao, H. Lu, L. Huang, X. Liu, and M.-M. Cheng, “Generative multi-modal models are good class incre- mental learners,” in Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition , 2024, pp. 28 706–28 717
work page 2024
-
[48]
H. Guo, F. Zeng, Z. Xiang, F. Zhu, D.-H. Wang, X.- Y . Zhang, and C.-L. Liu, “Hide-llava: Hierarchical de- coupling for continual instruction tuning of multimodal large language model,” 2025
work page 2025
-
[49]
J. Li, D. Li, S. Savarese, and S. Hoi, “BLIP-2: bootstrap- ping language-image pre-training with frozen image encoders and large language models,” in International Conference on Machine Learning , 2023
work page 2023
-
[50]
Learning task-aware language-image representation for class- incremental object detection,
H. Zhang, B.-B. Gao, Y . Zeng, X. Tian, X. Tan, Z. Zhang, Y . Qu, J. Liu, and Y . Xie, “Learning task-aware language-image representation for class- incremental object detection,” in Proceedings of the AAAI Conference on Artificial Intelligence , vol. 38, no. 7, 2024, pp. 7096–7104
work page 2024
-
[51]
Grounded language-image pre-training,
L. H. Li, P. Zhang, H. Zhang, J. Yang, C. Li, Y . Zhong, L. Wang, L. Yuan, L. Zhang, J.-N. Hwang et al. , “Grounded language-image pre-training,” in Proceed- ings of the IEEE/CVF conference on computer vision and pattern recognition , 2022, pp. 10 965–10 975
work page 2022
-
[52]
Ranpac: Random pro- jections and pre-trained models for continual learning,
M. D. McDonnell, D. Gong, A. Parvaneh, E. Abbas- nejad, and A. Van den Hengel, “Ranpac: Random pro- jections and pre-trained models for continual learning,” Advances in Neural Information Processing Systems , vol. 36, pp. 12 022–12 053, 2023
work page 2023
-
[53]
Isolation and impartial aggregation: A paradigm of incremental learning without interference,
Y . Wang, Z. Ma, Z. Huang, Y . Wang, Z. Su, and X. Hong, “Isolation and impartial aggregation: A paradigm of incremental learning without interference,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, no. 8, 2023, pp. 10 209–10 217
work page 2023
-
[54]
A unified continual learning framework with general parameter-efficient tuning,
Q. Gao, C. Zhao, Y . Sun, T. Xi, G. Zhang, B. Ghanem, and J. Zhang, “A unified continual learning framework with general parameter-efficient tuning,” inProceedings of the IEEE/CVF International Conference on Com- puter Vision, 2023, pp. 11 483–11 493
work page 2023
-
[55]
Weighted ensemble models are strong continual learn- ers,
I. E. Marouf, S. Roy, E. Tartaglione, and S. Lathuili `ere, JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 8, AUGUST 2025 15 “Weighted ensemble models are strong continual learn- ers,” in European Conference on Computer Vision . Springer, 2024, pp. 306–324
work page 2025
-
[56]
D.-W. Zhou, Z.-W. Cai, H.-J. Ye, D.-C. Zhan, and Z. Liu, “Revisiting class-incremental learning with pre- trained models: Generalizability and adaptivity are all you need,” International Journal of Computer Vision , vol. 133, no. 3, pp. 1012–1032, 2025
work page 2025
-
[57]
Towards a unified view of parameter-efficient transfer learning,
J. He, C. Zhou, X. Ma, T. Berg-Kirkpatrick, and G. Neu- big, “Towards a unified view of parameter-efficient transfer learning,” arXiv preprint arXiv:2110.04366 , 2021
-
[58]
Finetuned Language Models Are Zero-Shot Learners
J. Wei, M. Bosma, V . Y . Zhao, K. Guu, A. W. Yu, B. Lester, N. Du, A. M. Dai, and Q. V . Le, “Finetuned language models are zero-shot learners,” arXiv preprint arXiv:2109.01652, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[59]
Parameter-efficient transfer learning for nlp,
N. Houlsby, A. Giurgiu, S. Jastrzebski, B. Morrone, Q. De Laroussilhe, A. Gesmundo, M. Attariyan, and S. Gelly, “Parameter-efficient transfer learning for nlp,” in International conference on machine learning . PMLR, 2019, pp. 2790–2799
work page 2019
-
[60]
Prefix-Tuning: Optimizing Continuous Prompts for Generation
X. L. Li and P. Liang, “Prefix-tuning: Optimizing continuous prompts for generation,” arXiv preprint arXiv:2101.00190, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[61]
Orthogonal subspace learning for language model continual learning,
X. Wang, T. Chen, Q. Ge, H. Xia, R. Bao, R. Zheng, Q. Zhang, T. Gui, and X. Huang, “Orthogonal subspace learning for language model continual learning,” arXiv preprint arXiv:2310.14152, 2023
-
[62]
Is parameter collision hindering continual learning in llms?
S. Yang, K.-P. Ning, Y .-Y . Liu, J.-Y . Yao, Y .-H. Tian, Y .- B. Song, and L. Yuan, “Is parameter collision hindering continual learning in llms?” in Proceedings of the 31st International Conference on Computational Linguistics, 2025, pp. 4243–4259
work page 2025
-
[63]
Gradient projection for continual parameter- efficient tuning,
J. Qiao, Z. Zhang, X. Tan, Y . Qu, W. Zhang, Z. Han, and Y . Xie, “Gradient projection for continual parameter- efficient tuning,”IEEE Transactions on Pattern Analysis and Machine Intelligence , 2025
work page 2025
-
[64]
Learning to prompt for continual learning,
Z. Wang, Z. Zhang, C.-Y . Lee, H. Zhang, R. Sun, X. Ren, G. Su, V . Perot, J. Dy, and T. Pfister, “Learning to prompt for continual learning,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 139–149
work page 2022
-
[65]
Dualprompt: Complementary prompting for rehearsal- free continual learning,
Z. Wang, Z. Zhang, S. Ebrahimi, R. Sun, H. Zhang, C.-Y . Lee, X. Ren, G. Su, V . Perot, J. Dy et al. , “Dualprompt: Complementary prompting for rehearsal- free continual learning,” in European Conference on Computer Vision, 2022, pp. 631–648
work page 2022
-
[66]
Tic- clip: Continual training of clip models,
S. Garg, M. Farajtabar, H. Pouransari, R. Vemulapalli, S. Mehta, O. Tuzel, V . Shankar, and F. Faghri, “Tic- clip: Continual training of clip models,” arXiv preprint arXiv:2310.16226, 2023
-
[67]
MLLM-CL: Continual learning for multimodal large language models,
H. Zhao, F. Zhu, R. Wang, G. Meng, and Z. Zhang, “MLLM-CL: Continual learning for multimodal large language models,” 2025
work page 2025
-
[68]
A practitioner’s guide to continual multimodal pretraining,
K. Roth, V . Udandarao, S. Dziadzio, A. Prabhu, M. Cherti, O. Vinyals, O. H ´enaff, S. Albanie, M. Bethge, and Z. Akata, “A practitioner’s guide to continual multimodal pretraining,” arXiv preprint arXiv:2408.14471, 2024
-
[69]
Class-incremental learning: survey and performance evaluation,
M. Masana, X. Liu, B. Twardowski, M. Menta, A. D. Bagdanov, and J. van de Weijer, “Class-incremental learning: survey and performance evaluation,” IEEE Transactions on Pattern Analysis and Machine Intel- ligence, 2022
work page 2022
-
[70]
Continual learning in cross-modal retrieval,
K. Wang, L. Herranz, and J. van de Weijer, “Continual learning in cross-modal retrieval,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 3628–3638
work page 2021
-
[71]
Class- incremental learning with clip: Adaptive representation adjustment and parameter fusion,
L. Huang, X. Cao, H. Lu, and X. Liu, “Class- incremental learning with clip: Adaptive representation adjustment and parameter fusion,” in European Confer- ence on Computer Vision , 2024, pp. 214–231
work page 2024
-
[72]
Language guided concept bottleneck models for interpretable continual learning,
L. Yu, H. Han, Z. Tao, H. Yao, and C. Xu, “Language guided concept bottleneck models for interpretable continual learning,” arXiv preprint arXiv:2503.23283 , 2025
-
[73]
Clap4clip: Contin- ual learning with probabilistic finetuning for vision- language models,
S. Jha, D. Gong, and L. Yao, “Clap4clip: Contin- ual learning with probabilistic finetuning for vision- language models,” arXiv preprint arXiv:2403.19137 , 2024
-
[74]
Mind the gap: Preserving and compensating for the modality gap in clip-based continual learning,
L. Huang, X. Cao, H. Lu, Y . Meng, F. Yang, and X. Liu, “Mind the gap: Preserving and compensating for the modality gap in clip-based continual learning,” arXiv preprint arXiv:2507.09118, 2025
-
[75]
Robust fine-tuning of zero-shot models,
M. Wortsman, G. Ilharco, J. W. Kim, M. Li, S. Korn- blith, R. Roelofs, R. G. Lopes, H. Hajishirzi, A. Farhadi, H. Namkoong et al. , “Robust fine-tuning of zero-shot models,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 7959–7971
work page 2022
-
[76]
Preventing zero-shot transfer degradation in continual learning of vision-language models,
Z. Zheng, M. Ma, K. Wang, Z. Qin, X. Yue, and Y . You, “Preventing zero-shot transfer degradation in continual learning of vision-language models,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 19 125–19 136
work page 2023
-
[77]
Gradient episodic memory for continual learning,
D. Lopez-Paz and M. Ranzato, “Gradient episodic memory for continual learning,” in Advances in Neural Information Processing Systems , vol. 30, 2017
work page 2017
-
[78]
Continual learning with deep generative replay,
H. Shin, J. K. Lee, J. Kim, and J. Kim, “Continual learning with deep generative replay,” in Advances in Neural Information Processing Systems , vol. 30, 2017
work page 2017
-
[79]
Vqacl: A novel visual question answering continual learning setting,
X. Zhang, F. Zhang, and C. Xu, “Vqacl: A novel visual question answering continual learning setting,” in Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, 2023, pp. 19 102– 19 112
work page 2023
-
[80]
Continual multi- modal knowledge graph construction,
X. Chen, J. Zhang, X. Wang, N. Zhang, T. Wu, Y . Wang, Y . Wang, and H. Chen, “Continual multi- modal knowledge graph construction,” arXiv preprint arXiv:2305.08698, 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.