Continual Learning for VLMs: A Survey and Taxonomy Beyond Forgetting

Alexandra Gomez-Villa; Dipam Goswami; Joost Van De Weijer; Linlan Huang; Qiuhe Hong; Xialei Liu; Yonghong Tian; Yuyang Liu

arxiv: 2508.04227 · v2 · pith:CZNJS46Tnew · submitted 2025-08-06 · 💻 cs.CV · cs.LG

Continual Learning for VLMs: A Survey and Taxonomy Beyond Forgetting

Yuyang Liu , Qiuhe Hong , Linlan Huang , Alexandra Gomez-Villa , Dipam Goswami , Xialei Liu , Joost van de Weijer , Yonghong Tian This is my paper

classification 💻 cs.CV cs.LG

keywords alignmentcross-modallearningvlmscontinualforgettingmllmszero-shot

0 comments

read the original abstract

Vision-language models (VLMs) and the recent surge of Multimodal Large Language Models (MLLMs) have revolutionized artificial intelligence with unprecedented cross-modal alignment and zero-shot generalization. However, enabling them to learn continually from non-stationary data remains a major challenge, as their cross-modal alignment and generalization capabilities are particularly vulnerable to catastrophic forgetting. Unlike traditional unimodal continual learning (CL), VLMs face unique challenges such as cross-modal feature drift, parameter interference due to shared architectures, and zero-shot capability erosion. Furthermore, generative MLLMs exhibit a unique ``alignment tax,'' where catastrophic forgetting manifests not merely as factual amnesia, but as a systemic collapse of deep Chain-of-Thought (CoT) reasoning. This survey presents the first comprehensive, diagnostic review bridging continual learning for both predictive VLMs and generative MLLMs. We systematically deconstruct the aforementioned failure modes and propose a challenge-driven taxonomy comprising four core paradigms: (1) Multi-Modal Replay Strategies addressing explicit and implicit memory drift; (2) Cross-Modal Regularization enforcing topological and geometric alignment; (3) Parameter-Efficient Adaptation} utilizing dynamic routing and subspace projections; and the emerging (4) Model Fusion and Decoupling paradigms. We critically analyze the evolution of evaluation protocols, highlighting the essential shift toward dual-track benchmarks (Domain vs. Ability CL) and micro-diagnostic CoT evaluations. Finally, we chart a roadmap for future research, emphasizing compositional zero-shot learning, embodied AI with sensor fusion, and autonomous agentic ecosystems. All resources are available at: https://github.com/YuyangSunshine/Awesome-Continual-learning-of-Vision-Language-Models.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 6 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Reasoning Portability: Guiding Continual Learning for MLLMs in the RLVR Era
cs.LG 2026-05 unverdicted novelty 7.0

Formalizes Reasoning Portability (RP) and proposes RDB-CL to modulate per-sample KL regularization in RLVR for MLLM continual learning, achieving +12.0% Last accuracy over vanilla RLVR baseline by preserving reusable ...
DSCA: Dynamic Subspace Concept Alignment for Lifelong VLM Editing
cs.CV 2026-04 unverdicted novelty 7.0

DSCA turns concept isolation into an architectural property by dynamically creating orthogonal subspaces for non-interfering lifelong edits in vision-language models, sustaining over 95% success after 1000 sequential edits.
ImageHD: Energy-Efficient On-Device Continual Learning of Visual Representations via Hyperdimensional Computing
cs.CV 2026-04 unverdicted novelty 6.0

ImageHD delivers up to 40.4x speedup and 383x energy efficiency for on-device continual learning of visual representations by using hyperdimensional computing and bounded exemplar management on an FPGA.
AIM: Asymmetric Information Masking for Visual Question Answering Continual Learning
cs.CV 2026-04 unverdicted novelty 6.0

AIM applies modality-specific masks to balance stability and plasticity in asymmetric VLMs, achieving SOTA average performance and reduced forgetting on continual VQA v2 and GQA while preserving generalization to nove...
iGSP:Implicit Gradient Subspace Projection for Efficient Continual Learning of Vision-Language Models
cs.CV 2026-05 unverdicted novelty 5.0

iGSP uses implicit gradient subspace projection in two phases to enable efficient continual adaptation of vision-language models, claiming SOTA accuracy with 42.7% fewer trainable parameters and 86.9% less total param...
MAny: Merge Anything for Multimodal Continual Instruction Tuning
cs.LG 2026-04 unverdicted novelty 5.0

MAny addresses dual-forgetting in multimodal continual instruction tuning via CPM and LPM merging strategies, delivering up to 8.57% accuracy gains on UCIT benchmarks without additional training.