pith. sign in

arxiv: 2508.04227 · v2 · pith:CZNJS46Tnew · submitted 2025-08-06 · 💻 cs.CV · cs.LG

Continual Learning for VLMs: A Survey and Taxonomy Beyond Forgetting

Pith reviewed 2026-05-21 23:37 UTC · model grok-4.3

classification 💻 cs.CV cs.LG
keywords continual learningvision-language modelsmultimodal large language modelscatastrophic forgettingtaxonomysurveycross-modal alignmentzero-shot generalization
0
0 comments X

The pith

This survey establishes a taxonomy of four paradigms to address unique continual learning challenges in vision-language models and multimodal large language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to create the first comprehensive review of continual learning methods tailored for vision-language models and generative multimodal large language models. It breaks down specific problems these models face when learning from changing data, such as drift between visual and language features and loss of reasoning chains. The authors organize solutions into four main approaches based on these problems. A reader would care because these models are widely used but currently struggle to update without losing their core abilities, limiting their use in dynamic real-world settings.

Core claim

Continual learning for VLMs requires going beyond standard forgetting mitigation because of unique issues including cross-modal feature drift, parameter interference from shared architectures, erosion of zero-shot capabilities, and in generative MLLMs an alignment tax that disrupts chain-of-thought reasoning. The survey deconstructs these modes and introduces a challenge-driven taxonomy with four paradigms: multi-modal replay strategies for memory drift, cross-modal regularization for alignment, parameter-efficient adaptation with dynamic routing, and model fusion and decoupling. It further calls for better benchmarks that track both domain changes and ability retention along with detailed推理

What carries the argument

The challenge-driven taxonomy that organizes continual learning techniques around the distinct failure modes of cross-modal models.

If this is right

  • Adopting the taxonomy will direct efforts toward methods that maintain cross-modal alignments during continual updates.
  • Evaluation protocols will evolve to include separate tracking of domain adaptation and ability preservation in benchmarks.
  • Research will advance toward compositional zero-shot learning and integration with embodied systems using sensor data.
  • Autonomous agentic ecosystems will benefit from models that can update without collapsing their reasoning structures.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This framework might extend naturally to other multimodal combinations such as audio-language or video-text models.
  • Testing the taxonomy's coverage could involve applying it to classify methods in related fields like continual learning for large language models alone.
  • Future work could explore whether model fusion approaches offer advantages in resource-constrained environments for deploying updated VLMs.

Load-bearing premise

The identified failure modes of cross-modal feature drift, parameter interference, zero-shot erosion, and alignment tax represent the primary distinctive challenges for VLMs and MLLMs in continual learning, with the four paradigms sufficiently covering existing solutions.

What would settle it

Publication of multiple new continual learning methods for VLMs that do not align with any of the four proposed paradigms or that solve the problems without using cross-modal specific techniques would indicate the taxonomy is incomplete.

Figures

Figures reproduced from arXiv: 2508.04227 by Alexandra Gomez-Villa, Dipam Goswami, Joost Van De Weijer, Linlan Huang, Qiuhe Hong, Xialei Liu, Yonghong Tian, Yuyang Liu.

Figure 1
Figure 1. Figure 1: Our Taxonomy of Continual Learning Strategies for [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Illustration of three core challenges in VLM-CL. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Summary of VLM continual learning methods published in recent years. Several methods combine approaches; we [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Illustration of the detailed settings of different VLM continual learning methods. Methods are grouped into three [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Schematic illustration of the metrics commonly employed in VLM-CL. The matrix shows performance from its pretrained [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Comparison of SOTA methods on VQACL benchmark, [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗
read the original abstract

Vision-language models (VLMs) and the recent surge of Multimodal Large Language Models (MLLMs) have revolutionized artificial intelligence with unprecedented cross-modal alignment and zero-shot generalization. However, enabling them to learn continually from non-stationary data remains a major challenge, as their cross-modal alignment and generalization capabilities are particularly vulnerable to catastrophic forgetting. Unlike traditional unimodal continual learning (CL), VLMs face unique challenges such as cross-modal feature drift, parameter interference due to shared architectures, and zero-shot capability erosion. Furthermore, generative MLLMs exhibit a unique ``alignment tax,'' where catastrophic forgetting manifests not merely as factual amnesia, but as a systemic collapse of deep Chain-of-Thought (CoT) reasoning. This survey presents the first comprehensive, diagnostic review bridging continual learning for both predictive VLMs and generative MLLMs. We systematically deconstruct the aforementioned failure modes and propose a challenge-driven taxonomy comprising four core paradigms: (1) Multi-Modal Replay Strategies addressing explicit and implicit memory drift; (2) Cross-Modal Regularization enforcing topological and geometric alignment; (3) Parameter-Efficient Adaptation} utilizing dynamic routing and subspace projections; and the emerging (4) Model Fusion and Decoupling paradigms. We critically analyze the evolution of evaluation protocols, highlighting the essential shift toward dual-track benchmarks (Domain vs. Ability CL) and micro-diagnostic CoT evaluations. Finally, we chart a roadmap for future research, emphasizing compositional zero-shot learning, embodied AI with sensor fusion, and autonomous agentic ecosystems. All resources are available at: https://github.com/YuyangSunshine/Awesome-Continual-learning-of-Vision-Language-Models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript is a survey on continual learning for vision-language models (VLMs) and multimodal large language models (MLLMs). It identifies unique challenges beyond standard catastrophic forgetting, including cross-modal feature drift, parameter interference from shared architectures, zero-shot capability erosion, and an 'alignment tax' that disrupts Chain-of-Thought reasoning in generative models. The authors propose a challenge-driven taxonomy with four paradigms—(1) Multi-Modal Replay Strategies, (2) Cross-Modal Regularization, (3) Parameter-Efficient Adaptation, and (4) Model Fusion and Decoupling—to organize existing methods. The survey also reviews shifts in evaluation protocols toward dual-track (Domain vs. Ability) benchmarks and micro-diagnostic CoT evaluations, and outlines a future research roadmap.

Significance. If the taxonomy accurately and comprehensively organizes the literature on continual learning for both predictive VLMs and generative MLLMs without major omissions or forced categorizations, the survey would provide a valuable diagnostic framework for the field. Its emphasis on bridging the two model classes and advocating diagnostic evaluations could help researchers target the specific vulnerabilities of cross-modal alignment under non-stationary data.

major comments (2)
  1. [Taxonomy section] Taxonomy section (proposal of the four core paradigms): The central claim that the taxonomy is challenge-driven and comprehensively organizes solutions is load-bearing for the paper's contribution as the 'first comprehensive' review. However, the manuscript lacks an explicit coverage audit or mapping table showing how all cited methods (including prompt-based continual adaptation and hybrid replay-regularization approaches) fit into the four paradigms without retrofitting or omission. This weakens the assertion of comprehensive organization.
  2. [Failure modes deconstruction] Failure modes deconstruction (cross-modal drift, alignment tax, etc.): The paper positions these as primary unique challenges for VLMs relative to unimodal CL, but provides no quantitative synthesis or comparative analysis across cited works to substantiate that these modes dominate over standard forgetting; this is needed to support the taxonomy's challenge-driven foundation.
minor comments (2)
  1. [Abstract] Abstract: The phrase 'Parameter-Efficient Adaptation} utilizing' contains a stray closing brace, which is a typographical error.
  2. [Evaluation protocols] Evaluation protocols discussion: The shift to dual-track benchmarks is highlighted, but the section would benefit from concrete citations or examples of current benchmarks that exemplify the Domain vs. Ability distinction to improve clarity for readers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback on our survey manuscript. The comments highlight important areas for strengthening the presentation of our taxonomy and supporting analysis. We address each major comment point by point below, with clear indications of planned revisions.

read point-by-point responses
  1. Referee: [Taxonomy section] Taxonomy section (proposal of the four core paradigms): The central claim that the taxonomy is challenge-driven and comprehensively organizes solutions is load-bearing for the paper's contribution as the 'first comprehensive' review. However, the manuscript lacks an explicit coverage audit or mapping table showing how all cited methods (including prompt-based continual adaptation and hybrid replay-regularization approaches) fit into the four paradigms without retrofitting or omission. This weakens the assertion of comprehensive organization.

    Authors: We agree that an explicit mapping table would enhance the transparency and verifiability of the taxonomy's coverage. In the revised manuscript, we will add a dedicated table that systematically maps all cited methods—including prompt-based continual adaptation (categorized under Parameter-Efficient Adaptation due to its focus on efficient parameter updates) and hybrid replay-regularization approaches (placed according to their primary challenge address)—to the four paradigms. The categorization will be justified by the dominant challenge each method targets, ensuring the taxonomy remains challenge-driven without omissions or retrofitting. revision: yes

  2. Referee: [Failure modes deconstruction] Failure modes deconstruction (cross-modal drift, alignment tax, etc.): The paper positions these as primary unique challenges for VLMs relative to unimodal CL, but provides no quantitative synthesis or comparative analysis across cited works to substantiate that these modes dominate over standard forgetting; this is needed to support the taxonomy's challenge-driven foundation.

    Authors: The deconstruction in the manuscript is grounded in a systematic qualitative review of the literature, where these VLM-specific failure modes are consistently highlighted as distinct vulnerabilities. To provide stronger substantiation, we will add a comparative synthesis subsection in the revision that aggregates and contrasts key observations and metrics from the cited works, illustrating the relative prominence of cross-modal drift, alignment tax, and related issues versus standard forgetting. We note that a formal quantitative meta-analysis is inherently limited by the heterogeneity of evaluation protocols and metrics across existing studies, but the added synthesis will better support the challenge-driven foundation of the taxonomy. revision: partial

Circularity Check

0 steps flagged

No circularity: survey taxonomy constructed from external literature analysis

full rationale

This is a survey paper whose central contribution is a literature review and a proposed four-paradigm taxonomy of continual learning methods for VLMs/MLLMs. The taxonomy is derived by grouping existing published approaches according to the failure modes they address (cross-modal drift, alignment tax, etc.). No equations, fitted parameters, or self-referential definitions appear; the classification rests on cited external works rather than reducing to any input by construction. Self-citations, if present, are not load-bearing for the taxonomy itself. The paper is therefore self-contained against external benchmarks and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 4 invented entities

The paper's main contribution is the new taxonomy; it rests on domain assumptions about VLM failure modes rather than new free parameters or invented physical entities.

axioms (1)
  • domain assumption VLMs and MLLMs face unique challenges in continual learning including cross-modal feature drift, parameter interference, zero-shot capability erosion, and an alignment tax in generative models.
    This premise underpins the entire diagnostic review and taxonomy construction, as stated in the abstract.
invented entities (4)
  • Multi-Modal Replay Strategies no independent evidence
    purpose: Addressing explicit and implicit memory drift
    First core paradigm in the proposed taxonomy.
  • Cross-Modal Regularization no independent evidence
    purpose: Enforcing topological and geometric alignment
    Second core paradigm in the proposed taxonomy.
  • Parameter-Efficient Adaptation no independent evidence
    purpose: Utilizing dynamic routing and subspace projections
    Third core paradigm in the proposed taxonomy.
  • Model Fusion and Decoupling paradigms no independent evidence
    purpose: Emerging approach for continual learning in VLMs
    Fourth core paradigm in the proposed taxonomy.

pith-pipeline@v0.9.0 · 5861 in / 1725 out tokens · 67827 ms · 2026-05-21T23:37:00.197003+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    We propose a challenge-driven taxonomy comprising four core paradigms: (1) Multi-Modal Replay Strategies addressing explicit and implicit memory drift; (2) Cross-Modal Regularization enforcing topological and geometric alignment; (3) Parameter-Efficient Adaptation utilizing dynamic routing and subspace projections; and the emerging (4) Model Fusion and Decoupling paradigms.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 6 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Reasoning Portability: Guiding Continual Learning for MLLMs in the RLVR Era

    cs.LG 2026-05 unverdicted novelty 7.0

    Formalizes Reasoning Portability (RP) and proposes RDB-CL to modulate per-sample KL regularization in RLVR for MLLM continual learning, achieving +12.0% Last accuracy over vanilla RLVR baseline by preserving reusable ...

  2. DSCA: Dynamic Subspace Concept Alignment for Lifelong VLM Editing

    cs.CV 2026-04 unverdicted novelty 7.0

    DSCA turns concept isolation into an architectural property by dynamically creating orthogonal subspaces for non-interfering lifelong edits in vision-language models, sustaining over 95% success after 1000 sequential edits.

  3. ImageHD: Energy-Efficient On-Device Continual Learning of Visual Representations via Hyperdimensional Computing

    cs.CV 2026-04 unverdicted novelty 6.0

    ImageHD delivers up to 40.4x speedup and 383x energy efficiency for on-device continual learning of visual representations by using hyperdimensional computing and bounded exemplar management on an FPGA.

  4. AIM: Asymmetric Information Masking for Visual Question Answering Continual Learning

    cs.CV 2026-04 unverdicted novelty 6.0

    AIM applies modality-specific masks to balance stability and plasticity in asymmetric VLMs, achieving SOTA average performance and reduced forgetting on continual VQA v2 and GQA while preserving generalization to nove...

  5. iGSP:Implicit Gradient Subspace Projection for Efficient Continual Learning of Vision-Language Models

    cs.CV 2026-05 unverdicted novelty 5.0

    iGSP uses implicit gradient subspace projection in two phases to enable efficient continual adaptation of vision-language models, claiming SOTA accuracy with 42.7% fewer trainable parameters and 86.9% less total param...

  6. MAny: Merge Anything for Multimodal Continual Instruction Tuning

    cs.LG 2026-04 unverdicted novelty 5.0

    MAny addresses dual-forgetting in multimodal continual instruction tuning via CPM and LPM merging strategies, delivering up to 8.57% accuracy gains on UCIT benchmarks without additional training.

Reference graph

Works this paper leans on

144 extracted references · 144 canonical work pages · cited by 6 Pith papers · 8 internal anchors

  1. [1]

    Learning transferable visual models from natural language supervision,

    A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” in International Conference on Machine Learning, 2021, pp. 8748–8763

  2. [2]

    Scaling up visual and vision-language representation learning with noisy text supervision,

    C. Jia, Y . Yang, Y . Xia, Y .-T. Chen, Z. Parekh, H. Pham, Q. Le, Y .-H. Sung, Z. Li, and T. Duerig, “Scaling up visual and vision-language representation learning with noisy text supervision,” in International Conference on Machine Learning, 2021, pp. 4904–4916

  3. [3]

    Blip: Bootstrap- ping language-image pre-training for unified vision- language understanding and generation,

    J. Li, D. Li, C. Xiong, and S. C. Hoi, “Blip: Bootstrap- ping language-image pre-training for unified vision- language understanding and generation,” in Interna- tional Conference on Machine Learning , 2022

  4. [4]

    Flamingo: A visual language model for few-shot learning,

    J.-B. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, A. Hassani, A. Mensch, B. Millar, M. Reynolds, R. Ring et al., “Flamingo: A visual language model for few-shot learning,” in Advances in Neural Information Process- ing Systems, 2022

  5. [5]

    Retrieval-enhanced visual prompt learning for few-shot classification,

    J. Rong, H. Chen, T. Chen, L. Ou, X. Yu, and Y . Liu, “Retrieval-enhanced visual prompt learning for few-shot classification,” arXiv preprint arXiv:2306.02243 , 2023

  6. [6]

    RA-CLIP: Retrieval augmented contrastive language-image pre-training,

    C.-W. Xie, S. Sun, X. Xiong, Y . Zheng, D. Zhao, and J. Zhou, “RA-CLIP: Retrieval augmented contrastive language-image pre-training,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 19 265–19 274

  7. [7]

    VQA: Visual question answering,

    S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. L. Zitnick, and D. Parikh, “VQA: Visual question answering,” in Proceedings of the IEEE/CVF Interna- tional Conference on Computer Vision, 2015, pp. 2425– 2433

  8. [8]

    CoCa: Contrastive Captioners are Image-Text Foundation Models

    J. Yu, Z. Wang, V . Vasudevan, L. Yeung, M. Seyed- hosseini, and Y . Wu, “CoCa: Contrastive caption- ers are image-text foundation models,” arXiv preprint arXiv:2205.01917, 2022

  9. [9]

    ALIP: Adaptive language-image pre-training with synthetic caption,

    K. Yang, J. Deng, X. An, J. Li, Z. Feng, J. Guo, J. Yang, and T. Liu, “ALIP: Adaptive language-image pre-training with synthetic caption,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 2922–2931

  10. [10]

    Catastrophic interfer- ence in connectionist networks: The sequential learn- ing problem,

    M. McCloskey and N. J. Cohen, “Catastrophic interfer- ence in connectionist networks: The sequential learn- ing problem,” Psychology of learning and motivation , vol. 24, pp. 109–165, 1989

  11. [11]

    Continual lifelong learning with neural networks: A review,

    G. I. Parisi, R. Kemker, J. L. Part, C. Kanan, and S. Wermter, “Continual lifelong learning with neural networks: A review,” Neural Networks , vol. 113, pp. 54–71, 2019

  12. [12]

    A con- tinual learning survey: Defying forgetting in classifica- tion tasks,

    M. Delange, R. Aljundi, M. Masana, S. Parisot, X. Jia, A. Leonardis, G. Slabaugh, and T. Tuytelaars, “A con- tinual learning survey: Defying forgetting in classifica- tion tasks,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021

  13. [13]

    Overcoming catastrophic forgetting in neural networks,

    J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Ve- ness, G. Desjardins, A. A. Rusu, K. Milan, J. Quan, T. Ramalho, A. Grabska-Barwinska et al., “Overcoming catastrophic forgetting in neural networks,”Proceedings of the National Academy of Sciences , 2017

  14. [14]

    icarl: Incremental classifier and representation learning,

    S.-A. Rebuffi, A. Kolesnikov, G. Sperl, and C. H. Lam- pert, “icarl: Incremental classifier and representation learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , 2017

  15. [15]

    Lifelong learning with dynamically expandable networks,

    J. Yoon, E. Yang, J. Lee, and S. J. Hwang, “Lifelong learning with dynamically expandable networks,” in International Conference on Learning Representations , 2018

  16. [16]

    Multimodal continual learning: A survey,

    A. Douillard, S. Choi, P. Goyal, E. Belilovsky, and M. Cord, “Multimodal continual learning: A survey,” arXiv preprint arXiv:2209.06720 , 2022

  17. [17]

    PromptMM: Prompt-based multi-modal continual learning,

    J. Wang and Z. Liu, “PromptMM: Prompt-based multi-modal continual learning,” in Proceedings of the IEEE/CVF International Conference on Computer Vi- sion, 2023

  18. [18]

    Audio- visual event localization in unconstrained videos,

    Y . Tian, X. Shi, B. Li, Z. Duan, and C. Xu, “Audio- visual event localization in unconstrained videos,” in European Conference on Computer Vision , 2018, pp. 247–263

  19. [19]

    Microsoft coco: Common objects in context,

    T.-Y . Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Doll ´ar, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in Computer vision– ECCV 2014: 13th European conference, zurich, Switzer- land, September 6-12, 2014, proceedings, part v 13 . Springer, 2014, pp. 740–755

  20. [20]

    Incclip: Informed incremental learning for vision-language mod- els,

    X. Wang, Z. Yu, L. Yuan, and Y . Zhang, “Incclip: Informed incremental learning for vision-language mod- els,” in European Conference on Computer Vision , 2023

  21. [21]

    Zscl: Zero- shot continual learning with vision-language models,

    Y . Zhang, T. Gao, Y . Wang, and Y . Zhang, “Zscl: Zero- shot continual learning with vision-language models,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , 2023

  22. [22]

    Lora: Low-rank adaptation of large language models,

    E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, W. Chen et al. , “Lora: Low-rank adaptation of large language models,” International Conference on Learning Representations , 2022

  23. [23]

    MoE- Adapters: Parameter-efficient continual learning for vision-language models,

    Y . Zhou, X. Wang, X. Liu, and Y . Zhang, “MoE- Adapters: Parameter-efficient continual learning for vision-language models,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

  24. [24]

    Synthetic data is an elegant gift for continual vision-language models,

    B. Wu, W. Shi, J. Wang, and M. Ye, “Synthetic data is an elegant gift for continual vision-language models,” arXiv preprint arXiv:2503.04229 , 2025

  25. [25]

    Triplet: Task- aware regularization for inter-modal prompt learning,

    Z. Wu, C. Shen, Y . He, and Y . Wang, “Triplet: Task- aware regularization for inter-modal prompt learning,” in Proceedings of the IEEE/CVF International Confer- ence on Computer Vision , 2023

  26. [26]

    Quad: Query- augmented distillation for vision-language continual learning,

    Z. Chen, X. Wang, X. Liu, and Y . Zhang, “Quad: Query- augmented distillation for vision-language continual learning,” arXiv preprint arXiv:2307.09573 , 2023

  27. [27]

    Recent advances of multimodal contin- ual learning: A comprehensive survey,

    D. Yu, X. Zhang, Y . Chen, A. Liu, Y . Zhang, P. S. Yu, and I. King, “Recent advances of multimodal contin- ual learning: A comprehensive survey,” arXiv preprint JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 8, AUGUST 2025 14 arXiv:2410.05352, 2024

  28. [28]

    Chen and B

    Z. Chen and B. Liu, Lifelong machine learning . Mor- gan & Claypool Publishers, 2018

  29. [29]

    Continual learning through synaptic intelligence,

    F. Zenke, B. Poole, and S. Ganguli, “Continual learning through synaptic intelligence,” in International Confer- ence on Machine Learning , 2017, pp. 3987–3995

  30. [30]

    Learning without forgetting,

    Z. Li and D. Hoiem, “Learning without forgetting,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017

  31. [31]

    L3DOC: Lifelong 3d object classification,

    Y . Liu, Y . Cong, G. Sun, T. Zhang, J. Dong, and H. Liu, “L3DOC: Lifelong 3d object classification,” IEEE Transactions on Image Processing , vol. 30, pp. 7486–7498, 2021

  32. [32]

    Memory aware synapses: Learning what (not) to forget,

    R. Aljundi, F. Babiloni, M. Elhoseiny, M. Rohrbach, and T. Tuytelaars, “Memory aware synapses: Learning what (not) to forget,” in European Conference on Computer Vision, 2018

  33. [33]

    Progressive Neural Networks

    A. A. Rusu, N. C. Rabinowitz, G. Desjardins, H. Soyer, J. Kirkpatrick, K. Kavukcuoglu, R. Pascanu, and R. Hadsell, “Progressive neural networks,” arXiv preprint arXiv:1606.04671, 2016

  34. [34]

    Recall: Replay-based continual learning in semantic segmentation,

    A. Maracani, U. Michieli, M. Toldo, and P. Zanuttigh, “Recall: Replay-based continual learning in semantic segmentation,” in Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition , 2021, pp. 7026–7035

  35. [35]

    Mem- orizing complementation network for few-shot class- incremental learning,

    Z. Ji, Z. Hou, X. Liu, Y . Pang, and X. Li, “Mem- orizing complementation network for few-shot class- incremental learning,” IEEE Transactions on Image Processing, vol. 32, pp. 937–948, 2023

  36. [36]

    Rebalancing batch normalization for exemplar-based class-incremental learning,

    S. Cha, S. Cho, D. Hwang, S. Hong, M. Lee, and T. Moon, “Rebalancing batch normalization for exemplar-based class-incremental learning,” inProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 20 127–20 136

  37. [37]

    Augmented box replay: Overcoming foreground shift for incremental object detection,

    Y . Liu, Y . Cong, D. Goswami, X. Liu, and J. van de Wei- jer, “Augmented box replay: Overcoming foreground shift for incremental object detection,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 11 367–11 377

  38. [38]

    Vision-language models for vision tasks: A survey,

    J. Zhang, J. Huang, S. Jin, and S. Lu, “Vision-language models for vision tasks: A survey,” IEEE Transactions on Pattern Analysis and Machine Intelligence , vol. 46, no. 8, pp. 5625–5644, 2024

  39. [39]

    A survey of vision-language pre-trained models,

    Y . Du, Z. Liu, J. Li, and W. X. Zhao, “A survey of vision-language pre-trained models,” arXiv preprint arXiv:2202.10936, 2022

  40. [40]

    Sigmoid loss for language image pre-training,

    X. Zhai, B. Mustafa, A. Kolesnikov, and L. Beyer, “Sigmoid loss for language image pre-training,” in Proceedings of the IEEE/CVF International Conference on Computer Vision , 2023, pp. 11 975–11 986

  41. [41]

    Align before fuse: Vision and language representation learning with momentum distillation,

    J. Li, R. Selvaraju, A. Gotmare, S. Joty, C. Xiong, and S. C. H. Hoi, “Align before fuse: Vision and language representation learning with momentum distillation,” Advances in Neural Information Processing Systems , vol. 34, pp. 9694–9705, 2021

  42. [42]

    Vilt: Vision-and-language transformer without convolution or region supervi- sion,

    W. Kim, B. Son, and I. Kim, “Vilt: Vision-and-language transformer without convolution or region supervi- sion,” in International conference on machine learning . PMLR, 2021, pp. 5583–5594

  43. [43]

    Align before fuse: Vision and language representation learning with mo- mentum distillation,

    J. Li, J. Baldridge, and S. C. Hoi, “Align before fuse: Vision and language representation learning with mo- mentum distillation,” inAdvances in Neural Information Processing Systems (NeurIPS) , 2021

  44. [44]

    Flava: A foundational language and vision alignment model,

    A. Singh, R. Hu, V . Goswami, G. Couairon, W. Galuba, M. Rohrbach, and D. Kiela, “Flava: A foundational language and vision alignment model,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 15 638–15 650

  45. [45]

    MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

    D. Zhu, J. Chen, X. Shen, X. Li, and M. Elhoseiny, “Minigpt-4: Enhancing vision-language understanding with advanced large language models,” arXiv preprint arXiv:2304.10592, 2023

  46. [46]

    Visual instruction tuning,

    H. Liu, C. Li, Q. Wu, and Y . J. Lee, “Visual instruction tuning,” Advances in Neural Information Processing Systems, vol. 36, pp. 34 892–34 916, 2023

  47. [47]

    Generative multi-modal models are good class incre- mental learners,

    X. Cao, H. Lu, L. Huang, X. Liu, and M.-M. Cheng, “Generative multi-modal models are good class incre- mental learners,” in Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition , 2024, pp. 28 706–28 717

  48. [48]

    Hide-llava: Hierarchical de- coupling for continual instruction tuning of multimodal large language model,

    H. Guo, F. Zeng, Z. Xiang, F. Zhu, D.-H. Wang, X.- Y . Zhang, and C.-L. Liu, “Hide-llava: Hierarchical de- coupling for continual instruction tuning of multimodal large language model,” 2025

  49. [49]

    BLIP-2: bootstrap- ping language-image pre-training with frozen image encoders and large language models,

    J. Li, D. Li, S. Savarese, and S. Hoi, “BLIP-2: bootstrap- ping language-image pre-training with frozen image encoders and large language models,” in International Conference on Machine Learning , 2023

  50. [50]

    Learning task-aware language-image representation for class- incremental object detection,

    H. Zhang, B.-B. Gao, Y . Zeng, X. Tian, X. Tan, Z. Zhang, Y . Qu, J. Liu, and Y . Xie, “Learning task-aware language-image representation for class- incremental object detection,” in Proceedings of the AAAI Conference on Artificial Intelligence , vol. 38, no. 7, 2024, pp. 7096–7104

  51. [51]

    Grounded language-image pre-training,

    L. H. Li, P. Zhang, H. Zhang, J. Yang, C. Li, Y . Zhong, L. Wang, L. Yuan, L. Zhang, J.-N. Hwang et al. , “Grounded language-image pre-training,” in Proceed- ings of the IEEE/CVF conference on computer vision and pattern recognition , 2022, pp. 10 965–10 975

  52. [52]

    Ranpac: Random pro- jections and pre-trained models for continual learning,

    M. D. McDonnell, D. Gong, A. Parvaneh, E. Abbas- nejad, and A. Van den Hengel, “Ranpac: Random pro- jections and pre-trained models for continual learning,” Advances in Neural Information Processing Systems , vol. 36, pp. 12 022–12 053, 2023

  53. [53]

    Isolation and impartial aggregation: A paradigm of incremental learning without interference,

    Y . Wang, Z. Ma, Z. Huang, Y . Wang, Z. Su, and X. Hong, “Isolation and impartial aggregation: A paradigm of incremental learning without interference,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, no. 8, 2023, pp. 10 209–10 217

  54. [54]

    A unified continual learning framework with general parameter-efficient tuning,

    Q. Gao, C. Zhao, Y . Sun, T. Xi, G. Zhang, B. Ghanem, and J. Zhang, “A unified continual learning framework with general parameter-efficient tuning,” inProceedings of the IEEE/CVF International Conference on Com- puter Vision, 2023, pp. 11 483–11 493

  55. [55]

    Weighted ensemble models are strong continual learn- ers,

    I. E. Marouf, S. Roy, E. Tartaglione, and S. Lathuili `ere, JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 8, AUGUST 2025 15 “Weighted ensemble models are strong continual learn- ers,” in European Conference on Computer Vision . Springer, 2024, pp. 306–324

  56. [56]

    Revisiting class-incremental learning with pre- trained models: Generalizability and adaptivity are all you need,

    D.-W. Zhou, Z.-W. Cai, H.-J. Ye, D.-C. Zhan, and Z. Liu, “Revisiting class-incremental learning with pre- trained models: Generalizability and adaptivity are all you need,” International Journal of Computer Vision , vol. 133, no. 3, pp. 1012–1032, 2025

  57. [57]

    Towards a unified view of parameter-efficient transfer learning,

    J. He, C. Zhou, X. Ma, T. Berg-Kirkpatrick, and G. Neu- big, “Towards a unified view of parameter-efficient transfer learning,” arXiv preprint arXiv:2110.04366 , 2021

  58. [58]

    Finetuned Language Models Are Zero-Shot Learners

    J. Wei, M. Bosma, V . Y . Zhao, K. Guu, A. W. Yu, B. Lester, N. Du, A. M. Dai, and Q. V . Le, “Finetuned language models are zero-shot learners,” arXiv preprint arXiv:2109.01652, 2021

  59. [59]

    Parameter-efficient transfer learning for nlp,

    N. Houlsby, A. Giurgiu, S. Jastrzebski, B. Morrone, Q. De Laroussilhe, A. Gesmundo, M. Attariyan, and S. Gelly, “Parameter-efficient transfer learning for nlp,” in International conference on machine learning . PMLR, 2019, pp. 2790–2799

  60. [60]

    Prefix-Tuning: Optimizing Continuous Prompts for Generation

    X. L. Li and P. Liang, “Prefix-tuning: Optimizing continuous prompts for generation,” arXiv preprint arXiv:2101.00190, 2021

  61. [61]

    Orthogonal subspace learning for language model continual learning,

    X. Wang, T. Chen, Q. Ge, H. Xia, R. Bao, R. Zheng, Q. Zhang, T. Gui, and X. Huang, “Orthogonal subspace learning for language model continual learning,” arXiv preprint arXiv:2310.14152, 2023

  62. [62]

    Is parameter collision hindering continual learning in llms?

    S. Yang, K.-P. Ning, Y .-Y . Liu, J.-Y . Yao, Y .-H. Tian, Y .- B. Song, and L. Yuan, “Is parameter collision hindering continual learning in llms?” in Proceedings of the 31st International Conference on Computational Linguistics, 2025, pp. 4243–4259

  63. [63]

    Gradient projection for continual parameter- efficient tuning,

    J. Qiao, Z. Zhang, X. Tan, Y . Qu, W. Zhang, Z. Han, and Y . Xie, “Gradient projection for continual parameter- efficient tuning,”IEEE Transactions on Pattern Analysis and Machine Intelligence , 2025

  64. [64]

    Learning to prompt for continual learning,

    Z. Wang, Z. Zhang, C.-Y . Lee, H. Zhang, R. Sun, X. Ren, G. Su, V . Perot, J. Dy, and T. Pfister, “Learning to prompt for continual learning,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 139–149

  65. [65]

    Dualprompt: Complementary prompting for rehearsal- free continual learning,

    Z. Wang, Z. Zhang, S. Ebrahimi, R. Sun, H. Zhang, C.-Y . Lee, X. Ren, G. Su, V . Perot, J. Dy et al. , “Dualprompt: Complementary prompting for rehearsal- free continual learning,” in European Conference on Computer Vision, 2022, pp. 631–648

  66. [66]

    Tic- clip: Continual training of clip models,

    S. Garg, M. Farajtabar, H. Pouransari, R. Vemulapalli, S. Mehta, O. Tuzel, V . Shankar, and F. Faghri, “Tic- clip: Continual training of clip models,” arXiv preprint arXiv:2310.16226, 2023

  67. [67]

    MLLM-CL: Continual learning for multimodal large language models,

    H. Zhao, F. Zhu, R. Wang, G. Meng, and Z. Zhang, “MLLM-CL: Continual learning for multimodal large language models,” 2025

  68. [68]

    A practitioner’s guide to continual multimodal pretraining,

    K. Roth, V . Udandarao, S. Dziadzio, A. Prabhu, M. Cherti, O. Vinyals, O. H ´enaff, S. Albanie, M. Bethge, and Z. Akata, “A practitioner’s guide to continual multimodal pretraining,” arXiv preprint arXiv:2408.14471, 2024

  69. [69]

    Class-incremental learning: survey and performance evaluation,

    M. Masana, X. Liu, B. Twardowski, M. Menta, A. D. Bagdanov, and J. van de Weijer, “Class-incremental learning: survey and performance evaluation,” IEEE Transactions on Pattern Analysis and Machine Intel- ligence, 2022

  70. [70]

    Continual learning in cross-modal retrieval,

    K. Wang, L. Herranz, and J. van de Weijer, “Continual learning in cross-modal retrieval,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 3628–3638

  71. [71]

    Class- incremental learning with clip: Adaptive representation adjustment and parameter fusion,

    L. Huang, X. Cao, H. Lu, and X. Liu, “Class- incremental learning with clip: Adaptive representation adjustment and parameter fusion,” in European Confer- ence on Computer Vision , 2024, pp. 214–231

  72. [72]

    Language guided concept bottleneck models for interpretable continual learning,

    L. Yu, H. Han, Z. Tao, H. Yao, and C. Xu, “Language guided concept bottleneck models for interpretable continual learning,” arXiv preprint arXiv:2503.23283 , 2025

  73. [73]

    Clap4clip: Contin- ual learning with probabilistic finetuning for vision- language models,

    S. Jha, D. Gong, and L. Yao, “Clap4clip: Contin- ual learning with probabilistic finetuning for vision- language models,” arXiv preprint arXiv:2403.19137 , 2024

  74. [74]

    Mind the gap: Preserving and compensating for the modality gap in clip-based continual learning,

    L. Huang, X. Cao, H. Lu, Y . Meng, F. Yang, and X. Liu, “Mind the gap: Preserving and compensating for the modality gap in clip-based continual learning,” arXiv preprint arXiv:2507.09118, 2025

  75. [75]

    Robust fine-tuning of zero-shot models,

    M. Wortsman, G. Ilharco, J. W. Kim, M. Li, S. Korn- blith, R. Roelofs, R. G. Lopes, H. Hajishirzi, A. Farhadi, H. Namkoong et al. , “Robust fine-tuning of zero-shot models,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 7959–7971

  76. [76]

    Preventing zero-shot transfer degradation in continual learning of vision-language models,

    Z. Zheng, M. Ma, K. Wang, Z. Qin, X. Yue, and Y . You, “Preventing zero-shot transfer degradation in continual learning of vision-language models,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 19 125–19 136

  77. [77]

    Gradient episodic memory for continual learning,

    D. Lopez-Paz and M. Ranzato, “Gradient episodic memory for continual learning,” in Advances in Neural Information Processing Systems , vol. 30, 2017

  78. [78]

    Continual learning with deep generative replay,

    H. Shin, J. K. Lee, J. Kim, and J. Kim, “Continual learning with deep generative replay,” in Advances in Neural Information Processing Systems , vol. 30, 2017

  79. [79]

    Vqacl: A novel visual question answering continual learning setting,

    X. Zhang, F. Zhang, and C. Xu, “Vqacl: A novel visual question answering continual learning setting,” in Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, 2023, pp. 19 102– 19 112

  80. [80]

    Continual multi- modal knowledge graph construction,

    X. Chen, J. Zhang, X. Wang, N. Zhang, T. Wu, Y . Wang, Y . Wang, and H. Chen, “Continual multi- modal knowledge graph construction,” arXiv preprint arXiv:2305.08698, 2023

Showing first 80 references.