pith. sign in

arxiv: 2605.19301 · v1 · pith:BUI7VZF3new · submitted 2026-05-19 · 💻 cs.CV

iGSP:Implicit Gradient Subspace Projection for Efficient Continual Learning of Vision-Language Models

Pith reviewed 2026-05-20 07:13 UTC · model grok-4.3

classification 💻 cs.CV
keywords continual learningvision-language modelsmixture of expertssubspace projectionparameter-efficient fine-tuninggradient projectioncatastrophic forgettingknowledge reuse
0
0 comments X

The pith

Vision-language models adapt continually by projecting new task gradients onto subspaces identified from early MoE router convergence.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that the real barrier to efficient continual learning in vision-language models is deciding what knowledge to share, and that superficial visual similarity often mismatches the actual overlap in optimization paths. It treats this overlap as a geometric property of low-rank subspaces and uses the early stabilization of Mixture-of-Experts routers to fix a shared basis for those paths. New tasks then have their gradients implicitly projected onto the existing basis through constrained regularization, with routing probabilities serving as signals to drop redundant dimensions, before an orthogonal fine-tuning stage fits the remaining task-specific loss without further interference. If this geometric reuse holds, models can maintain high accuracy across a sequence of tasks while avoiding both parameter explosion from isolated modules and negative transfer from mismatched sharing.

Core claim

iGSP splits adaptation into a Subspace Identification phase that expands candidate experts, applies subspace-constrained regularization to project incoming gradients onto the historical basis established by early MoE router convergence, and prunes redundant dimensions using routing probabilities as gradient-flow indicators, followed by an Orthogonal Subspace Fine-Tuning phase that fixes the basis and drops the regularization to fit task residuals. This process is claimed to deliver state-of-the-art accuracy on the MTIL benchmark while cutting average trainable parameters by 42.7 percent and final total parameters by 86.9 percent relative to prior methods.

What carries the argument

Implicit gradient subspace projection that treats early-converged MoE routing probabilities as indicators for projecting new gradients onto and pruning within a shared historical low-rank subspace.

If this is right

  • State-of-the-art accuracy is reached on the MTIL continual learning benchmark for vision-language models.
  • Average trainable parameters drop by 42.7 percent compared with current state-of-the-art continual learning methods.
  • Final total parameter count falls by 86.9 percent relative to counterpart approaches that assign isolated modules per task.
  • Negative transfer between visually similar yet logically distinct tasks is reduced by aligning on optimization trajectory overlap instead of surface similarity.
  • Training becomes more efficient because the structural basis is identified once and then held fixed during rapid residual fitting.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same early-router convergence signal could serve as a general cue for subspace construction in other routed or gated architectures beyond vision-language models.
  • Monitoring gradient flow indicators in deployed systems might allow subspaces to expand or contract dynamically after the initial identification phase.
  • Direct comparison on task sequences where visual appearance and logical structure are deliberately decorrelated would isolate the benefit of trajectory-based sharing over similarity-based baselines.

Load-bearing premise

Early convergence of MoE routers produces a stable subspace basis that captures all necessary historical information so that later gradient projection and pruning preserve task-specific details without loss.

What would settle it

Running the method on MTIL but beginning subspace identification only after routers have not converged, or replacing routing probabilities with uniform values during pruning, and observing whether accuracy falls below prior methods or the reported parameter reductions vanish.

Figures

Figures reproduced from arXiv: 2605.19301 by Chao Tao, Dongbo Zhou, Gaozhi Zhou, Haifeng Li, Ling Zhao, Wang Guo, Wentao Yang, Xian Li, Xuezhi Cui, Zeyuan Wang, Ziyu Li.

Figure 1
Figure 1. Figure 1: Comparison of alignment strategies in continual learning. (a) Task-specific LoRA: Assigns isolated LoRA modules to [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Training loss and mean KL divergence (averaged over [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: (a) Overall architecture of iGSP; (b) Detailed IFER pipeline; (c)MoE structure of the plug-in module. {εj} NE j=1. Given an input x, the router produces a routing distribution π(x): π(x) = [πj (x)]NE j=1, X j πj (x) = 1 (1) where πj (x) denotes the activation probability of expert εj . The input x is then forwarded to each expert network to obtain expert outputs εj (x), where j ∈ {1, 2, . . . , NE}. The fi… view at source ↗
Figure 4
Figure 4. Figure 4: Geometric visualization of the iGSP optimization phases. (1) Subspace Pre-expansion: Initializing candidate basis [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: iGSP’s two-stage training procedure with the number of pre-expanded experts set to 3. [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗
Figure 7
Figure 7. Figure 7: Visualization of the L1 distances between average [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗
Figure 6
Figure 6. Figure 6: Number of Experts vs. Number of Learned Tasks under [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗
Figure 8
Figure 8. Figure 8: Visualization of subspace basis utilization in iGSP [PITH_FULL_IMAGE:figures/full_fig_p011_8.png] view at source ↗
read the original abstract

Vision-Language Models require efficient adaptation to continually emerging downstream tasks. While Parameter-Efficient Fine-Tuning mitigates catastrophic forgetting, assigning isolated modules per task leads to parameter explosion. Conversely, recent similarity-driven sharing mechanisms falsely equate superficial visual similarity with underlying alignment consistency. This fundamental mismatch triggers severe negative transfer between visually similar but logically distinct tasks and fails to exploit alignment reuse across visually diverse ones. We argue thatalignment sharing is fundamentally a geometric problem of overlapping optimization trajectories within shared low-rank subspaces. Grounded in this insight, we propose iGSP, a novel framework that achieves efficient adaptation via implicit gradient subspace projection. Leveraging the early convergence of MoE routers to establish the subspace basis, iGSP bifurcates the adaptation process into two phases. First, the Subspace Identification phase introduces candidate experts via basis pre-expansion, applies a novel subspace-constrained regularization to implicitly project new task gradients onto the historical subspace, and precisely prunes redundant dimensions by treating routing probabilities as gradient flow indicators, ultimately to maximize knowledge reuse. Second, the Orthogonal Subspace Fine-Tuning phase fixes this structural basis and removes the regularization to rapidly fit the task-specific residual loss. Extensive experiments on the MTIL benchmark demonstrate that iGSP achieves state-of-the-art accuracy while significantly improving training efficiency, reducing the average trainable parameters by 42.7\% compared to current SOTA methods, and decreasing the final total parameters by 86.9\% relative to counterparts. The source code is available at https://github.com/GeoX-Lab/iGSP.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes iGSP for efficient continual adaptation of vision-language models. It argues that alignment sharing is a geometric problem of overlapping low-rank subspaces and introduces a two-phase procedure: Subspace Identification (basis pre-expansion, subspace-constrained regularization, and pruning via MoE routing probabilities treated as gradient-flow proxies) followed by Orthogonal Subspace Fine-Tuning (fixed basis, regularization removed). On the MTIL benchmark the method reports state-of-the-art accuracy together with a 42.7% reduction in average trainable parameters and an 86.9% reduction in final total parameters relative to prior SOTA.

Significance. If the central geometric claim and the stability of the early-MoE-router subspace basis hold, the work supplies a principled route to parameter-efficient continual VLM adaptation that avoids both parameter explosion and negative transfer between visually similar but semantically distinct tasks. The explicit two-phase separation and the use of routing probabilities for pruning constitute a concrete, testable contribution to the PEFT/continual-learning literature.

major comments (2)
  1. [Section 3.2] Subspace Identification phase (Section 3.2): The central claim that early MoE router convergence produces a fixed, reusable low-rank subspace basis whose pruning (via routing probabilities as gradient-flow indicators) preserves task-specific information is load-bearing for both the 42.7% trainable-parameter reduction and the absence of negative transfer. The manuscript provides no ablation on router stability after the identification cutoff, no sensitivity analysis to the cutoff epoch, and no targeted evaluation on MTIL task pairs that are visually similar yet semantically distinct; without these, the reported efficiency gains cannot be confidently attributed to the proposed projection rather than to incomplete bases.
  2. [Section 3.3] Orthogonal Subspace Fine-Tuning phase (Section 3.3): The transition from subspace-constrained regularization to unconstrained fine-tuning assumes the identified basis already captures all reusable alignment; if this assumption fails on later tasks, the orthogonal residual fitting may still incur negative transfer. The current MTIL results do not report per-task forgetting curves or gradient alignment metrics before versus after the phase switch.
minor comments (2)
  1. [Section 3.1] The abstract and Section 3.1 refer to 'basis pre-expansion' without an explicit equation or pseudocode showing how the candidate experts are added to the historical subspace; a compact algorithmic box would improve reproducibility.
  2. [Figure 2] Figure 2 (or equivalent) comparing parameter counts should include error bars or multiple random seeds; the 86.9% total-parameter reduction is stated as a single figure without variance.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive review. We appreciate the recognition of the geometric framing of alignment sharing and the potential contribution of the two-phase iGSP procedure. Below we respond point by point to the major comments, indicating the revisions we will incorporate to address the concerns raised.

read point-by-point responses
  1. Referee: [Section 3.2] Subspace Identification phase (Section 3.2): The central claim that early MoE router convergence produces a fixed, reusable low-rank subspace basis whose pruning (via routing probabilities as gradient-flow indicators) preserves task-specific information is load-bearing for both the 42.7% trainable-parameter reduction and the absence of negative transfer. The manuscript provides no ablation on router stability after the identification cutoff, no sensitivity analysis to the cutoff epoch, and no targeted evaluation on MTIL task pairs that are visually similar yet semantically distinct; without these, the reported efficiency gains cannot be confidently attributed to the proposed projection rather than to incomplete bases.

    Authors: We agree that these analyses would strengthen the attribution of the reported gains to the implicit projection mechanism. In the revised manuscript we will add (i) an ablation monitoring router stability by extending training 10 epochs past the identification cutoff and reporting changes in routing probability distributions and downstream accuracy; (ii) a sensitivity study varying the cutoff epoch from 3 to 15 and tabulating the resulting trade-offs in parameter count versus final MTIL accuracy; and (iii) a targeted subsection evaluating negative transfer on visually similar yet semantically distinct MTIL pairs (e.g., different animal classes or vehicle subtypes), using both accuracy deltas and gradient cosine similarity as metrics. These additions will allow readers to assess whether the efficiency improvements stem from the subspace projection rather than incomplete bases. The existing MTIL results already include a broad range of task similarities, but we will make the supporting evidence more explicit. revision: yes

  2. Referee: [Section 3.3] Orthogonal Subspace Fine-Tuning phase (Section 3.3): The transition from subspace-constrained regularization to unconstrained fine-tuning assumes the identified basis already captures all reusable alignment; if this assumption fails on later tasks, the orthogonal residual fitting may still incur negative transfer. The current MTIL results do not report per-task forgetting curves or gradient alignment metrics before versus after the phase switch.

    Authors: We acknowledge the value of explicit per-task metrics to verify that the phase switch does not re-introduce negative transfer. In the revised version we will include (i) per-task forgetting curves showing accuracy on each prior task after every new-task adaptation and (ii) gradient alignment statistics (cosine similarity between pre- and post-switch gradients projected onto the identified subspace). The design of iGSP ensures that phase-one regularization projects new gradients onto the historical basis, so that phase-two residual fitting operates in the orthogonal complement; the observed SOTA accuracy together with the 86.9 % reduction in total parameter growth on MTIL is consistent with limited interference. Nevertheless, we will add the requested curves and metrics to make this claim directly verifiable. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in iGSP derivation

full rationale

The paper's central derivation introduces a two-phase process (Subspace Identification with basis pre-expansion, subspace-constrained regularization, and routing-probability pruning, followed by Orthogonal Subspace Fine-Tuning) grounded in the geometric insight that alignment sharing is a problem of overlapping low-rank optimization trajectories. This is presented as a novel framework rather than a re-derivation of prior fitted quantities or self-citations. Experimental results on the MTIL benchmark are used to support the reported accuracy and parameter reductions; no equations or steps in the provided text reduce the claimed gains to inputs by construction, self-definition, or load-bearing self-citation chains. The method adds independent regularization and pruning mechanisms.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests primarily on a domain assumption about the geometry of task alignment rather than on additional fitted constants or new postulated entities. Because only the abstract is available, the ledger records the explicit high-level premise stated by the authors.

axioms (1)
  • domain assumption Alignment sharing is fundamentally a geometric problem of overlapping optimization trajectories within shared low-rank subspaces.
    This premise is invoked to justify the entire subspace-projection approach and the use of MoE router convergence for basis identification.

pith-pipeline@v0.9.0 · 5844 in / 1292 out tokens · 65600 ms · 2026-05-20T07:13:56.555837+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

60 extracted references · 60 canonical work pages · 4 internal anchors

  1. [1]

    Learning transferable visual models from natural language supervision,

    A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clarket al., “Learning transferable visual models from natural language supervision,” inInternational conference on machine learning. PmLR, 2021, pp. 8748–8763

  2. [2]

    Visual instruction tuning,

    H. Liu, C. Li, Q. Wu, and Y . J. Lee, “Visual instruction tuning,” Advances in neural information processing systems, vol. 36, pp. 34 892– 34 916, 2023. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 12

  3. [3]

    Cl- moe: Enhancing multimodal large language model with dual momentum mixture-of-experts for continual visual question answering,

    T. Huai, J. Zhou, X. Wu, Q. Chen, Q. Bai, Z. Zhou, and L. He, “Cl- moe: Enhancing multimodal large language model with dual momentum mixture-of-experts for continual visual question answering,” inProceed- ings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 19 608–19 617

  4. [4]

    Continual learning of image classes with language guidance from a vision-language model,

    W. Zhang, Y . Huang, W. Zhang, T. Zhang, Q. Lao, Y . Yu, W.-S. Zheng, and R. Wang, “Continual learning of image classes with language guidance from a vision-language model,”IEEE Transactions on Circuits and Systems for Video Technology, vol. 34, no. 12, pp. 13 152–13 163, 2024

  5. [5]

    Bilora: Almost-orthogonal parameter spaces for continual learning,

    H. Zhu, Y . Zhang, J. Dong, and P. Koniusz, “Bilora: Almost-orthogonal parameter spaces for continual learning,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 25 613–25 622

  6. [6]

    Language guided concept bottleneck models for interpretable continual learning,

    L. Yu, H. Han, Z. Tao, H. Yao, and C. Xu, “Language guided concept bottleneck models for interpretable continual learning,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 14 976–14 986

  7. [7]

    Do your best and get enough rest for continual learning,

    H. Kang, G. Seifer, D. Lee, and J. Ryu, “Do your best and get enough rest for continual learning,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 10 077–10 086

  8. [8]

    Catastrophic interference in connec- tionist networks: The sequential learning problem,

    M. McCloskey and N. J. Cohen, “Catastrophic interference in connec- tionist networks: The sequential learning problem,” inPsychology of learning and motivation. Elsevier, 1989, vol. 24, pp. 109–165

  9. [9]

    Continual Learning for VLMs: A Survey and Taxonomy Beyond Forgetting

    Y . Liu, Q. Hong, L. Huang, A. Gomez-Villa, D. Goswami, X. Liu, J. van de Weijer, and Y . Tian, “Continual learning for vlms: A survey and taxonomy beyond forgetting,”arXiv preprint arXiv:2508.04227, 2025

  10. [10]

    Assessing and learning align- ment of unimodal vision and language models,

    L. Zhang, Q. Yang, and A. Agrawal, “Assessing and learning align- ment of unimodal vision and language models,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 14 604– 14 614

  11. [11]

    Learning without forgetting,

    Z. Li and D. Hoiem, “Learning without forgetting,”IEEE transactions on pattern analysis and machine intelligence, vol. 40, no. 12, pp. 2935– 2947, 2017

  12. [12]

    Preventing zero-shot transfer degradation in continual learning of vision-language models,

    Z. Zheng, M. Ma, K. Wang, Z. Qin, X. Yue, and Y . You, “Preventing zero-shot transfer degradation in continual learning of vision-language models,” inProceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 19 125–19 136

  13. [13]

    Continual cross-domain image compression via entropy prior guided knowledge distillation and scalable decoding,

    C. Wu, Q. Wu, R. Ma, K. N. Ngan, H. Li, F. Meng, and H. Qiu, “Continual cross-domain image compression via entropy prior guided knowledge distillation and scalable decoding,”IEEE Transactions on Circuits and Systems for Video Technology, vol. 34, no. 9, pp. 8080– 8092, 2024

  14. [14]

    Overcoming catastrophic forgetting in neural networks,

    J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins, A. A. Rusu, K. Milan, J. Quan, T. Ramalho, A. Grabska-Barwinska et al., “Overcoming catastrophic forgetting in neural networks,”Pro- ceedings of the national academy of sciences, vol. 114, no. 13, pp. 3521–3526, 2017

  15. [15]

    Memory aware synapses: Learning what (not) to forget,

    R. Aljundi, F. Babiloni, M. Elhoseiny, M. Rohrbach, and T. Tuytelaars, “Memory aware synapses: Learning what (not) to forget,” inProceedings of the European conference on computer vision (ECCV), 2018, pp. 139– 154

  16. [16]

    icarl: Incremental classifier and representation learning,

    S.-A. Rebuffi, A. Kolesnikov, G. Sperl, and C. H. Lampert, “icarl: Incremental classifier and representation learning,” inProceedings of the IEEE conference on Computer Vision and Pattern Recognition, 2017, pp. 2001–2010

  17. [17]

    Synthetic data is an elegant gift for continual vision-language models,

    B. Wu, W. Shi, J. Wang, and M. Ye, “Synthetic data is an elegant gift for continual vision-language models,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 2813–2823

  18. [18]

    Selective experience replay for lifelong learn- ing,

    D. Isele and A. Cosgun, “Selective experience replay for lifelong learn- ing,” inProceedings of the AAAI conference on artificial intelligence, vol. 32, no. 1, 2018

  19. [19]

    Squeezing more past knowledge for online class-incremental continual learning,

    D. Yu, M. Zhang, M. Li, F. Zha, J. Zhang, L. Sun, and K. Huang, “Squeezing more past knowledge for online class-incremental continual learning,”IEEE/CAA Journal of Automatica Sinica, vol. 10, no. 3, pp. 722–736, 2023

  20. [20]

    Class-specific knowledge-guided multimodal prompt tuning for few-shot class-incremental learning,

    F. Xiong, Z. Yuan, X. Wu, and C. Xu, “Class-specific knowledge-guided multimodal prompt tuning for few-shot class-incremental learning,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 36, no. 1, pp. 763–776, 2026

  21. [21]

    Learning to prompt for continual learning,

    Z. Wang, Z. Zhang, C.-Y . Lee, H. Zhang, R. Sun, X. Ren, G. Su, V . Perot, J. Dy, and T. Pfister, “Learning to prompt for continual learning,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 139–149

  22. [22]

    Mind the interference: Retaining pre-trained knowledge in parameter efficient continual learning of vision-language models,

    L. Tang, Z. Tian, K. Li, C. He, H. Zhou, H. Zhao, X. Li, and J. Jia, “Mind the interference: Retaining pre-trained knowledge in parameter efficient continual learning of vision-language models,” inEuropean conference on computer vision. Springer, 2024, pp. 346–365

  23. [23]

    Inflora: Interference-free low-rank adaptation for continual learning,

    Y .-S. Liang and W.-J. Li, “Inflora: Interference-free low-rank adaptation for continual learning,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 23 638–23 647

  24. [24]

    Lora: Low-rank adaptation of large language models

    E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, W. Chenet al., “Lora: Low-rank adaptation of large language models.” ICLR, vol. 1, no. 2, p. 3, 2022

  25. [25]

    Prefix-Tuning: Optimizing Continuous Prompts for Generation

    X. L. Li and P. Liang, “Prefix-tuning: Optimizing continuous prompts for generation,”arXiv preprint arXiv:2101.00190, 2021

  26. [26]

    Moe-adapters++: Towards more efficient continual learning of vision-language models via dynamic mixture-of-experts adapters,

    J. Yu, Z. Huang, Y . Zhuge, L. Zhang, P. Hu, D. Wang, H. Lu, and Y . He, “Moe-adapters++: Towards more efficient continual learning of vision-language models via dynamic mixture-of-experts adapters,”IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

  27. [27]

    Self-expansion of pre-trained models with mixture of adapters for continual learning,

    H. Wanget al., “Self-expansion of pre-trained models with mixture of adapters for continual learning,” inCVPR, 2025, pp. 10 087–10 098

  28. [28]

    Con- tinual learning with pre-trained models: A survey,

    D.-W. Zhou, H.-L. Sun, J. Ning, H.-J. Ye, and D.-C. Zhan, “Con- tinual learning with pre-trained models: A survey,”arXiv preprint arXiv:2401.16386, 2024

  29. [29]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    A. Dosovitskiy, “An image is worth 16x16 words: Transformers for image recognition at scale,”arXiv preprint arXiv:2010.11929, 2020

  30. [30]

    Class incremental learning with pre-trained vision-language models,

    X. Liu, X. Cao, H. Lu, J.-w. Xiao, A. D. Bagdanov, and M.-M. Cheng, “Class incremental learning with pre-trained vision-language models,” arXiv preprint arXiv:2310.20348, 2023

  31. [31]

    Dualprompt: Complementary prompting for rehearsal-free continual learning,

    Z. Wang, Z. Zhang, S. Ebrahimi, R. Sun, H. Zhang, C.-Y . Lee, X. Ren, G. Su, V . Perot, J. Dyet al., “Dualprompt: Complementary prompting for rehearsal-free continual learning,” inEuropean conference on computer vision. Springer, 2022, pp. 631–648

  32. [32]

    S-prompts learning with pre-trained transformers: An occam’s razor for domain incremental learning,

    Y . Wang, Z. Huang, and X. Hong, “S-prompts learning with pre-trained transformers: An occam’s razor for domain incremental learning,”Ad- vances in Neural Information Processing Systems, vol. 35, pp. 5682– 5695, 2022

  33. [33]

    Pectp: Parameter-efficient cross-task prompts for incremental vision transformer,

    Q. Feng, H. Zhao, C. Zhang, J. Dong, H. Ding, Y .-G. Jiang, and H. Qian, “Pectp: Parameter-efficient cross-task prompts for incremental vision transformer,”IEEE Transactions on Circuits and Systems for Video Technology, vol. 35, no. 11, pp. 11 282–11 296, 2025

  34. [34]

    Cl-lora: Continual low-rank adaptation for rehearsal-free class-incremental learning,

    J. He, Z. Duan, and F. Zhu, “Cl-lora: Continual low-rank adaptation for rehearsal-free class-incremental learning,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 30 534– 30 544

  35. [35]

    Boosting continual learning of vision-language models via mixture-of-experts adapters,

    J. Yu, Y . Zhuge, L. Zhang, P. Hu, D. Wang, H. Lu, and Y . He, “Boosting continual learning of vision-language models via mixture-of-experts adapters,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 23 219–23 230

  36. [36]

    Gradient episodic memory for continual learning,

    D. Lopez-Paz and M. Ranzato, “Gradient episodic memory for continual learning,”Advances in neural information processing systems, vol. 30, 2017

  37. [37]

    arXiv preprint arXiv:2005.00944 , year=

    S. Wu, H. R. Zhang, and C. R ´e, “Understanding and improving informa- tion transfer in multi-task learning,”arXiv preprint arXiv:2005.00944, 2020

  38. [38]

    Data augmented flatness-aware gradient projection for continual learning,

    E. Yang, L. Shen, Z. Wang, S. Liu, G. Guo, and X. Wang, “Data augmented flatness-aware gradient projection for continual learning,” inProceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 5630–5639

  39. [39]

    Rethinking gradient projection continual learning: Stability/plasticity feature space decoupling,

    Z. Zhao, Z. Zhang, X. Tan, J. Liu, Y . Qu, Y . Xie, and L. Ma, “Rethinking gradient projection continual learning: Stability/plasticity feature space decoupling,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 3718–3727

  40. [40]

    Code-cl: Conceptor- based gradient projection for deep continual learning,

    M. P. Apolinario, S. Choudhary, and K. Roy, “Code-cl: Conceptor- based gradient projection for deep continual learning,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 775–784

  41. [41]

    Visual prompt tuning in null space for continual learning,

    Y . Lu, S. Zhang, D. Cheng, Y . Xing, N. Wang, P. Wang, and Y . Zhang, “Visual prompt tuning in null space for continual learning,”Advances in neural information processing systems, vol. 37, pp. 7878–7901, 2024

  42. [42]

    Prompt gradient projection for continual learning,

    J. Qiao, X. Tan, C. Chen, Y . Qu, Y . Peng, Y . Xieet al., “Prompt gradient projection for continual learning,” inThe Twelfth International Conference on Learning Representations, 2024

  43. [43]

    Keeplora: Continual learning with residual gradient adaptation,

    M.-L. Luo, Z.-H. Zhou, Y .-L. Zhang, Y . Wan, T. Wei, and M.-L. Zhang, “Keeplora: Continual learning with residual gradient adaptation,”arXiv preprint arXiv:2601.19659, 2026

  44. [44]

    Splitlora: Balancing stability and plasticity in continual learning through gradient space splitting,

    H. Qiu, M. Zhang, Z. Qiao, W. Guan, M. Zhang, and L. Nie, “Splitlora: Balancing stability and plasticity in continual learning through gradient space splitting,”arXiv preprint arXiv:2505.22370, 2025

  45. [45]

    Gnsp: Gradient null space projection for preserving cross-modal alignment in vlms continual learning,

    T. Peng, Y . Liu, S. Yang, Q. Hong, and Y . Tian, “Gnsp: Gradient null space projection for preserving cross-modal alignment in vlms continual learning,”arXiv preprint arXiv:2507.19839, 2025. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 13

  46. [46]

    Dynamic multi-layer null space projection for vision-language continual learning,

    B. Kang, L. Wang, Z. Wu, T. Feng, Y . Li, Y . Gao, and W. Li, “Dynamic multi-layer null space projection for vision-language continual learning,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 2077–2086

  47. [47]

    Adaptive mixtures of local experts,

    R. A. Jacobs, M. I. Jordan, S. J. Nowlan, and G. E. Hinton, “Adaptive mixtures of local experts,”Neural computation, vol. 3, no. 1, pp. 79–87, 1991

  48. [48]

    Learning multiple layers of features from tiny images,

    A. Krizhevsky, G. Hintonet al., “Learning multiple layers of features from tiny images,” 2009

  49. [49]

    Der: Dynamically expandable representation for class incremental learning,

    S. Yan, J. Xie, and X. He, “Der: Dynamically expandable representation for class incremental learning,” inProceedings of the IEEE/CVF confer- ence on computer vision and pattern recognition, 2021, pp. 3014–3023

  50. [50]

    Decoupled Weight Decay Regularization

    I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” arXiv preprint arXiv:1711.05101, 2017

  51. [51]

    Don’t stop learn- ing: Towards continual learning for the clip model,

    Y . Ding, L. Liu, C. Tian, J. Yang, and H. Ding, “Don’t stop learn- ing: Towards continual learning for the clip model,”arXiv preprint arXiv:2207.09248, 2022

  52. [52]

    Robust fine-tuning of zero-shot models,

    M. Wortsman, G. Ilharco, J. W. Kim, M. Li, S. Kornblith, R. Roelofs, R. G. Lopes, H. Hajishirzi, A. Farhadi, H. Namkoonget al., “Robust fine-tuning of zero-shot models,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 7959– 7971

  53. [53]

    End-to-end incremental learning,

    F. M. Castro, M. J. Mar ´ın-Jim´enez, N. Guil, C. Schmid, and K. Alahari, “End-to-end incremental learning,” inProceedings of the European conference on computer vision (ECCV), 2018, pp. 233–248

  54. [54]

    Learning a unified classifier incrementally via rebalancing,

    S. Hou, X. Pan, C. C. Loy, Z. Wang, and D. Lin, “Learning a unified classifier incrementally via rebalancing,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 831–839

  55. [55]

    More classifiers, less forgetting: A generic multi-classifier paradigm for incremental learning,

    Y . Liu, S. Parisot, G. Slabaugh, X. Jia, A. Leonardis, and T. Tuytelaars, “More classifiers, less forgetting: A generic multi-classifier paradigm for incremental learning,” inEuropean Conference on Computer Vision. Springer, 2020, pp. 699–716

  56. [56]

    Prototype aug- mentation and self-supervision for incremental learning,

    F. Zhu, X.-Y . Zhang, C. Wang, F. Yin, and C.-L. Liu, “Prototype aug- mentation and self-supervision for incremental learning,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 5871–5880

  57. [57]

    Dytox: Trans- formers for continual learning with dynamic token expansion,

    A. Douillard, A. Ram ´e, G. Couairon, and M. Cord, “Dytox: Trans- formers for continual learning with dynamic token expansion,” inPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 9285–9295

  58. [58]

    Large scale incremental learning,

    Y . Wu, Y . Chen, L. Wang, Y . Ye, Z. Liu, Y . Guo, and Y . Fu, “Large scale incremental learning,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 374–382

  59. [59]

    Podnet: Pooled outputs distillation for small-tasks incremental learning,

    A. Douillard, M. Cord, C. Ollion, T. Robert, and E. Valle, “Podnet: Pooled outputs distillation for small-tasks incremental learning,” in European Conference on Computer Vision. Springer, 2020, pp. 86– 102

  60. [60]

    Dense network ex- pansion for class incremental learning,

    Z. Hu, Y . Li, J. Lyu, D. Gao, and N. Vasconcelos, “Dense network ex- pansion for class incremental learning,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 11 858–11 867