pith. sign in

arxiv: 2606.23063 · v1 · pith:HSBXI6ISnew · submitted 2026-06-22 · 💻 cs.CV · cs.AI

Attention-Spectrum Regularization for Replay-Free Continual Multimodal LLMs

Pith reviewed 2026-06-26 08:54 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords continual learningmultimodal LLMsattention regularizationcatastrophic forgettingspectral analysisvision-language modelsreplay-free learningcross-attention maps
0
0 comments X

The pith

Attention-Spectrum Regularization preserves skill-level cross-attention spectra to reduce forgetting in continual multimodal LLMs without replay.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that skill-conditioned spectral properties of cross-modal attention maps can be summarized and preserved to control forgetting during continual adaptation of multimodal large language models. ASR extracts compact Fourier-based spectral statistics from attention maps, stores only skill-wise prototype distributions, and applies a phase-invariant regularizer that limits harmful drift while permitting task-specific adaptation. Theoretical analysis connects spectral drift to forgetting under a spectral sufficiency assumption and proves stability of the spectra to translations and perturbations. Experiments across VQA and instruction-tuning benchmarks demonstrate consistent gains over replay, regularization, and adapter baselines. This positions internal attention structure as a lightweight carrier of multimodal skills that does not require data replay or parameter isolation.

Core claim

ASR treats cross-attention maps as two-dimensional signals, summarizes their scale and directional properties into compact spectral statistics, and stores only skill-wise prototype distributions; a phase-invariant spectral regularizer then constrains harmful drift while allowing instance-level adaptation, with theoretical analysis showing that skill-conditioned spectral drift controls forgetting under a spectral sufficiency assumption.

What carries the argument

Skill-wise prototype distributions of Fourier power spectra extracted from cross-attention maps, which encode scale and directional properties and are regularized to prevent drift.

If this is right

  • Skill-conditioned spectral regularization reduces forgetting on continual VQA and multimodal instruction-tuning tasks.
  • The approach operates without storing past image-question pairs, pseudo-examples, or teacher snapshots.
  • Fourier power spectra of attention maps remain stable under spatial translations and bounded perturbations.
  • ASR improves final performance over replay-based, regularization-based, and adapter-based baselines on VQA v2, VQACL, CLT-VQA, CoIN, and UCIT.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same spectral-prototype mechanism could be tested in continual learning settings for text-only or vision-only transformers.
  • Combining attention-spectrum regularization with output-level distillation might yield additive gains on long task sequences.
  • Scaling the number of stored prototypes with model size or task count would test whether the storage cost remains sublinear.

Load-bearing premise

That controlling skill-conditioned spectral drift of attention prototypes is sufficient to control forgetting of multimodal skills.

What would settle it

An experiment in which skill-wise spectral drift remains small yet substantial forgetting still occurs on held-out tasks, or in which large spectral drift occurs with negligible forgetting.

read the original abstract

Multimodal large language models (MLLMs) are increasingly required to adapt to non-stationary streams of visual domains, question types, and user instructions, yet continual fine-tuning often causes severe forgetting of previously acquired multimodal skills. Existing continual vision-language methods mainly preserve outputs, replay data or pseudo-data, regularize embedding geometry, or allocate task-specific parameters, but they provide limited control over how internal cross-modal attention patterns supporting old skills drift during adaptation. We propose Attention-Spectrum Regularization (ASR), a replay-free continual learning framework that preserves skill-conditioned structures of cross-modal attention. ASR treats cross-attention maps as two-dimensional signals, summarizes their scale and directional properties into compact spectral statistics, and stores only skill-wise prototype distributions instead of replaying past image-question pairs, generated pseudo-examples, or old-stage teacher snapshots. In later stages, a phase-invariant spectral regularizer constrains harmful drift of these prototypes while allowing instance-level attention to adapt to new tasks. We provide theoretical analysis showing that skill-conditioned spectral drift controls forgetting under a spectral sufficiency assumption, and that Fourier power spectra are stable to spatial translations and bounded perturbations. Experiments on continual VQA and multimodal instruction-tuning benchmarks, including VQA v2, VQACL, CLT-VQA, CoIN, and UCIT, show that ASR consistently improves final performance and reduces forgetting over strong replay-, regularization-, and adapter-based baselines. Preserving skill-level attention structure is an effective and lightweight mechanism for continual MLLMs. Code is available at https://github.com/Creative-zcx/attention-spectrum-replay

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes Attention-Spectrum Regularization (ASR), a replay-free continual learning framework for multimodal LLMs. It treats cross-attention maps as 2D signals, extracts compact spectral statistics, stores skill-wise prototype distributions, and applies a phase-invariant spectral regularizer to limit harmful drift during adaptation to new visual domains or instructions. Theoretical analysis claims that skill-conditioned spectral drift controls forgetting under a spectral sufficiency assumption, with stability results for Fourier spectra under translations and perturbations. Experiments on VQA v2, VQACL, CLT-VQA, CoIN, and UCIT benchmarks report improved final performance and reduced forgetting versus replay, regularization, and adapter baselines.

Significance. If the central claims hold, ASR provides a lightweight, data-free mechanism for preserving multimodal skills via attention spectra rather than replay buffers or task-specific parameters, which could scale better for non-stationary MLLM streams. Code release is a positive contribution for reproducibility. The approach is novel in its focus on spectral properties of cross-attention but its impact hinges on validation of the linking assumption between spectra and forgetting.

major comments (3)
  1. [Abstract, theoretical analysis paragraph] Abstract, theoretical analysis paragraph: the spectral sufficiency assumption (that controlling skill-conditioned spectral drift is sufficient to control forgetting) is invoked without derivation, empirical bounds, or ablation showing that other forgetting pathways (e.g., feed-forward weight changes or output-head drift) are negligible. This assumption is load-bearing for the claim that the phase-invariant regularizer guarantees reduced forgetting.
  2. [Section 5 (Experiments)] Section 5 (Experiments): while improvements over baselines are claimed, the manuscript must report per-task forgetting metrics (e.g., average accuracy drop from peak to final) with standard deviations across runs and statistical tests; without these, the quantitative support for 'consistently reduces forgetting' cannot be assessed.
  3. [Method section] Method section (prototype storage and regularizer): the claim that storing only skill-wise spectral prototype distributions is strictly replay-free requires explicit confirmation that no pseudo-data, teacher logits, or old attention maps are retained at inference or regularization time; any implicit storage would undermine the replay-free positioning.
minor comments (2)
  1. [Method] Clarify notation for the spectral statistics (e.g., exact definition of scale and directional summaries) and the phase-invariance property in the method description.
  2. [Experiments] Add dataset sizes, number of continual stages/tasks, and training hyperparameters for each benchmark to support reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify key aspects of our work. We address each major comment below and indicate planned revisions where appropriate.

read point-by-point responses
  1. Referee: [Abstract, theoretical analysis paragraph] Abstract, theoretical analysis paragraph: the spectral sufficiency assumption (that controlling skill-conditioned spectral drift is sufficient to control forgetting) is invoked without derivation, empirical bounds, or ablation showing that other forgetting pathways (e.g., feed-forward weight changes or output-head drift) are negligible. This assumption is load-bearing for the claim that the phase-invariant regularizer guarantees reduced forgetting.

    Authors: We agree that the spectral sufficiency assumption is central and is stated without a full derivation or supporting ablation in the current version. The theoretical analysis establishes stability of Fourier spectra under translations and perturbations but links this to forgetting control only under the assumption. In revision we will expand the theoretical section with a short derivation sketch based on the role of cross-attention in multimodal fusion and add an ablation that isolates spectral regularization from feed-forward and output-head changes, reporting the resulting forgetting metrics. revision: yes

  2. Referee: [Section 5 (Experiments)] Section 5 (Experiments): while improvements over baselines are claimed, the manuscript must report per-task forgetting metrics (e.g., average accuracy drop from peak to final) with standard deviations across runs and statistical tests; without these, the quantitative support for 'consistently reduces forgetting' cannot be assessed.

    Authors: This is a valid request for stronger quantitative evidence. The current experiments report aggregate final performance and overall forgetting reduction but omit per-task peak-to-final drops, standard deviations, and statistical tests. We will revise Section 5 to include these metrics in additional tables, computed over multiple random seeds, together with paired statistical tests against the baselines. revision: yes

  3. Referee: [Method section] Method section (prototype storage and regularizer): the claim that storing only skill-wise spectral prototype distributions is strictly replay-free requires explicit confirmation that no pseudo-data, teacher logits, or old attention maps are retained at inference or regularization time; any implicit storage would undermine the replay-free positioning.

    Authors: The manuscript states that only skill-wise prototype distributions of spectral statistics are stored and that no past image-question pairs, pseudo-examples, or teacher snapshots are retained. During both regularization and inference the regularizer compares the current model's attention spectra solely against these stored prototypes; no data, logits, or prior maps are generated or accessed. We will add an explicit clarifying paragraph in the Method section confirming this point and reiterating that the approach remains strictly replay-free. revision: partial

Circularity Check

0 steps flagged

No circularity; regularization target and theoretical claim are independently defined

full rationale

The paper defines ASR directly from cross-attention maps treated as 2D signals, extracts spectral statistics, stores skill-wise prototypes, and applies a phase-invariant regularizer. The theoretical claim links spectral drift to forgetting only under an explicitly stated spectral sufficiency assumption that is posited rather than derived from the method itself or from self-citations. No equations, fitted parameters, or predictions reduce to the inputs by construction, and no load-bearing self-citation chain is present in the provided text. The experimental results on external benchmarks are therefore not forced by the method's definition.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the spectral sufficiency assumption stated in the theoretical analysis section of the abstract; no free parameters or invented entities are named in the provided text.

axioms (1)
  • domain assumption spectral sufficiency assumption: skill-conditioned spectral drift controls forgetting
    Invoked to link the regularizer to reduced forgetting (abstract, theoretical analysis sentence).

pith-pipeline@v0.9.1-grok · 5843 in / 1286 out tokens · 20111 ms · 2026-06-26T08:54:41.522904+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

54 extracted references · 2 linked inside Pith

  1. [1]

    Visual instruction tuning.Advances in neural information processing systems, 36:34892– 34916, 2023

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892– 34916, 2023

  2. [2]

    Lvlm-ehub: A comprehen- sive evaluation benchmark for large vision-language models.IEEE Transactions on Pattern Analysis and Machine Intelligence, 47(3):1877–1893, 2024

    Peng Xu, Wenqi Shao, Kaipeng Zhang, Peng Gao, Shuo Liu, Meng Lei, Fanqing Meng, Siyuan Huang, Yu Qiao, and Ping Luo. Lvlm-ehub: A comprehen- sive evaluation benchmark for large vision-language models.IEEE Transactions on Pattern Analysis and Machine Intelligence, 47(3):1877–1893, 2024

  3. [3]

    Improving representation of high- frequency components for medical visual foundation models.IEEE Transactions on Medical Imaging, 2025

    Yuetan Chu, Yilan Zhang, Zhongyi Han, Changchun Yang, Longxi Zhou, Gongning Luo, Chao Huang, and Xin Gao. Improving representation of high- frequency components for medical visual foundation models.IEEE Transactions on Medical Imaging, 2025

  4. [4]

    Construct-vl: Data-free continual structured vl concepts learning

    James Seale Smith, Paola Cascante-Bonilla, Assaf Ar- belle, Donghyun Kim, Rameswar Panda, David Cox, Diyi Yang, Zsolt Kira, Rogerio Feris, and Leonid Kar- linsky. Construct-vl: Data-free continual structured vl concepts learning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion, pages 14994–15004, 2023

  5. [5]

    Coin: A benchmark of continual instruction tuning for multimodel large language models.Advances in Neural Information Processing Systems, 37:57817–57840, 2024

    Cheng Chen, Junchen Zhu, Xu Luo, Heng T Shen, Jingkuan Song, and Lianli Gao. Coin: A benchmark of continual instruction tuning for multimodel large language models.Advances in Neural Information Processing Systems, 37:57817–57840, 2024

  6. [6]

    Ct-based ai system for quantitative and integrated manage- ment of acute respiratory distress syndrome in criti- cal care.npj Digital Medicine, 2026

    Yuetan Chu, Jianpeng Wang, Peiyao Luo, Hui Chen, Zhongheng Zhang, Jiannan Zhang, Yilan Zhang, Yingnan Ju, Yaxin Xiong, Xiqing Luo, et al. Ct-based ai system for quantitative and integrated manage- ment of acute respiratory distress syndrome in criti- cal care.npj Digital Medicine, 2026

  7. [7]

    Impromptu vla: Open weights and open data for driving vision-language- action models.Advances in Neural Information Pro- cessing Systems, 38, 2026

    Haohan Chi, Huan-ang Gao, Ziming Liu, Jianing Liu, Chenyu Liu, Jinwei Li, Kaisen Yang, Yangcheng Yu, Zeda Wang, Wenyi Li, et al. Impromptu vla: Open weights and open data for driving vision-language- action models.Advances in Neural Information Pro- cessing Systems, 38, 2026

  8. [8]

    Psycholinguistics meets con- tinuallearning: Measuringcatastrophicforgettingin visual question answering

    Claudio Greco, Barbara Plank, Raquel Fernández, and Raffaella Bernardi. Psycholinguistics meets con- tinuallearning: Measuringcatastrophicforgettingin visual question answering. InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3601–3605, 2019

  9. [9]

    Continual vision-language representation learning with off-diagonal informa- tion

    Zixuan Ni, Longhui Wei, Siliang Tang, Yueting Zhuang, and Qi Tian. Continual vision-language representation learning with off-diagonal informa- tion. InInternational Conference on Machine Learn- ing, pages 26129–26149. PMLR, 2023

  10. [10]

    Preventing zero-shot transfer degradation in continual learn- ing of vision-language models

    Zangwei Zheng, Mingyuan Ma, Kai Wang, Ziheng Qin, Xiangyu Yue, and Yang You. Preventing zero-shot transfer degradation in continual learn- ing of vision-language models. InProceedings of the IEEE/CVF international conference on computer vision, pages 19125–19136, 2023

  11. [11]

    Medtri: Aplatformforstructuredmedi- cal report normalization to enhance vision-language pretraining.arXiv preprint arXiv:2602.22143, 2026

    Yuetan Chu, Xinhua Ma, Xinran Jin, Gongning Luo, andXinGao. Medtri: Aplatformforstructuredmedi- cal report normalization to enhance vision-language pretraining.arXiv preprint arXiv:2602.22143, 2026

  12. [12]

    Semantic perceptual image com- pression with a laplacian pyramid of convolutional networks.IEEE Transactions on Image Processing, 30:4225–4237, 2021

    Juan Wang, Yiping Duan, Xiaoming Tao, Mai Xu, and Jianhua Lu. Semantic perceptual image com- pression with a laplacian pyramid of convolutional networks.IEEE Transactions on Image Processing, 30:4225–4237, 2021

  13. [13]

    Symbolic replay: Scene graph as prompt for continual learning on vqa task

    Stan Weixian Lei, Difei Gao, Jay Zhangjie Wu, Yux- uanWang, WeiLiu, MengmiZhang, andMikeZheng Shou. Symbolic replay: Scene graph as prompt for continual learning on vqa task. InProceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 1250–1259, 2023

  14. [14]

    Vqacl: Anovelvisualquestionansweringcontinuallearning setting

    Xi Zhang, Feifei Zhang, and Changsheng Xu. Vqacl: Anovelvisualquestionansweringcontinuallearning setting. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19102–19112, 2023

  15. [15]

    Ask and re- member: A questions-only replay strategy for con- tinual visual question answering.arXiv preprint arXiv:2502.04469, 2025

    Imad Eddine Marouf, Enzo Tartaglione, Stéphane Lathuilière, and Joost van de Weijer. Ask and re- member: A questions-only replay strategy for con- tinual visual question answering.arXiv preprint arXiv:2502.04469, 2025. URL https://arxiv. org/pdf/2502.04469

  16. [16]

    Cl-moe: Enhancing multimodal large language model with dual momentum mixture-of-experts for continual visual question answering

    Tianyu Huai, Jie Zhou, Xingjiao Wu, Qin Chen, Qingchun Bai, Ze Zhou, and Liang He. Cl-moe: Enhancing multimodal large language model with dual momentum mixture-of-experts for continual visual question answering. InProceedings of the 16 Attention-Spectrum Regularization for Replay-Free Continual Multimodal LLMs Computer Vision and Pattern Recognition Confe...

  17. [17]

    Generative negative text replay for continual vision-language pretraining

    Shipeng Yan, Lanqing Hong, Hang Xu, Jianhua Han, Tinne Tuytelaars, Zhenguo Li, and Xuming He. Generative negative text replay for continual vision-language pretraining. InEuropean Conference on Computer Vision, pages 22–38. Springer, 2022

  18. [18]

    Ctp: Towards vision- language continual pretraining via compatible mo- mentum contrast and topology preservation

    Hongguang Zhu, Yunchao Wei, Xiaodan Liang, Chunjie Zhang, and Yao Zhao. Ctp: Towards vision- language continual pretraining via compatible mo- mentum contrast and topology preservation. InPro- ceedings of the IEEE/CVF International Conference on Computer Vision, pages 22257–22267, 2023

  19. [19]

    C-clip: Multimodal continual learning for vision- language model

    Wenzhuo Liu, Fei Zhu, Longhui Wei, and Qi Tian. C-clip: Multimodal continual learning for vision- language model. InThe Thirteenth International Conference on Learning Representations, 2025

  20. [20]

    Hide- llava: Hierarchical decoupling for continual instruc- tion tuning of multimodal large language model

    Haiyang Guo, Fanhu Zeng, Ziwei Xiang, Fei Zhu, Da- Han Wang, Xu-Yao Zhang, and Cheng-Lin Liu. Hide- llava: Hierarchical decoupling for continual instruc- tion tuning of multimodal large language model. In Proceedings of the 63rd Annual Meeting of the Associ- ation for Computational Linguistics (Volume 1: Long Papers), pages 13572–13586, 2025

  21. [21]

    Sefe: Superficial and essential forgetting eliminator for multimodal continual instruction tuning

    Jinpeng Chen, Runmin Cong, Yuzhi Zhao, Hongzheng Yang, Guangneng Hu, Horace Ip, and Sam Kwong. Sefe: Superficial and essential forgetting eliminator for multimodal continual instruction tuning. InForty-second International Conference on Machine Learning, 2025

  22. [22]

    Enhancing multimodal continual in- struction tuning with branchlora.arXiv preprint arXiv:2506.02041, 2025

    Duzhen Zhang, Yong Ren, Zhong-Zhi Li, Yahan Yu, Jiahua Dong, Chenxing Li, Zhilong Ji, and Jinfeng Bai. Enhancing multimodal continual in- struction tuning with branchlora.arXiv preprint arXiv:2506.02041, 2025

  23. [23]

    Dynamic mixture of curriculum lora experts for continual multimodal instruction tuning

    Chendi Ge, Xin Wang, Zeyang Zhang, Hong Chen, Jiapei Fan, Longtao Huang, Hui Xue, and Wenwu Zhu. Dynamic mixture of curriculum lora experts for continual multimodal instruction tuning. InForty- second International Conference on Machine Learning, 2025

  24. [24]

    Adapt- ∞: Scalable continual multimodal instruction tuning via dynamic data se- lection

    Adyasha Maharana, Jaehong Yoon, Tianlong Chen, and Mohit Bansal. Adapt- ∞: Scalable continual multimodal instruction tuning via dynamic data se- lection. InThe Thirteenth International Conference on Learning Representations, 2025

  25. [25]

    Stacked attention networks for image question answering

    Zichao Yang, Xiaodong He, Jianfeng Gao, Li Deng, and Alex Smola. Stacked attention networks for image question answering. InProceedings of the IEEE conferenceoncomputervisionandpatternrecognition, pages 21–29, 2016

  26. [26]

    Bottom-up and top-down attention for image captioning and visual question answering

    Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6077–6086, 2018

  27. [27]

    Vilbert: Pretraining task-agnostic visiolinguistic rep- resentations for vision-and-language tasks.Advances in neural information processing systems, 32, 2019

    Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. Vilbert: Pretraining task-agnostic visiolinguistic rep- resentations for vision-and-language tasks.Advances in neural information processing systems, 32, 2019

  28. [28]

    Lxmert: Learning cross-modality encoder representations from trans- formers

    Hao Tan and Mohit Bansal. Lxmert: Learning cross-modality encoder representations from trans- formers. InProceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP), pages 5100– 5111, 2019

  29. [29]

    Grad-cam: Visual explanations from deep networks via gradient-based localization

    Ramprasaath R Selvaraju, Michael Cogswell, Ab- hishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. Grad-cam: Visual explanations from deep networks via gradient-based localization. InProceedings of the IEEE international conference on computer vision, pages 618–626, 2017

  30. [30]

    Re- versible primitive–composition alignment for con- tinual vision–language learning

    Canran Xiao, Tianxiang Xu, Siyuan Ma, Yiyang Jiang, Haoyu Gao, Yuhan Wu, et al. Re- versible primitive–composition alignment for con- tinual vision–language learning. InThe Fourteenth International Conference on Learning Representations, 2026

  31. [31]

    Jinlai Zhang, Mingchao Xiang, Yongheng Hu, Wei Hao, Linlong Lei, and Kefu Yi. Multivariate fea- ture learning and associative spatial information en- hancement for snow object detection in autonomous driving.Engineering Applications of Artificial Intelli- gence, 175:114672, 2026

  32. [32]

    Large-kernel spatially parallel feature fusion for monocular 3d perception in autonomous driving

    Ruanzhi Jiao, Jinlai Zhang, Chang Li, and Lin Hu. Large-kernel spatially parallel feature fusion for monocular 3d perception in autonomous driving. Knowledge-Based Systems, 343:115998, 2026

  33. [33]

    Making the v in vqa matter: Elevating the role of image understand- ing in visual question answering

    Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understand- ing in visual question answering. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 6904–6913, 2017

  34. [34]

    Overcoming catastrophic forgetting in neural networks.Proceedings of the national academy of sciences, 114(13):3521–3526, 2017

    JamesKirkpatrick, RazvanPascanu, NeilRabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, KieranMilan,JohnQuan,TiagoRamalho,Agnieszka Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural networks.Proceedings of the national academy of sciences, 114(13):3521–3526, 2017

  35. [35]

    Learning without forgetting.IEEE transactions on pattern analysis and machine intelligence, 40(12):2935–2947, 2017

    Zhizhong Li and Derek Hoiem. Learning without forgetting.IEEE transactions on pattern analysis and machine intelligence, 40(12):2935–2947, 2017. 17 Attention-Spectrum Regularization for Replay-Free Continual Multimodal LLMs

  36. [36]

    Continual learning with tiny episodic memories

    Arslan Chaudhry, Marcus Rohrbach, Mohamed Elho- seiny, Thalaiyasingam Ajanthan, P Dokania, P Torr, and M Ranzato. Continual learning with tiny episodic memories. InWorkshop on Multi-Task and Lifelong Reinforcement Learning, 2019

  37. [37]

    Dark experi- ence for general continual learning: a strong, simple baseline.Advances in neural information processing systems, 33:15920–15930, 2020

    Pietro Buzzega, Matteo Boschini, Angelo Porrello, Davide Abati, and Simone Calderara. Dark experi- ence for general continual learning: a strong, simple baseline.Advances in neural information processing systems, 33:15920–15930, 2020

  38. [38]

    Tic-clip: Con- tinual training of clip models

    Saurabh Garg, Mehrdad Farajtabar, Hadi Pouransari, Raviteja Vemulapalli, Sachin Mehta, Oncel Tuzel, Vaishaal Shankar, and Fartash Faghri. Tic-clip: Con- tinual training of clip models. InThe Twelfth In- ternational Conference on Learning Representations, 2024

  39. [39]

    Decouple before in- teract: Multi-modal prompt learning for continual visual question answering

    Zi Qian, Xin Wang, Xuguang Duan, Pengda Qin, Yuhong Li, and Wenwu Zhu. Decouple before in- teract: Multi-modal prompt learning for continual visual question answering. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 2953–2962, 2023

  40. [40]

    Overcoming dual drift for continual long- tailed visual question answering

    Feifei Zhang, Zhihao Wang, Xi Zhang, and Chang- sheng Xu. Overcoming dual drift for continual long- tailed visual question answering. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 4413–4423, 2025

  41. [41]

    Lora: Low-rank adaptation of large language models

    Edward J Hu, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models. InInternational Conference on Learning Rep- resentations, 2022

  42. [42]

    Orthogonal subspace learning for language model continual learning

    Xiao Wang, Tianze Chen, Qiming Ge, Han Xia, Rong Bao, Rui Zheng, Qi Zhang, Tao Gui, and Xuan-Jing Huang. Orthogonal subspace learning for language model continual learning. InFindings of the Asso- ciation for Computational Linguistics: EMNLP 2023, pages 10658–10671, 2023

  43. [43]

    Plop: Learning without forgetting for continual semantic segmentation

    Arthur Douillard, Yifu Chen, Arnaud Dapogny, and Matthieu Cord. Plop: Learning without forgetting for continual semantic segmentation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4040–4050, 2021

  44. [44]

    Paying more attention to attention: Improving the perfor- mance of convolutional neural networks via atten- tion transfer

    Sergey Zagoruyko and Nikos Komodakis. Paying more attention to attention: Improving the perfor- mance of convolutional neural networks via atten- tion transfer. InInternational Conference on Learning Representations, 2017

  45. [45]

    Vibus: Data-efficient 3d scene parsing with view- point bottleneck and uncertainty-spectrum model- ing.ISPRS Journal of Photogrammetry and Remote Sensing, 194:302–318, 2022

    Beiwen Tian, Liyi Luo, Hao Zhao, and Guyue Zhou. Vibus: Data-efficient 3d scene parsing with view- point bottleneck and uncertainty-spectrum model- ing.ISPRS Journal of Photogrammetry and Remote Sensing, 194:302–318, 2022

  46. [46]

    Enhancing con- tinual learning in visual question answering with modality-aware feature distillation

    Malvina Nikandrou, Georgios Pantazopoulos, Ioan- nis Konstas, and Alessandro Suglia. Enhancing con- tinual learning in visual question answering with modality-aware feature distillation. InProceedings of the 3rd Workshop on Advances in Language and Vision Research (ALVR), pages 73–85, 2024

  47. [47]

    One vlm to keep it learning: Generation and balancing for data- free continual visual question answering

    Deepayan Das, Davide Talon, Massimiliano Mancini, Yiming Wang, and Elisa Ricci. One vlm to keep it learning: Generation and balancing for data- free continual visual question answering. In2025 IEEE/CVF Winter Conference on Applications of Com- puter Vision (WACV), pages 5635–5645. IEEE, 2025

  48. [48]

    Adaptive dual cross-attention network for multispectral object detection in autonomous driving.Expert Systems with Applications, page 132012, 2026

    Jinlai Zhang, Xiaolong Song, Yucheng Li, Diqing Liang, Zhiyong Zhang, and Jinhu Cai. Adaptive dual cross-attention network for multispectral object detection in autonomous driving.Expert Systems with Applications, page 132012, 2026

  49. [49]

    Cerberus transformer: Joint semantic, affordance and attribute parsing

    Xiaoxue Chen, Tianyu Liu, Hao Zhao, Guyue Zhou, and Ya-Qin Zhang. Cerberus transformer: Joint semantic, affordance and attribute parsing. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19649–19658, 2022

  50. [50]

    Keeplora: Con- tinual learning with residual gradient adaptation

    Mao-Lin Luo, Zi-Hao Zhou, Yi-Lin Zhang, Yuanyu Wan, TongWei, andMin-LingZhang. Keeplora: Con- tinual learning with residual gradient adaptation. In The Fourteenth International Conference on Learning Representations, 2026

  51. [51]

    Moelora: An moe-based parameter efficient fine- tuning method for multi-task medical applications

    Qidong Liu, Xian Wu, Xiangyu Zhao, Yuanshao Zhu, Derong Xu, Feng Tian, and Yefeng Zheng. Moelora: An moe-based parameter efficient fine- tuning method for multi-task medical applications. CoRR, 2023

  52. [52]

    Pclr: Progressively compressed lora for multimodal continual instruc- tion tuning

    Weicheng Meng, Jingyang Qiao, Shaohui Liu, Zhizhong Zhang, and Yuan Xie. Pclr: Progressively compressed lora for multimodal continual instruc- tion tuning. InThe Fourteenth International Confer- ence on Learning Representations, 2026

  53. [53]

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report. arXiv preprint arXiv:2502.13923, 2025

  54. [54]

    question- type incremental

    Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, et al. Internvl3: Ex- ploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479, 2025. 18 Attention-Spectrum Regularization for Replay-Free Continual Multimodal LLMs A Architecture-Agn...