Attention-Spectrum Regularization for Replay-Free Continual Multimodal LLMs

Canran Xiao; Chuangxin Zhao; Guiguang Ding; Jun Xia; Mengyao Lyu; Siyuan Ma; Yanbiao Ma; Yang Liu

arxiv: 2606.23063 · v1 · pith:HSBXI6ISnew · submitted 2026-06-22 · 💻 cs.CV · cs.AI

Attention-Spectrum Regularization for Replay-Free Continual Multimodal LLMs

Chuangxin Zhao , Canran Xiao , Siyuan Ma , Mengyao Lyu , Yanbiao Ma , Jun Xia , Guiguang Ding , Yang Liu This is my paper

Pith reviewed 2026-06-26 08:54 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords continual learningmultimodal LLMsattention regularizationcatastrophic forgettingspectral analysisvision-language modelsreplay-free learningcross-attention maps

0 comments

The pith

Attention-Spectrum Regularization preserves skill-level cross-attention spectra to reduce forgetting in continual multimodal LLMs without replay.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that skill-conditioned spectral properties of cross-modal attention maps can be summarized and preserved to control forgetting during continual adaptation of multimodal large language models. ASR extracts compact Fourier-based spectral statistics from attention maps, stores only skill-wise prototype distributions, and applies a phase-invariant regularizer that limits harmful drift while permitting task-specific adaptation. Theoretical analysis connects spectral drift to forgetting under a spectral sufficiency assumption and proves stability of the spectra to translations and perturbations. Experiments across VQA and instruction-tuning benchmarks demonstrate consistent gains over replay, regularization, and adapter baselines. This positions internal attention structure as a lightweight carrier of multimodal skills that does not require data replay or parameter isolation.

Core claim

ASR treats cross-attention maps as two-dimensional signals, summarizes their scale and directional properties into compact spectral statistics, and stores only skill-wise prototype distributions; a phase-invariant spectral regularizer then constrains harmful drift while allowing instance-level adaptation, with theoretical analysis showing that skill-conditioned spectral drift controls forgetting under a spectral sufficiency assumption.

What carries the argument

Skill-wise prototype distributions of Fourier power spectra extracted from cross-attention maps, which encode scale and directional properties and are regularized to prevent drift.

If this is right

Skill-conditioned spectral regularization reduces forgetting on continual VQA and multimodal instruction-tuning tasks.
The approach operates without storing past image-question pairs, pseudo-examples, or teacher snapshots.
Fourier power spectra of attention maps remain stable under spatial translations and bounded perturbations.
ASR improves final performance over replay-based, regularization-based, and adapter-based baselines on VQA v2, VQACL, CLT-VQA, CoIN, and UCIT.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same spectral-prototype mechanism could be tested in continual learning settings for text-only or vision-only transformers.
Combining attention-spectrum regularization with output-level distillation might yield additive gains on long task sequences.
Scaling the number of stored prototypes with model size or task count would test whether the storage cost remains sublinear.

Load-bearing premise

That controlling skill-conditioned spectral drift of attention prototypes is sufficient to control forgetting of multimodal skills.

What would settle it

An experiment in which skill-wise spectral drift remains small yet substantial forgetting still occurs on held-out tasks, or in which large spectral drift occurs with negligible forgetting.

read the original abstract

Multimodal large language models (MLLMs) are increasingly required to adapt to non-stationary streams of visual domains, question types, and user instructions, yet continual fine-tuning often causes severe forgetting of previously acquired multimodal skills. Existing continual vision-language methods mainly preserve outputs, replay data or pseudo-data, regularize embedding geometry, or allocate task-specific parameters, but they provide limited control over how internal cross-modal attention patterns supporting old skills drift during adaptation. We propose Attention-Spectrum Regularization (ASR), a replay-free continual learning framework that preserves skill-conditioned structures of cross-modal attention. ASR treats cross-attention maps as two-dimensional signals, summarizes their scale and directional properties into compact spectral statistics, and stores only skill-wise prototype distributions instead of replaying past image-question pairs, generated pseudo-examples, or old-stage teacher snapshots. In later stages, a phase-invariant spectral regularizer constrains harmful drift of these prototypes while allowing instance-level attention to adapt to new tasks. We provide theoretical analysis showing that skill-conditioned spectral drift controls forgetting under a spectral sufficiency assumption, and that Fourier power spectra are stable to spatial translations and bounded perturbations. Experiments on continual VQA and multimodal instruction-tuning benchmarks, including VQA v2, VQACL, CLT-VQA, CoIN, and UCIT, show that ASR consistently improves final performance and reduces forgetting over strong replay-, regularization-, and adapter-based baselines. Preserving skill-level attention structure is an effective and lightweight mechanism for continual MLLMs. Code is available at https://github.com/Creative-zcx/attention-spectrum-replay

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ASR introduces spectral regularization of cross-attention maps as a replay-free way to limit forgetting in continual MLLMs, but the approach rests on an unverified assumption that spectral control is sufficient.

read the letter

The paper's main move is to treat cross-attention maps as 2-D signals, compute their Fourier power spectra, and store compact skill-wise prototype distributions. During new-task training a phase-invariant regularizer then limits drift in those spectra.

This framing is new relative to the replay, output-preservation, and adapter baselines the abstract cites. It is also practically attractive because nothing from past data needs to be kept or regenerated.

The method does a reasonable job on the engineering side: only spectral statistics are stored, and the Fourier basis gives built-in invariance to translations and small perturbations. That part looks clean.

The soft spot is the spectral sufficiency assumption. The theoretical analysis claims that controlling skill-conditioned spectral drift controls forgetting, yet the abstract gives no indication whether this is derived, bounded empirically, or simply posited. If forgetting also occurs through feed-forward weights, output heads, or non-attention alignments, stabilizing the spectra alone may not be enough.

Experiments are reported on VQA v2, VQACL, CLT-VQA, CoIN, and UCIT, with claims of reduced forgetting versus strong baselines. Without the actual numbers, ablations, or dataset sizes it is hard to judge how large or robust the gains are.

The work is aimed at people already working on continual multimodal models who need replay-free options. A reader looking for fresh regularization targets could extract an idea or two.

It deserves peer review so the assumption and the experimental controls can be checked directly.

Referee Report

3 major / 2 minor

Summary. The paper proposes Attention-Spectrum Regularization (ASR), a replay-free continual learning framework for multimodal LLMs. It treats cross-attention maps as 2D signals, extracts compact spectral statistics, stores skill-wise prototype distributions, and applies a phase-invariant spectral regularizer to limit harmful drift during adaptation to new visual domains or instructions. Theoretical analysis claims that skill-conditioned spectral drift controls forgetting under a spectral sufficiency assumption, with stability results for Fourier spectra under translations and perturbations. Experiments on VQA v2, VQACL, CLT-VQA, CoIN, and UCIT benchmarks report improved final performance and reduced forgetting versus replay, regularization, and adapter baselines.

Significance. If the central claims hold, ASR provides a lightweight, data-free mechanism for preserving multimodal skills via attention spectra rather than replay buffers or task-specific parameters, which could scale better for non-stationary MLLM streams. Code release is a positive contribution for reproducibility. The approach is novel in its focus on spectral properties of cross-attention but its impact hinges on validation of the linking assumption between spectra and forgetting.

major comments (3)

[Abstract, theoretical analysis paragraph] Abstract, theoretical analysis paragraph: the spectral sufficiency assumption (that controlling skill-conditioned spectral drift is sufficient to control forgetting) is invoked without derivation, empirical bounds, or ablation showing that other forgetting pathways (e.g., feed-forward weight changes or output-head drift) are negligible. This assumption is load-bearing for the claim that the phase-invariant regularizer guarantees reduced forgetting.
[Section 5 (Experiments)] Section 5 (Experiments): while improvements over baselines are claimed, the manuscript must report per-task forgetting metrics (e.g., average accuracy drop from peak to final) with standard deviations across runs and statistical tests; without these, the quantitative support for 'consistently reduces forgetting' cannot be assessed.
[Method section] Method section (prototype storage and regularizer): the claim that storing only skill-wise spectral prototype distributions is strictly replay-free requires explicit confirmation that no pseudo-data, teacher logits, or old attention maps are retained at inference or regularization time; any implicit storage would undermine the replay-free positioning.

minor comments (2)

[Method] Clarify notation for the spectral statistics (e.g., exact definition of scale and directional summaries) and the phase-invariance property in the method description.
[Experiments] Add dataset sizes, number of continual stages/tasks, and training hyperparameters for each benchmark to support reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify key aspects of our work. We address each major comment below and indicate planned revisions where appropriate.

read point-by-point responses

Referee: [Abstract, theoretical analysis paragraph] Abstract, theoretical analysis paragraph: the spectral sufficiency assumption (that controlling skill-conditioned spectral drift is sufficient to control forgetting) is invoked without derivation, empirical bounds, or ablation showing that other forgetting pathways (e.g., feed-forward weight changes or output-head drift) are negligible. This assumption is load-bearing for the claim that the phase-invariant regularizer guarantees reduced forgetting.

Authors: We agree that the spectral sufficiency assumption is central and is stated without a full derivation or supporting ablation in the current version. The theoretical analysis establishes stability of Fourier spectra under translations and perturbations but links this to forgetting control only under the assumption. In revision we will expand the theoretical section with a short derivation sketch based on the role of cross-attention in multimodal fusion and add an ablation that isolates spectral regularization from feed-forward and output-head changes, reporting the resulting forgetting metrics. revision: yes
Referee: [Section 5 (Experiments)] Section 5 (Experiments): while improvements over baselines are claimed, the manuscript must report per-task forgetting metrics (e.g., average accuracy drop from peak to final) with standard deviations across runs and statistical tests; without these, the quantitative support for 'consistently reduces forgetting' cannot be assessed.

Authors: This is a valid request for stronger quantitative evidence. The current experiments report aggregate final performance and overall forgetting reduction but omit per-task peak-to-final drops, standard deviations, and statistical tests. We will revise Section 5 to include these metrics in additional tables, computed over multiple random seeds, together with paired statistical tests against the baselines. revision: yes
Referee: [Method section] Method section (prototype storage and regularizer): the claim that storing only skill-wise spectral prototype distributions is strictly replay-free requires explicit confirmation that no pseudo-data, teacher logits, or old attention maps are retained at inference or regularization time; any implicit storage would undermine the replay-free positioning.

Authors: The manuscript states that only skill-wise prototype distributions of spectral statistics are stored and that no past image-question pairs, pseudo-examples, or teacher snapshots are retained. During both regularization and inference the regularizer compares the current model's attention spectra solely against these stored prototypes; no data, logits, or prior maps are generated or accessed. We will add an explicit clarifying paragraph in the Method section confirming this point and reiterating that the approach remains strictly replay-free. revision: partial

Circularity Check

0 steps flagged

No circularity; regularization target and theoretical claim are independently defined

full rationale

The paper defines ASR directly from cross-attention maps treated as 2D signals, extracts spectral statistics, stores skill-wise prototypes, and applies a phase-invariant regularizer. The theoretical claim links spectral drift to forgetting only under an explicitly stated spectral sufficiency assumption that is posited rather than derived from the method itself or from self-citations. No equations, fitted parameters, or predictions reduce to the inputs by construction, and no load-bearing self-citation chain is present in the provided text. The experimental results on external benchmarks are therefore not forced by the method's definition.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the spectral sufficiency assumption stated in the theoretical analysis section of the abstract; no free parameters or invented entities are named in the provided text.

axioms (1)

domain assumption spectral sufficiency assumption: skill-conditioned spectral drift controls forgetting
Invoked to link the regularizer to reduced forgetting (abstract, theoretical analysis sentence).

pith-pipeline@v0.9.1-grok · 5843 in / 1286 out tokens · 20111 ms · 2026-06-26T08:54:41.522904+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

54 extracted references · 2 linked inside Pith

[1]

Visual instruction tuning.Advances in neural information processing systems, 36:34892– 34916, 2023

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892– 34916, 2023

2023
[2]

Lvlm-ehub: A comprehen- sive evaluation benchmark for large vision-language models.IEEE Transactions on Pattern Analysis and Machine Intelligence, 47(3):1877–1893, 2024

Peng Xu, Wenqi Shao, Kaipeng Zhang, Peng Gao, Shuo Liu, Meng Lei, Fanqing Meng, Siyuan Huang, Yu Qiao, and Ping Luo. Lvlm-ehub: A comprehen- sive evaluation benchmark for large vision-language models.IEEE Transactions on Pattern Analysis and Machine Intelligence, 47(3):1877–1893, 2024

2024
[3]

Improving representation of high- frequency components for medical visual foundation models.IEEE Transactions on Medical Imaging, 2025

Yuetan Chu, Yilan Zhang, Zhongyi Han, Changchun Yang, Longxi Zhou, Gongning Luo, Chao Huang, and Xin Gao. Improving representation of high- frequency components for medical visual foundation models.IEEE Transactions on Medical Imaging, 2025

2025
[4]

Construct-vl: Data-free continual structured vl concepts learning

James Seale Smith, Paola Cascante-Bonilla, Assaf Ar- belle, Donghyun Kim, Rameswar Panda, David Cox, Diyi Yang, Zsolt Kira, Rogerio Feris, and Leonid Kar- linsky. Construct-vl: Data-free continual structured vl concepts learning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion, pages 14994–15004, 2023

2023
[5]

Coin: A benchmark of continual instruction tuning for multimodel large language models.Advances in Neural Information Processing Systems, 37:57817–57840, 2024

Cheng Chen, Junchen Zhu, Xu Luo, Heng T Shen, Jingkuan Song, and Lianli Gao. Coin: A benchmark of continual instruction tuning for multimodel large language models.Advances in Neural Information Processing Systems, 37:57817–57840, 2024

2024
[6]

Ct-based ai system for quantitative and integrated manage- ment of acute respiratory distress syndrome in criti- cal care.npj Digital Medicine, 2026

Yuetan Chu, Jianpeng Wang, Peiyao Luo, Hui Chen, Zhongheng Zhang, Jiannan Zhang, Yilan Zhang, Yingnan Ju, Yaxin Xiong, Xiqing Luo, et al. Ct-based ai system for quantitative and integrated manage- ment of acute respiratory distress syndrome in criti- cal care.npj Digital Medicine, 2026

2026
[7]

Impromptu vla: Open weights and open data for driving vision-language- action models.Advances in Neural Information Pro- cessing Systems, 38, 2026

Haohan Chi, Huan-ang Gao, Ziming Liu, Jianing Liu, Chenyu Liu, Jinwei Li, Kaisen Yang, Yangcheng Yu, Zeda Wang, Wenyi Li, et al. Impromptu vla: Open weights and open data for driving vision-language- action models.Advances in Neural Information Pro- cessing Systems, 38, 2026

2026
[8]

Psycholinguistics meets con- tinuallearning: Measuringcatastrophicforgettingin visual question answering

Claudio Greco, Barbara Plank, Raquel Fernández, and Raffaella Bernardi. Psycholinguistics meets con- tinuallearning: Measuringcatastrophicforgettingin visual question answering. InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3601–3605, 2019

2019
[9]

Continual vision-language representation learning with off-diagonal informa- tion

Zixuan Ni, Longhui Wei, Siliang Tang, Yueting Zhuang, and Qi Tian. Continual vision-language representation learning with off-diagonal informa- tion. InInternational Conference on Machine Learn- ing, pages 26129–26149. PMLR, 2023

2023
[10]

Preventing zero-shot transfer degradation in continual learn- ing of vision-language models

Zangwei Zheng, Mingyuan Ma, Kai Wang, Ziheng Qin, Xiangyu Yue, and Yang You. Preventing zero-shot transfer degradation in continual learn- ing of vision-language models. InProceedings of the IEEE/CVF international conference on computer vision, pages 19125–19136, 2023

2023
[11]

Medtri: Aplatformforstructuredmedi- cal report normalization to enhance vision-language pretraining.arXiv preprint arXiv:2602.22143, 2026

Yuetan Chu, Xinhua Ma, Xinran Jin, Gongning Luo, andXinGao. Medtri: Aplatformforstructuredmedi- cal report normalization to enhance vision-language pretraining.arXiv preprint arXiv:2602.22143, 2026

arXiv 2026
[12]

Semantic perceptual image com- pression with a laplacian pyramid of convolutional networks.IEEE Transactions on Image Processing, 30:4225–4237, 2021

Juan Wang, Yiping Duan, Xiaoming Tao, Mai Xu, and Jianhua Lu. Semantic perceptual image com- pression with a laplacian pyramid of convolutional networks.IEEE Transactions on Image Processing, 30:4225–4237, 2021

2021
[13]

Symbolic replay: Scene graph as prompt for continual learning on vqa task

Stan Weixian Lei, Difei Gao, Jay Zhangjie Wu, Yux- uanWang, WeiLiu, MengmiZhang, andMikeZheng Shou. Symbolic replay: Scene graph as prompt for continual learning on vqa task. InProceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 1250–1259, 2023

2023
[14]

Vqacl: Anovelvisualquestionansweringcontinuallearning setting

Xi Zhang, Feifei Zhang, and Changsheng Xu. Vqacl: Anovelvisualquestionansweringcontinuallearning setting. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19102–19112, 2023

2023
[15]

Ask and re- member: A questions-only replay strategy for con- tinual visual question answering.arXiv preprint arXiv:2502.04469, 2025

Imad Eddine Marouf, Enzo Tartaglione, Stéphane Lathuilière, and Joost van de Weijer. Ask and re- member: A questions-only replay strategy for con- tinual visual question answering.arXiv preprint arXiv:2502.04469, 2025. URL https://arxiv. org/pdf/2502.04469

arXiv 2025
[16]

Cl-moe: Enhancing multimodal large language model with dual momentum mixture-of-experts for continual visual question answering

Tianyu Huai, Jie Zhou, Xingjiao Wu, Qin Chen, Qingchun Bai, Ze Zhou, and Liang He. Cl-moe: Enhancing multimodal large language model with dual momentum mixture-of-experts for continual visual question answering. InProceedings of the 16 Attention-Spectrum Regularization for Replay-Free Continual Multimodal LLMs Computer Vision and Pattern Recognition Confe...

2025
[17]

Generative negative text replay for continual vision-language pretraining

Shipeng Yan, Lanqing Hong, Hang Xu, Jianhua Han, Tinne Tuytelaars, Zhenguo Li, and Xuming He. Generative negative text replay for continual vision-language pretraining. InEuropean Conference on Computer Vision, pages 22–38. Springer, 2022

2022
[18]

Ctp: Towards vision- language continual pretraining via compatible mo- mentum contrast and topology preservation

Hongguang Zhu, Yunchao Wei, Xiaodan Liang, Chunjie Zhang, and Yao Zhao. Ctp: Towards vision- language continual pretraining via compatible mo- mentum contrast and topology preservation. InPro- ceedings of the IEEE/CVF International Conference on Computer Vision, pages 22257–22267, 2023

2023
[19]

C-clip: Multimodal continual learning for vision- language model

Wenzhuo Liu, Fei Zhu, Longhui Wei, and Qi Tian. C-clip: Multimodal continual learning for vision- language model. InThe Thirteenth International Conference on Learning Representations, 2025

2025
[20]

Hide- llava: Hierarchical decoupling for continual instruc- tion tuning of multimodal large language model

Haiyang Guo, Fanhu Zeng, Ziwei Xiang, Fei Zhu, Da- Han Wang, Xu-Yao Zhang, and Cheng-Lin Liu. Hide- llava: Hierarchical decoupling for continual instruc- tion tuning of multimodal large language model. In Proceedings of the 63rd Annual Meeting of the Associ- ation for Computational Linguistics (Volume 1: Long Papers), pages 13572–13586, 2025

2025
[21]

Sefe: Superficial and essential forgetting eliminator for multimodal continual instruction tuning

Jinpeng Chen, Runmin Cong, Yuzhi Zhao, Hongzheng Yang, Guangneng Hu, Horace Ip, and Sam Kwong. Sefe: Superficial and essential forgetting eliminator for multimodal continual instruction tuning. InForty-second International Conference on Machine Learning, 2025

2025
[22]

Enhancing multimodal continual in- struction tuning with branchlora.arXiv preprint arXiv:2506.02041, 2025

Duzhen Zhang, Yong Ren, Zhong-Zhi Li, Yahan Yu, Jiahua Dong, Chenxing Li, Zhilong Ji, and Jinfeng Bai. Enhancing multimodal continual in- struction tuning with branchlora.arXiv preprint arXiv:2506.02041, 2025

arXiv 2025
[23]

Dynamic mixture of curriculum lora experts for continual multimodal instruction tuning

Chendi Ge, Xin Wang, Zeyang Zhang, Hong Chen, Jiapei Fan, Longtao Huang, Hui Xue, and Wenwu Zhu. Dynamic mixture of curriculum lora experts for continual multimodal instruction tuning. InForty- second International Conference on Machine Learning, 2025

2025
[24]

Adapt- ∞: Scalable continual multimodal instruction tuning via dynamic data se- lection

Adyasha Maharana, Jaehong Yoon, Tianlong Chen, and Mohit Bansal. Adapt- ∞: Scalable continual multimodal instruction tuning via dynamic data se- lection. InThe Thirteenth International Conference on Learning Representations, 2025

2025
[25]

Stacked attention networks for image question answering

Zichao Yang, Xiaodong He, Jianfeng Gao, Li Deng, and Alex Smola. Stacked attention networks for image question answering. InProceedings of the IEEE conferenceoncomputervisionandpatternrecognition, pages 21–29, 2016

2016
[26]

Bottom-up and top-down attention for image captioning and visual question answering

Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6077–6086, 2018

2018
[27]

Vilbert: Pretraining task-agnostic visiolinguistic rep- resentations for vision-and-language tasks.Advances in neural information processing systems, 32, 2019

Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. Vilbert: Pretraining task-agnostic visiolinguistic rep- resentations for vision-and-language tasks.Advances in neural information processing systems, 32, 2019

2019
[28]

Lxmert: Learning cross-modality encoder representations from trans- formers

Hao Tan and Mohit Bansal. Lxmert: Learning cross-modality encoder representations from trans- formers. InProceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP), pages 5100– 5111, 2019

2019
[29]

Grad-cam: Visual explanations from deep networks via gradient-based localization

Ramprasaath R Selvaraju, Michael Cogswell, Ab- hishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. Grad-cam: Visual explanations from deep networks via gradient-based localization. InProceedings of the IEEE international conference on computer vision, pages 618–626, 2017

2017
[30]

Re- versible primitive–composition alignment for con- tinual vision–language learning

Canran Xiao, Tianxiang Xu, Siyuan Ma, Yiyang Jiang, Haoyu Gao, Yuhan Wu, et al. Re- versible primitive–composition alignment for con- tinual vision–language learning. InThe Fourteenth International Conference on Learning Representations, 2026

2026
[31]

Jinlai Zhang, Mingchao Xiang, Yongheng Hu, Wei Hao, Linlong Lei, and Kefu Yi. Multivariate fea- ture learning and associative spatial information en- hancement for snow object detection in autonomous driving.Engineering Applications of Artificial Intelli- gence, 175:114672, 2026

2026
[32]

Large-kernel spatially parallel feature fusion for monocular 3d perception in autonomous driving

Ruanzhi Jiao, Jinlai Zhang, Chang Li, and Lin Hu. Large-kernel spatially parallel feature fusion for monocular 3d perception in autonomous driving. Knowledge-Based Systems, 343:115998, 2026

2026
[33]

Making the v in vqa matter: Elevating the role of image understand- ing in visual question answering

Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understand- ing in visual question answering. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 6904–6913, 2017

2017
[34]

Overcoming catastrophic forgetting in neural networks.Proceedings of the national academy of sciences, 114(13):3521–3526, 2017

JamesKirkpatrick, RazvanPascanu, NeilRabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, KieranMilan,JohnQuan,TiagoRamalho,Agnieszka Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural networks.Proceedings of the national academy of sciences, 114(13):3521–3526, 2017

2017
[35]

Learning without forgetting.IEEE transactions on pattern analysis and machine intelligence, 40(12):2935–2947, 2017

Zhizhong Li and Derek Hoiem. Learning without forgetting.IEEE transactions on pattern analysis and machine intelligence, 40(12):2935–2947, 2017. 17 Attention-Spectrum Regularization for Replay-Free Continual Multimodal LLMs

2017
[36]

Continual learning with tiny episodic memories

Arslan Chaudhry, Marcus Rohrbach, Mohamed Elho- seiny, Thalaiyasingam Ajanthan, P Dokania, P Torr, and M Ranzato. Continual learning with tiny episodic memories. InWorkshop on Multi-Task and Lifelong Reinforcement Learning, 2019

2019
[37]

Dark experi- ence for general continual learning: a strong, simple baseline.Advances in neural information processing systems, 33:15920–15930, 2020

Pietro Buzzega, Matteo Boschini, Angelo Porrello, Davide Abati, and Simone Calderara. Dark experi- ence for general continual learning: a strong, simple baseline.Advances in neural information processing systems, 33:15920–15930, 2020

2020
[38]

Tic-clip: Con- tinual training of clip models

Saurabh Garg, Mehrdad Farajtabar, Hadi Pouransari, Raviteja Vemulapalli, Sachin Mehta, Oncel Tuzel, Vaishaal Shankar, and Fartash Faghri. Tic-clip: Con- tinual training of clip models. InThe Twelfth In- ternational Conference on Learning Representations, 2024

2024
[39]

Decouple before in- teract: Multi-modal prompt learning for continual visual question answering

Zi Qian, Xin Wang, Xuguang Duan, Pengda Qin, Yuhong Li, and Wenwu Zhu. Decouple before in- teract: Multi-modal prompt learning for continual visual question answering. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 2953–2962, 2023

2023
[40]

Overcoming dual drift for continual long- tailed visual question answering

Feifei Zhang, Zhihao Wang, Xi Zhang, and Chang- sheng Xu. Overcoming dual drift for continual long- tailed visual question answering. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 4413–4423, 2025

2025
[41]

Lora: Low-rank adaptation of large language models

Edward J Hu, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models. InInternational Conference on Learning Rep- resentations, 2022

2022
[42]

Orthogonal subspace learning for language model continual learning

Xiao Wang, Tianze Chen, Qiming Ge, Han Xia, Rong Bao, Rui Zheng, Qi Zhang, Tao Gui, and Xuan-Jing Huang. Orthogonal subspace learning for language model continual learning. InFindings of the Asso- ciation for Computational Linguistics: EMNLP 2023, pages 10658–10671, 2023

2023
[43]

Plop: Learning without forgetting for continual semantic segmentation

Arthur Douillard, Yifu Chen, Arnaud Dapogny, and Matthieu Cord. Plop: Learning without forgetting for continual semantic segmentation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4040–4050, 2021

2021
[44]

Paying more attention to attention: Improving the perfor- mance of convolutional neural networks via atten- tion transfer

Sergey Zagoruyko and Nikos Komodakis. Paying more attention to attention: Improving the perfor- mance of convolutional neural networks via atten- tion transfer. InInternational Conference on Learning Representations, 2017

2017
[45]

Vibus: Data-efficient 3d scene parsing with view- point bottleneck and uncertainty-spectrum model- ing.ISPRS Journal of Photogrammetry and Remote Sensing, 194:302–318, 2022

Beiwen Tian, Liyi Luo, Hao Zhao, and Guyue Zhou. Vibus: Data-efficient 3d scene parsing with view- point bottleneck and uncertainty-spectrum model- ing.ISPRS Journal of Photogrammetry and Remote Sensing, 194:302–318, 2022

2022
[46]

Enhancing con- tinual learning in visual question answering with modality-aware feature distillation

Malvina Nikandrou, Georgios Pantazopoulos, Ioan- nis Konstas, and Alessandro Suglia. Enhancing con- tinual learning in visual question answering with modality-aware feature distillation. InProceedings of the 3rd Workshop on Advances in Language and Vision Research (ALVR), pages 73–85, 2024

2024
[47]

One vlm to keep it learning: Generation and balancing for data- free continual visual question answering

Deepayan Das, Davide Talon, Massimiliano Mancini, Yiming Wang, and Elisa Ricci. One vlm to keep it learning: Generation and balancing for data- free continual visual question answering. In2025 IEEE/CVF Winter Conference on Applications of Com- puter Vision (WACV), pages 5635–5645. IEEE, 2025

2025
[48]

Adaptive dual cross-attention network for multispectral object detection in autonomous driving.Expert Systems with Applications, page 132012, 2026

Jinlai Zhang, Xiaolong Song, Yucheng Li, Diqing Liang, Zhiyong Zhang, and Jinhu Cai. Adaptive dual cross-attention network for multispectral object detection in autonomous driving.Expert Systems with Applications, page 132012, 2026

2026
[49]

Cerberus transformer: Joint semantic, affordance and attribute parsing

Xiaoxue Chen, Tianyu Liu, Hao Zhao, Guyue Zhou, and Ya-Qin Zhang. Cerberus transformer: Joint semantic, affordance and attribute parsing. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19649–19658, 2022

2022
[50]

Keeplora: Con- tinual learning with residual gradient adaptation

Mao-Lin Luo, Zi-Hao Zhou, Yi-Lin Zhang, Yuanyu Wan, TongWei, andMin-LingZhang. Keeplora: Con- tinual learning with residual gradient adaptation. In The Fourteenth International Conference on Learning Representations, 2026

2026
[51]

Moelora: An moe-based parameter efficient fine- tuning method for multi-task medical applications

Qidong Liu, Xian Wu, Xiangyu Zhao, Yuanshao Zhu, Derong Xu, Feng Tian, and Yefeng Zheng. Moelora: An moe-based parameter efficient fine- tuning method for multi-task medical applications. CoRR, 2023

2023
[52]

Pclr: Progressively compressed lora for multimodal continual instruc- tion tuning

Weicheng Meng, Jingyang Qiao, Shaohui Liu, Zhizhong Zhang, and Yuan Xie. Pclr: Progressively compressed lora for multimodal continual instruc- tion tuning. InThe Fourteenth International Confer- ence on Learning Representations, 2026

2026
[53]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report. arXiv preprint arXiv:2502.13923, 2025

Pith/arXiv arXiv 2025
[54]

question- type incremental

Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, et al. Internvl3: Ex- ploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479, 2025. 18 Attention-Spectrum Regularization for Replay-Free Continual Multimodal LLMs A Architecture-Agn...

Pith/arXiv arXiv 2025

[1] [1]

Visual instruction tuning.Advances in neural information processing systems, 36:34892– 34916, 2023

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892– 34916, 2023

2023

[2] [2]

Lvlm-ehub: A comprehen- sive evaluation benchmark for large vision-language models.IEEE Transactions on Pattern Analysis and Machine Intelligence, 47(3):1877–1893, 2024

Peng Xu, Wenqi Shao, Kaipeng Zhang, Peng Gao, Shuo Liu, Meng Lei, Fanqing Meng, Siyuan Huang, Yu Qiao, and Ping Luo. Lvlm-ehub: A comprehen- sive evaluation benchmark for large vision-language models.IEEE Transactions on Pattern Analysis and Machine Intelligence, 47(3):1877–1893, 2024

2024

[3] [3]

Improving representation of high- frequency components for medical visual foundation models.IEEE Transactions on Medical Imaging, 2025

Yuetan Chu, Yilan Zhang, Zhongyi Han, Changchun Yang, Longxi Zhou, Gongning Luo, Chao Huang, and Xin Gao. Improving representation of high- frequency components for medical visual foundation models.IEEE Transactions on Medical Imaging, 2025

2025

[4] [4]

Construct-vl: Data-free continual structured vl concepts learning

James Seale Smith, Paola Cascante-Bonilla, Assaf Ar- belle, Donghyun Kim, Rameswar Panda, David Cox, Diyi Yang, Zsolt Kira, Rogerio Feris, and Leonid Kar- linsky. Construct-vl: Data-free continual structured vl concepts learning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion, pages 14994–15004, 2023

2023

[5] [5]

Coin: A benchmark of continual instruction tuning for multimodel large language models.Advances in Neural Information Processing Systems, 37:57817–57840, 2024

Cheng Chen, Junchen Zhu, Xu Luo, Heng T Shen, Jingkuan Song, and Lianli Gao. Coin: A benchmark of continual instruction tuning for multimodel large language models.Advances in Neural Information Processing Systems, 37:57817–57840, 2024

2024

[6] [6]

Ct-based ai system for quantitative and integrated manage- ment of acute respiratory distress syndrome in criti- cal care.npj Digital Medicine, 2026

Yuetan Chu, Jianpeng Wang, Peiyao Luo, Hui Chen, Zhongheng Zhang, Jiannan Zhang, Yilan Zhang, Yingnan Ju, Yaxin Xiong, Xiqing Luo, et al. Ct-based ai system for quantitative and integrated manage- ment of acute respiratory distress syndrome in criti- cal care.npj Digital Medicine, 2026

2026

[7] [7]

Impromptu vla: Open weights and open data for driving vision-language- action models.Advances in Neural Information Pro- cessing Systems, 38, 2026

Haohan Chi, Huan-ang Gao, Ziming Liu, Jianing Liu, Chenyu Liu, Jinwei Li, Kaisen Yang, Yangcheng Yu, Zeda Wang, Wenyi Li, et al. Impromptu vla: Open weights and open data for driving vision-language- action models.Advances in Neural Information Pro- cessing Systems, 38, 2026

2026

[8] [8]

Psycholinguistics meets con- tinuallearning: Measuringcatastrophicforgettingin visual question answering

Claudio Greco, Barbara Plank, Raquel Fernández, and Raffaella Bernardi. Psycholinguistics meets con- tinuallearning: Measuringcatastrophicforgettingin visual question answering. InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3601–3605, 2019

2019

[9] [9]

Continual vision-language representation learning with off-diagonal informa- tion

Zixuan Ni, Longhui Wei, Siliang Tang, Yueting Zhuang, and Qi Tian. Continual vision-language representation learning with off-diagonal informa- tion. InInternational Conference on Machine Learn- ing, pages 26129–26149. PMLR, 2023

2023

[10] [10]

Preventing zero-shot transfer degradation in continual learn- ing of vision-language models

Zangwei Zheng, Mingyuan Ma, Kai Wang, Ziheng Qin, Xiangyu Yue, and Yang You. Preventing zero-shot transfer degradation in continual learn- ing of vision-language models. InProceedings of the IEEE/CVF international conference on computer vision, pages 19125–19136, 2023

2023

[11] [11]

Medtri: Aplatformforstructuredmedi- cal report normalization to enhance vision-language pretraining.arXiv preprint arXiv:2602.22143, 2026

Yuetan Chu, Xinhua Ma, Xinran Jin, Gongning Luo, andXinGao. Medtri: Aplatformforstructuredmedi- cal report normalization to enhance vision-language pretraining.arXiv preprint arXiv:2602.22143, 2026

arXiv 2026

[12] [12]

Semantic perceptual image com- pression with a laplacian pyramid of convolutional networks.IEEE Transactions on Image Processing, 30:4225–4237, 2021

Juan Wang, Yiping Duan, Xiaoming Tao, Mai Xu, and Jianhua Lu. Semantic perceptual image com- pression with a laplacian pyramid of convolutional networks.IEEE Transactions on Image Processing, 30:4225–4237, 2021

2021

[13] [13]

Symbolic replay: Scene graph as prompt for continual learning on vqa task

Stan Weixian Lei, Difei Gao, Jay Zhangjie Wu, Yux- uanWang, WeiLiu, MengmiZhang, andMikeZheng Shou. Symbolic replay: Scene graph as prompt for continual learning on vqa task. InProceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 1250–1259, 2023

2023

[14] [14]

Vqacl: Anovelvisualquestionansweringcontinuallearning setting

Xi Zhang, Feifei Zhang, and Changsheng Xu. Vqacl: Anovelvisualquestionansweringcontinuallearning setting. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19102–19112, 2023

2023

[15] [15]

Ask and re- member: A questions-only replay strategy for con- tinual visual question answering.arXiv preprint arXiv:2502.04469, 2025

Imad Eddine Marouf, Enzo Tartaglione, Stéphane Lathuilière, and Joost van de Weijer. Ask and re- member: A questions-only replay strategy for con- tinual visual question answering.arXiv preprint arXiv:2502.04469, 2025. URL https://arxiv. org/pdf/2502.04469

arXiv 2025

[16] [16]

Cl-moe: Enhancing multimodal large language model with dual momentum mixture-of-experts for continual visual question answering

Tianyu Huai, Jie Zhou, Xingjiao Wu, Qin Chen, Qingchun Bai, Ze Zhou, and Liang He. Cl-moe: Enhancing multimodal large language model with dual momentum mixture-of-experts for continual visual question answering. InProceedings of the 16 Attention-Spectrum Regularization for Replay-Free Continual Multimodal LLMs Computer Vision and Pattern Recognition Confe...

2025

[17] [17]

Generative negative text replay for continual vision-language pretraining

Shipeng Yan, Lanqing Hong, Hang Xu, Jianhua Han, Tinne Tuytelaars, Zhenguo Li, and Xuming He. Generative negative text replay for continual vision-language pretraining. InEuropean Conference on Computer Vision, pages 22–38. Springer, 2022

2022

[18] [18]

Ctp: Towards vision- language continual pretraining via compatible mo- mentum contrast and topology preservation

Hongguang Zhu, Yunchao Wei, Xiaodan Liang, Chunjie Zhang, and Yao Zhao. Ctp: Towards vision- language continual pretraining via compatible mo- mentum contrast and topology preservation. InPro- ceedings of the IEEE/CVF International Conference on Computer Vision, pages 22257–22267, 2023

2023

[19] [19]

C-clip: Multimodal continual learning for vision- language model

Wenzhuo Liu, Fei Zhu, Longhui Wei, and Qi Tian. C-clip: Multimodal continual learning for vision- language model. InThe Thirteenth International Conference on Learning Representations, 2025

2025

[20] [20]

Hide- llava: Hierarchical decoupling for continual instruc- tion tuning of multimodal large language model

Haiyang Guo, Fanhu Zeng, Ziwei Xiang, Fei Zhu, Da- Han Wang, Xu-Yao Zhang, and Cheng-Lin Liu. Hide- llava: Hierarchical decoupling for continual instruc- tion tuning of multimodal large language model. In Proceedings of the 63rd Annual Meeting of the Associ- ation for Computational Linguistics (Volume 1: Long Papers), pages 13572–13586, 2025

2025

[21] [21]

Sefe: Superficial and essential forgetting eliminator for multimodal continual instruction tuning

Jinpeng Chen, Runmin Cong, Yuzhi Zhao, Hongzheng Yang, Guangneng Hu, Horace Ip, and Sam Kwong. Sefe: Superficial and essential forgetting eliminator for multimodal continual instruction tuning. InForty-second International Conference on Machine Learning, 2025

2025

[22] [22]

Enhancing multimodal continual in- struction tuning with branchlora.arXiv preprint arXiv:2506.02041, 2025

Duzhen Zhang, Yong Ren, Zhong-Zhi Li, Yahan Yu, Jiahua Dong, Chenxing Li, Zhilong Ji, and Jinfeng Bai. Enhancing multimodal continual in- struction tuning with branchlora.arXiv preprint arXiv:2506.02041, 2025

arXiv 2025

[23] [23]

Dynamic mixture of curriculum lora experts for continual multimodal instruction tuning

Chendi Ge, Xin Wang, Zeyang Zhang, Hong Chen, Jiapei Fan, Longtao Huang, Hui Xue, and Wenwu Zhu. Dynamic mixture of curriculum lora experts for continual multimodal instruction tuning. InForty- second International Conference on Machine Learning, 2025

2025

[24] [24]

Adapt- ∞: Scalable continual multimodal instruction tuning via dynamic data se- lection

Adyasha Maharana, Jaehong Yoon, Tianlong Chen, and Mohit Bansal. Adapt- ∞: Scalable continual multimodal instruction tuning via dynamic data se- lection. InThe Thirteenth International Conference on Learning Representations, 2025

2025

[25] [25]

Stacked attention networks for image question answering

Zichao Yang, Xiaodong He, Jianfeng Gao, Li Deng, and Alex Smola. Stacked attention networks for image question answering. InProceedings of the IEEE conferenceoncomputervisionandpatternrecognition, pages 21–29, 2016

2016

[26] [26]

Bottom-up and top-down attention for image captioning and visual question answering

Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6077–6086, 2018

2018

[27] [27]

Vilbert: Pretraining task-agnostic visiolinguistic rep- resentations for vision-and-language tasks.Advances in neural information processing systems, 32, 2019

Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. Vilbert: Pretraining task-agnostic visiolinguistic rep- resentations for vision-and-language tasks.Advances in neural information processing systems, 32, 2019

2019

[28] [28]

Lxmert: Learning cross-modality encoder representations from trans- formers

Hao Tan and Mohit Bansal. Lxmert: Learning cross-modality encoder representations from trans- formers. InProceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP), pages 5100– 5111, 2019

2019

[29] [29]

Grad-cam: Visual explanations from deep networks via gradient-based localization

Ramprasaath R Selvaraju, Michael Cogswell, Ab- hishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. Grad-cam: Visual explanations from deep networks via gradient-based localization. InProceedings of the IEEE international conference on computer vision, pages 618–626, 2017

2017

[30] [30]

Re- versible primitive–composition alignment for con- tinual vision–language learning

Canran Xiao, Tianxiang Xu, Siyuan Ma, Yiyang Jiang, Haoyu Gao, Yuhan Wu, et al. Re- versible primitive–composition alignment for con- tinual vision–language learning. InThe Fourteenth International Conference on Learning Representations, 2026

2026

[31] [31]

Jinlai Zhang, Mingchao Xiang, Yongheng Hu, Wei Hao, Linlong Lei, and Kefu Yi. Multivariate fea- ture learning and associative spatial information en- hancement for snow object detection in autonomous driving.Engineering Applications of Artificial Intelli- gence, 175:114672, 2026

2026

[32] [32]

Large-kernel spatially parallel feature fusion for monocular 3d perception in autonomous driving

Ruanzhi Jiao, Jinlai Zhang, Chang Li, and Lin Hu. Large-kernel spatially parallel feature fusion for monocular 3d perception in autonomous driving. Knowledge-Based Systems, 343:115998, 2026

2026

[33] [33]

Making the v in vqa matter: Elevating the role of image understand- ing in visual question answering

Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understand- ing in visual question answering. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 6904–6913, 2017

2017

[34] [34]

Overcoming catastrophic forgetting in neural networks.Proceedings of the national academy of sciences, 114(13):3521–3526, 2017

JamesKirkpatrick, RazvanPascanu, NeilRabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, KieranMilan,JohnQuan,TiagoRamalho,Agnieszka Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural networks.Proceedings of the national academy of sciences, 114(13):3521–3526, 2017

2017

[35] [35]

Learning without forgetting.IEEE transactions on pattern analysis and machine intelligence, 40(12):2935–2947, 2017

Zhizhong Li and Derek Hoiem. Learning without forgetting.IEEE transactions on pattern analysis and machine intelligence, 40(12):2935–2947, 2017. 17 Attention-Spectrum Regularization for Replay-Free Continual Multimodal LLMs

2017

[36] [36]

Continual learning with tiny episodic memories

Arslan Chaudhry, Marcus Rohrbach, Mohamed Elho- seiny, Thalaiyasingam Ajanthan, P Dokania, P Torr, and M Ranzato. Continual learning with tiny episodic memories. InWorkshop on Multi-Task and Lifelong Reinforcement Learning, 2019

2019

[37] [37]

Dark experi- ence for general continual learning: a strong, simple baseline.Advances in neural information processing systems, 33:15920–15930, 2020

Pietro Buzzega, Matteo Boschini, Angelo Porrello, Davide Abati, and Simone Calderara. Dark experi- ence for general continual learning: a strong, simple baseline.Advances in neural information processing systems, 33:15920–15930, 2020

2020

[38] [38]

Tic-clip: Con- tinual training of clip models

Saurabh Garg, Mehrdad Farajtabar, Hadi Pouransari, Raviteja Vemulapalli, Sachin Mehta, Oncel Tuzel, Vaishaal Shankar, and Fartash Faghri. Tic-clip: Con- tinual training of clip models. InThe Twelfth In- ternational Conference on Learning Representations, 2024

2024

[39] [39]

Decouple before in- teract: Multi-modal prompt learning for continual visual question answering

Zi Qian, Xin Wang, Xuguang Duan, Pengda Qin, Yuhong Li, and Wenwu Zhu. Decouple before in- teract: Multi-modal prompt learning for continual visual question answering. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 2953–2962, 2023

2023

[40] [40]

Overcoming dual drift for continual long- tailed visual question answering

Feifei Zhang, Zhihao Wang, Xi Zhang, and Chang- sheng Xu. Overcoming dual drift for continual long- tailed visual question answering. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 4413–4423, 2025

2025

[41] [41]

Lora: Low-rank adaptation of large language models

Edward J Hu, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models. InInternational Conference on Learning Rep- resentations, 2022

2022

[42] [42]

Orthogonal subspace learning for language model continual learning

Xiao Wang, Tianze Chen, Qiming Ge, Han Xia, Rong Bao, Rui Zheng, Qi Zhang, Tao Gui, and Xuan-Jing Huang. Orthogonal subspace learning for language model continual learning. InFindings of the Asso- ciation for Computational Linguistics: EMNLP 2023, pages 10658–10671, 2023

2023

[43] [43]

Plop: Learning without forgetting for continual semantic segmentation

Arthur Douillard, Yifu Chen, Arnaud Dapogny, and Matthieu Cord. Plop: Learning without forgetting for continual semantic segmentation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4040–4050, 2021

2021

[44] [44]

Paying more attention to attention: Improving the perfor- mance of convolutional neural networks via atten- tion transfer

Sergey Zagoruyko and Nikos Komodakis. Paying more attention to attention: Improving the perfor- mance of convolutional neural networks via atten- tion transfer. InInternational Conference on Learning Representations, 2017

2017

[45] [45]

Vibus: Data-efficient 3d scene parsing with view- point bottleneck and uncertainty-spectrum model- ing.ISPRS Journal of Photogrammetry and Remote Sensing, 194:302–318, 2022

Beiwen Tian, Liyi Luo, Hao Zhao, and Guyue Zhou. Vibus: Data-efficient 3d scene parsing with view- point bottleneck and uncertainty-spectrum model- ing.ISPRS Journal of Photogrammetry and Remote Sensing, 194:302–318, 2022

2022

[46] [46]

Enhancing con- tinual learning in visual question answering with modality-aware feature distillation

Malvina Nikandrou, Georgios Pantazopoulos, Ioan- nis Konstas, and Alessandro Suglia. Enhancing con- tinual learning in visual question answering with modality-aware feature distillation. InProceedings of the 3rd Workshop on Advances in Language and Vision Research (ALVR), pages 73–85, 2024

2024

[47] [47]

One vlm to keep it learning: Generation and balancing for data- free continual visual question answering

Deepayan Das, Davide Talon, Massimiliano Mancini, Yiming Wang, and Elisa Ricci. One vlm to keep it learning: Generation and balancing for data- free continual visual question answering. In2025 IEEE/CVF Winter Conference on Applications of Com- puter Vision (WACV), pages 5635–5645. IEEE, 2025

2025

[48] [48]

Adaptive dual cross-attention network for multispectral object detection in autonomous driving.Expert Systems with Applications, page 132012, 2026

Jinlai Zhang, Xiaolong Song, Yucheng Li, Diqing Liang, Zhiyong Zhang, and Jinhu Cai. Adaptive dual cross-attention network for multispectral object detection in autonomous driving.Expert Systems with Applications, page 132012, 2026

2026

[49] [49]

Cerberus transformer: Joint semantic, affordance and attribute parsing

Xiaoxue Chen, Tianyu Liu, Hao Zhao, Guyue Zhou, and Ya-Qin Zhang. Cerberus transformer: Joint semantic, affordance and attribute parsing. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19649–19658, 2022

2022

[50] [50]

Keeplora: Con- tinual learning with residual gradient adaptation

Mao-Lin Luo, Zi-Hao Zhou, Yi-Lin Zhang, Yuanyu Wan, TongWei, andMin-LingZhang. Keeplora: Con- tinual learning with residual gradient adaptation. In The Fourteenth International Conference on Learning Representations, 2026

2026

[51] [51]

Moelora: An moe-based parameter efficient fine- tuning method for multi-task medical applications

Qidong Liu, Xian Wu, Xiangyu Zhao, Yuanshao Zhu, Derong Xu, Feng Tian, and Yefeng Zheng. Moelora: An moe-based parameter efficient fine- tuning method for multi-task medical applications. CoRR, 2023

2023

[52] [52]

Pclr: Progressively compressed lora for multimodal continual instruc- tion tuning

Weicheng Meng, Jingyang Qiao, Shaohui Liu, Zhizhong Zhang, and Yuan Xie. Pclr: Progressively compressed lora for multimodal continual instruc- tion tuning. InThe Fourteenth International Confer- ence on Learning Representations, 2026

2026

[53] [53]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report. arXiv preprint arXiv:2502.13923, 2025

Pith/arXiv arXiv 2025

[54] [54]

question- type incremental

Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, et al. Internvl3: Ex- ploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479, 2025. 18 Attention-Spectrum Regularization for Replay-Free Continual Multimodal LLMs A Architecture-Agn...

Pith/arXiv arXiv 2025