Attention-Spectrum Regularization for Replay-Free Continual Multimodal LLMs
Pith reviewed 2026-06-26 08:54 UTC · model grok-4.3
The pith
Attention-Spectrum Regularization preserves skill-level cross-attention spectra to reduce forgetting in continual multimodal LLMs without replay.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
ASR treats cross-attention maps as two-dimensional signals, summarizes their scale and directional properties into compact spectral statistics, and stores only skill-wise prototype distributions; a phase-invariant spectral regularizer then constrains harmful drift while allowing instance-level adaptation, with theoretical analysis showing that skill-conditioned spectral drift controls forgetting under a spectral sufficiency assumption.
What carries the argument
Skill-wise prototype distributions of Fourier power spectra extracted from cross-attention maps, which encode scale and directional properties and are regularized to prevent drift.
If this is right
- Skill-conditioned spectral regularization reduces forgetting on continual VQA and multimodal instruction-tuning tasks.
- The approach operates without storing past image-question pairs, pseudo-examples, or teacher snapshots.
- Fourier power spectra of attention maps remain stable under spatial translations and bounded perturbations.
- ASR improves final performance over replay-based, regularization-based, and adapter-based baselines on VQA v2, VQACL, CLT-VQA, CoIN, and UCIT.
Where Pith is reading between the lines
- The same spectral-prototype mechanism could be tested in continual learning settings for text-only or vision-only transformers.
- Combining attention-spectrum regularization with output-level distillation might yield additive gains on long task sequences.
- Scaling the number of stored prototypes with model size or task count would test whether the storage cost remains sublinear.
Load-bearing premise
That controlling skill-conditioned spectral drift of attention prototypes is sufficient to control forgetting of multimodal skills.
What would settle it
An experiment in which skill-wise spectral drift remains small yet substantial forgetting still occurs on held-out tasks, or in which large spectral drift occurs with negligible forgetting.
read the original abstract
Multimodal large language models (MLLMs) are increasingly required to adapt to non-stationary streams of visual domains, question types, and user instructions, yet continual fine-tuning often causes severe forgetting of previously acquired multimodal skills. Existing continual vision-language methods mainly preserve outputs, replay data or pseudo-data, regularize embedding geometry, or allocate task-specific parameters, but they provide limited control over how internal cross-modal attention patterns supporting old skills drift during adaptation. We propose Attention-Spectrum Regularization (ASR), a replay-free continual learning framework that preserves skill-conditioned structures of cross-modal attention. ASR treats cross-attention maps as two-dimensional signals, summarizes their scale and directional properties into compact spectral statistics, and stores only skill-wise prototype distributions instead of replaying past image-question pairs, generated pseudo-examples, or old-stage teacher snapshots. In later stages, a phase-invariant spectral regularizer constrains harmful drift of these prototypes while allowing instance-level attention to adapt to new tasks. We provide theoretical analysis showing that skill-conditioned spectral drift controls forgetting under a spectral sufficiency assumption, and that Fourier power spectra are stable to spatial translations and bounded perturbations. Experiments on continual VQA and multimodal instruction-tuning benchmarks, including VQA v2, VQACL, CLT-VQA, CoIN, and UCIT, show that ASR consistently improves final performance and reduces forgetting over strong replay-, regularization-, and adapter-based baselines. Preserving skill-level attention structure is an effective and lightweight mechanism for continual MLLMs. Code is available at https://github.com/Creative-zcx/attention-spectrum-replay
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Attention-Spectrum Regularization (ASR), a replay-free continual learning framework for multimodal LLMs. It treats cross-attention maps as 2D signals, extracts compact spectral statistics, stores skill-wise prototype distributions, and applies a phase-invariant spectral regularizer to limit harmful drift during adaptation to new visual domains or instructions. Theoretical analysis claims that skill-conditioned spectral drift controls forgetting under a spectral sufficiency assumption, with stability results for Fourier spectra under translations and perturbations. Experiments on VQA v2, VQACL, CLT-VQA, CoIN, and UCIT benchmarks report improved final performance and reduced forgetting versus replay, regularization, and adapter baselines.
Significance. If the central claims hold, ASR provides a lightweight, data-free mechanism for preserving multimodal skills via attention spectra rather than replay buffers or task-specific parameters, which could scale better for non-stationary MLLM streams. Code release is a positive contribution for reproducibility. The approach is novel in its focus on spectral properties of cross-attention but its impact hinges on validation of the linking assumption between spectra and forgetting.
major comments (3)
- [Abstract, theoretical analysis paragraph] Abstract, theoretical analysis paragraph: the spectral sufficiency assumption (that controlling skill-conditioned spectral drift is sufficient to control forgetting) is invoked without derivation, empirical bounds, or ablation showing that other forgetting pathways (e.g., feed-forward weight changes or output-head drift) are negligible. This assumption is load-bearing for the claim that the phase-invariant regularizer guarantees reduced forgetting.
- [Section 5 (Experiments)] Section 5 (Experiments): while improvements over baselines are claimed, the manuscript must report per-task forgetting metrics (e.g., average accuracy drop from peak to final) with standard deviations across runs and statistical tests; without these, the quantitative support for 'consistently reduces forgetting' cannot be assessed.
- [Method section] Method section (prototype storage and regularizer): the claim that storing only skill-wise spectral prototype distributions is strictly replay-free requires explicit confirmation that no pseudo-data, teacher logits, or old attention maps are retained at inference or regularization time; any implicit storage would undermine the replay-free positioning.
minor comments (2)
- [Method] Clarify notation for the spectral statistics (e.g., exact definition of scale and directional summaries) and the phase-invariance property in the method description.
- [Experiments] Add dataset sizes, number of continual stages/tasks, and training hyperparameters for each benchmark to support reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive comments, which help clarify key aspects of our work. We address each major comment below and indicate planned revisions where appropriate.
read point-by-point responses
-
Referee: [Abstract, theoretical analysis paragraph] Abstract, theoretical analysis paragraph: the spectral sufficiency assumption (that controlling skill-conditioned spectral drift is sufficient to control forgetting) is invoked without derivation, empirical bounds, or ablation showing that other forgetting pathways (e.g., feed-forward weight changes or output-head drift) are negligible. This assumption is load-bearing for the claim that the phase-invariant regularizer guarantees reduced forgetting.
Authors: We agree that the spectral sufficiency assumption is central and is stated without a full derivation or supporting ablation in the current version. The theoretical analysis establishes stability of Fourier spectra under translations and perturbations but links this to forgetting control only under the assumption. In revision we will expand the theoretical section with a short derivation sketch based on the role of cross-attention in multimodal fusion and add an ablation that isolates spectral regularization from feed-forward and output-head changes, reporting the resulting forgetting metrics. revision: yes
-
Referee: [Section 5 (Experiments)] Section 5 (Experiments): while improvements over baselines are claimed, the manuscript must report per-task forgetting metrics (e.g., average accuracy drop from peak to final) with standard deviations across runs and statistical tests; without these, the quantitative support for 'consistently reduces forgetting' cannot be assessed.
Authors: This is a valid request for stronger quantitative evidence. The current experiments report aggregate final performance and overall forgetting reduction but omit per-task peak-to-final drops, standard deviations, and statistical tests. We will revise Section 5 to include these metrics in additional tables, computed over multiple random seeds, together with paired statistical tests against the baselines. revision: yes
-
Referee: [Method section] Method section (prototype storage and regularizer): the claim that storing only skill-wise spectral prototype distributions is strictly replay-free requires explicit confirmation that no pseudo-data, teacher logits, or old attention maps are retained at inference or regularization time; any implicit storage would undermine the replay-free positioning.
Authors: The manuscript states that only skill-wise prototype distributions of spectral statistics are stored and that no past image-question pairs, pseudo-examples, or teacher snapshots are retained. During both regularization and inference the regularizer compares the current model's attention spectra solely against these stored prototypes; no data, logits, or prior maps are generated or accessed. We will add an explicit clarifying paragraph in the Method section confirming this point and reiterating that the approach remains strictly replay-free. revision: partial
Circularity Check
No circularity; regularization target and theoretical claim are independently defined
full rationale
The paper defines ASR directly from cross-attention maps treated as 2D signals, extracts spectral statistics, stores skill-wise prototypes, and applies a phase-invariant regularizer. The theoretical claim links spectral drift to forgetting only under an explicitly stated spectral sufficiency assumption that is posited rather than derived from the method itself or from self-citations. No equations, fitted parameters, or predictions reduce to the inputs by construction, and no load-bearing self-citation chain is present in the provided text. The experimental results on external benchmarks are therefore not forced by the method's definition.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption spectral sufficiency assumption: skill-conditioned spectral drift controls forgetting
Reference graph
Works this paper leans on
-
[1]
Visual instruction tuning.Advances in neural information processing systems, 36:34892– 34916, 2023
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892– 34916, 2023
2023
-
[2]
Lvlm-ehub: A comprehen- sive evaluation benchmark for large vision-language models.IEEE Transactions on Pattern Analysis and Machine Intelligence, 47(3):1877–1893, 2024
Peng Xu, Wenqi Shao, Kaipeng Zhang, Peng Gao, Shuo Liu, Meng Lei, Fanqing Meng, Siyuan Huang, Yu Qiao, and Ping Luo. Lvlm-ehub: A comprehen- sive evaluation benchmark for large vision-language models.IEEE Transactions on Pattern Analysis and Machine Intelligence, 47(3):1877–1893, 2024
2024
-
[3]
Improving representation of high- frequency components for medical visual foundation models.IEEE Transactions on Medical Imaging, 2025
Yuetan Chu, Yilan Zhang, Zhongyi Han, Changchun Yang, Longxi Zhou, Gongning Luo, Chao Huang, and Xin Gao. Improving representation of high- frequency components for medical visual foundation models.IEEE Transactions on Medical Imaging, 2025
2025
-
[4]
Construct-vl: Data-free continual structured vl concepts learning
James Seale Smith, Paola Cascante-Bonilla, Assaf Ar- belle, Donghyun Kim, Rameswar Panda, David Cox, Diyi Yang, Zsolt Kira, Rogerio Feris, and Leonid Kar- linsky. Construct-vl: Data-free continual structured vl concepts learning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion, pages 14994–15004, 2023
2023
-
[5]
Coin: A benchmark of continual instruction tuning for multimodel large language models.Advances in Neural Information Processing Systems, 37:57817–57840, 2024
Cheng Chen, Junchen Zhu, Xu Luo, Heng T Shen, Jingkuan Song, and Lianli Gao. Coin: A benchmark of continual instruction tuning for multimodel large language models.Advances in Neural Information Processing Systems, 37:57817–57840, 2024
2024
-
[6]
Ct-based ai system for quantitative and integrated manage- ment of acute respiratory distress syndrome in criti- cal care.npj Digital Medicine, 2026
Yuetan Chu, Jianpeng Wang, Peiyao Luo, Hui Chen, Zhongheng Zhang, Jiannan Zhang, Yilan Zhang, Yingnan Ju, Yaxin Xiong, Xiqing Luo, et al. Ct-based ai system for quantitative and integrated manage- ment of acute respiratory distress syndrome in criti- cal care.npj Digital Medicine, 2026
2026
-
[7]
Impromptu vla: Open weights and open data for driving vision-language- action models.Advances in Neural Information Pro- cessing Systems, 38, 2026
Haohan Chi, Huan-ang Gao, Ziming Liu, Jianing Liu, Chenyu Liu, Jinwei Li, Kaisen Yang, Yangcheng Yu, Zeda Wang, Wenyi Li, et al. Impromptu vla: Open weights and open data for driving vision-language- action models.Advances in Neural Information Pro- cessing Systems, 38, 2026
2026
-
[8]
Psycholinguistics meets con- tinuallearning: Measuringcatastrophicforgettingin visual question answering
Claudio Greco, Barbara Plank, Raquel Fernández, and Raffaella Bernardi. Psycholinguistics meets con- tinuallearning: Measuringcatastrophicforgettingin visual question answering. InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3601–3605, 2019
2019
-
[9]
Continual vision-language representation learning with off-diagonal informa- tion
Zixuan Ni, Longhui Wei, Siliang Tang, Yueting Zhuang, and Qi Tian. Continual vision-language representation learning with off-diagonal informa- tion. InInternational Conference on Machine Learn- ing, pages 26129–26149. PMLR, 2023
2023
-
[10]
Preventing zero-shot transfer degradation in continual learn- ing of vision-language models
Zangwei Zheng, Mingyuan Ma, Kai Wang, Ziheng Qin, Xiangyu Yue, and Yang You. Preventing zero-shot transfer degradation in continual learn- ing of vision-language models. InProceedings of the IEEE/CVF international conference on computer vision, pages 19125–19136, 2023
2023
-
[11]
Yuetan Chu, Xinhua Ma, Xinran Jin, Gongning Luo, andXinGao. Medtri: Aplatformforstructuredmedi- cal report normalization to enhance vision-language pretraining.arXiv preprint arXiv:2602.22143, 2026
arXiv 2026
-
[12]
Semantic perceptual image com- pression with a laplacian pyramid of convolutional networks.IEEE Transactions on Image Processing, 30:4225–4237, 2021
Juan Wang, Yiping Duan, Xiaoming Tao, Mai Xu, and Jianhua Lu. Semantic perceptual image com- pression with a laplacian pyramid of convolutional networks.IEEE Transactions on Image Processing, 30:4225–4237, 2021
2021
-
[13]
Symbolic replay: Scene graph as prompt for continual learning on vqa task
Stan Weixian Lei, Difei Gao, Jay Zhangjie Wu, Yux- uanWang, WeiLiu, MengmiZhang, andMikeZheng Shou. Symbolic replay: Scene graph as prompt for continual learning on vqa task. InProceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 1250–1259, 2023
2023
-
[14]
Vqacl: Anovelvisualquestionansweringcontinuallearning setting
Xi Zhang, Feifei Zhang, and Changsheng Xu. Vqacl: Anovelvisualquestionansweringcontinuallearning setting. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19102–19112, 2023
2023
-
[15]
Imad Eddine Marouf, Enzo Tartaglione, Stéphane Lathuilière, and Joost van de Weijer. Ask and re- member: A questions-only replay strategy for con- tinual visual question answering.arXiv preprint arXiv:2502.04469, 2025. URL https://arxiv. org/pdf/2502.04469
arXiv 2025
-
[16]
Cl-moe: Enhancing multimodal large language model with dual momentum mixture-of-experts for continual visual question answering
Tianyu Huai, Jie Zhou, Xingjiao Wu, Qin Chen, Qingchun Bai, Ze Zhou, and Liang He. Cl-moe: Enhancing multimodal large language model with dual momentum mixture-of-experts for continual visual question answering. InProceedings of the 16 Attention-Spectrum Regularization for Replay-Free Continual Multimodal LLMs Computer Vision and Pattern Recognition Confe...
2025
-
[17]
Generative negative text replay for continual vision-language pretraining
Shipeng Yan, Lanqing Hong, Hang Xu, Jianhua Han, Tinne Tuytelaars, Zhenguo Li, and Xuming He. Generative negative text replay for continual vision-language pretraining. InEuropean Conference on Computer Vision, pages 22–38. Springer, 2022
2022
-
[18]
Ctp: Towards vision- language continual pretraining via compatible mo- mentum contrast and topology preservation
Hongguang Zhu, Yunchao Wei, Xiaodan Liang, Chunjie Zhang, and Yao Zhao. Ctp: Towards vision- language continual pretraining via compatible mo- mentum contrast and topology preservation. InPro- ceedings of the IEEE/CVF International Conference on Computer Vision, pages 22257–22267, 2023
2023
-
[19]
C-clip: Multimodal continual learning for vision- language model
Wenzhuo Liu, Fei Zhu, Longhui Wei, and Qi Tian. C-clip: Multimodal continual learning for vision- language model. InThe Thirteenth International Conference on Learning Representations, 2025
2025
-
[20]
Hide- llava: Hierarchical decoupling for continual instruc- tion tuning of multimodal large language model
Haiyang Guo, Fanhu Zeng, Ziwei Xiang, Fei Zhu, Da- Han Wang, Xu-Yao Zhang, and Cheng-Lin Liu. Hide- llava: Hierarchical decoupling for continual instruc- tion tuning of multimodal large language model. In Proceedings of the 63rd Annual Meeting of the Associ- ation for Computational Linguistics (Volume 1: Long Papers), pages 13572–13586, 2025
2025
-
[21]
Sefe: Superficial and essential forgetting eliminator for multimodal continual instruction tuning
Jinpeng Chen, Runmin Cong, Yuzhi Zhao, Hongzheng Yang, Guangneng Hu, Horace Ip, and Sam Kwong. Sefe: Superficial and essential forgetting eliminator for multimodal continual instruction tuning. InForty-second International Conference on Machine Learning, 2025
2025
-
[22]
Duzhen Zhang, Yong Ren, Zhong-Zhi Li, Yahan Yu, Jiahua Dong, Chenxing Li, Zhilong Ji, and Jinfeng Bai. Enhancing multimodal continual in- struction tuning with branchlora.arXiv preprint arXiv:2506.02041, 2025
arXiv 2025
-
[23]
Dynamic mixture of curriculum lora experts for continual multimodal instruction tuning
Chendi Ge, Xin Wang, Zeyang Zhang, Hong Chen, Jiapei Fan, Longtao Huang, Hui Xue, and Wenwu Zhu. Dynamic mixture of curriculum lora experts for continual multimodal instruction tuning. InForty- second International Conference on Machine Learning, 2025
2025
-
[24]
Adapt- ∞: Scalable continual multimodal instruction tuning via dynamic data se- lection
Adyasha Maharana, Jaehong Yoon, Tianlong Chen, and Mohit Bansal. Adapt- ∞: Scalable continual multimodal instruction tuning via dynamic data se- lection. InThe Thirteenth International Conference on Learning Representations, 2025
2025
-
[25]
Stacked attention networks for image question answering
Zichao Yang, Xiaodong He, Jianfeng Gao, Li Deng, and Alex Smola. Stacked attention networks for image question answering. InProceedings of the IEEE conferenceoncomputervisionandpatternrecognition, pages 21–29, 2016
2016
-
[26]
Bottom-up and top-down attention for image captioning and visual question answering
Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6077–6086, 2018
2018
-
[27]
Vilbert: Pretraining task-agnostic visiolinguistic rep- resentations for vision-and-language tasks.Advances in neural information processing systems, 32, 2019
Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. Vilbert: Pretraining task-agnostic visiolinguistic rep- resentations for vision-and-language tasks.Advances in neural information processing systems, 32, 2019
2019
-
[28]
Lxmert: Learning cross-modality encoder representations from trans- formers
Hao Tan and Mohit Bansal. Lxmert: Learning cross-modality encoder representations from trans- formers. InProceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP), pages 5100– 5111, 2019
2019
-
[29]
Grad-cam: Visual explanations from deep networks via gradient-based localization
Ramprasaath R Selvaraju, Michael Cogswell, Ab- hishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. Grad-cam: Visual explanations from deep networks via gradient-based localization. InProceedings of the IEEE international conference on computer vision, pages 618–626, 2017
2017
-
[30]
Re- versible primitive–composition alignment for con- tinual vision–language learning
Canran Xiao, Tianxiang Xu, Siyuan Ma, Yiyang Jiang, Haoyu Gao, Yuhan Wu, et al. Re- versible primitive–composition alignment for con- tinual vision–language learning. InThe Fourteenth International Conference on Learning Representations, 2026
2026
-
[31]
Jinlai Zhang, Mingchao Xiang, Yongheng Hu, Wei Hao, Linlong Lei, and Kefu Yi. Multivariate fea- ture learning and associative spatial information en- hancement for snow object detection in autonomous driving.Engineering Applications of Artificial Intelli- gence, 175:114672, 2026
2026
-
[32]
Large-kernel spatially parallel feature fusion for monocular 3d perception in autonomous driving
Ruanzhi Jiao, Jinlai Zhang, Chang Li, and Lin Hu. Large-kernel spatially parallel feature fusion for monocular 3d perception in autonomous driving. Knowledge-Based Systems, 343:115998, 2026
2026
-
[33]
Making the v in vqa matter: Elevating the role of image understand- ing in visual question answering
Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understand- ing in visual question answering. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 6904–6913, 2017
2017
-
[34]
Overcoming catastrophic forgetting in neural networks.Proceedings of the national academy of sciences, 114(13):3521–3526, 2017
JamesKirkpatrick, RazvanPascanu, NeilRabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, KieranMilan,JohnQuan,TiagoRamalho,Agnieszka Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural networks.Proceedings of the national academy of sciences, 114(13):3521–3526, 2017
2017
-
[35]
Learning without forgetting.IEEE transactions on pattern analysis and machine intelligence, 40(12):2935–2947, 2017
Zhizhong Li and Derek Hoiem. Learning without forgetting.IEEE transactions on pattern analysis and machine intelligence, 40(12):2935–2947, 2017. 17 Attention-Spectrum Regularization for Replay-Free Continual Multimodal LLMs
2017
-
[36]
Continual learning with tiny episodic memories
Arslan Chaudhry, Marcus Rohrbach, Mohamed Elho- seiny, Thalaiyasingam Ajanthan, P Dokania, P Torr, and M Ranzato. Continual learning with tiny episodic memories. InWorkshop on Multi-Task and Lifelong Reinforcement Learning, 2019
2019
-
[37]
Dark experi- ence for general continual learning: a strong, simple baseline.Advances in neural information processing systems, 33:15920–15930, 2020
Pietro Buzzega, Matteo Boschini, Angelo Porrello, Davide Abati, and Simone Calderara. Dark experi- ence for general continual learning: a strong, simple baseline.Advances in neural information processing systems, 33:15920–15930, 2020
2020
-
[38]
Tic-clip: Con- tinual training of clip models
Saurabh Garg, Mehrdad Farajtabar, Hadi Pouransari, Raviteja Vemulapalli, Sachin Mehta, Oncel Tuzel, Vaishaal Shankar, and Fartash Faghri. Tic-clip: Con- tinual training of clip models. InThe Twelfth In- ternational Conference on Learning Representations, 2024
2024
-
[39]
Decouple before in- teract: Multi-modal prompt learning for continual visual question answering
Zi Qian, Xin Wang, Xuguang Duan, Pengda Qin, Yuhong Li, and Wenwu Zhu. Decouple before in- teract: Multi-modal prompt learning for continual visual question answering. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 2953–2962, 2023
2023
-
[40]
Overcoming dual drift for continual long- tailed visual question answering
Feifei Zhang, Zhihao Wang, Xi Zhang, and Chang- sheng Xu. Overcoming dual drift for continual long- tailed visual question answering. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 4413–4423, 2025
2025
-
[41]
Lora: Low-rank adaptation of large language models
Edward J Hu, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models. InInternational Conference on Learning Rep- resentations, 2022
2022
-
[42]
Orthogonal subspace learning for language model continual learning
Xiao Wang, Tianze Chen, Qiming Ge, Han Xia, Rong Bao, Rui Zheng, Qi Zhang, Tao Gui, and Xuan-Jing Huang. Orthogonal subspace learning for language model continual learning. InFindings of the Asso- ciation for Computational Linguistics: EMNLP 2023, pages 10658–10671, 2023
2023
-
[43]
Plop: Learning without forgetting for continual semantic segmentation
Arthur Douillard, Yifu Chen, Arnaud Dapogny, and Matthieu Cord. Plop: Learning without forgetting for continual semantic segmentation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4040–4050, 2021
2021
-
[44]
Paying more attention to attention: Improving the perfor- mance of convolutional neural networks via atten- tion transfer
Sergey Zagoruyko and Nikos Komodakis. Paying more attention to attention: Improving the perfor- mance of convolutional neural networks via atten- tion transfer. InInternational Conference on Learning Representations, 2017
2017
-
[45]
Vibus: Data-efficient 3d scene parsing with view- point bottleneck and uncertainty-spectrum model- ing.ISPRS Journal of Photogrammetry and Remote Sensing, 194:302–318, 2022
Beiwen Tian, Liyi Luo, Hao Zhao, and Guyue Zhou. Vibus: Data-efficient 3d scene parsing with view- point bottleneck and uncertainty-spectrum model- ing.ISPRS Journal of Photogrammetry and Remote Sensing, 194:302–318, 2022
2022
-
[46]
Enhancing con- tinual learning in visual question answering with modality-aware feature distillation
Malvina Nikandrou, Georgios Pantazopoulos, Ioan- nis Konstas, and Alessandro Suglia. Enhancing con- tinual learning in visual question answering with modality-aware feature distillation. InProceedings of the 3rd Workshop on Advances in Language and Vision Research (ALVR), pages 73–85, 2024
2024
-
[47]
One vlm to keep it learning: Generation and balancing for data- free continual visual question answering
Deepayan Das, Davide Talon, Massimiliano Mancini, Yiming Wang, and Elisa Ricci. One vlm to keep it learning: Generation and balancing for data- free continual visual question answering. In2025 IEEE/CVF Winter Conference on Applications of Com- puter Vision (WACV), pages 5635–5645. IEEE, 2025
2025
-
[48]
Adaptive dual cross-attention network for multispectral object detection in autonomous driving.Expert Systems with Applications, page 132012, 2026
Jinlai Zhang, Xiaolong Song, Yucheng Li, Diqing Liang, Zhiyong Zhang, and Jinhu Cai. Adaptive dual cross-attention network for multispectral object detection in autonomous driving.Expert Systems with Applications, page 132012, 2026
2026
-
[49]
Cerberus transformer: Joint semantic, affordance and attribute parsing
Xiaoxue Chen, Tianyu Liu, Hao Zhao, Guyue Zhou, and Ya-Qin Zhang. Cerberus transformer: Joint semantic, affordance and attribute parsing. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19649–19658, 2022
2022
-
[50]
Keeplora: Con- tinual learning with residual gradient adaptation
Mao-Lin Luo, Zi-Hao Zhou, Yi-Lin Zhang, Yuanyu Wan, TongWei, andMin-LingZhang. Keeplora: Con- tinual learning with residual gradient adaptation. In The Fourteenth International Conference on Learning Representations, 2026
2026
-
[51]
Moelora: An moe-based parameter efficient fine- tuning method for multi-task medical applications
Qidong Liu, Xian Wu, Xiangyu Zhao, Yuanshao Zhu, Derong Xu, Feng Tian, and Yefeng Zheng. Moelora: An moe-based parameter efficient fine- tuning method for multi-task medical applications. CoRR, 2023
2023
-
[52]
Pclr: Progressively compressed lora for multimodal continual instruc- tion tuning
Weicheng Meng, Jingyang Qiao, Shaohui Liu, Zhizhong Zhang, and Yuan Xie. Pclr: Progressively compressed lora for multimodal continual instruc- tion tuning. InThe Fourteenth International Confer- ence on Learning Representations, 2026
2026
-
[53]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report. arXiv preprint arXiv:2502.13923, 2025
Pith/arXiv arXiv 2025
-
[54]
Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, et al. Internvl3: Ex- ploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479, 2025. 18 Attention-Spectrum Regularization for Replay-Free Continual Multimodal LLMs A Architecture-Agn...
Pith/arXiv arXiv 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.