Modality-Inconsistent Continual Learning of Multimodal Large Language Models

Mingrui Liu; Shentong Mo; Shijian Deng; Weiguo Pian; Yapeng Tian; Yunhui Guo

arxiv: 2412.13050 · v2 · submitted 2024-12-17 · 💻 cs.LG · cs.AI· cs.CL· cs.CV· cs.SD· eess.AS

Modality-Inconsistent Continual Learning of Multimodal Large Language Models

Weiguo Pian , Shijian Deng , Shentong Mo , Mingrui Liu , Yunhui Guo , Yapeng Tian This is my paper

Pith reviewed 2026-05-23 06:44 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CLcs.CVcs.SDeess.AS

keywords continual learningmultimodal large language modelscatastrophic forgettingmodality inconsistencyknowledge distillationpseudo targetstask-type shift

0 comments

The pith

MoInCL counters catastrophic forgetting in MLLMs when both input modalities and task types shift across training stages.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper defines a new continual learning problem, MICL, in which multimodal large language models must handle sequences of tasks whose input types (image, audio, video) and output types (captioning versus question-answering) both change. Existing methods fail because modality switches and task-type switches each trigger forgetting of earlier capabilities. MoInCL therefore adds a Pseudo Targets Generation Module that creates synthetic targets for previously seen modalities under new task formats, and an Instruction-based Knowledge Distillation step that keeps the model responsive to old modalities when new ones arrive. Experiments on a six-task benchmark demonstrate that these two additions together produce higher average performance than standard and state-of-the-art continual-learning baselines.

Core claim

MICL is the continual-learning setting that jointly introduces modality inconsistency and task-type inconsistency; MoInCL mitigates the resulting forgetting by generating pseudo targets for old modalities under new task instructions and by distilling knowledge from old-modality instructions, yielding measurable gains on the six-task MICL benchmark over representative continual-learning methods.

What carries the argument

The Pseudo Targets Generation Module together with Instruction-based Knowledge Distillation; the first creates surrogate supervision signals that let the model rehearse earlier modalities under later task formats, while the second transfers modality-specific knowledge via instruction alignment.

If this is right

MLLMs can retain captioning and question-answering performance on earlier modalities even after new modalities and task formats are introduced.
Instruction-based distillation allows preservation of modality-specific behavior without storing raw past data.
The approach separates the handling of task-type forgetting from modality forgetting, allowing modular extension to additional modalities.
Average accuracy across the sequence of tasks rises compared with replay-free or modality-incremental baselines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same pseudo-target and distillation pattern could be tested on sequences that also include text-only or sensor-data tasks.
If the pseudo-target generator itself is made task-adaptive, the method might scale to longer task sequences without additional hyper-parameters.
The framework suggests that modality-specific instruction tuning can serve as a lightweight rehearsal mechanism for any multimodal model that must evolve over time.

Load-bearing premise

The six-task benchmark is representative of the full range of modality and task-type shifts that will occur in practice, and the added modules do not degrade performance on newly introduced modalities or tasks.

What would settle it

A follow-up experiment on a different collection of modality-task pairs in which MoInCL shows no improvement over baselines or produces lower accuracy on the newest tasks would falsify the central claim.

Figures

Figures reproduced from arXiv: 2412.13050 by Mingrui Liu, Shentong Mo, Shijian Deng, Weiguo Pian, Yapeng Tian, Yunhui Guo.

**Figure 2.** Figure 2: Overview of our proposed MoInCL, which mainly consists of a Multimodal Large Language Model [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Qualitative results of the Fine-tuning method in Order 2. The sample is randomly selected from the test [PITH_FULL_IMAGE:figures/full_fig_p015_3.png] view at source ↗

**Figure 4.** Figure 4: Qualitative results of the LwF (Li and Hoiem, 2017) method in Order 2. The sample is randomly selected from the test set of Task 1 (Image Captioning). The results are generated using models trained after after (a) Task 1, (b) Task 2, (c) Task 3, (d) Task 4, (e) Task 5, and (f) Task 6. 15 [PITH_FULL_IMAGE:figures/full_fig_p015_4.png] view at source ↗

**Figure 5.** Figure 5: Qualitative results of the EWC (Kirkpatrick et al., 2017) method in Order 2. The sample is randomly selected from the test set of Task 1 (Image Captioning). The results are generated using models trained after after (a) Task 1, (b) Task 2, (c) Task 3, (d) Task 4, (e) Task 5, and (f) Task 6. Describe the image. A man in a red jacket plays the guitar on the sidewalk. (a) Describe the image. A man in a red ja… view at source ↗

**Figure 6.** Figure 6: Qualitative results of the EWF (Xiao et al., 2023) method in Order 2. The sample is randomly selected from the test set of Task 1 (Image Captioning). The results are generated using models trained after after (a) Task 1, (b) Task 2, (c) Task 3, (d) Task 4, (e) Task 5, and (f) Task 6. 16 [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗

**Figure 7.** Figure 7: Qualitative results of the PathWeave (Yu et al., 2024) method in Order 2. The sample is randomly selected from the test set of Task 1 (Image Captioning). The results are generated using models trained after after (a) Task 1, (b) Task 2, (c) Task 3, (d) Task 4, (e) Task 5, and (f) Task 6. Describe the image. A man in a red jacket plays the guitar on the sidewalk. (a) Describe the image. A man in a red jacke… view at source ↗

**Figure 8.** Figure 8: Qualitative results of our proposed MoInCL in Order 2. The sample is randomly selected from the test set [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗

read the original abstract

In this paper, we introduce Modality-Inconsistent Continual Learning (MICL), a new continual learning scenario for Multimodal Large Language Models (MLLMs) that involves tasks with inconsistent modalities (image, audio, or video) and varying task types (captioning or question-answering). Unlike existing vision-only or modality-incremental settings, MICL combines modality and task type shifts, both of which drive catastrophic forgetting. To address these challenges, we propose MoInCL, which employs a Pseudo Targets Generation Module to mitigate forgetting caused by task type shifts in previously seen modalities. It also incorporates Instruction-based Knowledge Distillation to preserve the model's ability to handle previously learned modalities when new ones are introduced. We benchmark MICL using a total of six tasks and conduct experiments to validate the effectiveness of our MoInCL. The experimental results highlight the superiority of MoInCL, showing significant improvements over representative and state-of-the-art continual learning baselines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper defines a new MICL scenario mixing modality and task-type shifts for MLLMs and proposes two targeted modules, but the abstract supplies zero numbers or ablations so the gains cannot be assessed.

read the letter

The punchline is that the paper introduces Modality-Inconsistent Continual Learning as a distinct setting for MLLMs, where both modality and task type change, and offers two modules to reduce forgetting in that case. The combination of those shifts does not appear in the settings they cite, so the scenario itself is the clearest addition. The work does a good job spelling out why task-type shifts on old modalities and modality shifts on old tasks both matter, and why standard continual learning tricks may not cover both at once. The Pseudo Targets Generation Module and the Instruction-based Knowledge Distillation are presented as direct responses to each problem. Benchmarking on six tasks that mix image, audio, and video with captioning and QA is a concrete way to test the idea. The soft spot is the complete absence of any numbers in the abstract. It says the method shows significant improvements, but there are no tables, no baseline scores, no ablation on the two modules, and no check on whether new-task performance drops. Without those it is hard to know if the modules work as claimed or if the benchmark is simply not demanding enough. The full paper may have the details, but they are not visible here. This paper is for researchers focused on continual learning in multimodal settings. A reader who wants to extend forgetting mitigation to mixed-modality MLLMs would find the scenario and the two modules worth looking at. It is not broad enough to interest people outside that niche. I think it deserves peer review. The setting is new enough and the approach specific enough that referees can evaluate whether the experiments support the claims. The authors will need to add the missing quantitative evidence and any ablations in a revision.

Referee Report

1 major / 0 minor

Summary. The paper introduces Modality-Inconsistent Continual Learning (MICL), a new continual learning scenario for Multimodal Large Language Models involving simultaneous shifts in input modalities (image/audio/video) and task types (captioning/question-answering). It proposes MoInCL, which uses a Pseudo Targets Generation Module to address forgetting from task-type changes on seen modalities and Instruction-based Knowledge Distillation to maintain performance on prior modalities when new ones are added. The method is evaluated on a six-task benchmark and asserted to outperform representative and state-of-the-art continual learning baselines.

Significance. If the empirical claims hold after detailed validation, the work would be significant for defining and tackling a realistic MICL setting that combines modality and task-type shifts, both known drivers of forgetting in MLLMs. The two proposed modules target these shifts specifically, and the six-task benchmark could serve as a useful testbed if it adequately exercises the inconsistencies without hidden trade-offs on new modalities.

major comments (1)

[Abstract] Abstract: the central claim that 'the experimental results highlight the superiority of MoInCL, showing significant improvements over representative and state-of-the-art continual learning baselines' is unsupported by any metrics, baseline names, task descriptions, ablation results, or statistical tests. This is load-bearing for the paper's main contribution, as the soundness of the two modules cannot be assessed without quantitative evidence.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for highlighting the need for quantitative support in the abstract. We agree this strengthens the presentation and will revise accordingly while noting that the full experimental details appear in the body of the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that 'the experimental results highlight the superiority of MoInCL, showing significant improvements over representative and state-of-the-art continual learning baselines' is unsupported by any metrics, baseline names, task descriptions, ablation results, or statistical tests. This is load-bearing for the paper's main contribution, as the soundness of the two modules cannot be assessed without quantitative evidence.

Authors: We agree that the abstract would benefit from including specific quantitative evidence to support the superiority claim. The manuscript body (Sections 4 and 5) already provides the six-task benchmark details, baseline names (e.g., standard CL methods and SOTA variants), metrics, and ablation studies, but these are not referenced in the abstract. We will revise the abstract to incorporate key results, such as average accuracy gains and task-specific improvements, to make the central claim self-contained and substantiated. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper introduces a new MICL scenario and two modules (Pseudo Targets Generation Module, Instruction-based Knowledge Distillation) whose effectiveness is asserted via six-task experiments on MLLMs. No equations, algorithmic derivations, fitted parameters renamed as predictions, or self-citations appear in the provided text. The central claims rest on empirical comparisons to baselines rather than any reduction to inputs by construction, self-definition, or load-bearing self-citation chains. This is a standard empirical engineering contribution whose validation is external to the method definition itself.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no explicit free parameters, axioms, or invented entities are stated. The central claim rests on the empirical effectiveness of the two introduced modules and the representativeness of the six-task benchmark.

pith-pipeline@v0.9.0 · 5723 in / 1077 out tokens · 27505 ms · 2026-05-23T06:44:17.598694+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

The Blind Spot of Adaptation: Quantifying and Mitigating Forgetting in Fine-tuned Driving Models
cs.CV 2026-04 unverdicted novelty 7.0

Fine-tuning VLMs for driving erodes pre-trained world knowledge, but shifting adaptation to prompt space via the Drive Expert Adapter preserves generalization while improving task performance.

Reference graph

Works this paper leans on

48 extracted references · 48 canonical work pages · cited by 1 Pith paper · 5 internal anchors

[1]

Hongjoon Ahn, Jihwan Kwak, Subin Lim, Hyeonsu Bang, Hyojun Kim, and Taesup Moon. 2021. Ss-il: Separated softmax for incremental learning. In Proceedings of the IEEE/CVF International conference on computer vision, pages 844--853

work page 2021
[2]

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. 2022. Flamingo: a visual language model for few-shot learning. Advances in neural information processing systems, 35:23716--23736

work page 2022
[3]

Xusheng Cao, Haori Lu, Linlan Huang, Xialei Liu, and Ming-Ming Cheng. 2024. Generative multi-modal models are good class incremental learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 28706--28717

work page 2024
[4]

Arslan Chaudhry, Marcus Rohrbach, Mohamed Elhoseiny, Thalaiyasingam Ajanthan, Puneet K Dokania, Philip HS Torr, and Marc'Aurelio Ranzato. 2019. On tiny episodic memories in continual learning. arXiv preprint arXiv:1902.10486

work page internal anchor Pith review Pith/arXiv arXiv 2019
[5]

Sanyuan Chen, Yu Wu, Chengyi Wang, Shujie Liu, Daniel Tompkins, Zhuo Chen, Wanxiang Che, Xiangzhan Yu, and Furu Wei. 2023. Beats: Audio pre-training with acoustic tokenizers. In International Conference on Machine Learning, volume 202, pages 5178--5193. PMLR

work page 2023
[6]

Wenliang Dai, Junnan Li, Dongxu Li, Anthony Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. 2023. Instruct BLIP : Towards general-purpose vision-language models with instruction tuning. In Thirty-seventh Conference on Neural Information Processing Systems

work page 2023
[7]

Arthur Douillard, Matthieu Cord, Charles Ollion, Thomas Robert, and Eduardo Valle. 2020. Podnet: Pooled outputs distillation for small-tasks incremental learning. In Computer vision--ECCV 2020: 16th European conference, Glasgow, UK, August 23--28, 2020, proceedings, part XX 16, pages 86--102. Springer

work page 2020
[8]

Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, et al. 2023. Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378

work page internal anchor Pith review Pith/arXiv arXiv 2023
[9]

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. 2024. The llama 3 herd of models. arXiv preprint arXiv:2407.21783

work page internal anchor Pith review Pith/arXiv arXiv 2024
[10]

Yuxin Fang, Wen Wang, Binhui Xie, Quan Sun, Ledell Wu, Xinggang Wang, Tiejun Huang, Xinlong Wang, and Yue Cao. 2023. Eva: Exploring the limits of masked visual representation learning at scale. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19358--19369

work page 2023
[11]

Jinghan He, Haiyun Guo, Ming Tang, and Jinqiao Wang. 2023. Continual instruction tuning for large multimodal models. arXiv preprint arXiv:2311.16206

work page arXiv 2023
[12]

Dan Hendrycks and Kevin Gimpel. 2016. Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415

work page internal anchor Pith review Pith/arXiv arXiv 2016
[13]

Edward J Hu, yelong shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2022. Lo RA : Low-rank adaptation of large language models. In International Conference on Learning Representations

work page 2022
[14]

Chris Dongjoo Kim, Byeongchang Kim, Hyunmin Lee, and Gunhee Kim. 2019. Audiocaps: Generating captions for audios in the wild. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 119--132

work page 2019
[15]

Sanghwan Kim, Lorenzo Noci, Antonio Orvieto, and Thomas Hofmann. 2023. Achieving a better stability-plasticity trade-off via auxiliary networks in continual learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11930--11939

work page 2023
[16]

James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al. 2017. Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences, 114(13):3521--3526

work page 2017
[17]

Dongxu Li, Junnan Li, Hung Le, Guangsen Wang, Silvio Savarese, and Steven CH Hoi. 2022 a . Lavis: A library for language-vision intelligence. arXiv preprint arXiv:2209.09019

work page arXiv 2022
[18]

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. 2023. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In International conference on machine learning, pages 19730--19742. PMLR

work page 2023
[19]

Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. 2022 b . Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International conference on machine learning, pages 12888--12900. PMLR

work page 2022
[20]

Zhizhong Li and Derek Hoiem. 2017. Learning without forgetting. IEEE transactions on pattern analysis and machine intelligence, 40(12):2935--2947

work page 2017
[21]

Samuel Lipping, Parthasaarathy Sudarsanam, Konstantinos Drossos, and Tuomas Virtanen. 2022. Clotho-aqa: A crowdsourced dataset for audio question answering. In 2022 30th European Signal Processing Conference (EUSIPCO), pages 1140--1144. IEEE

work page 2022
[22]

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. 2024. Improved baselines with visual instruction tuning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26296--26306

work page 2024
[23]

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. 2023. Visual instruction tuning. Advances in neural information processing systems, 36

work page 2023
[24]

David Lopez-Paz and Marc'Aurelio Ranzato. 2017. Gradient episodic memory for continual learning. Advances in neural information processing systems, 30

work page 2017
[25]

Ilya Loshchilov and Frank Hutter. 2019. Decoupled weight decay regularization. In International Conference on Learning Representations

work page 2019
[26]

Kenneth Marino, Mohammad Rastegari, Ali Farhadi, and Roozbeh Mottaghi. 2019. Ok-vqa: A visual question answering benchmark requiring external knowledge. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3195--3204

work page 2019
[27]

Swaroop Mishra, Daniel Khashabi, Chitta Baral, and Hannaneh Hajishirzi. 2022. Cross-task generalization via natural language crowdsourcing instructions. In 60th Annual Meeting of the Association for Computational Linguistics, ACL 2022, pages 3470--3487. Association for Computational Linguistics (ACL)

work page 2022
[28]

Shentong Mo, Weiguo Pian, and Yapeng Tian. 2023. Class-incremental grouping network for continual audio-visual learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7788--7798

work page 2023
[29]

Augustus Odena, Christopher Olah, and Jonathon Shlens. 2017. Conditional image synthesis with auxiliary classifier gans. In International conference on machine learning, pages 2642--2651. PMLR

work page 2017
[30]

Oleksiy Ostapenko, Mihai Puscas, Tassilo Klein, Patrick Jahnichen, and Moin Nabi. 2019. Learning to remember: A synaptic plasticity driven framework for continual learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11321--11329

work page 2019
[31]

Artemis Panagopoulou, Le Xue, Ning Yu, Junnan Li, Dongxu Li, Shafiq Joty, Ran Xu, Silvio Savarese, Caiming Xiong, and Juan Carlos Niebles. 2023. X-instructblip: A framework for aligning x-modal instruction-aware representations to llms and emergent cross-modal reasoning. arXiv preprint arXiv:2311.18799

work page arXiv 2023
[32]

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. 2019. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32

work page 2019
[33]

Weiguo Pian, Shentong Mo, Yunhui Guo, and Yapeng Tian. 2023. Audio-visual class-incremental learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7799--7811

work page 2023
[34]

Weiguo Pian, Yiyang Nan, Shijian Deng, Shentong Mo, Yunhui Guo, and Yapeng Tian. 2024. Continual audio-visual sound separation. In The Thirty-eighth Annual Conference on Neural Information Processing Systems

work page 2024
[35]

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748--8763. PMLR

work page 2021
[36]

Sylvestre-Alvise Rebuffi, Alexander Kolesnikov, Georg Sperl, and Christoph H Lampert. 2017. icarl: Incremental classifier and representation learning. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 2001--2010

work page 2017
[37]

Zechao Sun, Haolin Jin, Weitong Chen, and Luping Zhou. 2024. Awf: Adaptive weight fusion for enhanced class incremental semantic segmentation. arXiv preprint arXiv:2409.08516

work page arXiv 2024
[38]

Ramakrishna Vedantam, C Lawrence Zitnick, and Devi Parikh. 2015. Cider: Consensus-based image description evaluation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4566--4575

work page 2015
[39]

Jia-Wen Xiao, Chang-Bin Zhang, Jiekang Feng, Xialei Liu, Joost van de Weijer, and Ming-Ming Cheng. 2023. Endpoints weight fusion for class incremental semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7204--7213

work page 2023
[40]

Dejing Xu, Zhou Zhao, Jun Xiao, Fei Wu, Hanwang Zhang, Xiangnan He, and Yueting Zhuang. 2017. Video question answering via gradually refined attention over appearance and motion. In Proceedings of the 25th ACM international conference on Multimedia, pages 1645--1653

work page 2017
[41]

Jun Xu, Tao Mei, Ting Yao, and Yong Rui. 2016. Msr-vtt: A large video description dataset for bridging video and language. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5288--5296

work page 2016
[42]

Peter Young, Alice Lai, Micah Hodosh, and Julia Hockenmaier. 2014. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. TACL, 2:67--78

work page 2014
[43]

Jiazuo Yu, Haomiao Xiong, Lu Zhang, Haiwen Diao, Yunzhi Zhuge, Lanqing HONG, Dong Wang, Huchuan Lu, You He, and Long Chen. 2024. LLM s can evolve continually on modality for x-modal reasoning. In The Thirty-eighth Annual Conference on Neural Information Processing Systems

work page 2024
[44]

Fanhu Zeng, Fei Zhu, Haiyang Guo, Xu-Yao Zhang, and Cheng-Lin Liu. 2024. Modalprompt: Dual-modality guided prompt for continual learning of large multimodal models. arXiv preprint arXiv:2410.05849

work page arXiv 2024
[45]

Hang Zhang, Xin Li, and Lidong Bing. 2023. Video-llama: An instruction-tuned audio-visual language model for video understanding. arXiv preprint arXiv:2306.02858

work page internal anchor Pith review Pith/arXiv arXiv 2023
[46]

Junhao Zheng, Qianli Ma, Zhen Liu, Binquan Wu, and Huawen Feng. 2024. Beyond anti-forgetting: Multimodal continual instruction tuning with positive forward transfer. arXiv preprint arXiv:2401.09181

work page arXiv 2024
[47]

online" 'onlinestring :=

ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...

work page
[48]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

work page

[1] [1]

Hongjoon Ahn, Jihwan Kwak, Subin Lim, Hyeonsu Bang, Hyojun Kim, and Taesup Moon. 2021. Ss-il: Separated softmax for incremental learning. In Proceedings of the IEEE/CVF International conference on computer vision, pages 844--853

work page 2021

[2] [2]

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. 2022. Flamingo: a visual language model for few-shot learning. Advances in neural information processing systems, 35:23716--23736

work page 2022

[3] [3]

Xusheng Cao, Haori Lu, Linlan Huang, Xialei Liu, and Ming-Ming Cheng. 2024. Generative multi-modal models are good class incremental learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 28706--28717

work page 2024

[4] [4]

Arslan Chaudhry, Marcus Rohrbach, Mohamed Elhoseiny, Thalaiyasingam Ajanthan, Puneet K Dokania, Philip HS Torr, and Marc'Aurelio Ranzato. 2019. On tiny episodic memories in continual learning. arXiv preprint arXiv:1902.10486

work page internal anchor Pith review Pith/arXiv arXiv 2019

[5] [5]

Sanyuan Chen, Yu Wu, Chengyi Wang, Shujie Liu, Daniel Tompkins, Zhuo Chen, Wanxiang Che, Xiangzhan Yu, and Furu Wei. 2023. Beats: Audio pre-training with acoustic tokenizers. In International Conference on Machine Learning, volume 202, pages 5178--5193. PMLR

work page 2023

[6] [6]

Wenliang Dai, Junnan Li, Dongxu Li, Anthony Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. 2023. Instruct BLIP : Towards general-purpose vision-language models with instruction tuning. In Thirty-seventh Conference on Neural Information Processing Systems

work page 2023

[7] [7]

Arthur Douillard, Matthieu Cord, Charles Ollion, Thomas Robert, and Eduardo Valle. 2020. Podnet: Pooled outputs distillation for small-tasks incremental learning. In Computer vision--ECCV 2020: 16th European conference, Glasgow, UK, August 23--28, 2020, proceedings, part XX 16, pages 86--102. Springer

work page 2020

[8] [8]

Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, et al. 2023. Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378

work page internal anchor Pith review Pith/arXiv arXiv 2023

[9] [9]

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. 2024. The llama 3 herd of models. arXiv preprint arXiv:2407.21783

work page internal anchor Pith review Pith/arXiv arXiv 2024

[10] [10]

Yuxin Fang, Wen Wang, Binhui Xie, Quan Sun, Ledell Wu, Xinggang Wang, Tiejun Huang, Xinlong Wang, and Yue Cao. 2023. Eva: Exploring the limits of masked visual representation learning at scale. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19358--19369

work page 2023

[11] [11]

Jinghan He, Haiyun Guo, Ming Tang, and Jinqiao Wang. 2023. Continual instruction tuning for large multimodal models. arXiv preprint arXiv:2311.16206

work page arXiv 2023

[12] [12]

Dan Hendrycks and Kevin Gimpel. 2016. Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415

work page internal anchor Pith review Pith/arXiv arXiv 2016

[13] [13]

Edward J Hu, yelong shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2022. Lo RA : Low-rank adaptation of large language models. In International Conference on Learning Representations

work page 2022

[14] [14]

Chris Dongjoo Kim, Byeongchang Kim, Hyunmin Lee, and Gunhee Kim. 2019. Audiocaps: Generating captions for audios in the wild. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 119--132

work page 2019

[15] [15]

Sanghwan Kim, Lorenzo Noci, Antonio Orvieto, and Thomas Hofmann. 2023. Achieving a better stability-plasticity trade-off via auxiliary networks in continual learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11930--11939

work page 2023

[16] [16]

James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al. 2017. Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences, 114(13):3521--3526

work page 2017

[17] [17]

Dongxu Li, Junnan Li, Hung Le, Guangsen Wang, Silvio Savarese, and Steven CH Hoi. 2022 a . Lavis: A library for language-vision intelligence. arXiv preprint arXiv:2209.09019

work page arXiv 2022

[18] [18]

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. 2023. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In International conference on machine learning, pages 19730--19742. PMLR

work page 2023

[19] [19]

Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. 2022 b . Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International conference on machine learning, pages 12888--12900. PMLR

work page 2022

[20] [20]

Zhizhong Li and Derek Hoiem. 2017. Learning without forgetting. IEEE transactions on pattern analysis and machine intelligence, 40(12):2935--2947

work page 2017

[21] [21]

Samuel Lipping, Parthasaarathy Sudarsanam, Konstantinos Drossos, and Tuomas Virtanen. 2022. Clotho-aqa: A crowdsourced dataset for audio question answering. In 2022 30th European Signal Processing Conference (EUSIPCO), pages 1140--1144. IEEE

work page 2022

[22] [22]

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. 2024. Improved baselines with visual instruction tuning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26296--26306

work page 2024

[23] [23]

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. 2023. Visual instruction tuning. Advances in neural information processing systems, 36

work page 2023

[24] [24]

David Lopez-Paz and Marc'Aurelio Ranzato. 2017. Gradient episodic memory for continual learning. Advances in neural information processing systems, 30

work page 2017

[25] [25]

Ilya Loshchilov and Frank Hutter. 2019. Decoupled weight decay regularization. In International Conference on Learning Representations

work page 2019

[26] [26]

Kenneth Marino, Mohammad Rastegari, Ali Farhadi, and Roozbeh Mottaghi. 2019. Ok-vqa: A visual question answering benchmark requiring external knowledge. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3195--3204

work page 2019

[27] [27]

Swaroop Mishra, Daniel Khashabi, Chitta Baral, and Hannaneh Hajishirzi. 2022. Cross-task generalization via natural language crowdsourcing instructions. In 60th Annual Meeting of the Association for Computational Linguistics, ACL 2022, pages 3470--3487. Association for Computational Linguistics (ACL)

work page 2022

[28] [28]

Shentong Mo, Weiguo Pian, and Yapeng Tian. 2023. Class-incremental grouping network for continual audio-visual learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7788--7798

work page 2023

[29] [29]

Augustus Odena, Christopher Olah, and Jonathon Shlens. 2017. Conditional image synthesis with auxiliary classifier gans. In International conference on machine learning, pages 2642--2651. PMLR

work page 2017

[30] [30]

Oleksiy Ostapenko, Mihai Puscas, Tassilo Klein, Patrick Jahnichen, and Moin Nabi. 2019. Learning to remember: A synaptic plasticity driven framework for continual learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11321--11329

work page 2019

[31] [31]

Artemis Panagopoulou, Le Xue, Ning Yu, Junnan Li, Dongxu Li, Shafiq Joty, Ran Xu, Silvio Savarese, Caiming Xiong, and Juan Carlos Niebles. 2023. X-instructblip: A framework for aligning x-modal instruction-aware representations to llms and emergent cross-modal reasoning. arXiv preprint arXiv:2311.18799

work page arXiv 2023

[32] [32]

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. 2019. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32

work page 2019

[33] [33]

Weiguo Pian, Shentong Mo, Yunhui Guo, and Yapeng Tian. 2023. Audio-visual class-incremental learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7799--7811

work page 2023

[34] [34]

Weiguo Pian, Yiyang Nan, Shijian Deng, Shentong Mo, Yunhui Guo, and Yapeng Tian. 2024. Continual audio-visual sound separation. In The Thirty-eighth Annual Conference on Neural Information Processing Systems

work page 2024

[35] [35]

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748--8763. PMLR

work page 2021

[36] [36]

Sylvestre-Alvise Rebuffi, Alexander Kolesnikov, Georg Sperl, and Christoph H Lampert. 2017. icarl: Incremental classifier and representation learning. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 2001--2010

work page 2017

[37] [37]

Zechao Sun, Haolin Jin, Weitong Chen, and Luping Zhou. 2024. Awf: Adaptive weight fusion for enhanced class incremental semantic segmentation. arXiv preprint arXiv:2409.08516

work page arXiv 2024

[38] [38]

Ramakrishna Vedantam, C Lawrence Zitnick, and Devi Parikh. 2015. Cider: Consensus-based image description evaluation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4566--4575

work page 2015

[39] [39]

Jia-Wen Xiao, Chang-Bin Zhang, Jiekang Feng, Xialei Liu, Joost van de Weijer, and Ming-Ming Cheng. 2023. Endpoints weight fusion for class incremental semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7204--7213

work page 2023

[40] [40]

Dejing Xu, Zhou Zhao, Jun Xiao, Fei Wu, Hanwang Zhang, Xiangnan He, and Yueting Zhuang. 2017. Video question answering via gradually refined attention over appearance and motion. In Proceedings of the 25th ACM international conference on Multimedia, pages 1645--1653

work page 2017

[41] [41]

Jun Xu, Tao Mei, Ting Yao, and Yong Rui. 2016. Msr-vtt: A large video description dataset for bridging video and language. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5288--5296

work page 2016

[42] [42]

Peter Young, Alice Lai, Micah Hodosh, and Julia Hockenmaier. 2014. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. TACL, 2:67--78

work page 2014

[43] [43]

Jiazuo Yu, Haomiao Xiong, Lu Zhang, Haiwen Diao, Yunzhi Zhuge, Lanqing HONG, Dong Wang, Huchuan Lu, You He, and Long Chen. 2024. LLM s can evolve continually on modality for x-modal reasoning. In The Thirty-eighth Annual Conference on Neural Information Processing Systems

work page 2024

[44] [44]

Fanhu Zeng, Fei Zhu, Haiyang Guo, Xu-Yao Zhang, and Cheng-Lin Liu. 2024. Modalprompt: Dual-modality guided prompt for continual learning of large multimodal models. arXiv preprint arXiv:2410.05849

work page arXiv 2024

[45] [45]

Hang Zhang, Xin Li, and Lidong Bing. 2023. Video-llama: An instruction-tuned audio-visual language model for video understanding. arXiv preprint arXiv:2306.02858

work page internal anchor Pith review Pith/arXiv arXiv 2023

[46] [46]

Junhao Zheng, Qianli Ma, Zhen Liu, Binquan Wu, and Huawen Feng. 2024. Beyond anti-forgetting: Multimodal continual instruction tuning with positive forward transfer. arXiv preprint arXiv:2401.09181

work page arXiv 2024

[47] [47]

online" 'onlinestring :=

ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...

work page

[48] [48]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

work page