Investigating the Catastrophic Forgetting in Multimodal Large Language Models

Mu Cai; Qing Qu; Shengbang Tong; Xiao Li; Yi Ma; Yong Jae Lee; Yuexiang Zhai

arxiv: 2309.10313 · v4 · pith:BM2JDM3Fnew · submitted 2023-09-19 · 💻 cs.CL · cs.AI· cs.LG

Investigating the Catastrophic Forgetting in Multimodal Large Language Models

Yuexiang Zhai , Shengbang Tong , Xiao Li , Mu Cai , Qing Qu , Yong Jae Lee , Yi Ma This is my paper

classification 💻 cs.CL cs.AIcs.LG

keywords fine-tuningimagemllmmllmsperformancecatastrophicforgettingllms

0 comments

read the original abstract

Following the success of GPT4, there has been a surge in interest in multimodal large language model (MLLM) research. This line of research focuses on developing general-purpose LLMs through fine-tuning pre-trained LLMs and vision models. However, catastrophic forgetting, a notorious phenomenon where the fine-tuned model fails to retain similar performance compared to the pre-trained model, still remains an inherent problem in multimodal LLMs (MLLM). In this paper, we introduce EMT: Evaluating MulTimodality for evaluating the catastrophic forgetting in MLLMs, by treating each MLLM as an image classifier. We first apply EMT to evaluate several open-source fine-tuned MLLMs and we discover that almost all evaluated MLLMs fail to retain the same performance levels as their vision encoders on standard image classification tasks. Moreover, we continue fine-tuning LLaVA, an MLLM and utilize EMT to assess performance throughout the fine-tuning. Interestingly, our results suggest that early-stage fine-tuning on an image dataset improves performance across other image datasets, by enhancing the alignment of text and visual features. However, as fine-tuning proceeds, the MLLMs begin to hallucinate, resulting in a significant loss of generalizability, even when the image encoder remains frozen. Our results suggest that MLLMs have yet to demonstrate performance on par with their vision models on standard image classification tasks and the current MLLM fine-tuning procedure still has room for improvement.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 8 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Test-Time Distillation for Continual Model Adaptation
cs.CV 2025-06 conditional novelty 7.0

CoDiRe blends VLM and target model predictions via MSP-based weighting and Optimal Transport rectification to enable stable continual test-time adaptation, outperforming CoTTA by 10.55% on ImageNet-C at 48% of the com...
Social Human Robot Embodied Conversation (SHREC) Dataset: Benchmarking Foundational Models' Social Reasoning
cs.HC 2025-04 unverdicted novelty 7.0

SHREC is a new benchmark dataset of embodied human-robot conversations that shows substantial performance gaps in state-of-the-art foundation models on tasks involving social error detection and rationale generation.
HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusion in Large Vision-Language Models
cs.CV 2023-10 unverdicted novelty 7.0

HallusionBench shows GPT-4V reaches only 31.42% accuracy on paired questions testing language hallucination and visual illusion in LVLMs, with other models below 16%.
AlignCultura: Towards Culturally Aligned Large Language Models?
cs.CL 2026-04 unverdicted novelty 6.0

Align-Cultura introduces the CULTURAX dataset and shows that culturally fine-tuned LLMs improve joint HHH scores by 4-6%, cut cultural failures by 18%, and gain 10-12% efficiency with minimal leakage.
PRISM: Probing Reasoning, Instruction, and Source Memory in LLM Hallucinations
cs.CL 2026-04 unverdicted novelty 6.0

PRISM benchmark disentangles LLM hallucinations into knowledge missing, knowledge errors, reasoning errors, and instruction-following errors across three generation stages, revealing trade-offs when testing 24 models.
Routing-Based Continual Learning for Multimodal Large Language Models
cs.LG 2025-11 unverdicted novelty 6.0

Routing architecture for MLLMs enables continual learning with constant compute, matching multi-task learning performance and supporting cross-modal transfer.
From Structure to Synergy: A Survey of Vision-Language Perception Paradigm Evolution in Multimodal Large Language Models
cs.CL 2026-06 unverdicted novelty 5.0

The survey formalizes MLLM perception as a unified vision-language capability and traces its evolution via a new five-stage taxonomy while outlining future challenges.
ActiveScope: Actively Seeking and Correcting Perception for MLLMs
cs.CV 2026-06 unverdicted novelty 5.0

ActiveScope introduces Semantic Anchor Localization (SAL) and Interference-Suppressed Refinement (ISR) to address semantic bias and contextual dominance in MLLMs, reporting 96.34% accuracy on V* Bench.