SAME: Stabilized Mixture-of-Experts for Multimodal Continual Instruction Tuning

Da-Wei Zhou; De-Chuan Zhan; Han-Jia Ye; Jun-Tao Tang; Yu-Cheng Shi; Zhen-Hao Xie

arxiv: 2602.01990 · v2 · pith:3NS74TEBnew · submitted 2026-02-02 · 💻 cs.LG · cs.AI

SAME: Stabilized Mixture-of-Experts for Multimodal Continual Instruction Tuning

Zhen-Hao Xie , Jun-Tao Tang , Yu-Cheng Shi , Han-Jia Ye , De-Chuan Zhan , Da-Wei Zhou This is my paper

classification 💻 cs.LG cs.AI

keywords expertdriftexpertssameinstructionmcitmultimodalrouting

0 comments

read the original abstract

Multimodal Large Language Models (MLLMs) achieve strong performance through instruction tuning, but real-world deployment requires them to continually expand their capabilities, making Multimodal Continual Instruction Tuning (MCIT) essential. Recent methods leverage sparse expert routing to promote task specialization, but we find that the expert routing process suffers from drift as the data distribution evolves. For example, a grounding query that previously activated localization experts may instead be routed to irrelevant experts after learning OCR tasks. Meanwhile, the grounding-related experts can be overwritten by new tasks and lose their original functionality. Such failure reflects two problems: router drift, where expert selection becomes inconsistent over time, and expert drift, where shared experts are overwritten across tasks. Therefore, we propose StAbilized Mixture-of-Experts (SAME) for MCIT. To address router drift, SAME stabilizes expert selection by decomposing routing dynamics into orthogonal subspaces and updating only task-relevant directions. To mitigate expert drift, we regulate expert updates via curvature-aware scaling using historical input covariance in a rehearsal-free manner. SAME also introduces adaptive expert activation to freeze selected experts during training, reducing redundant computation and cross-task interference. We also introduce a new benchmark to evaluate MCIT with long task sequence, and extensive experiments demonstrate SAME's SOTA performance. Code is available at https://github.com/LAMDA-CL/Prism.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Dynamic Cross-Modal Prompt Generation for Multimodal Continual Instruction Tuning
cs.CV 2026-05 unverdicted novelty 7.0

DRAPE generates query-image conditioned prompts on the fly for multimodal continual instruction tuning and reports SOTA results on MCIT benchmarks.
ProtoAda: Prototype-Guided Adaptive Adapter Expansion and Geometric Consolidation for Multimodal Continual Instruction Tuning
cs.CV 2026-06 unverdicted novelty 6.0

ProtoAda uses format-aware prototypes for better task routing and geometry-aware consolidation to reduce interference in multimodal continual instruction tuning.
AREA: Attribute Extraction and Aggregation for CLIP-Based Class-Incremental Learning
cs.CV 2026-05 unverdicted novelty 6.0

AREA stabilizes attribute extraction with principal geodesic analysis on hyperspherical space and aggregation with lightweight task experts plus variational bottleneck and optimal transport routing, outperforming SOTA...
CRAM: Centroid-Routing and Adaptive MoE for Multimodal Continual Instruction Tuning
cs.CL 2026-06 unverdicted novelty 5.0

CRAM uses adaptive MoE with centroid routing and orthogonality constraints to enable parameter-efficient multimodal continual instruction tuning while mitigating forgetting.