How to Teach Large Multimodal Models New Skills

Derek Hoiem; Yao Xiao; Yaoyao Liu; Yiming Gong; Zhen Zhu

arxiv: 2510.08564 · v2 · submitted 2025-10-09 · 💻 cs.AI · cs.CV· cs.LG

How to Teach Large Multimodal Models New Skills

Zhen Zhu , Yiming Gong , Yao Xiao , Yaoyao Liu , Derek Hoiem This is my paper

Pith reviewed 2026-05-18 08:39 UTC · model grok-4.3

classification 💻 cs.AI cs.CVcs.LG

keywords multimodal modelssequential fine-tuningcatastrophic forgettingoutput distribution shiftselective tuningself-attention layersMLP projections

0 comments

The pith

Selective updates to self-attention or MLP layers let multimodal models learn new skills while largely preserving prior abilities by limiting output distribution shifts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines sequential fine-tuning of large multimodal models on target skills while tracking retention on held-out benchmarks. It finds that lost performance on general tasks can partly recover when the model is later tuned on a different skill, and links this pattern to measurable changes in the output token distribution that a simple counting-bias probe can track. Guided by this observation, the authors test component-wise tuning and show that restricting updates to self-attention projection layers or to the Gate and Up projections in the MLP (while freezing the Down projection) yields strong gains on new skills with far less forgetting than updating the entire model. These selective recipes perform at least as well as established mitigation techniques such as LoRA or Learning without Forgetting, yet require no replay buffers, extra parameters, or staged auxiliary losses. The pattern holds across three different model families.

Core claim

Performance lost on held-out tasks after fine-tuning on one skill can recover when the model is subsequently tuned on a different skill. This recovery co-varies with shifts in the output token distribution. Updating only the self-attention projection layers produces a learning-forgetting delta of +24.9 / -0.6, while updating only the MLP Gate&Up while freezing Down produces +30.5 / -2.1; both substantially outperform full-LLM tuning (+31.8 / -23.3). The same selective rules match or exceed common forgetting-mitigation baselines without replay or auxiliary parameters.

What carries the argument

Selective component-wise tuning that restricts updates to self-attention projections or to the Gate&Up sub-layers of the MLP (freezing the Down projection) in order to limit output token distribution shift.

If this is right

The selective recipes achieve comparable or better learning-stability balance than LwF, LoRA, Mixture-of-Experts, or weight-space interpolation while remaining simpler.
Recovery of previously lost performance occurs when the model is tuned on a subsequent but different skill.
The same selective tuning rules generalize across LLaVA-OneVision, LLaVA-NeXT, and Qwen2.5-VL.
A counting-bias probe on output tokens can serve as a practical monitor for distribution shift during tuning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Because recovery happens across different skills, the order in which skills are introduced may matter less under selective tuning than under full tuning.
The approach could be tested on sequential instruction-tuning pipelines where new visual or language tasks arrive over time.
If output distribution shift is the dominant driver of forgetting, then similar selective updates might reduce forgetting in other sequential adaptation settings such as continual pre-training.

Load-bearing premise

Changes measured on the eight held-out benchmarks are caused primarily by output distribution shift rather than by other unmeasured factors during sequential tuning.

What would settle it

A new set of held-out benchmarks on which selective tuning produces the same high forgetting rates as full tuning, or a direct measurement showing that the counting-bias probe no longer correlates with forgetting after selective updates.

read the original abstract

How can we teach large multimodal models (LMMs) new skills without erasing prior abilities? We study sequential fine-tuning on five target skills while monitoring general ability on eight held-out benchmarks across three model families. Surprisingly, we find that performance lost on held-out tasks after fine-tuning on one skill can partly recover when the model is subsequently tuned on a different skill. We trace this behavior to a measurable shift in the output token distribution, manifested through a simple counting-bias probe that shows the shift co-varies with forgetting. Guided by this insight, we identify two simple, robust tuning recipes that learn strongly while limiting drift: (i) updating only the self-attention projection layers (SA Proj., $\Delta$ learning +24.9 / $\Delta$ held-out forgetting -0.6), and (ii) updating only the MLP Gate&Up while freezing the Down projection (+30.5 / -2.1). Both substantially outperform full-LLM tuning (+31.8 / -23.3) in the learning-forgetting trade-off. We also compare against common forgetting mitigation methods: Learning without Forgetting (LwF), LoRA, Mixture-of-Experts, and weight-space interpolation (WiSE-FT), and find that our selective tuning recipes match or exceed their learning-stability balance while remaining simpler, requiring no replay, auxiliary parameters, or per-stage tuning. These results hold across LLaVA-OneVision, LLaVA-NeXT, and Qwen2.5-VL, confirming that the key to teaching LMMs new skills without forgetting lies in controlling output distribution shift by choosing which components to tune. Code will be made available.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Selective updates to self-attention projections or MLP Gate&Up layers give a clearer learning-forgetting trade-off than full tuning across three LMM families, with some held-out recovery appearing during sequential tasks.

read the letter

The main thing to know is that this paper shows selective component updates—specifically self-attention projections or MLP Gate&Up while freezing Down—deliver strong new-skill gains with minimal held-out forgetting, beating full-LLM tuning and matching or exceeding LwF, LoRA, MoE, and WiSE-FT without replay or extra modules. They also report that some performance lost after one skill partly recovers on the next, tied to output distribution shift via a counting-bias probe.

Referee Report

2 major / 2 minor

Summary. The paper studies sequential fine-tuning of large multimodal models on five target skills while monitoring performance on eight held-out benchmarks across three model families (LLaVA-OneVision, LLaVA-NeXT, Qwen2.5-VL). It reports that held-out performance can partially recover after tuning on subsequent skills, links this to measurable shifts in output token distributions via a counting-bias probe, and identifies two selective tuning recipes—updating only self-attention projection layers or only MLP Gate&Up while freezing Down—that yield superior learning-forgetting trade-offs (+24.9/-0.6 and +30.5/-2.1) compared to full-LLM tuning (+31.8/-23.3) and baselines including LwF, LoRA, MoE, and WiSE-FT.

Significance. If the empirical trade-off results hold, the work provides simple, practical tuning strategies for continual skill acquisition in LMMs that require no replay buffers, auxiliary parameters, or per-stage optimization, while matching or exceeding established forgetting-mitigation methods. The cross-model consistency and direct comparison to multiple baselines, together with the planned code release, strengthen the potential impact for reproducible research in multimodal continual learning.

major comments (2)

[Experimental setup and results sections] Experimental setup and results sections: the eight held-out benchmarks are central to the forgetting measurements and the claim that selective tuning controls output distribution shift, yet the manuscript provides limited detail on their exact definitions, task formulations, and any selection criteria used to choose the five target skills versus the held-out set; this leaves open the possibility of selection effects that could affect the generality of the reported deltas.
[Results section, Table reporting deltas] Results section, Table reporting deltas: the learning and forgetting deltas (e.g., SA Proj. +24.9 / -0.6) are presented without reported standard deviations, number of runs, or statistical significance tests, which is load-bearing for confidently asserting that the selective recipes substantially outperform full tuning in the trade-off.

minor comments (2)

[Methods] The counting-bias probe is introduced to link distribution shift to forgetting, but its exact construction (e.g., prompt templates, counting categories, and how correlation with held-out scores is quantified) could be clarified in the methods for easier replication.
[Introduction and methods] Notation for layer components (Gate&Up, Down projection) should be defined explicitly on first use with reference to the underlying transformer architecture to avoid ambiguity for readers unfamiliar with the specific LMM implementations.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive review and the recommendation for minor revision. The comments identify opportunities to improve clarity on experimental details and statistical reporting. We address each major comment below and outline the corresponding revisions.

read point-by-point responses

Referee: [Experimental setup and results sections] Experimental setup and results sections: the eight held-out benchmarks are central to the forgetting measurements and the claim that selective tuning controls output distribution shift, yet the manuscript provides limited detail on their exact definitions, task formulations, and any selection criteria used to choose the five target skills versus the held-out set; this leaves open the possibility of selection effects that could affect the generality of the reported deltas.

Authors: We agree that expanded details on the benchmarks and selection criteria will strengthen the presentation and address concerns about potential selection effects. In the revised manuscript we will add precise task formulations and definitions for all eight held-out benchmarks in the Experimental Setup section. We will also describe the selection criteria for the five target skills, noting that they were chosen to span distinct capabilities (e.g., visual reasoning, detailed captioning, and compositional VQA) while the held-out set covers a broader range of general multimodal abilities. These additions will better substantiate the generality of the reported deltas. revision: yes
Referee: [Results section, Table reporting deltas] Results section, Table reporting deltas: the learning and forgetting deltas (e.g., SA Proj. +24.9 / -0.6) are presented without reported standard deviations, number of runs, or statistical significance tests, which is load-bearing for confidently asserting that the selective recipes substantially outperform full tuning in the trade-off.

Authors: We acknowledge that explicit reporting of variability and statistical measures would increase confidence in the comparisons. Due to the high computational cost of sequential fine-tuning across three model families, the primary results reflect single runs per configuration. In the revision we will state the number of runs explicitly, report standard deviations from any auxiliary multi-seed checks that were performed, and add a brief discussion of effect sizes and cross-model consistency to support the observed trade-off advantages. Full statistical significance testing across all configurations would require additional compute that is not feasible within the current experimental budget. revision: partial

Circularity Check

0 steps flagged

No significant circularity in empirical derivation chain

full rationale

The paper's central claims rest on direct experimental measurements: performance deltas on eight held-out benchmarks and a counting-bias probe across three model families after sequential fine-tuning. Selective updates to self-attention projections or MLP Gate&Up layers are shown to improve the learning-forgetting trade-off via concrete reported values (+24.9/-0.6 and +30.5/-2.1) compared to full tuning (+31.8/-23.3). No equations, fitted parameters, or self-citations are used to define or predict these quantities; the results are obtained from explicit tuning runs and baseline comparisons (LwF, LoRA, MoE, WiSE-FT). The derivation chain is therefore self-contained against external benchmarks with no reduction by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work rests on standard machine-learning assumptions about what held-out benchmarks measure and that output distribution shift is the dominant driver of observed forgetting.

axioms (1)

domain assumption Held-out benchmarks measure general multimodal ability independently of the particular fine-tuning tasks and component choices.
Invoked when interpreting Δ held-out scores as forgetting.

pith-pipeline@v0.9.0 · 5846 in / 1329 out tokens · 49808 ms · 2026-05-18T08:39:44.372398+00:00 · methodology

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

DocAtlas: Multilingual Document Understanding Across 80+ Languages
cs.CL 2026-05 unverdicted novelty 6.0

DocAtlas creates multilingual document datasets across 82 languages and shows DPO with rendered ground truth improves model accuracy by 1.7-1.9% without degrading base-language performance, unlike supervised fine-tuning.
DocAtlas: Multilingual Document Understanding Across 80+ Languages
cs.CL 2026-05 unverdicted novelty 6.0

DocAtlas introduces model-free rendering pipelines to create DocTag-annotated datasets across 82 languages and shows DPO adaptation improves multilingual performance without base-language degradation.