How to Teach Large Multimodal Models New Skills
Pith reviewed 2026-05-18 08:39 UTC · model grok-4.3
The pith
Selective updates to self-attention or MLP layers let multimodal models learn new skills while largely preserving prior abilities by limiting output distribution shifts.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Performance lost on held-out tasks after fine-tuning on one skill can recover when the model is subsequently tuned on a different skill. This recovery co-varies with shifts in the output token distribution. Updating only the self-attention projection layers produces a learning-forgetting delta of +24.9 / -0.6, while updating only the MLP Gate&Up while freezing Down produces +30.5 / -2.1; both substantially outperform full-LLM tuning (+31.8 / -23.3). The same selective rules match or exceed common forgetting-mitigation baselines without replay or auxiliary parameters.
What carries the argument
Selective component-wise tuning that restricts updates to self-attention projections or to the Gate&Up sub-layers of the MLP (freezing the Down projection) in order to limit output token distribution shift.
If this is right
- The selective recipes achieve comparable or better learning-stability balance than LwF, LoRA, Mixture-of-Experts, or weight-space interpolation while remaining simpler.
- Recovery of previously lost performance occurs when the model is tuned on a subsequent but different skill.
- The same selective tuning rules generalize across LLaVA-OneVision, LLaVA-NeXT, and Qwen2.5-VL.
- A counting-bias probe on output tokens can serve as a practical monitor for distribution shift during tuning.
Where Pith is reading between the lines
- Because recovery happens across different skills, the order in which skills are introduced may matter less under selective tuning than under full tuning.
- The approach could be tested on sequential instruction-tuning pipelines where new visual or language tasks arrive over time.
- If output distribution shift is the dominant driver of forgetting, then similar selective updates might reduce forgetting in other sequential adaptation settings such as continual pre-training.
Load-bearing premise
Changes measured on the eight held-out benchmarks are caused primarily by output distribution shift rather than by other unmeasured factors during sequential tuning.
What would settle it
A new set of held-out benchmarks on which selective tuning produces the same high forgetting rates as full tuning, or a direct measurement showing that the counting-bias probe no longer correlates with forgetting after selective updates.
read the original abstract
How can we teach large multimodal models (LMMs) new skills without erasing prior abilities? We study sequential fine-tuning on five target skills while monitoring general ability on eight held-out benchmarks across three model families. Surprisingly, we find that performance lost on held-out tasks after fine-tuning on one skill can partly recover when the model is subsequently tuned on a different skill. We trace this behavior to a measurable shift in the output token distribution, manifested through a simple counting-bias probe that shows the shift co-varies with forgetting. Guided by this insight, we identify two simple, robust tuning recipes that learn strongly while limiting drift: (i) updating only the self-attention projection layers (SA Proj., $\Delta$ learning +24.9 / $\Delta$ held-out forgetting -0.6), and (ii) updating only the MLP Gate&Up while freezing the Down projection (+30.5 / -2.1). Both substantially outperform full-LLM tuning (+31.8 / -23.3) in the learning-forgetting trade-off. We also compare against common forgetting mitigation methods: Learning without Forgetting (LwF), LoRA, Mixture-of-Experts, and weight-space interpolation (WiSE-FT), and find that our selective tuning recipes match or exceed their learning-stability balance while remaining simpler, requiring no replay, auxiliary parameters, or per-stage tuning. These results hold across LLaVA-OneVision, LLaVA-NeXT, and Qwen2.5-VL, confirming that the key to teaching LMMs new skills without forgetting lies in controlling output distribution shift by choosing which components to tune. Code will be made available.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper studies sequential fine-tuning of large multimodal models on five target skills while monitoring performance on eight held-out benchmarks across three model families (LLaVA-OneVision, LLaVA-NeXT, Qwen2.5-VL). It reports that held-out performance can partially recover after tuning on subsequent skills, links this to measurable shifts in output token distributions via a counting-bias probe, and identifies two selective tuning recipes—updating only self-attention projection layers or only MLP Gate&Up while freezing Down—that yield superior learning-forgetting trade-offs (+24.9/-0.6 and +30.5/-2.1) compared to full-LLM tuning (+31.8/-23.3) and baselines including LwF, LoRA, MoE, and WiSE-FT.
Significance. If the empirical trade-off results hold, the work provides simple, practical tuning strategies for continual skill acquisition in LMMs that require no replay buffers, auxiliary parameters, or per-stage optimization, while matching or exceeding established forgetting-mitigation methods. The cross-model consistency and direct comparison to multiple baselines, together with the planned code release, strengthen the potential impact for reproducible research in multimodal continual learning.
major comments (2)
- [Experimental setup and results sections] Experimental setup and results sections: the eight held-out benchmarks are central to the forgetting measurements and the claim that selective tuning controls output distribution shift, yet the manuscript provides limited detail on their exact definitions, task formulations, and any selection criteria used to choose the five target skills versus the held-out set; this leaves open the possibility of selection effects that could affect the generality of the reported deltas.
- [Results section, Table reporting deltas] Results section, Table reporting deltas: the learning and forgetting deltas (e.g., SA Proj. +24.9 / -0.6) are presented without reported standard deviations, number of runs, or statistical significance tests, which is load-bearing for confidently asserting that the selective recipes substantially outperform full tuning in the trade-off.
minor comments (2)
- [Methods] The counting-bias probe is introduced to link distribution shift to forgetting, but its exact construction (e.g., prompt templates, counting categories, and how correlation with held-out scores is quantified) could be clarified in the methods for easier replication.
- [Introduction and methods] Notation for layer components (Gate&Up, Down projection) should be defined explicitly on first use with reference to the underlying transformer architecture to avoid ambiguity for readers unfamiliar with the specific LMM implementations.
Simulated Author's Rebuttal
We thank the referee for the constructive review and the recommendation for minor revision. The comments identify opportunities to improve clarity on experimental details and statistical reporting. We address each major comment below and outline the corresponding revisions.
read point-by-point responses
-
Referee: [Experimental setup and results sections] Experimental setup and results sections: the eight held-out benchmarks are central to the forgetting measurements and the claim that selective tuning controls output distribution shift, yet the manuscript provides limited detail on their exact definitions, task formulations, and any selection criteria used to choose the five target skills versus the held-out set; this leaves open the possibility of selection effects that could affect the generality of the reported deltas.
Authors: We agree that expanded details on the benchmarks and selection criteria will strengthen the presentation and address concerns about potential selection effects. In the revised manuscript we will add precise task formulations and definitions for all eight held-out benchmarks in the Experimental Setup section. We will also describe the selection criteria for the five target skills, noting that they were chosen to span distinct capabilities (e.g., visual reasoning, detailed captioning, and compositional VQA) while the held-out set covers a broader range of general multimodal abilities. These additions will better substantiate the generality of the reported deltas. revision: yes
-
Referee: [Results section, Table reporting deltas] Results section, Table reporting deltas: the learning and forgetting deltas (e.g., SA Proj. +24.9 / -0.6) are presented without reported standard deviations, number of runs, or statistical significance tests, which is load-bearing for confidently asserting that the selective recipes substantially outperform full tuning in the trade-off.
Authors: We acknowledge that explicit reporting of variability and statistical measures would increase confidence in the comparisons. Due to the high computational cost of sequential fine-tuning across three model families, the primary results reflect single runs per configuration. In the revision we will state the number of runs explicitly, report standard deviations from any auxiliary multi-seed checks that were performed, and add a brief discussion of effect sizes and cross-model consistency to support the observed trade-off advantages. Full statistical significance testing across all configurations would require additional compute that is not feasible within the current experimental budget. revision: partial
Circularity Check
No significant circularity in empirical derivation chain
full rationale
The paper's central claims rest on direct experimental measurements: performance deltas on eight held-out benchmarks and a counting-bias probe across three model families after sequential fine-tuning. Selective updates to self-attention projections or MLP Gate&Up layers are shown to improve the learning-forgetting trade-off via concrete reported values (+24.9/-0.6 and +30.5/-2.1) compared to full tuning (+31.8/-23.3). No equations, fitted parameters, or self-citations are used to define or predict these quantities; the results are obtained from explicit tuning runs and baseline comparisons (LwF, LoRA, MoE, WiSE-FT). The derivation chain is therefore self-contained against external benchmarks with no reduction by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Held-out benchmarks measure general multimodal ability independently of the particular fine-tuning tasks and component choices.
Forward citations
Cited by 2 Pith papers
-
DocAtlas: Multilingual Document Understanding Across 80+ Languages
DocAtlas creates multilingual document datasets across 82 languages and shows DPO with rendered ground truth improves model accuracy by 1.7-1.9% without degrading base-language performance, unlike supervised fine-tuning.
-
DocAtlas: Multilingual Document Understanding Across 80+ Languages
DocAtlas introduces model-free rendering pipelines to create DocTag-annotated datasets across 82 languages and shows DPO adaptation improves multilingual performance without base-language degradation.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.