FRISM: Fine-Grained Reasoning Injection via Subspace-Level Model Merging for Vision-Language Models
Pith reviewed 2026-05-16 10:29 UTC · model grok-4.3
The pith
FRISM injects reasoning capabilities into vision-language models by merging at the SVD subspace level instead of whole layers.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
FRISM decomposes LRM task vectors via SVD and adaptively tunes the scaling coefficients of each subspace through learning, paired with a label-free self-distillation strategy using common vision-language perception datasets, to achieve fine-grained reasoning injection that improves reasoning performance while largely preserving visual capabilities.
What carries the argument
Subspace-level merging of task vectors obtained by SVD decomposition, with learned per-subspace scaling coefficients.
If this is right
- Stronger results appear across multiple visual-language reasoning benchmarks compared with layer-level merging baselines.
- Visual capabilities remain close to the original VLM because subspace scaling avoids broad interference.
- The method works using only unlabeled common vision-language datasets for the distillation step.
- Adaptive per-subspace coefficients replace fixed or uniform merging weights.
Where Pith is reading between the lines
- The same subspace scaling idea could apply to merging other specialized models beyond reasoning and vision-language pairs.
- If subspaces align with distinct capabilities, similar decomposition might help in merging models for different modalities or tasks.
- Testing on larger or more diverse VLMs would show whether the subspace separation remains stable at scale.
Load-bearing premise
Different SVD subspaces contribute differently to reasoning versus visual perception, so selective scaling can add one without harming the other.
What would settle it
Apply FRISM to a new VLM-LRM pair and measure no gain on reasoning benchmarks or a clear drop in accuracy on standard visual perception tasks.
read the original abstract
Efficiently enhancing the reasoning capabilities of Vision-Language Models (VLMs) by merging them with Large Reasoning Models (LRMs) has emerged as a promising direction. However, existing methods typically operate at a coarse-grained layer level, which often leads to a trade-off between injecting reasoning capabilities and preserving visual capabilities. To address this limitation, we propose FRISM (Fine-grained Reasoning Injection via Subspace-level model Merging), a fine-grained reasoning injection framework based on subspace-level model merging. Observing that different SVD subspaces contribute differently to reasoning and perception, FRISM decomposes LRM task vectors via Singular Value Decomposition (SVD) and adaptively tunes the scaling coefficients of each subspace through learning to realize fine-grained reasoning injection. Furthermore, we introduce a label-free self-distillation learning strategy with dual-objective optimization using common vision-language perception datasets. Extensive experiments demonstrate that FRISM effectively improves reasoning capabilities while largely preserving the model's visual capabilities by consistently achieving strong performance across diverse visual-language reasoning benchmarks.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes FRISM, a subspace-level model merging method to inject reasoning capabilities from Large Reasoning Models (LRMs) into Vision-Language Models (VLMs). It decomposes LRM task vectors via SVD, learns per-subspace scaling coefficients to enable fine-grained injection, and uses a label-free self-distillation objective with dual optimization on vision-language perception datasets to improve reasoning while preserving visual performance.
Significance. If the reported experiments confirm consistent gains across benchmarks without visual degradation, the subspace-level approach would represent a useful refinement over layer-wise merging techniques, offering more targeted control via data-driven scaling. The self-distillation safeguard is a sensible addition for maintaining stability.
major comments (2)
- [Abstract] Abstract: the claim of 'consistently achieving strong performance across diverse visual-language reasoning benchmarks' is presented without any quantitative metrics, baseline comparisons, or ablation results, preventing direct evaluation of whether the data support the central claim of improved reasoning with preserved visual capabilities.
- [Method] Method section (SVD decomposition and scaling): the premise that different SVD subspaces differentially contribute to reasoning versus perception is load-bearing for the fine-grained injection claim, yet the manuscript provides no explicit equations, ablation tables, or quantitative analysis showing how the learned coefficients exploit this differential (e.g., no reported subspace-wise contribution scores or sensitivity analysis).
minor comments (2)
- [Method] Ensure all scaling coefficients and optimization objectives are defined with explicit notation and pseudocode in the main text rather than relying solely on high-level descriptions.
- [Experiments] Figure captions and tables should include error bars or standard deviations for all reported benchmark scores to allow assessment of statistical significance.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address the two major comments below and will revise the manuscript to strengthen the presentation of results and analysis.
read point-by-point responses
-
Referee: [Abstract] Abstract: the claim of 'consistently achieving strong performance across diverse visual-language reasoning benchmarks' is presented without any quantitative metrics, baseline comparisons, or ablation results, preventing direct evaluation of whether the data support the central claim of improved reasoning with preserved visual capabilities.
Authors: We agree that the abstract would benefit from explicit quantitative support. In the revised manuscript we will update the abstract to include key metrics (e.g., average reasoning benchmark gains and visual-task retention relative to baselines) while directing readers to the full tables and ablations in Sections 4–5. This change directly addresses the concern without altering the underlying claims. revision: yes
-
Referee: [Method] Method section (SVD decomposition and scaling): the premise that different SVD subspaces differentially contribute to reasoning versus perception is load-bearing for the fine-grained injection claim, yet the manuscript provides no explicit equations, ablation tables, or quantitative analysis showing how the learned coefficients exploit this differential (e.g., no reported subspace-wise contribution scores or sensitivity analysis).
Authors: The differential contribution premise is indeed central. Although the current text states the observation, we acknowledge the absence of supporting equations and quantitative validation. We will insert the explicit SVD decomposition and scaling-coefficient formulation in Section 3, together with an ablation study (including subspace-wise contribution scores and sensitivity plots) in the experiments section to demonstrate how the learned coefficients exploit the differential. revision: yes
Circularity Check
No significant circularity in the derivation chain
full rationale
The paper's core procedure decomposes LRM task vectors with standard SVD, then learns per-subspace scaling coefficients via optimization on external perception datasets using a label-free self-distillation objective. Claimed gains are measured on separate visual-language reasoning benchmarks rather than being defined by the fit. No equations reduce any prediction or result to its own inputs by construction, and no load-bearing self-citation or imported uniqueness theorem is invoked. The derivation remains self-contained against external evaluation.
Axiom & Free-Parameter Ledger
free parameters (1)
- scaling coefficients of each subspace
axioms (1)
- domain assumption Different SVD subspaces contribute differently to reasoning and perception
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.