FRISM: Fine-Grained Reasoning Injection via Subspace-Level Model Merging for Vision-Language Models

Chenyu Huang; Jinhan Mu; Li Shen; Peng Ye; Shenghe Zheng; Tao Chen; Xudong Tan

arxiv: 2601.21187 · v2 · submitted 2026-01-29 · 💻 cs.CV · cs.LG

FRISM: Fine-Grained Reasoning Injection via Subspace-Level Model Merging for Vision-Language Models

Chenyu Huang , Peng Ye , Xudong Tan , Jinhan Mu , Shenghe Zheng , Li Shen , Tao Chen This is my paper

Pith reviewed 2026-05-16 10:29 UTC · model grok-4.3

classification 💻 cs.CV cs.LG

keywords model mergingvision-language modelsreasoning injectionSVD subspacesfine-grained mergingself-distillationtask vectors

0 comments

The pith

FRISM injects reasoning capabilities into vision-language models by merging at the SVD subspace level instead of whole layers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents FRISM as a way to combine large reasoning models with vision-language models more precisely than prior merging techniques. It decomposes the reasoning task vectors using singular value decomposition and learns separate scaling factors for each resulting subspace. A label-free self-distillation process on ordinary vision-language datasets then tunes these factors so reasoning improves while visual perception stays largely intact. Experiments show consistent gains on reasoning benchmarks with minimal loss on visual tasks, avoiding the usual trade-offs seen in coarser layer-level merging.

Core claim

FRISM decomposes LRM task vectors via SVD and adaptively tunes the scaling coefficients of each subspace through learning, paired with a label-free self-distillation strategy using common vision-language perception datasets, to achieve fine-grained reasoning injection that improves reasoning performance while largely preserving visual capabilities.

What carries the argument

Subspace-level merging of task vectors obtained by SVD decomposition, with learned per-subspace scaling coefficients.

If this is right

Stronger results appear across multiple visual-language reasoning benchmarks compared with layer-level merging baselines.
Visual capabilities remain close to the original VLM because subspace scaling avoids broad interference.
The method works using only unlabeled common vision-language datasets for the distillation step.
Adaptive per-subspace coefficients replace fixed or uniform merging weights.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same subspace scaling idea could apply to merging other specialized models beyond reasoning and vision-language pairs.
If subspaces align with distinct capabilities, similar decomposition might help in merging models for different modalities or tasks.
Testing on larger or more diverse VLMs would show whether the subspace separation remains stable at scale.

Load-bearing premise

Different SVD subspaces contribute differently to reasoning versus visual perception, so selective scaling can add one without harming the other.

What would settle it

Apply FRISM to a new VLM-LRM pair and measure no gain on reasoning benchmarks or a clear drop in accuracy on standard visual perception tasks.

read the original abstract

Efficiently enhancing the reasoning capabilities of Vision-Language Models (VLMs) by merging them with Large Reasoning Models (LRMs) has emerged as a promising direction. However, existing methods typically operate at a coarse-grained layer level, which often leads to a trade-off between injecting reasoning capabilities and preserving visual capabilities. To address this limitation, we propose FRISM (Fine-grained Reasoning Injection via Subspace-level model Merging), a fine-grained reasoning injection framework based on subspace-level model merging. Observing that different SVD subspaces contribute differently to reasoning and perception, FRISM decomposes LRM task vectors via Singular Value Decomposition (SVD) and adaptively tunes the scaling coefficients of each subspace through learning to realize fine-grained reasoning injection. Furthermore, we introduce a label-free self-distillation learning strategy with dual-objective optimization using common vision-language perception datasets. Extensive experiments demonstrate that FRISM effectively improves reasoning capabilities while largely preserving the model's visual capabilities by consistently achieving strong performance across diverse visual-language reasoning benchmarks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

FRISM refines model merging with per-subspace SVD scaling to inject reasoning into VLMs while trying to hold visual performance, but the abstract gives no numbers so the gains stay unproven.

read the letter

The main new element is the shift from layer-level merging to SVD subspace decomposition on the LRM task vectors, followed by learned scaling coefficients per subspace. This lets the method tune reasoning injection more finely than the coarse approaches it cites, and the observation that subspaces split reasoning versus perception contributions is a plausible basis for avoiding the usual trade-offs. The label-free self-distillation step on ordinary vision-language datasets is also a practical addition that keeps the method from needing extra labeled data while guarding against capability loss. That combination is the part worth looking at if you work on merging or adaptation techniques. The soft spot is the evidence. The abstract asserts strong benchmark results and preservation of visual capabilities, yet supplies no metrics, baselines, ablations, or error analysis. Without those details it is impossible to tell whether the subspace scaling actually drives the improvement or whether simpler merging plus distillation would have produced similar numbers. The central assumption about differential subspace roles therefore rests on the experiments that are not shown here. This paper is for people already working on efficient VLM adaptation and model merging. A reader in that area would get a clear method sketch and a reproducible-sounding procedure, even if the claims need tighter validation. I would send it to peer review because the framework is straightforward to implement and test, and the idea directly targets a known limitation in current merging work.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes FRISM, a subspace-level model merging method to inject reasoning capabilities from Large Reasoning Models (LRMs) into Vision-Language Models (VLMs). It decomposes LRM task vectors via SVD, learns per-subspace scaling coefficients to enable fine-grained injection, and uses a label-free self-distillation objective with dual optimization on vision-language perception datasets to improve reasoning while preserving visual performance.

Significance. If the reported experiments confirm consistent gains across benchmarks without visual degradation, the subspace-level approach would represent a useful refinement over layer-wise merging techniques, offering more targeted control via data-driven scaling. The self-distillation safeguard is a sensible addition for maintaining stability.

major comments (2)

[Abstract] Abstract: the claim of 'consistently achieving strong performance across diverse visual-language reasoning benchmarks' is presented without any quantitative metrics, baseline comparisons, or ablation results, preventing direct evaluation of whether the data support the central claim of improved reasoning with preserved visual capabilities.
[Method] Method section (SVD decomposition and scaling): the premise that different SVD subspaces differentially contribute to reasoning versus perception is load-bearing for the fine-grained injection claim, yet the manuscript provides no explicit equations, ablation tables, or quantitative analysis showing how the learned coefficients exploit this differential (e.g., no reported subspace-wise contribution scores or sensitivity analysis).

minor comments (2)

[Method] Ensure all scaling coefficients and optimization objectives are defined with explicit notation and pseudocode in the main text rather than relying solely on high-level descriptions.
[Experiments] Figure captions and tables should include error bars or standard deviations for all reported benchmark scores to allow assessment of statistical significance.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the two major comments below and will revise the manuscript to strengthen the presentation of results and analysis.

read point-by-point responses

Referee: [Abstract] Abstract: the claim of 'consistently achieving strong performance across diverse visual-language reasoning benchmarks' is presented without any quantitative metrics, baseline comparisons, or ablation results, preventing direct evaluation of whether the data support the central claim of improved reasoning with preserved visual capabilities.

Authors: We agree that the abstract would benefit from explicit quantitative support. In the revised manuscript we will update the abstract to include key metrics (e.g., average reasoning benchmark gains and visual-task retention relative to baselines) while directing readers to the full tables and ablations in Sections 4–5. This change directly addresses the concern without altering the underlying claims. revision: yes
Referee: [Method] Method section (SVD decomposition and scaling): the premise that different SVD subspaces differentially contribute to reasoning versus perception is load-bearing for the fine-grained injection claim, yet the manuscript provides no explicit equations, ablation tables, or quantitative analysis showing how the learned coefficients exploit this differential (e.g., no reported subspace-wise contribution scores or sensitivity analysis).

Authors: The differential contribution premise is indeed central. Although the current text states the observation, we acknowledge the absence of supporting equations and quantitative validation. We will insert the explicit SVD decomposition and scaling-coefficient formulation in Section 3, together with an ablation study (including subspace-wise contribution scores and sensitivity plots) in the experiments section to demonstrate how the learned coefficients exploit the differential. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the derivation chain

full rationale

The paper's core procedure decomposes LRM task vectors with standard SVD, then learns per-subspace scaling coefficients via optimization on external perception datasets using a label-free self-distillation objective. Claimed gains are measured on separate visual-language reasoning benchmarks rather than being defined by the fit. No equations reduce any prediction or result to its own inputs by construction, and no load-bearing self-citation or imported uniqueness theorem is invoked. The derivation remains self-contained against external evaluation.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that SVD subspaces have separable contributions to reasoning versus perception and on learned scaling coefficients obtained via self-distillation.

free parameters (1)

scaling coefficients of each subspace
Learned adaptively through the dual-objective self-distillation optimization on vision-language perception datasets.

axioms (1)

domain assumption Different SVD subspaces contribute differently to reasoning and perception
Presented as the key observation motivating the fine-grained approach.

pith-pipeline@v0.9.0 · 5490 in / 1211 out tokens · 40579 ms · 2026-05-16T10:29:20.528987+00:00 · methodology

FRISM: Fine-Grained Reasoning Injection via Subspace-Level Model Merging for Vision-Language Models

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)