arxiv: 2510.15050 · v2 · submitted 2025-10-16 · 💻 cs.CV

DRIFT: Transferring Reasoning Priors for Efficient MLLM Fine-Tuning

Chao Huang , Zeliang Zhang , Jiang Liu , Ximeng Sun , Jialian Wu , Xiaodong Yu , Ze Wang , Chenliang Xu

show 2 more authors

Emad Barsoum Zicheng Liu

This is my paper

Pith reviewed 2026-05-18 05:51 UTC · model grok-4.3

classification 💻 cs.CV

keywords multimodal large language modelsreasoning transfergradient biasingsupervised fine-tuningparameter differencesefficient adaptationmodel merging

0 comments

The pith

A precomputed reasoning prior from parameter differences biases gradients to transfer text-model reasoning into MLLMs during standard fine-tuning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to demonstrate that reasoning gaps in multimodal models can be closed without large new datasets or reinforcement learning by extracting a directional prior from how text-only reasoning models differ in parameters from multimodal ones. This prior then steers the gradient steps of ordinary supervised fine-tuning so that reasoning improves while multimodal alignment stays intact. A reader would care because the approach keeps training pipelines simple yet delivers performance that rivals far more expensive methods across standard multimodal reasoning benchmarks.

Core claim

The central claim is that the vector difference in parameters between a text-only reasoning-enhanced LLM and an MLLM forms a stable reasoning prior; when this prior is used to bias the gradient updates during supervised fine-tuning, the resulting model gains stronger reasoning on multimodal tasks while preserving its original multimodal capabilities, outperforming both naive parameter merging and plain SFT.

What carries the argument

The reasoning prior, formed by subtracting the parameters of a base multimodal model from those of a text-only reasoning expert, which then directionally biases gradient updates throughout supervised fine-tuning.

If this is right

DRIFT matches or exceeds the accuracy of training-intensive reasoning methods while using substantially less data and compute.
The method remains stable where naive merging degrades performance on certain model families such as Qwen-based MLLMs.
Standard SFT pipelines can incorporate the prior without changing their overall structure or requiring new data collection.
Reasoning improvements appear on benchmarks including MathVista and MathVerse without separate reinforcement-learning stages.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same prior-biasing technique might transfer other capabilities such as coding skill or safety constraints if suitable expert models exist.
Iterating the process by updating the prior from successively stronger reasoning models could compound gains over multiple rounds.
The approach may lower the barrier for adapting new multimodal architectures by reducing the volume of task-specific data needed.

Load-bearing premise

The parameter differences between text-only reasoning models and multimodal models encode a reliable prior that can steer gradients without harming multimodal performance.

What would settle it

Running the same fine-tuning schedule on MathVista or MathVerse with and without the gradient bias and finding no gain, or finding degradation on multimodal alignment metrics.

read the original abstract

Multimodal large language models (MLLMs) have made rapid progress, yet their reasoning ability often lags behind strong text-only LLMs. Bridging this gap typically requires large-scale multimodal reasoning data or reinforcement learning, incurring substantial cost. An appealing alternative is parameter-space model merging between reasoning-enhanced LLMs and MLLMs, but we show that naive merging is fragile: its effectiveness varies widely across model families and can significantly degrade performance (e.g., for Qwen-based MLLMs). We propose Directional Reasoning Injection for Fine-Tuning (DRIFT), a lightweight method that transfers reasoning knowledge in the gradient space while preserving multimodal alignment. DRIFT precomputes a reasoning prior from the parameter differences between text-only reasoning experts and multimodal models, and uses it to bias gradients during supervised fine-tuning. This design retains the simplicity of standard SFT pipelines while enabling efficient and stable reasoning transfer. Experiments on multimodal reasoning benchmarks, including MathVista and MathVerse, show that DRIFT consistently outperforms naive merging and standard SFT, and matches or surpasses training-intensive methods with substantially lower data and compute.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DRIFT biases SFT gradients with a precomputed parameter delta from text reasoning models to improve MLLM math performance over naive merging, but the delta's purity needs checking.

read the letter

DRIFT's main contribution is a lightweight way to inject reasoning priors into MLLMs. They calculate the difference in parameters between a strong text reasoning model and the multimodal base, then add that vector as a bias to the gradients in supervised fine-tuning. This lets them improve performance on math reasoning benchmarks without needing massive new datasets or RL. The paper does a good job showing that naive parameter merging is unreliable across different model families. DRIFT sidesteps that by operating in gradient space, which seems to keep the multimodal capabilities stable. On MathVista and MathVerse, it beats standard SFT and matches heavier methods with far less compute and data. That's a practical win for anyone trying to boost reasoning efficiently. The soft spot is in validating that the prior is clean. The parameter delta might mix in non-reasoning elements like modality mismatches or optimization quirks. If so, biasing the gradients could quietly hurt performance on general visual tasks even as math scores rise. The abstract flags problems with merging on Qwen models but doesn't show explicit tests for the gradient approach on diverse tasks or other families. More details on ablations for the bias strength and controls for alignment preservation would help. Overall the approach is grounded in existing techniques but applies them in a new combination. No major issues with the logic or citations from what's described. This paper targets practitioners and researchers working on MLLM adaptation for reasoning-heavy applications like visual math or analysis. A reader in that space would get ideas for low-cost improvements. It deserves peer review so the community can examine the experimental setup and confirm the gains are reliable.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes DRIFT (Directional Reasoning Injection for Fine-Tuning), a lightweight method that precomputes a reasoning prior as the parameter difference between a text-only reasoning expert and a base MLLM, then adds this direction to bias gradients during standard supervised fine-tuning. It claims that naive parameter-space merging is fragile (especially on Qwen-based models) and that DRIFT achieves consistent gains over both naive merging and vanilla SFT on MathVista and MathVerse while matching or exceeding training-intensive baselines at far lower data and compute cost.

Significance. If the empirical claims are robustly supported, DRIFT offers a practical, low-cost route to inject text-only reasoning capabilities into MLLMs without large-scale multimodal reasoning corpora or RL, directly addressing the fragility of merging across model families.

major comments (2)

[Method and Experiments] The central claim that the precomputed delta isolates transferable reasoning knowledge (rather than modality-specific or optimization artifacts) is load-bearing for the stability and outperformance assertions, yet the manuscript provides no explicit held-out evaluation of general multimodal alignment (e.g., VQA or captioning performance) after gradient biasing with the same delta.
[Experiments] The abstract and results sections report consistent outperformance versus naive merging and SFT, but supply no details on statistical significance testing, number of random seeds, or variance across runs; this weakens support for the “consistently outperforms” claim given the free parameter for bias strength.

minor comments (2)

[Introduction] The motivation paragraph would be strengthened by a short quantitative illustration of the performance drop under naive merging on Qwen-based models.
[Method] Add an explicit equation or pseudocode block showing how the precomputed prior is added to the gradient update; current prose description leaves the exact injection mechanism under-specified for reproduction.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below, indicating where we will revise the paper to strengthen the empirical support.

read point-by-point responses

Referee: [Method and Experiments] The central claim that the precomputed delta isolates transferable reasoning knowledge (rather than modality-specific or optimization artifacts) is load-bearing for the stability and outperformance assertions, yet the manuscript provides no explicit held-out evaluation of general multimodal alignment (e.g., VQA or captioning performance) after gradient biasing with the same delta.

Authors: We agree that an explicit evaluation on held-out general multimodal tasks would provide stronger evidence that the reasoning prior primarily captures transferable reasoning capabilities rather than modality-specific or optimization artifacts. While DRIFT is designed to apply the directional bias only during supervised fine-tuning on multimodal data (thereby preserving the original alignment), we will add results on standard VQA and captioning benchmarks in the revised manuscript, comparing DRIFT against vanilla SFT to demonstrate that general multimodal performance is retained or improved. revision: yes
Referee: [Experiments] The abstract and results sections report consistent outperformance versus naive merging and SFT, but supply no details on statistical significance testing, number of random seeds, or variance across runs; this weakens support for the “consistently outperforms” claim given the free parameter for bias strength.

Authors: We acknowledge that additional details on run-to-run variance and statistical testing would better support the robustness of our claims, particularly given the bias strength hyperparameter. In the revised manuscript we will report results averaged over at least three random seeds with standard deviations, include statistical significance tests (e.g., paired t-tests) against the baselines, and provide a brief sensitivity analysis for the bias strength to clarify its selection and impact. revision: yes

Circularity Check

0 steps flagged

DRIFT derives reasoning prior from parameter deltas and applies it as gradient bias in standard SFT without definitional reduction or self-referential fitting

full rationale

The paper's core mechanism precomputes a directional prior as the difference between a text-only reasoning expert and the base MLLM, then injects this fixed vector to bias gradients during supervised fine-tuning. This construction is additive and external to the fine-tuning objective itself; no equation equates the final performance metric to the prior by algebraic identity, nor is any parameter fitted on the target multimodal data and then relabeled as a prediction. The claim of outperforming naive merging and standard SFT rests on reported benchmark results rather than on any uniqueness theorem or ansatz imported from the authors' prior work. No load-bearing step reduces to a self-citation chain or renames an existing empirical pattern under new coordinates. The derivation therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim depends on the domain assumption that parameter differences encode transferable reasoning knowledge usable for gradient biasing. No explicit free parameters or invented entities are named in the abstract; any scaling of the bias direction would constitute an unstated hyperparameter.

free parameters (1)

reasoning prior bias strength
The magnitude of the directional bias added to gradients during fine-tuning is likely controlled by a scaling hyperparameter whose value is not specified in the abstract.

axioms (1)

domain assumption Parameter differences between text-only reasoning experts and multimodal models capture a generalizable reasoning prior.
Invoked when the abstract states that DRIFT precomputes a reasoning prior from these differences and uses it to bias gradients while preserving alignment.

pith-pipeline@v0.9.0 · 5756 in / 1355 out tokens · 44448 ms · 2026-05-18T05:51:17.601961+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

DRIFT precomputes a reasoning prior as the parameter-space difference between reasoning and multimodal variants, then uses it to bias gradients during multimodal fine-tuning.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We define the difference between a reasoning model and a multimodal variant: Δ = ϕ_reason − ϕ_VL ... ˜g = g + α·scale(g,Δ)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.