DRIFT: Transferring Reasoning Priors for Efficient MLLM Fine-Tuning
Pith reviewed 2026-05-18 05:51 UTC · model grok-4.3
The pith
A precomputed reasoning prior from parameter differences biases gradients to transfer text-model reasoning into MLLMs during standard fine-tuning.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that the vector difference in parameters between a text-only reasoning-enhanced LLM and an MLLM forms a stable reasoning prior; when this prior is used to bias the gradient updates during supervised fine-tuning, the resulting model gains stronger reasoning on multimodal tasks while preserving its original multimodal capabilities, outperforming both naive parameter merging and plain SFT.
What carries the argument
The reasoning prior, formed by subtracting the parameters of a base multimodal model from those of a text-only reasoning expert, which then directionally biases gradient updates throughout supervised fine-tuning.
If this is right
- DRIFT matches or exceeds the accuracy of training-intensive reasoning methods while using substantially less data and compute.
- The method remains stable where naive merging degrades performance on certain model families such as Qwen-based MLLMs.
- Standard SFT pipelines can incorporate the prior without changing their overall structure or requiring new data collection.
- Reasoning improvements appear on benchmarks including MathVista and MathVerse without separate reinforcement-learning stages.
Where Pith is reading between the lines
- The same prior-biasing technique might transfer other capabilities such as coding skill or safety constraints if suitable expert models exist.
- Iterating the process by updating the prior from successively stronger reasoning models could compound gains over multiple rounds.
- The approach may lower the barrier for adapting new multimodal architectures by reducing the volume of task-specific data needed.
Load-bearing premise
The parameter differences between text-only reasoning models and multimodal models encode a reliable prior that can steer gradients without harming multimodal performance.
What would settle it
Running the same fine-tuning schedule on MathVista or MathVerse with and without the gradient bias and finding no gain, or finding degradation on multimodal alignment metrics.
read the original abstract
Multimodal large language models (MLLMs) have made rapid progress, yet their reasoning ability often lags behind strong text-only LLMs. Bridging this gap typically requires large-scale multimodal reasoning data or reinforcement learning, incurring substantial cost. An appealing alternative is parameter-space model merging between reasoning-enhanced LLMs and MLLMs, but we show that naive merging is fragile: its effectiveness varies widely across model families and can significantly degrade performance (e.g., for Qwen-based MLLMs). We propose Directional Reasoning Injection for Fine-Tuning (DRIFT), a lightweight method that transfers reasoning knowledge in the gradient space while preserving multimodal alignment. DRIFT precomputes a reasoning prior from the parameter differences between text-only reasoning experts and multimodal models, and uses it to bias gradients during supervised fine-tuning. This design retains the simplicity of standard SFT pipelines while enabling efficient and stable reasoning transfer. Experiments on multimodal reasoning benchmarks, including MathVista and MathVerse, show that DRIFT consistently outperforms naive merging and standard SFT, and matches or surpasses training-intensive methods with substantially lower data and compute.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes DRIFT (Directional Reasoning Injection for Fine-Tuning), a lightweight method that precomputes a reasoning prior as the parameter difference between a text-only reasoning expert and a base MLLM, then adds this direction to bias gradients during standard supervised fine-tuning. It claims that naive parameter-space merging is fragile (especially on Qwen-based models) and that DRIFT achieves consistent gains over both naive merging and vanilla SFT on MathVista and MathVerse while matching or exceeding training-intensive baselines at far lower data and compute cost.
Significance. If the empirical claims are robustly supported, DRIFT offers a practical, low-cost route to inject text-only reasoning capabilities into MLLMs without large-scale multimodal reasoning corpora or RL, directly addressing the fragility of merging across model families.
major comments (2)
- [Method and Experiments] The central claim that the precomputed delta isolates transferable reasoning knowledge (rather than modality-specific or optimization artifacts) is load-bearing for the stability and outperformance assertions, yet the manuscript provides no explicit held-out evaluation of general multimodal alignment (e.g., VQA or captioning performance) after gradient biasing with the same delta.
- [Experiments] The abstract and results sections report consistent outperformance versus naive merging and SFT, but supply no details on statistical significance testing, number of random seeds, or variance across runs; this weakens support for the “consistently outperforms” claim given the free parameter for bias strength.
minor comments (2)
- [Introduction] The motivation paragraph would be strengthened by a short quantitative illustration of the performance drop under naive merging on Qwen-based models.
- [Method] Add an explicit equation or pseudocode block showing how the precomputed prior is added to the gradient update; current prose description leaves the exact injection mechanism under-specified for reproduction.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment below, indicating where we will revise the paper to strengthen the empirical support.
read point-by-point responses
-
Referee: [Method and Experiments] The central claim that the precomputed delta isolates transferable reasoning knowledge (rather than modality-specific or optimization artifacts) is load-bearing for the stability and outperformance assertions, yet the manuscript provides no explicit held-out evaluation of general multimodal alignment (e.g., VQA or captioning performance) after gradient biasing with the same delta.
Authors: We agree that an explicit evaluation on held-out general multimodal tasks would provide stronger evidence that the reasoning prior primarily captures transferable reasoning capabilities rather than modality-specific or optimization artifacts. While DRIFT is designed to apply the directional bias only during supervised fine-tuning on multimodal data (thereby preserving the original alignment), we will add results on standard VQA and captioning benchmarks in the revised manuscript, comparing DRIFT against vanilla SFT to demonstrate that general multimodal performance is retained or improved. revision: yes
-
Referee: [Experiments] The abstract and results sections report consistent outperformance versus naive merging and SFT, but supply no details on statistical significance testing, number of random seeds, or variance across runs; this weakens support for the “consistently outperforms” claim given the free parameter for bias strength.
Authors: We acknowledge that additional details on run-to-run variance and statistical testing would better support the robustness of our claims, particularly given the bias strength hyperparameter. In the revised manuscript we will report results averaged over at least three random seeds with standard deviations, include statistical significance tests (e.g., paired t-tests) against the baselines, and provide a brief sensitivity analysis for the bias strength to clarify its selection and impact. revision: yes
Circularity Check
DRIFT derives reasoning prior from parameter deltas and applies it as gradient bias in standard SFT without definitional reduction or self-referential fitting
full rationale
The paper's core mechanism precomputes a directional prior as the difference between a text-only reasoning expert and the base MLLM, then injects this fixed vector to bias gradients during supervised fine-tuning. This construction is additive and external to the fine-tuning objective itself; no equation equates the final performance metric to the prior by algebraic identity, nor is any parameter fitted on the target multimodal data and then relabeled as a prediction. The claim of outperforming naive merging and standard SFT rests on reported benchmark results rather than on any uniqueness theorem or ansatz imported from the authors' prior work. No load-bearing step reduces to a self-citation chain or renames an existing empirical pattern under new coordinates. The derivation therefore remains self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
free parameters (1)
- reasoning prior bias strength
axioms (1)
- domain assumption Parameter differences between text-only reasoning experts and multimodal models capture a generalizable reasoning prior.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanabsolute_floor_iff_bare_distinguishability unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
DRIFT precomputes a reasoning prior as the parameter-space difference between reasoning and multimodal variants, then uses it to bias gradients during multimodal fine-tuning.
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We define the difference between a reasoning model and a multimodal variant: Δ = ϕ_reason − ϕ_VL ... ˜g = g + α·scale(g,Δ)
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.