RL makes MLLMs see better than SFT

Byeongho Heo; Dongyoon Han; Jaegul Choo; Junha Song; Sangdoo Yun

arxiv: 2510.16333 · v2 · submitted 2025-10-18 · 💻 cs.CV · cs.LG

RL makes MLLMs see better than SFT

Junha Song , Sangdoo Yun , Dongyoon Han , Jaegul Choo , Byeongho Heo This is my paper

Pith reviewed 2026-05-18 05:39 UTC · model grok-4.3

classification 💻 cs.CV cs.LG

keywords multimodal language modelsvision encoderreinforcement learningsupervised fine-tuningPIVOTvisual representationsefficient training

0 comments

The pith

Reinforcement learning produces stronger and more localized visual representations than supervised fine-tuning in MLLM vision encoders.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper questions the common view that MLLM performance mainly comes from the large language model by focusing on the vision encoder instead. Experiments across classification, segmentation, and gradient analysis show that switching from SFT to RL after the main training changes how the encoder processes images, yielding more precise and localized features. This observation is turned into a lightweight training recipe called PIVOT that delivers competitive or better MLLM results while using far less compute than conventional vision pretraining.

Core claim

The key finding of our study is that RL produces stronger and precisely localized visual representations compared to SFT, boosting the ability of the vision encoder for MLLM. We then reframe our findings into a simple recipe for building strong vision encoders for MLLMs, Preference-Instructed Vision OpTimization (PIVOT). When integrated into MLLMs, a PIVOT-trained vision encoder outperforms even larger and more heavily-trained counterparts, despite requiring less than 1% of the computational cost of standard vision pretraining.

What carries the argument

Preference-Instructed Vision OpTimization (PIVOT), the recipe that converts the observed RL advantage into an optimization procedure for training vision encoders with preference signals to improve localization.

If this is right

RL post-training yields better results than SFT on vision-heavy VQA benchmarks in MLLMs.
RL reshapes the vision encoder's internal representations in ways that SFT does not.
PIVOT produces vision encoders that can replace larger, more expensive pretrained models inside MLLMs.
Strong vision backbones for MLLMs can be obtained at under one percent of standard pretraining compute.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same RL preference approach might improve localization performance in vision-only models used for detection or segmentation.
Post-training objectives could prove more influential for perceptual quality than further scaling of the vision backbone itself.
PIVOT-style methods might be tested on other multimodal architectures to check whether the efficiency advantage generalizes.

Load-bearing premise

The superiority of RL over SFT on vision benchmarks and internal representations is due mainly to the post-training objective rather than differences in model scale, data, or optimization details.

What would settle it

A controlled comparison in which the identical vision encoder is trained with RL and with SFT using exactly the same data, scale, and hyperparameters, then evaluated on localization metrics such as segmentation accuracy or gradient-based attention maps, would falsify the claim if RL shows no advantage.

read the original abstract

A dominant assumption in Multimodal Language Model (MLLM) research is that its performance is largely inherited from the LLM backbone, given its immense parameter scale and remarkable capabilities. This has created a void in the understanding of the vision encoder, which determines how MLLMs perceive images. The recent shift in MLLM training paradigms, from Supervised Finetuning (SFT) to Reinforcement Learning (RL), magnifies this oversight-namely, the significant lack of analysis on how such training reshapes the vision encoder as well as the MLLM. To address this, we first investigate the impact of training strategies on MLLMs, where RL shows a clear advantage over SFT in strongly vision-related VQA benchmarks. Motivated by this, we conduct a critical yet under-explored analysis of the vision encoder of MLLMs through diverse and in-depth experiments, ranging from ImageNet classification and segmentation to gradient visualization. Our results demonstrate that MLLM's post-training strategy (i.e., SFT or RL) not only leads to distinct outcomes on MLLM downstream tasks, but also fundamentally reshapes MLLM's underlying visual representations. Specifically, the key finding of our study is that RL produces stronger and precisely localized visual representations compared to SFT, boosting the ability of the vision encoder for MLLM. We then reframe our findings into a simple recipe for building strong vision encoders for MLLMs, Preference-Instructed Vision OpTimization (PIVOT). When integrated into MLLMs, a PIVOT-trained vision encoder outperforms even larger and more heavily-trained counterparts, despite requiring less than 1% of the computational cost of standard vision pretraining. This result opens an effective and efficient path for advancing the vision backbones of MLLMs. Project page available at https://june-page.github.io/pivot/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper investigates the effects of post-training strategies on the vision encoders of Multimodal Large Language Models (MLLMs), comparing Supervised Fine-Tuning (SFT) to Reinforcement Learning (RL). It reports that RL yields advantages on vision-heavy VQA benchmarks and produces stronger, more precisely localized visual representations, as shown via ImageNet classification, segmentation tasks, and gradient visualizations. The authors introduce Preference-Instructed Vision OpTimization (PIVOT) as an efficient recipe for training vision encoders that, when plugged into MLLMs, outperform larger and more heavily pre-trained models at under 1% of standard vision pretraining compute.

Significance. If the attribution to the training objective holds after controlling for confounds, the work would meaningfully advance understanding of vision encoders in MLLMs and offer a low-cost alternative to conventional pretraining. The multi-faceted empirical evaluation (VQA, ImageNet, segmentation, visualizations) and the concrete efficiency claim for PIVOT are strengths that could influence future MLLM design if reproducibility and fairness of comparisons are established.

major comments (2)

[Experimental setup and results sections] Experimental comparisons (Sections describing RL vs. SFT setups): the manuscript does not report matched total training compute, data composition, or optimization hyperparameters between the RL and SFT conditions. Because the central claim attributes improved localization and downstream vision performance primarily to the objective rather than these factors, explicit controls or ablation tables isolating the objective are required to support the attribution.
[PIVOT recipe and integration experiments] PIVOT integration results (the section reporting MLLM performance with PIVOT-trained encoders): the claim that a PIVOT encoder outperforms larger, more heavily-trained counterparts at <1% compute cost needs a precise accounting of the baseline compute (e.g., which vision pretraining runs are used for the percentage) and confirmation that the MLLM backbone and other training stages remain identical.

minor comments (2)

[Abstract] The abstract states 'less than 1% of the computational cost of standard vision pretraining' without defining the exact baseline or FLOPs/epoch measurement; add a short footnote or table entry clarifying the reference.
[Visualization experiments] Gradient visualization figures would benefit from quantitative metrics (e.g., localization error or saliency overlap scores) alongside the qualitative examples to strengthen the 'precisely localized' claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback, which helps clarify the attribution of our findings and the efficiency claims for PIVOT. We address each major comment below and will revise the manuscript accordingly to strengthen the experimental rigor.

read point-by-point responses

Referee: [Experimental setup and results sections] Experimental comparisons (Sections describing RL vs. SFT setups): the manuscript does not report matched total training compute, data composition, or optimization hyperparameters between the RL and SFT conditions. Because the central claim attributes improved localization and downstream vision performance primarily to the objective rather than these factors, explicit controls or ablation tables isolating the objective are required to support the attribution.

Authors: We agree that explicit controls are important for isolating the contribution of the training objective. Our original RL and SFT comparisons followed the standard hyperparameter settings and data pipelines reported in the source MLLM papers to enable direct comparison with published baselines. To address the concern, we will add a dedicated ablation subsection with matched total compute (by adjusting training steps), identical data composition, and harmonized optimization hyperparameters. The revised manuscript will include a table reporting these settings and the resulting performance differences. revision: yes
Referee: [PIVOT recipe and integration experiments] PIVOT integration results (the section reporting MLLM performance with PIVOT-trained encoders): the claim that a PIVOT encoder outperforms larger, more heavily-trained counterparts at <1% compute cost needs a precise accounting of the baseline compute (e.g., which vision pretraining runs are used for the percentage) and confirmation that the MLLM backbone and other training stages remain identical.

Authors: We will add a precise compute accounting table in the PIVOT section. The <1% figure is computed relative to the full pretraining cost of large vision models such as CLIP ViT-G/14 and SigLIP-400M (estimated from their original papers at >10,000 GPU-hours on web-scale data). PIVOT training uses a small preference dataset for a limited number of epochs. We confirm that all MLLM integration experiments keep the LLM backbone, projector, and subsequent training stages identical, changing only the vision encoder. The revision will explicitly list the baseline models and compute references. revision: yes

Circularity Check

0 steps flagged

No circularity: claims rest on empirical comparisons and reframing of observations

full rationale

The paper's derivation chain consists of experimental investigations (RL vs SFT on VQA benchmarks, ImageNet classification/segmentation, gradient visualizations) followed by reframing observed results into the PIVOT recipe. No equations, fitted parameters renamed as predictions, or self-referential definitions appear in the chain. PIVOT is explicitly described as a reframing of findings rather than a derivation that reduces to its inputs by construction. The central attribution to post-training objective is supported by direct comparisons, not by self-citation load-bearing or ansatz smuggling.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

The work is primarily empirical and introduces no new mathematical axioms or free parameters; the PIVOT method is a practical reframing of experimental observations rather than a derivation resting on unstated assumptions beyond standard training practices.

invented entities (1)

PIVOT no independent evidence
purpose: Preference-Instructed Vision OpTimization recipe for training vision encoders
Introduced as a distilled practical method derived from the RL versus SFT comparison results.

pith-pipeline@v0.9.0 · 5887 in / 1413 out tokens · 42645 ms · 2026-05-18T05:39:00.044973+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

RL produces stronger and precisely localized visual representations compared to SFT... Preference-Instructed Vision OpTimization (PIVOT)
IndisputableMonolith/Foundation/AbsoluteFloorClosure absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

gradient signals from DPO align more strongly with question-relevant regions than those from SFT

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.