Reducing Hallucination in Vision-Language Models via Stage-wise Preference Optimization under Distribution Shift

Qinwu Xu

arxiv: 2605.16411 · v1 · pith:IXEZ2DTHnew · submitted 2026-05-13 · 💻 cs.CV · cs.AI· cs.CL· cs.DB· cs.LG

Reducing Hallucination in Vision-Language Models via Stage-wise Preference Optimization under Distribution Shift

Qinwu Xu This is my paper

Pith reviewed 2026-05-20 21:09 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.CLcs.DBcs.LG

keywords hallucination reductionvision-language modelsdirect preference optimizationmultimodal preference pairsspatial reasoningdistribution shiftvisual grounding

0 comments

The pith

Stage-wise preference optimization on targeted multimodal pairs reduces hallucinations in vision-language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes training vision-language models to avoid producing responses that sound right but fail to match the image. It does this by building preference data in stages, focusing on hard cases such as ambiguous object positions, unclear text, and trick questions that assume false facts. Preference pairs are made by taking a correct answer and creating a close but visually wrong alternative, then applying direct preference optimization so the model learns to favor the grounded version. This matters because current models often favor fluent language over visual accuracy, limiting their reliability for describing photos or reasoning about scenes. If the method holds, it points to a practical way to make autoregressive generation more consistent without changing the base architecture.

Core claim

The central claim is that a stage-wise preference optimization framework, which progressively constructs hallucination-focused preference pairs near known failure boundaries using minimally perturbed yet visually inconsistent alternatives, enables direct preference optimization to separate grounded reasoning from plausible hallucination under distribution shift, producing improved grounding consistency, reduced hallucination rates, and more informative responses on open-source benchmarks and real-world multimodal scenarios, including qualitative advantages over several frontier proprietary VLMs in ambiguous spatial reasoning and adversarial false-premise settings.

What carries the argument

Stage-wise multimodal direct preference optimization framework that generates hallucination-focused preference pairs near known failure boundaries from minimally perturbed visually inconsistent alternatives, targeting ambiguous spatial orientation, object relationships, OCR uncertainty, and adversarial false premises.

If this is right

Improved consistency when answering ambiguous spatial orientation and object relationship queries.
Better resistance to adversarial false-premise prompts that invite hallucinated details.
Higher rates of visually grounded and informative responses on open-source multimodal benchmarks.
Qualitative outperformance against several frontier proprietary VLMs in cross-model comparisons.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same targeted construction of preference pairs near failure modes could be applied to reduce hallucinations in other autoregressive generative systems.
If the approach scales, it suggests that data construction strategy may matter more than raw model scale for grounding.
Persistent issues might still require moving beyond standard autoregressive decoding toward models that enforce physical consistency directly.
The framework implies that handling distribution shift through staged, boundary-focused data can improve robustness in multimodal tasks more broadly.

Load-bearing premise

Minimally perturbed yet visually inconsistent alternatives can be generated to form preference pairs that allow DPO to reliably separate grounded reasoning from plausible hallucination under distribution shift.

What would settle it

No measurable drop in hallucination rates or gain in visual grounding scores on standard VLM benchmarks and real-world evaluation scenarios after running the stage-wise DPO procedure compared with baseline training.

Figures

Figures reproduced from arXiv: 2605.16411 by Qinwu Xu.

**Figure 2.** Figure 2: Overview of the model and training pipeline. A ViT encoder extracts visual features, [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Stage-wise data curation. In the first stage (SFT), training data emphasizes concise and [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: DPO response length of words distribution: a) original one; b) new one with data duplica [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Representative comparisons between SFT and DPO model outputs across diverse multi [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗

**Figure 6.** Figure 6: Representative examples of hallucination mitigation under spatial reasoning, OCR, and [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗

**Figure 7.** Figure 7: Adversarial hallucination example for fine-grained object identification. The DPO-trained [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗

**Figure 8.** Figure 8: Cross-model comparison across challenging multimodal reasoning scenarios involving [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗

read the original abstract

Hallucination remains a fundamental challenge in vision-language models (VLMs), where autoregressive generation may produce linguistically plausible yet physically inconsistent or visually ungrounded responses due to likelihood maximization under joint probabilistic modeling. We propose a stage-wise preference optimization framework for hallucination reduction through targeted multimodal data construction. Rather than directly optimizing on generic instruction-following data, our approach progressively constructs hallucination-focused preference pairs near known failure boundaries. The framework emphasizes ambiguous spatial orientation, object relationships, OCR uncertainty, and adversarial false-premise training. Hallucinated negatives are generated through minimally perturbed yet visually inconsistent alternatives, enabling Direct Preference Optimization (DPO) to better separate grounded reasoning from plausible hallucination. Experiments on open-source benchmarks and real-world multimodal evaluation scenarios demonstrate improved grounding consistency, reduced hallucination, and more informative grounded responses. Cross-model qualitative evaluation further shows that the proposed multimodal LLM DPO framework produces more visually grounded responses than several frontier proprietary VLMs, such as in ambiguous spatial reasoning and adversarial false-premise settings. The results suggest that hallucination may arise not only from limited model capacity, but also from inherent tendencies of autoregressive probabilistic generation to favor linguistically plausible continuations under weak visual grounding. Future work may explore physical consistency modeling, uncertainty-aware multimodal reasoning, and architectural alternatives beyond standard autoregressive decoding.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper stages DPO around targeted hallucination boundaries in VLMs but rests on qualitative claims without numbers or pair-construction details.

read the letter

The main takeaway is that the authors extend standard DPO into a staged process that builds preference pairs focused on specific VLM failure modes such as spatial ambiguity, object relations, OCR issues, and false-premise prompts. They generate negatives by minimal perturbations meant to keep linguistic plausibility while breaking visual consistency, then run DPO to push the model toward grounded outputs. This is a concrete, if incremental, way to move beyond generic instruction tuning for hallucination control.

Referee Report

2 major / 0 minor

Summary. The manuscript proposes a stage-wise preference optimization framework to reduce hallucinations in vision-language models. It constructs targeted multimodal preference pairs near known failure boundaries (ambiguous spatial reasoning, object relationships, OCR uncertainty, and adversarial false-premise cases) by generating hallucinated negatives via minimally perturbed yet visually inconsistent alternatives, then applies Direct Preference Optimization (DPO) to separate grounded reasoning from plausible hallucination. Experiments on open-source benchmarks and qualitative cross-model evaluations against proprietary VLMs are claimed to show improved grounding consistency and reduced hallucination.

Significance. If the perturbation-based preference pairs can be shown to isolate visual grounding issues without introducing correlated artifacts or label noise, the framework would offer a practical, data-centric route to improving VLM reliability under distribution shift. The emphasis on stage-wise construction and specific hallucination modes extends existing DPO techniques to multimodal settings in a targeted way.

major comments (2)

[Abstract] The core technical step—generating hallucinated negatives through minimally perturbed yet visually inconsistent alternatives—is load-bearing for the DPO stage and the distribution-shift claim. The abstract provides no quantitative validation (e.g., metrics on visual inconsistency, linguistic plausibility preservation, or distribution proximity) or ablation on perturbation fidelity, leaving open whether the resulting pairs reliably teach grounding or merely reinforce superficial cues.
[Abstract] The empirical support is described only at a high level: 'experiments on open-source benchmarks and real-world multimodal evaluation scenarios demonstrate improved grounding consistency.' No specific metrics, error bars, ablation tables, or explicit data-construction procedure appear in the provided description, which undermines assessment of the magnitude and robustness of the reported gains.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and have revised the abstract to increase specificity on the perturbation method and empirical results while preserving its concise nature.

read point-by-point responses

Referee: [Abstract] The core technical step—generating hallucinated negatives through minimally perturbed yet visually inconsistent alternatives—is load-bearing for the DPO stage and the distribution-shift claim. The abstract provides no quantitative validation (e.g., metrics on visual inconsistency, linguistic plausibility preservation, or distribution proximity) or ablation on perturbation fidelity, leaving open whether the resulting pairs reliably teach grounding or merely reinforce superficial cues.

Authors: We agree that the abstract would benefit from greater specificity on perturbation validation. The full manuscript details the construction of hallucinated negatives in Section 3, including the use of minimal visual perturbations to ensure inconsistency while maintaining linguistic plausibility, supported by analyses of visual and textual metrics as well as ablations on perturbation parameters. We have revised the abstract to reference this validation approach and the associated ablations. revision: yes
Referee: [Abstract] The empirical support is described only at a high level: 'experiments on open-source benchmarks and real-world multimodal evaluation scenarios demonstrate improved grounding consistency.' No specific metrics, error bars, ablation tables, or explicit data-construction procedure appear in the provided description, which undermines assessment of the magnitude and robustness of the reported gains.

Authors: We acknowledge the abstract presents results at a summary level. The manuscript provides specific metrics with error bars, ablation tables for the stage-wise process, and the full data-construction procedure in Sections 4 and 5. We have updated the abstract to incorporate key quantitative findings and explicit references to these experimental details and tables. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical framework proposal with external validation

full rationale

The paper describes a stage-wise preference optimization method that constructs hallucination-focused preference pairs via minimal perturbations and applies DPO, with results evaluated on open-source benchmarks and cross-model qualitative comparisons to proprietary VLMs. No equations, derivations, or self-citations are presented that reduce any claimed improvement or separation of grounded reasoning to a quantity fitted or defined inside the paper itself. The central claims rest on the construction of new training pairs and external evaluation rather than any self-referential reduction or imported uniqueness theorem.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the effectiveness of the constructed preference pairs and the assumption that DPO applied to them yields better grounding than standard training; these are domain assumptions rather than derived results.

free parameters (1)

perturbation magnitude for negative examples
The degree of visual inconsistency introduced to create hallucinated negatives is a design choice that must be set to produce useful pairs.

axioms (1)

domain assumption DPO can separate grounded multimodal reasoning from linguistically plausible but visually inconsistent continuations when given appropriate preference pairs
Invoked when the framework is said to enable DPO to better separate grounded reasoning from plausible hallucination.

pith-pipeline@v0.9.0 · 5772 in / 1180 out tokens · 70224 ms · 2026-05-20T21:09:36.554583+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

6 extracted references · 6 canonical work pages · 6 internal anchors

[1]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Bai, Jinze, Shuai Bai, Shusheng Yang, et al. Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond.arXiv preprint arXiv:2308.12966,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Perceiver IO: A General Architecture for Structured Inputs & Outputs

Jaegle, Andrew, Sebastian Borgeaud, Jean-Baptiste Alayrac, et al. Perceiver IO: A General Archi- tecture for Structured Inputs and Outputs.arXiv preprint arXiv:2107.14795,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

MMBench: Is Your Multi-modal Model an All-around Player?

Liu, Yuan, Haotian Li, Yuhang Wu, et al. MMBench: Is Your Multi-modal Model an All-around Player?arXiv preprint arXiv:2307.06281,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Training language models to follow instructions with human feedback

Ouyang, Long, Jeffrey Wu, Xu Jiang, et al. Training Language Models to Follow Instructions with Human Feedback.arXiv preprint arXiv:2203.02155,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Shao, Zhihong, Peiyi Wang, Qihao Zhu, et al. DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models.arXiv preprint arXiv:2402.03300,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

Zheng, Lianmin, Wei-Lin Chiang, Ying Sheng, et al. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena.arXiv preprint arXiv:2306.05685,

work page internal anchor Pith review Pith/arXiv arXiv

[1] [1]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Bai, Jinze, Shuai Bai, Shusheng Yang, et al. Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond.arXiv preprint arXiv:2308.12966,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

Perceiver IO: A General Architecture for Structured Inputs & Outputs

Jaegle, Andrew, Sebastian Borgeaud, Jean-Baptiste Alayrac, et al. Perceiver IO: A General Archi- tecture for Structured Inputs and Outputs.arXiv preprint arXiv:2107.14795,

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

MMBench: Is Your Multi-modal Model an All-around Player?

Liu, Yuan, Haotian Li, Yuhang Wu, et al. MMBench: Is Your Multi-modal Model an All-around Player?arXiv preprint arXiv:2307.06281,

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

Training language models to follow instructions with human feedback

Ouyang, Long, Jeffrey Wu, Xu Jiang, et al. Training Language Models to Follow Instructions with Human Feedback.arXiv preprint arXiv:2203.02155,

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Shao, Zhihong, Peiyi Wang, Qihao Zhu, et al. DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models.arXiv preprint arXiv:2402.03300,

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

Zheng, Lianmin, Wei-Lin Chiang, Ying Sheng, et al. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena.arXiv preprint arXiv:2306.05685,

work page internal anchor Pith review Pith/arXiv arXiv