pith. sign in

arxiv: 2605.31312 · v1 · pith:NJVJJDOVnew · submitted 2026-05-29 · 💻 cs.CV · cs.CL

Learning from Fine-Grained Visual Discrepancies: Mitigating Multimodal Hallucinations via In-Context Visual Contrastive Optimization

Pith reviewed 2026-06-28 22:41 UTC · model grok-4.3

classification 💻 cs.CV cs.CL
keywords multimodal hallucinationvision-language modelsdirect preference optimizationcontrastive optimizationin-context learningvisual supervisionhallucination mitigationpreference alignment
0
0 comments X

The pith

Placing contrastive images in one shared prompt context produces a consistent objective for visual preference optimization that reduces multimodal hallucinations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper identifies that textual DPO lacks visual signals for fixing hallucinations in VLMs and that prior visual DPO variants produce inconsistent objectives from partition-function mismatches plus coarse negatives that permit shortcuts. IC-VCO corrects the inconsistency by embedding the original and negative images inside the same multi-image prompt so the preference loss operates over a single coherent context. An auxiliary Visual Contrast Distillation term keeps the multi-image training aligned with single-image inference, while a sample-editing procedure creates hard negatives through targeted semantic changes. The resulting method records the strongest results across five standard hallucination benchmarks.

Core claim

By placing contrastive images within a shared multi-image context, IC-VCO ensures a mathematically rigorous objective. The method further introduces Visual Contrast Distillation as a reliability-gated regularizer that maintains consistency between multi-image contrastive training and single-image inference, together with a contrastive sample editing strategy that generates hard negatives via precise semantic perturbations.

What carries the argument

In-Context Visual Contrastive Optimization (IC-VCO), the mechanism that embeds original and negative images inside one shared prompt context so the preference loss is computed without partition-function mismatch.

If this is right

  • The preference objective becomes theoretically consistent because all images share the same context window.
  • The auxiliary distillation term transfers the learned contrastive signal from training to ordinary single-image inference.
  • Hard negatives generated by precise semantic edits reduce the opportunity for models to exploit coarse visual cues.
  • Overall hallucination rates fall on every benchmark when both the shared-context loss and the editing strategy are used.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same shared-context trick could be applied to preference optimization in other multimodal tasks where separate negative examples previously produced inconsistent gradients.
  • The editing strategy for hard negatives might extend to text-only or audio preference data by applying analogous local semantic perturbations.
  • If the method scales, it suggests that context length rather than model size may be the more direct lever for stabilizing visual alignment objectives.

Load-bearing premise

The dominant failure in earlier visual DPO comes from partition-function mismatch and that embedding the images in one context removes this mismatch without creating new inconsistencies when the model later sees single images.

What would settle it

A direct computation of the learned preference probabilities on held-out contrastive pairs that shows the shared-context loss still produces the same mismatched ratios as separate-image baselines, or no measurable drop in hallucination rates on the five evaluation sets.

Figures

Figures reproduced from arXiv: 2605.31312 by Chen Chen, Haolin Deng, Haonan Lu, Xin Zou, Xuming Hu, Zhiwei Jin.

Figure 1
Figure 1. Figure 1: Schematic comparison of preference optimization frameworks. (a) Standard DPO optimizes textual preferences (y vs. y ′ ) while treating the image m merely as a static condition, lacking explicit supervision for visual grounding. (b) Visual Preference DPO attempts to introduce visual rejected samples by changing visual context, e.g. swapping the input images (m vs. m′ ). However, this approach suffers from a… view at source ↗
Figure 2
Figure 2. Figure 2: Qualitative comparison of contrastive images. Synthetic baselines (yellow) exhibit global stylistic shifts, acting as coarse￾grained negatives prone to shortcut learning. In contrast, our Contrastive Editing (green) performs surgical, localized interventions while better preserving the visual context, yielding fine-grained hard negatives that compel rigorous visual discrimination. entities), and environmen… view at source ↗
Figure 3
Figure 3. Figure 3: CLIP-based image pair similarity distribution. fine-grained visual discrepancies of the target concept. To approximate this theoretical objective, we propose a Con￾trastive Sample Editing Framework utilizing a targeted edit pipeline to generate high-quality contrastive samples. Editing Pipeline. Given a seed tuple (m, x, y, y′ ), we use QwenVL-Plus (Bai et al., 2025a) as an expert planner to generate an ex… view at source ↗
Figure 4
Figure 4. Figure 4: Training diagnostics of IC-VCO on synthetic and edited preference data. (a) Edited preferences are harder to optimize than synthetic preferences under the same response-level IC-VCO setup, yielding lower reward accuracy for both the single-image and multi-image branches. (b) On edited data, the multi-image branch consistently achieves higher reward accuracy than the single-image branch, indicating that mul… view at source ↗
Figure 6
Figure 6. Figure 6: VCDist dual-gating dynamics. (a) As training pro￾gresses, the active fraction increases while the correctness-blocked fraction decreases. (b) The confidence gate provides an additional filtering step among teacher-correct samples by blocking cases where distillation is no longer needed. Impact of Partition Function Mismatch. Directly esti￾mating log Z(m, x)/Z(m′ , x) is difficult in DPO because the reward … view at source ↗
Figure 8
Figure 8. Figure 8: reports ∆ = Pos2 − Pos1, where positive val￾ues indicate higher values when the anchor-targeted image is placed second. The teacher-side quantities show a de￾tectable position-2 advantage: the last-100-step deltas are +2.0 points for teacher probability, +8.4 points for teacher accuracy, and +12.7 points for VCDist trigger rate. However, 0.00 0.05 0.10 0.15 pos2 pos1 Teacher prob. Teacher acc. Trigger rate… view at source ↗
read the original abstract

Multimodal hallucination remains a persistent challenge for Vision-Language Models (VLMs). Standard textual Direct Preference Optimization (DPO) often fails to mitigate it due to a lack of explicit visual supervision. While existing works introduce visual preference DPO by contrasting original images against negative ones, they suffer from a theoretically inconsistent objective caused by partition function mismatches and rely on coarse-grained negatives that could enable shortcut learning. In this work, we propose In-Context Visual Contrastive Optimization (IC-VCO). By placing contrastive images within a shared multi-image context, IC-VCO ensures a mathematically rigorous objective. We further introduce Visual Contrast Distillation (VCDist), an auxiliary reliability-gated regularizer that encourages consistency between multi-image contrastive training and single-image inference. Finally, we propose a contrastive sample editing strategy that generates hard negatives via precise semantic perturbations. Experiments on five benchmarks demonstrate IC-VCO's best overall performance and the effectiveness of our sample editing strategy. Code and data are available at https://github.com/OPPO-Mente-Lab/IC-VCO.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper claims that standard textual DPO fails to mitigate multimodal hallucinations in VLMs due to missing visual supervision, while prior visual preference DPO variants suffer from partition-function mismatches yielding inconsistent objectives and from coarse negatives enabling shortcuts. It proposes IC-VCO, which places contrastive images in a shared multi-image context to produce a mathematically rigorous objective, introduces VCDist as a reliability-gated regularizer to enforce consistency between multi-image training and single-image inference, and adds a contrastive sample editing method for hard negatives via semantic perturbations. Experiments on five benchmarks are reported to show best overall performance.

Significance. If the claimed mathematical rigor of the IC-VCO objective can be formally established and the empirical gains prove robust under ablations and error bars, the work would advance visual preference tuning by directly addressing a known source of inconsistency in visual DPO. The public release of code and data strengthens the contribution by enabling direct verification.

major comments (2)
  1. [Abstract] Abstract: the assertion that placing contrastive images in a shared multi-image context 'ensures a mathematically rigorous objective' is unsupported by any derivation, proof, or explicit partition-function analysis; the introduction of VCDist to handle the train-test context mismatch indicates an approximation whose bias is not bounded.
  2. [Abstract] Abstract: the claim of 'best overall performance' on five benchmarks is presented without error bars, ablation tables, or statistical tests, preventing assessment of whether gains are reliable or attributable to the proposed components rather than implementation details.
minor comments (1)
  1. [Abstract] Abstract: the GitHub link for code and data is a positive step for reproducibility and should be retained.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the constructive feedback. We address the two major comments point by point below and will incorporate revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the assertion that placing contrastive images in a shared multi-image context 'ensures a mathematically rigorous objective' is unsupported by any derivation, proof, or explicit partition-function analysis; the introduction of VCDist to handle the train-test context mismatch indicates an approximation whose bias is not bounded.

    Authors: We acknowledge that the manuscript does not provide an explicit derivation or partition-function analysis to support the claim of mathematical rigor. In the revised version we will add a formal analysis (in the main text or an appendix) showing how the shared multi-image context aligns the partition functions and yields a consistent objective, in contrast to prior visual DPO approaches. We will also expand the discussion of VCDist to clarify its role as a regularizer and include any available bounds or empirical analysis of the resulting approximation. revision: yes

  2. Referee: [Abstract] Abstract: the claim of 'best overall performance' on five benchmarks is presented without error bars, ablation tables, or statistical tests, preventing assessment of whether gains are reliable or attributable to the proposed components rather than implementation details.

    Authors: We agree that the abstract claim would be more robust with supporting details. The full paper reports results across five benchmarks, but we will revise the abstract to qualify the performance statement and ensure the experimental section (and any referenced tables) includes error bars, ablation studies, and statistical significance tests. This will allow readers to better evaluate the reliability and attribution of the gains. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation self-contained

full rationale

The paper builds IC-VCO on standard DPO by introducing a shared multi-image context to address partition-function mismatch, then adds VCDist explicitly as an auxiliary regularizer to bridge the acknowledged train-inference gap and a separate contrastive editing strategy. No quoted step reduces a claimed prediction or rigorous objective to a fitted parameter defined inside the same work, nor does any load-bearing claim rest on a self-citation chain or imported uniqueness theorem. The central claim remains an independent architectural choice whose consistency is handled by an additional term rather than asserted by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no equations or experimental details, so free parameters, axioms, and invented entities cannot be enumerated.

pith-pipeline@v0.9.1-grok · 5738 in / 995 out tokens · 15202 ms · 2026-06-28T22:41:22.082020+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

9 extracted references · 4 canonical work pages · 2 internal anchors

  1. [1]

    Deng, H., Wang, C., Xin, L., Yuan, D., Zhan, J., Zhou, T., Ma, J., Gao, J., and Xu, R

    URL https://openreview.net/forum? id=jznbgiynus. Deng, H., Wang, C., Xin, L., Yuan, D., Zhan, J., Zhou, T., Ma, J., Gao, J., and Xu, R. Webcites: Attributed query- focused summarization on chinese web search results with citations. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 15095–...

  2. [2]

    Fu, X., Hu, Y ., Li, B., Feng, Y ., Wang, H., Lin, X., Roth, D., Smith, N

    URL https://openreview.net/forum? id=7lpDn2MhM2. Fu, X., Hu, Y ., Li, B., Feng, Y ., Wang, H., Lin, X., Roth, D., Smith, N. A., Ma, W.-C., and Krishna, R. Blink: Multi- modal large language models can see but not perceive. In European Conference on Computer Vision, pp. 148–166. Springer, 2024. Geirhos, R., Jacobsen, J.-H., Michaelis, C., Zemel, R., Bren- ...

  3. [3]

    Jin, Z., Song, X., Wang, N., Liu, Y ., Li, C., Li, X., Wang, R., Li, Z., Qi, Q., Cheng, L., et al

    URL https://openreview.net/forum? id=94kQgWXojH. Jin, Z., Song, X., Wang, N., Liu, Y ., Li, C., Li, X., Wang, R., Li, Z., Qi, Q., Cheng, L., et al. Andesvl technical re- port: An efficient mobile-side multimodal large language model.arXiv preprint arXiv:2510.11496, 2025. Leng, S., Zhang, H., Chen, G., Li, X., Lu, S., Miao, C., and Bing, L. Mitigating obje...

  4. [4]

    Li, C., Zhang, J., Zhang, Z., Wu, H., Tian, Y ., Sun, W., Lu, G., Min, X., Liu, X., Lin, W., et al

    URL https://openreview.net/forum? id=zKv8qULV6n. Li, C., Zhang, J., Zhang, Z., Wu, H., Tian, Y ., Sun, W., Lu, G., Min, X., Liu, X., Lin, W., et al. R-bench: Are your large multimodal model robust to real-world corruptions? IEEE Journal of Selected Topics in Signal Processing, 2025b. Li, F., Zhang, R., Zhang, H., Zhang, Y ., Li, B., Li, W., Ma, Z., and Li...

  5. [5]

    Kirk, R., Mediratta, I., Nalmpantis, C., Luketina, J., Ham- bro, E., Grefenstette, E., and Raileanu, R

    URL https://openreview.net/forum? id=bhTBirS0qi. 11 Mitigating Multimodal Hallucinations via In-Context Visual Contrastive Optimization Manevich, A. and Tsarfaty, R. Mitigating hallucinations in large vision-language models (lvlms) via language- contrastive decoding (lcd). InFindings of the Association for Computational Linguistics ACL 2024, pp. 6008–6022...

  6. [6]

    Proximal Policy Optimization Algorithms

    URL https://aclanthology.org/2025. emnlp-main.631/. Sch¨olkopf, B., Locatello, F., Bauer, S., Ke, N. R., Kalch- brenner, N., Goyal, A., and Bengio, Y . Toward causal representation learning.Proceedings of the IEEE, 109(5): 612–634, 2021. Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algo- rithms, 2017. UR...

  7. [7]

    AMBER: An LLM-free Multi-dimensional Benchmark for MLLMs Hallucination Evaluation

    URL https://aclanthology.org/2024. emnlp-main.460/. Wang, J., Wang, Y ., Xu, G., Zhang, J., Gu, Y ., Jia, H., Yan, M., Zhang, J., and Sang, J. An llm-free multi-dimensional benchmark for mllms hallucination evaluation.arXiv preprint arXiv:2311.07397, 2023. Wang, J., Gao, Y ., and Sang, J. Valid: Mitigating the hallucination of large vision language models...

  8. [8]

    acl-long.1462/

    URL https://aclanthology.org/2025. acl-long.1462/. Xie, Y ., Li, G., Xu, X., and Kan, M.-Y . V-dpo: Mitigating hallucination in large vision language models via vision- guided direct preference optimization. InFindings of the Association for Computational Linguistics: EMNLP 2024, pp. 13258–13273, 2024. Yang, Z., Luo, X., Han, D., Xu, Y ., and Li, D. Mitig...

  9. [9]

    URL https: //aclanthology.org/2025.acl-long.640/

    doi: 10.18653/v1/2025.acl-long.640. URL https: //aclanthology.org/2025.acl-long.640/. Yu, T., Yao, Y ., Zhang, H., He, T., Han, Y ., Cui, G., Hu, J., Liu, Z., Zheng, H.-T., Sun, M., et al. Rlhf-v: To- wards trustworthy mllms via behavior alignment from fine-grained correctional human feedback. InProceed- ings of the IEEE/CVF Conference on Computer Vision ...