arxiv: 2605.04874 · v1 · submitted 2026-05-06 · 💻 cs.LG · cs.CL· cs.CV

Recognition: unknown

Uncertainty-Aware Exploratory Direct Preference Optimization for Multimodal Large Language Models

Huatian Zhang , Zhendong Mao , Lei Zhang , Yongdong Zhang

Authors on Pith no claims yet

Pith reviewed 2026-05-08 16:56 UTC · model grok-4.3

classification 💻 cs.LG cs.CLcs.CV

keywords Direct Preference OptimizationMultimodal Large Language ModelsHallucination MitigationEpistemic UncertaintyVision-Language AlignmentSelf-CorrectionExploratory Training

0 comments

The pith

Uncertainty-aware exploratory direct preference optimization lets multimodal models identify and correct visual deficiencies using token-level epistemic uncertainty.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes UE-DPO as a way to refine direct preference optimization for multimodal large language models that suffer from hallucinations. Standard approaches use the model's own sensitivity estimates to weight training, but this creates bias toward already-learned visual cues while missing harder details. UE-DPO instead measures epistemic uncertainty from the model's failure to ground tokens in the image and uses that signal to apply stronger learning pressure on deficient tokens in preferred responses while relaxing penalties on useful knowledge in dispreferred ones. The method includes a theoretical justification and shows improved hallucination mitigation in experiments. A reader would care because it turns an internal model signal into a self-correction mechanism without requiring new external data.

Core claim

UE-DPO quantifies token-level epistemic uncertainty from the model's inability to ground predictions in the given image, then applies an uncertainty-aware exploration intensity that increases learning emphasis on visually deficient tokens in preferred samples and reduces over-penalization of beneficial knowledge in dispreferred samples, thereby enabling the model to uncover cognitive deficiencies and pursue self-correction with theoretical support.

What carries the argument

token-level epistemic uncertainty, which quantifies the model's failure to ground token predictions in the image and adjusts differential learning pressure across preference pairs via uncertainty-aware exploration intensity.

If this is right

Sequence-level preferences transfer more effectively into fine-grained supervision on visual fidelity.
Training emphasis shifts away from reinforcing already-mastered cues toward hard-to-perceive details.
Over-penalization of useful knowledge in dispreferred responses decreases.
Models gain an explicit mechanism for active exploration and self-correction during alignment.
The approach maintains robustness across different multimodal tasks and datasets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same uncertainty-driven weighting could be tested in non-preference settings such as standard supervised fine-tuning for vision-language tasks.
If internal uncertainty reliably proxies perceptual difficulty, it might reduce reliance on human preference data in broader multimodal alignment pipelines.
Combining UE-DPO with external vision models for uncertainty estimation could further isolate whether the signal must come from the target model itself.
The method suggests a general pattern where model-internal uncertainty replaces heuristic sensitivity scores in any preference-based training loop.

Load-bearing premise

Epistemic uncertainty computed from the still-training model's own token-grounding failures provides an unbiased signal that reliably highlights critical but overlooked visual details without creating new self-referential bias.

What would settle it

Compare hallucination rates on held-out images containing rare or ambiguous visual elements between models trained with UE-DPO versus standard DPO; if rates remain identical or uncertainty estimates correlate more with well-learned tokens than deficient ones, the central claim fails.

Figures

Figures reproduced from arXiv: 2605.04874 by Huatian Zhang, Lei Zhang, Yongdong Zhang, Zhendong Mao.

**Figure 1.** Figure 1: Illustration of the focus shift from established visual view at source ↗

**Figure 2.** Figure 2: Schematic illustration of our method. (a) For preferred view at source ↗

**Figure 3.** Figure 3: Visualization of the token-wise visual sensitivity and epistemic uncertainty in generating the corresponding responses. Positive view at source ↗

read the original abstract

Direct Preference Optimization (DPO) has proven to be an effective solution for mitigating hallucination in Multimodal Large Language Models (MLLMs) by learning from preference pairs. One of its key challenges lies in how to transfer the sequence-level preference into fine-grained supervision on visual fidelity. To safeguard vision-related tokens that are prone to hallucination, existing methods typically allocate training emphasis according to the model's self-assessed visual sensitivity signals. However, such sensitivity, estimated by a model still under training, introduces self-referential bias: reinforcing already well-learned visual cues while neglecting hard-to-perceive but critical details, thereby limiting deeper alignment. In this work, we propose an Uncertainty-aware Exploratory Direct Preference Optimization (UE-DPO) method for MLLMs, which enables the model to uncover its cognitive deficiencies and actively explore for self-correction, guided by token-level epistemic uncertainty. Specifically, we first quantify the uncertainty from the model's failure to ground token predictions in the given image. Then, based on an uncertainty-aware exploration intensity, we encourage more learning pressure on visually deficient tokens in preferred samples, and alleviate the over-penalization of beneficial knowledge in dispreferred samples. Further, we provide a theoretical justification for our method, and extensive experiments demonstrate its effectiveness and robustness.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

UE-DPO adds token-level uncertainty modulation to DPO for MLLMs to target visual hallucinations, but the dynamic estimates risk the same self-bias the method tries to avoid.

read the letter

The paper introduces UE-DPO, which quantifies epistemic uncertainty at the token level from the model's grounding failures and uses an uncertainty-aware exploration intensity to increase learning pressure on deficient tokens in preferred samples while reducing over-penalization in dispreferred ones. This is positioned as a fix for the self-referential bias in prior DPO variants that rely on the model's own sensitivity signals during training. The approach is new in its specific combination for multimodal models, and the abstract does a clear job naming the practical problem of reinforcing easy visual cues at the expense of harder ones. The claimed theoretical justification and extensive experiments on hallucination reduction give it some grounding beyond pure heuristics. The experiments appear to test robustness, which is a plus for an alignment method. The soft spot is the core mechanism itself. Because uncertainty is computed from the still-updating model, the signal at each step is conditioned on the current parameters, which can cause already partially learned tokens to look low-uncertainty while genuinely difficult visual details stay undetected or mis-ranked. The stress-test concern lands here: without a derivation showing the dynamic estimate remains unbiased relative to the final aligned distribution, the exploration intensity could reduce to another hyperparameter that fits the training trajectory rather than uncovering real deficiencies. The paper flags this exact issue in prior work, so the theory section needs to address it directly rather than assume the modulation solves it. This is for researchers already running DPO on vision-language models who want a targeted tweak for visual fidelity. It is not foundational but could be worth trying if the experiments hold up under scrutiny. It deserves peer review because the problem is concrete, the method is reproducible in principle, and the bias question is worth referee attention even if revisions are needed.

Referee Report

2 major / 2 minor

Summary. The paper proposes Uncertainty-aware Exploratory Direct Preference Optimization (UE-DPO) for multimodal LLMs to reduce hallucinations. It identifies self-referential bias in prior DPO variants that rely on the model's own sensitivity signals during training, and introduces token-level epistemic uncertainty to guide an exploration intensity that increases learning pressure on visually deficient tokens in preferred samples while reducing over-penalization of useful knowledge in dispreferred samples. The authors claim a theoretical justification for the approach and report extensive experiments showing improved effectiveness and robustness.

Significance. If the dynamic uncertainty estimation can be shown to remain unbiased with respect to the final aligned distribution and reliably surfaces under-learned visual details without circular reinforcement, the method would offer a principled way to achieve finer-grained visual alignment in preference optimization. The explicit focus on theoretical justification and the attempt to break the self-referential loop are strengths that could influence subsequent work on hallucination mitigation in MLLMs.

major comments (2)

[Abstract / Theoretical justification] Abstract and theoretical justification section: the central claim that token-level epistemic uncertainty estimated from the still-updating model provides an unbiased signal for 'uncover[ing] its cognitive deficiencies' requires a derivation showing that the uncertainty at training step t is not conditioned on the partially aligned parameters in a way that systematically under-ranks genuinely hard visual tokens. Without an explicit bound or invariance argument (e.g., relating the uncertainty measure to the final posterior), the method risks inheriting the same self-referential bias it aims to correct.
[Method] Method section (uncertainty quantification and exploration intensity): the description states that uncertainty is quantified 'from the model's failure to ground token predictions' and then used to modulate an 'uncertainty-aware exploration intensity.' It is unclear whether this intensity remains a fixed hyperparameter or is derived parameter-free from the uncertainty estimate; if the former, the approach reduces to a tuned weighting scheme whose advantage over standard DPO must be demonstrated via ablation rather than assumed.

minor comments (2)

[Experiments] The abstract refers to 'extensive experiments' but provides no quantitative results, baselines, or metrics; the full manuscript should include these in a dedicated results section with clear tables comparing hallucination rates, visual grounding accuracy, and ablation on the uncertainty component.
[Method] Notation for epistemic uncertainty (e.g., how token-level variance or entropy is computed across forward passes or ensembles) should be defined explicitly with an equation, as the current high-level description leaves implementation details ambiguous.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and insightful comments, which help clarify key aspects of our work. We address each major comment point by point below, indicating planned revisions to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract / Theoretical justification] Abstract and theoretical justification section: the central claim that token-level epistemic uncertainty estimated from the still-updating model provides an unbiased signal for 'uncover[ing] its cognitive deficiencies' requires a derivation showing that the uncertainty at training step t is not conditioned on the partially aligned parameters in a way that systematically under-ranks genuinely hard visual tokens. Without an explicit bound or invariance argument (e.g., relating the uncertainty measure to the final posterior), the method risks inheriting the same self-referential bias it aims to correct.

Authors: We appreciate the referee's emphasis on rigorously establishing the unbiased nature of the uncertainty signal. The theoretical justification in the manuscript derives the token-level epistemic uncertainty from the model's grounding failures on visual tokens, which is formulated to be independent of the preference-based alignment updates and thus avoids the self-referential sensitivity signals critiqued in prior DPO variants. This grounding-based measure is intended to surface under-learned visual details without circular reinforcement from the evolving parameters. That said, we agree that an explicit invariance argument or bound relating the uncertainty at step t to the final posterior would further solidify the claim. We will revise the theoretical section to include such a derivation, for example by showing that the uncertainty estimate depends primarily on the visual grounding loss variance rather than the partially aligned preference parameters. revision: yes
Referee: [Method] Method section (uncertainty quantification and exploration intensity): the description states that uncertainty is quantified 'from the model's failure to ground token predictions' and then used to modulate an 'uncertainty-aware exploration intensity.' It is unclear whether this intensity remains a fixed hyperparameter or is derived parameter-free from the uncertainty estimate; if the former, the approach reduces to a tuned weighting scheme whose advantage over standard DPO must be demonstrated via ablation rather than assumed.

Authors: We thank the referee for noting this potential ambiguity in the method description. The exploration intensity is derived in a parameter-free manner directly from the token-level epistemic uncertainty estimate, as it is computed on-the-fly from the grounding failure signals to dynamically adjust learning pressure (increasing it for deficient tokens in preferred samples and reducing over-penalization in dispreferred ones). It is not a fixed hyperparameter. To eliminate any confusion, we will revise the method section to explicitly state this derivation process with the relevant equations. We will also add an ablation study comparing UE-DPO against standard DPO and fixed-intensity variants to empirically confirm the benefits of the uncertainty-aware modulation. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation remains self-contained

full rationale

The provided abstract explicitly flags self-referential bias in prior DPO variants that rely on a still-training model's sensitivity estimates, then introduces UE-DPO as an alternative that quantifies token-level epistemic uncertainty from grounding failures and modulates exploration intensity accordingly. No equations, derivation steps, or self-citations are supplied that reduce the central claim (uncertainty-guided self-correction) to a fitted hyperparameter, renamed input, or load-bearing self-citation chain. The theoretical justification is asserted as independent support, and the method is presented as addressing rather than inheriting the acknowledged bias, satisfying the criteria for a self-contained derivation against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The approach rests on the assumption that uncertainty can be extracted from grounding failures and used to modulate preference signals without circularity; one tunable exploration intensity parameter is introduced.

free parameters (1)

uncertainty-aware exploration intensity
Scalar that controls how much extra learning pressure is applied to high-uncertainty tokens in preferred samples.

axioms (1)

domain assumption Token-level epistemic uncertainty can be reliably quantified from the model's failure to ground its predictions in the provided image.
This quantity is used to decide exploration intensity and is assumed to point to genuinely deficient visual knowledge.

pith-pipeline@v0.9.0 · 5534 in / 1178 out tokens · 37682 ms · 2026-05-08T16:56:39.790804+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

51 extracted references · 21 canonical work pages · 5 internal anchors

[1]

Qwen2.5-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen2.5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 5

work page internal anchor Pith review arXiv 2025
[2]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025. 1

work page internal anchor Pith review arXiv 2025
[3]

Hallucinatory image tokens: A training-free eazy approach to detecting and mitigating ob- ject hallucinations in lvlms

Liwei Che, Tony Qingze Liu, Jing Jia, Weiyi Qin, Ruixiang Tang, and Vladimir Pavlovic. Hallucinatory image tokens: A training-free eazy approach to detecting and mitigating ob- ject hallucinations in lvlms. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 21635– 21644, 2025. 2

2025
[4]

Attention hijackers: Detect and disentan- gle attention hijacking in lvlms for hallucination mitigation

Beitao Chen, Xinyu Lyu, Lianli Gao, Jingkuan Song, and Heng Tao Shen. Attention hijackers: Detect and disentan- gle attention hijacking in lvlms for hallucination mitigation. arXiv preprint arXiv:2503.08216, 2025. 2

work page arXiv 2025
[5]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blis- tein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025. 1

work page internal anchor Pith review arXiv 2025
[6]

Instructblip: Towards general-purpose vision- language models with instruction tuning.Advances in neural information processing systems, 36:49250–49267, 2023

Wenliang Dai, Junnan Li, Dongxu Li, Anthony Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale N Fung, and Steven Hoi. Instructblip: Towards general-purpose vision- language models with instruction tuning.Advances in neural information processing systems, 36:49250–49267, 2023. 1

2023
[7]

Damro: Dive into the attention mechanism of lvlm to re- duce object hallucination

Xuan Gong, Tianshi Ming, Xinpeng Wang, and Zhihua Wei. Damro: Dive into the attention mechanism of lvlm to re- duce object hallucination. InProceedings of the 2024 Con- ference on Empirical Methods in Natural Language Process- ing, pages 7696–7712, 2024. 2

2024
[8]

Token pref- erence optimization with self-calibrated visual-anchored rewards for hallucination mitigation.arXiv preprint arXiv:2412.14487, 2024

Jihao Gu, Yingyao Wang, Meng Cao, Pi Bu, Jun Song, Yancheng He, Shilong Li, and Bo Zheng. Token pref- erence optimization with self-calibrated visual-anchored rewards for hallucination mitigation.arXiv preprint arXiv:2412.14487, 2024. 2, 6

work page arXiv 2024
[9]

Con- trastive preference learning: Learning from human feedback without reinforcement learning

Joey Hejna, Rafael Rafailov, Harshit Sikchi, Chelsea Finn, Scott Niekum, W Bradley Knox, and Dorsa Sadigh. Con- trastive preference learning: Learning from human feedback without reinforcement learning. InThe Twelfth International Conference on Learning Representations, 2024. 3

2024
[10]

Denoising dif- fusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising dif- fusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020. 3

2020
[11]

LoRA: Low-rank adaptation of large language models

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. InIn- ternational Conference on Learning Representations, 2022. 5

2022
[12]

Opera: Alleviating hallucination in multi- modal large language models via over-trust penalty and retrospection-allocation

Qidong Huang, Xiaoyi Dong, Pan Zhang, Bin Wang, Con- ghui He, Jiaqi Wang, Dahua Lin, Weiming Zhang, and Nenghai Yu. Opera: Alleviating hallucination in multi- modal large language models via over-trust penalty and retrospection-allocation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13418–13427, 2024. 2

2024
[13]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perel- man, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Weli- hinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024. 1

work page internal anchor Pith review arXiv 2024
[14]

Learning optimal advantage from preferences and mistaking it for reward

W Bradley Knox, Stephane Hatgis-Kessell, Sigurdur Orn Adalgeirsson, Serena Booth, Anca Dragan, Peter Stone, and Scott Niekum. Learning optimal advantage from preferences and mistaking it for reward. InProceedings of the AAAI Con- ference on Artificial Intelligence, pages 10066–10073, 2024. 3

2024
[15]

Semantic Uncertainty: Linguistic Invariances for Uncertainty Estimation in Natural Language Generation

Lorenz Kuhn, Yarin Gal, and Sebastian Farquhar. Seman- tic uncertainty: Linguistic invariances for uncertainty es- timation in natural language generation.arXiv preprint arXiv:2302.09664, 2023. 3

work page internal anchor Pith review arXiv 2023
[16]

V olcano: Mitigating multimodal hallucination through self- feedback guided revision

Seongyun Lee, Sue Park, Yongrae Jo, and Minjoon Seo. V olcano: Mitigating multimodal hallucination through self- feedback guided revision. InProceedings of the 2024 Con- ference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 391–404, 2024. 2

2024
[17]

Mitigating object hal- lucinations in large vision-language models through visual contrastive decoding

Sicong Leng, Hang Zhang, Guanzheng Chen, Xin Li, Shijian Lu, Chunyan Miao, and Lidong Bing. Mitigating object hal- lucinations in large vision-language models through visual contrastive decoding. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 13872–13882, 2024. 2

2024
[18]

arXiv preprint arXiv:2501.01926 , year=

Jiaming Li, Jiacheng Zhang, Zequn Jie, Lin Ma, and Guan- bin Li. Mitigating hallucination for large vision language model by inter-modality correlation calibration decoding. arXiv preprint arXiv:2501.01926, 2025. 2

work page arXiv 2025
[19]

The hidden life of tokens: Reducing hallucination of large vision-language models via visual in- formation steering

Zhuowei Li, Haizhou Shi, Yunhe Gao, Di Liu, Zhenting Wang, Yuxiao Chen, Ting Liu, Long Zhao, Hao Wang, and Dimitris N Metaxas. The hidden life of tokens: Reducing hallucination of large vision-language models via visual in- formation steering. InInternational Conference on Machine Learning, pages 35799–35819, 2025. 2

2025
[20]

Improved baselines with visual instruction tuning

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 26296–26306, 2024. 1, 5

2024
[21]

Mia-dpo: Multi-image augmented di- rect preference optimization for large vision-language mod- els

Ziyu Liu, Yuhang Zang, Xiaoyi Dong, Pan Zhang, Yuhang Cao, Haodong Duan, Conghui He, Yuanjun Xiong, Dahua Lin, and Jiaqi Wang. Mia-dpo: Multi-image augmented di- rect preference optimization for large vision-language mod- els.arXiv preprint arXiv:2410.17637, 2024. 2

work page arXiv 2024
[22]

Estimating llm uncertainty with evidence.arXiv preprint arXiv:2502.00290, 2025

Huan Ma, Jingdong Chen, Joey Tianyi Zhou, Guangyu Wang, and Changqing Zhang. Estimating llm uncertainty with evidence.arXiv preprint arXiv:2502.00290, 2025. 3

work page arXiv 2025
[23]

Mitigating object hallucinations via sentence-level early in- tervention

Shangpin Peng, Senqiao Yang, Li Jiang, and Zhuotao Tian. Mitigating object hallucinations via sentence-level early in- tervention. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 635–646, 2025. 1, 2, 6

2025
[24]

Strengthening multi- modal large language model with bootstrapped preference optimization

Renjie Pi, Tianyang Han, Wei Xiong, Jipeng Zhang, Run- tao Liu, Rui Pan, and Tong Zhang. Strengthening multi- modal large language model with bootstrapped preference optimization. InEuropean Conference on Computer Vision, pages 382–398, 2024. 2

2024
[25]

Direct preference optimization: Your language model is secretly a reward model.Advances in neural information processing systems, 36:53728–53741, 2023

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christo- pher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model.Advances in neural information processing systems, 36:53728–53741, 2023. 1, 2

2023
[26]

Fromr to Q∗: Your language model is secretly a Q-function,

Rafael Rafailov, Joey Hejna, Ryan Park, and Chelsea Finn. Fromrtoq ∗: Your language model is secretly a q-function. arXiv preprint arXiv:2404.12358, 2024. 3

work page arXiv 2024
[27]

Object hallucination in image cap- tioning

Anna Rohrbach, Lisa Anne Hendricks, Kaylee Burns, Trevor Darrell, and Kate Saenko. Object hallucination in image cap- tioning. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 4035–4045,

2018
[28]

Mitigating ob- ject hallucination in mllms via data-augmented phrase-level alignment.arXiv preprint arXiv:2405.18654, 2024

Pritam Sarkar, Sayna Ebrahimi, Ali Etemad, Ahmad Beirami, Sercan ¨O Arık, and Tomas Pfister. Mitigating ob- ject hallucination in mllms via data-augmented phrase-level alignment.arXiv preprint arXiv:2405.18654, 2024. 6

work page arXiv 2024
[29]

Aligning large multimodal models with factually augmented rlhf

Zhiqing Sun, Sheng Shen, Shengcao Cao, Haotian Liu, Chunyuan Li, Yikang Shen, Chuang Gan, Liangyan Gui, Yu- Xiong Wang, Yiming Yang, et al. Aligning large multimodal models with factually augmented rlhf. InFindings of the As- sociation for Computational Linguistics: ACL 2024, pages 13088–13110, 2024. 5, 6

2024
[30]

Octopus: Alleviating hal- lucination via dynamic contrastive decoding

Wei Suo, Lijun Zhang, Mengyang Sun, Lin Yuanbo Wu, Peng Wang, and Yanning Zhang. Octopus: Alleviating hal- lucination via dynamic contrastive decoding. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 29904–29914, 2025. 2

2025
[31]

arXiv preprint arXiv:2410.11779 , year=

Chenxi Wang, Xiang Chen, Ningyu Zhang, Bozhong Tian, Haoming Xu, Shumin Deng, and Huajun Chen. Mllm can see? dynamic correction decoding for hallucination mitiga- tion.arXiv preprint arXiv:2410.11779, 2024. 2

work page arXiv 2024
[32]

mdpo: Conditional preference optimization for multimodal large language mod- els

Fei Wang, Wenxuan Zhou, James Y Huang, Nan Xu, Sheng Zhang, Hoifung Poon, and Muhao Chen. mdpo: Conditional preference optimization for multimodal large language mod- els. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 8078–8088,

2024
[33]

Amber: An llm-free multi-dimensional benchmark for mllms hallucination evaluation.arXiv preprint arXiv:2311.07397, 2023

Junyang Wang, Yuhang Wang, Guohai Xu, Jing Zhang, Yukai Gu, Haitao Jia, Ming Yan, Ji Zhang, and Jitao Sang. An llm-free multi-dimensional benchmark for mllms hallu- cination evaluation.arXiv preprint arXiv:2311.07397, 2023. 5

work page arXiv 2023
[34]

Mitigating hallucinations in large vision-language models with instruction contrastive decoding

Xintong Wang, Jingheng Pan, Liang Ding, and Chris Bie- mann. Mitigating hallucinations in large vision-language models with instruction contrastive decoding. InFindings of the Association for Computational Linguistics: ACL 2024, pages 15840–15853, 2024. 2

2024
[35]

V-dpo: Mitigating hallucination in large vision language models via vision-guided direct preference optimization

Yuxi Xie, Guanzhen Li, Xiao Xu, and Min-Yen Kan. V-dpo: Mitigating hallucination in large vision language models via vision-guided direct preference optimization. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 13258–13273, 2024. 2, 6

2024
[36]

To believe or not to believe your llm: Iterative prompting for estimating epistemic uncertainty.Advances in Neural Information Processing Systems, 37:58077–58117,

Yasin A Yadkori, Ilja Kuzborskij, Andr´as Gy¨orgy, and Csaba Szepesvari. To believe or not to believe your llm: Iterative prompting for estimating epistemic uncertainty.Advances in Neural Information Processing Systems, 37:58077–58117,
[37]

Mitigating hallucinations in large vision- language models via dpo: On-policy data hold the key

Zhihe Yang, Xufang Luo, Dongqi Han, Yunjian Xu, and Dongsheng Li. Mitigating hallucinations in large vision- language models via dpo: On-policy data hold the key. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 10610–10620, 2025. 1, 2, 6

2025
[38]

Clearsight: Vi- sual signal enhancement for object hallucination mitigation in multimodal large language models

Hao Yin, Guangzong Si, and Zilei Wang. Clearsight: Vi- sual signal enhancement for object hallucination mitigation in multimodal large language models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 14625–14634, 2025. 2

2025
[39]

Woodpecker: Hallucination correction for multimodal large language models.Science China Information Sciences, 67(12):220105, 2024

Shukang Yin, Chaoyou Fu, Sirui Zhao, Tong Xu, Hao Wang, Dianbo Sui, Yunhang Shen, Ke Li, Xing Sun, and Enhong Chen. Woodpecker: Hallucination correction for multimodal large language models.Science China Information Sciences, 67(12):220105, 2024. 2

2024
[40]

Rlhf-v: Towards trustworthy mllms via behavior alignment from fine-grained correctional hu- man feedback

Tianyu Yu, Yuan Yao, Haoye Zhang, Taiwen He, Yifeng Han, Ganqu Cui, Jinyi Hu, Zhiyuan Liu, Hai-Tao Zheng, Maosong Sun, et al. Rlhf-v: Towards trustworthy mllms via behavior alignment from fine-grained correctional hu- man feedback. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13807– 13816, 2024. 2, 5, 6

2024
[41]

Rlaif-v: Open-source ai feedback leads to su- per gpt-4v trustworthiness

Tianyu Yu, Haoye Zhang, Qiming Li, Qixin Xu, Yuan Yao, Da Chen, Xiaoman Lu, Ganqu Cui, Yunkai Dang, Taiwen He, et al. Rlaif-v: Open-source ai feedback leads to su- per gpt-4v trustworthiness. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 19985– 19995, 2025. 5, 6

2025
[42]

Self- correcting decoding with generative feedback for mitigat- ing hallucinations in large vision-language models.arXiv preprint arXiv:2502.06130, 2025

Ce Zhang, Zifu Wan, Zhehan Kan, Martin Q Ma, Si- mon Stepputtis, Deva Ramanan, Russ Salakhutdinov, Louis- Philippe Morency, Katia Sycara, and Yaqi Xie. Self- correcting decoding with generative feedback for mitigat- ing hallucinations in large vision-language models.arXiv preprint arXiv:2502.06130, 2025. 2

work page arXiv 2025
[43]

Seeing clearly by layer two: Enhancing attention heads to alleviate hallucination in lvlms.arXiv preprint arXiv:2411.09968, 2024

Xiaofeng Zhang, Yihao Quan, Chaochen Gu, Chen Shen, Xiaosong Yuan, Shaotian Yan, Hao Cheng, Kaijie Wu, and Jieping Ye. Seeing clearly by layer two: Enhancing atten- tion heads to alleviate hallucination in lvlms.arXiv preprint arXiv:2411.09968, 2024. 2

work page arXiv 2024
[44]

Look- ing beyond text: Reducing language bias in large vision- language models via multimodal dual-attention and soft- image guidance

Haozhe Zhao, Shuzheng Si, Liang Chen, Yichi Zhang, Maosong Sun, Baobao Chang, and Minjia Zhang. Look- ing beyond text: Reducing language bias in large vision- language models via multimodal dual-attention and soft- image guidance. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 19677–19701, 2025. 2

2025
[45]

arXiv preprint arXiv:2402.08680 , year=

Linxi Zhao, Yihe Deng, Weitong Zhang, and Quanquan Gu. Mitigating object hallucination in large vision- language models via classifier-free guidance.arXiv preprint arXiv:2402.08680, 2024. 2

work page arXiv 2024
[46]

Beyond hallucinations: Enhancing lvlms through hallucination-aware direct preference optimization,

Zhiyuan Zhao, Bin Wang, Linke Ouyang, Xiaoyi Dong, Ji- aqi Wang, and Conghui He. Beyond hallucinations: Enhanc- ing lvlms through hallucination-aware direct preference op- timization.arXiv preprint arXiv:2311.16839, 2023. 2

work page arXiv 2023
[47]

Analyzing and mitigating object hallucination in large vision-language models,

Yiyang Zhou, Chenhang Cui, Jaehong Yoon, Linjun Zhang, Zhun Deng, Chelsea Finn, Mohit Bansal, and Huaxiu Yao. Analyzing and mitigating object hallucination in large vision-language models.arXiv preprint arXiv:2310.00754,

work page arXiv
[48]

arXiv preprint arXiv:2402.11411 , year=

Yiyang Zhou, Chenhang Cui, Rafael Rafailov, Chelsea Finn, and Huaxiu Yao. Aligning modalities in vision large lan- guage models via preference fine-tuning.arXiv preprint arXiv:2402.11411, 2024. 1, 2, 6

work page arXiv 2024
[49]

Ibd: Alleviating hallucinations in large vision- language models via image-biased decoding

Lanyun Zhu, Deyi Ji, Tianrun Chen, Peng Xu, Jieping Ye, and Jun Liu. Ibd: Alleviating hallucinations in large vision- language models via image-biased decoding. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 1624–1633, 2025. 2

2025
[50]

Mitigating object hallucinations in large vision- language models via attention calibration.arXiv preprint arXiv:2502.01969, 2025

Younan Zhu, Linwei Tao, Minjing Dong, and Chang Xu. Mitigating object hallucinations in large vision- language models via attention calibration.arXiv preprint arXiv:2502.01969, 2025. 2

work page arXiv 2025
[51]

forgotten correct tokens

Xianwei Zhuang, Zhihong Zhu, Yuxin Xie, Liming Liang, and Yuexian Zou. Vasparse: Towards efficient visual hallu- cination mitigation via visual-aware token sparsification. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 4189–4199, 2025. 2 Uncertainty-Aware Exploratory Direct Preference Optimization for Multimodal Large Lang...

2025