Beyond Hallucinations: Enhancing LVLMs through Hallucination-Aware Direct Preference Optimization
Pith reviewed 2026-05-17 10:48 UTC · model grok-4.3
pith:GDV46QHW Add to your LaTeX paper
What is a Pith Number?\usepackage{pith}
\pithnumber{GDV46QHW}
Prints a linked pith:GDV46QHW badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more
The pith
HA-DPO trains multimodal models to prefer accurate image descriptions over hallucinatory ones by optimizing on paired responses.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By reframing hallucination mitigation as a direct preference optimization task and supplying an efficient pipeline for style-consistent positive and negative response pairs, HA-DPO trains models to favor accurate descriptions, which reduces hallucination rates and raises performance on downstream metrics such as POPE accuracy and MME scores across multiple multimodal architectures.
What carries the argument
Hallucination-Aware Direct Preference Optimization (HA-DPO), a preference-learning method that uses paired accurate and hallucinatory responses for the same image to steer the model toward non-hallucinating outputs.
If this is right
- Models fine-tuned with HA-DPO exhibit lower hallucination rates on standard benchmarks such as POPE.
- The same training yields higher overall scores on comprehensive evaluation suites like MME.
- The approach transfers across three different mainstream multimodal models with consistent benefits.
- The constructed preference dataset supports robust learning without requiring changes to the base model architecture.
Where Pith is reading between the lines
- The preference-pair construction pipeline could be reused to align models on other grounding tasks such as visual question answering with factual constraints.
- If the gains hold on out-of-distribution images, the method might reduce the need for post-hoc hallucination detection modules in deployed systems.
- Similar preference optimization could be tested on text-only models by generating synthetic contradictory response pairs.
Load-bearing premise
The pipeline produces positive and negative sample pairs that remain high-quality, style-consistent, and free of new biases capable of undermining preference learning.
What would settle it
Run the trained model on a new set of images where the preference pairs were deliberately constructed with mismatched styles or added factual errors and check whether the reported accuracy gains on POPE and MME disappear.
read the original abstract
Multimodal large language models have made significant advancements in recent years, yet they still suffer from a common issue known as the "hallucination problem", in which the models generate textual descriptions that inaccurately depict or entirely fabricate content from associated images. This paper introduces a novel solution, Hallucination-Aware Direct Preference Optimization (HA-DPO), which reframes the hallucination problem as a preference selection task. The model is trained to favor the non-hallucinating response when presented with two responses of the same image (one accurate and one hallucinatory). Furthermore, this paper proposes an efficient pipeline for constructing positive~(non-hallucinatory) and negative~(hallucinatory) sample pairs, ensuring a high-quality, style-consistent dataset for robust preference learning. When applied to three mainstream multimodal models, HA-DPO significantly reduced hallucination issues and amplified the models' generalization capabilities. Notably, the MiniGPT-4 model, when enhanced with HA-DPO, demonstrated a substantial improvement: POPE accuracy rose from 51.13% to 86.13% (an absolute improvement of 35%), and the MME score surged from 932.00 to 1326.46 (a relative improvement of 42.32%). The codes, models, and datasets are made accessible at https://opendatalab.github.io/HA-DPO.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Hallucination-Aware Direct Preference Optimization (HA-DPO) to address hallucinations in Large Vision-Language Models (LVLMs). It reframes the problem as a preference selection task, training the model to prefer accurate (non-hallucinatory) responses over hallucinatory ones for the same image. An efficient pipeline is introduced to construct high-quality, style-consistent positive and negative sample pairs for robust preference learning. Experiments apply HA-DPO to three mainstream multimodal models and report large gains on POPE and MME benchmarks, including MiniGPT-4 improving from 51.13% to 86.13% on POPE and from 932.00 to 1326.46 on MME. Code, models, and datasets are released.
Significance. If the gains arise from genuine learning of visual grounding rather than artifacts in pair construction, the approach could provide a practical and scalable way to reduce hallucinations via preference optimization. The open release of code, models, and datasets is a clear strength that supports reproducibility and further work. The empirical deltas on established benchmarks are notable, but their interpretation depends on confirming that the method does not exploit surface cues in the constructed pairs.
major comments (2)
- [Method section (pair-construction pipeline)] The headline performance claims (e.g., MiniGPT-4 POPE +35 pp, MME +42%) rest on the pipeline producing style-consistent, bias-free positive/negative pairs. The description of this 'efficient pipeline' provides no ablation or control experiment that isolates style-matching, response length, hedging language, or factual density differences between positives and negatives. Without such controls, it remains possible that DPO optimizes for these surface statistics rather than visual grounding, which would undermine the generalization interpretation of the POPE and MME results.
- [Experiments section] No experiment is reported that compares HA-DPO against a baseline using randomly generated or deliberately mismatched negative responses. Such an ablation would directly test whether the reported improvements require the claimed high-quality, style-consistent pairs or could be obtained more cheaply.
minor comments (2)
- Clarify the exact three models used beyond the MiniGPT-4 example and report per-model results in a single table for easier comparison.
- The abstract and results mention DPO hyperparameters only implicitly; add a short paragraph or table listing the chosen beta, learning rate, and any selection procedure.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment below and agree that additional controls would help strengthen the interpretation of our results. We plan to incorporate the suggested experiments in the revised version.
read point-by-point responses
-
Referee: [Method section (pair-construction pipeline)] The headline performance claims (e.g., MiniGPT-4 POPE +35 pp, MME +42%) rest on the pipeline producing style-consistent, bias-free positive/negative pairs. The description of this 'efficient pipeline' provides no ablation or control experiment that isolates style-matching, response length, hedging language, or factual density differences between positives and negatives. Without such controls, it remains possible that DPO optimizes for these surface statistics rather than visual grounding, which would undermine the generalization interpretation of the POPE and MME results.
Authors: We acknowledge the validity of this concern. While the pipeline generates positive and negative pairs from the same LVLM using accuracy-targeted prompts and consistent instruction templates to promote stylistic similarity, the current manuscript does not include explicit ablations for length, hedging, or factual density. In the revised manuscript we will add controls that (i) truncate responses to matched lengths, (ii) construct style-mismatched pairs, and (iii) quantify hedging language differences, to test whether gains persist beyond surface cues. revision: yes
-
Referee: [Experiments section] No experiment is reported that compares HA-DPO against a baseline using randomly generated or deliberately mismatched negative responses. Such an ablation would directly test whether the reported improvements require the claimed high-quality, style-consistent pairs or could be obtained more cheaply.
Authors: We agree that a direct comparison to random or mismatched negatives is a useful control. In the revised manuscript we will report results from a baseline that uses randomly sampled or deliberately mismatched negative responses (e.g., generic incorrect captions or shuffled pairs) and show that performance is substantially lower than with our style-consistent pairs, thereby supporting the value of the proposed pipeline. revision: yes
Circularity Check
No circularity: empirical gains on external benchmarks
full rationale
The paper introduces HA-DPO as a reframing of hallucination mitigation into a preference optimization task using constructed positive/negative response pairs, then reports measured improvements on held-out benchmarks (POPE, MME) after fine-tuning three models. No equations, derivations, or first-principles claims appear in the provided text; the central results are performance deltas obtained via standard training and evaluation rather than quantities forced by construction from fitted inputs or self-citations. The method relies on an external DPO framework and benchmark metrics independent of the training pairs, satisfying the self-contained criterion against external benchmarks.
Axiom & Free-Parameter Ledger
free parameters (1)
- DPO beta and learning-rate hyperparameters
axioms (1)
- domain assumption Direct preference optimization loss can be applied to vision-language model outputs to reduce hallucinations when positive and negative pairs are available.
Forward citations
Cited by 20 Pith papers
-
Aligning with Your Own Voice: Self-Corrected Preference Learning for Hallucination Mitigation in LVLMs
AVES-DPO mitigates hallucinations in LVLMs by creating in-distribution preference pairs through the model's self-correction, outperforming baselines with only 5.2k samples.
-
Don't Let the Video Speak: Audio-Contrastive Preference Optimization for Audio-Visual Language Models
Audio-Contrastive Preference Optimization (ACPO) mitigates audio hallucination in AVLMs via output-contrastive and input-contrastive objectives that enforce faithful audio grounding.
-
Decoding by Perturbation: Mitigating MLLM Hallucinations via Dynamic Textual Perturbation
DeP mitigates MLLM hallucinations by dynamically perturbing text prompts to identify and reinforce stable visual evidence regions while counteracting language prior biases using attention variance and logit statistics.
-
Focus Matters: Phase-Aware Suppression for Hallucination in Vision-Language Models
Suppressing low-attention tokens during the focus phase of vision-encoder processing reduces object hallucinations in LVLMs while preserving caption quality and adding negligible inference time.
-
Unified Reward Model for Multimodal Understanding and Generation
UnifiedReward is the first unified reward model that jointly assesses multimodal understanding and generation to provide better preference signals for aligning vision models via DPO.
-
Deep Pre-Alignment for VLMs
Deep Pre-Alignment uses a small VLM perceiver instead of ViT to pre-align visual features with LLM text space, yielding 1.9-3.0 point gains on multimodal benchmarks and 32.9% less language forgetting.
-
Mitigating Action-Relation Hallucinations in LVLMs via Relation-aware Visual Enhancement
A new attention-enhancement method using ARS scores and RVE reduces action-relation hallucinations in LVLMs while generalizing to spatial and object hallucinations.
-
Vocabulary Hijacking in LVLMs: Unveiling Critical Attention Heads by Excluding Inert Tokens to Mitigate Hallucination
LVLMs show vocabulary hijacking by inert tokens that decode to hijacking anchors; HABI locates them, NHAR finds resilient heads, and HAVAE boosts those heads to cut hallucinations.
-
Object Hallucination-Free Reinforcement Unlearning for Vision-Language Models
HFRU is a two-stage reinforcement unlearning method operating on the vision encoder with GRPO optimization and an abstraction reward that achieves over 98% forgetting and retention on object and face tasks with neglig...
-
S2H-DPO: Hardness-Aware Preference Optimization for Vision-Language Models
S2H-DPO generates hierarchical prompt-driven preference pairs to improve multi-image reasoning in VLMs while keeping single-image performance intact.
-
Revis: Sparse Latent Steering to Mitigate Object Hallucination in Large Vision-Language Models
REVIS reduces object hallucination in large vision-language models by about 19% via sparse orthogonal projection in latent space at suppression depths while keeping reasoning intact.
-
Visual-RFT: Visual Reinforcement Fine-Tuning
Visual-RFT applies reinforcement learning with verifiable perception rewards to improve large vision-language models on fine-grained classification, few-shot detection, and grounding tasks.
-
Uncertainty-Aware Exploratory Direct Preference Optimization for Multimodal Large Language Models
UE-DPO quantifies epistemic uncertainty from grounding failures to direct more learning pressure on hard visual tokens in preferred samples while easing penalties on dispreferred ones.
-
Multimodal Large Language Models with Adaptive Preference Optimization for Sequential Recommendation
HaNoRec dynamically weights harder preference samples and applies Gaussian perturbations to output distributions to improve multimodal LLM performance on sequential recommendation tasks.
-
InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output
InternLM-XComposer-2.5 is a 7B vision-language model supporting up to 96K context that reaches GPT-4V-level performance on image, video, and multi-turn tasks and adds LoRA-driven text-image composition capabilities.
-
Hallucination of Multimodal Large Language Models: A Survey
The survey organizes causes of hallucinations in MLLMs, reviews evaluation benchmarks and metrics, and outlines mitigation approaches plus open questions.
-
Aligning Modalities in Vision Large Language Models via Preference Fine-tuning
POVID generates AI-created preference data to fine-tune vision-language models with DPO, reducing hallucinations and improving benchmark scores.
-
InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model
InternLM-XComposer2 introduces Partial LoRA on InternLM2-7B to enable high-quality free-form text-image composition while matching or exceeding GPT-4V on select vision-language benchmarks.
-
Vision-EKIPL: External Knowledge-Infused Policy Learning for Visual Reasoning
Vision-EKIPL injects high-quality actions from external models into RL training to expand exploration and raise the reasoning ceiling of MLLMs, reporting up to 5% gains on the Reason-RFT-CoT benchmark.
-
A Survey on Hallucination in Large Vision-Language Models
This survey reviews the definition, symptoms, evaluation benchmarks, root causes, and mitigation methods for hallucinations in large vision-language models.
Reference graph
Works this paper leans on
-
[1]
Brown, Jack Clark, Sam McCandlish, Chris Olah, Benjamin Mann, and Jared Kaplan
Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, Nicholas Joseph, Saurav Ka- davath, Jackson Kernion, Tom Conerly, Sheer El Showk, Nelson Elhage, Zac Hatfield-Dodds, Danny Hernandez, Tris- tan Hume, Scott Johnston, Shauna Kravec, Liane Lovitt, Neel Nanda, Catherine Olss...
work page 2022
-
[2]
The exploration-exploitation dilemma: a multidisci- plinary framework
Oded Berger-Tal, Jonathan Nathan, Ehud Meron, and David Saltz. The exploration-exploitation dilemma: a multidisci- plinary framework. PLOS ONE, 2014. 2
work page 2014
-
[3]
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Sub- biah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakan- tan, Pranav Shyam, Girish Sastry, Amanda Askell, Sand- hini Agarwal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz...
work page 2020
-
[4]
Paul F. Christiano, Jan Leike, Tom B. Brown, Miljan Mar- tic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences. In Advances in Neural Information Processing Systems (NIPS), 2017. 2
work page 2017
-
[5]
Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven C. H. Hoi. Instructblip: Towards general- purpose vision-language models with instruction tuning. arXiv.org, 2023. 1, 2, 7
work page 2023
-
[6]
MME: A comprehensive evaluation benchmark for multimodal large language models
Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Zhenyu Qiu, Wei Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, and Rongrong Ji. MME: A comprehensive evaluation benchmark for multimodal large language models. arXiv.org, 2023. 8
work page 2023
-
[7]
Detecting and preventing hallucinations in large vision language models
Anisha Gunjal, Jihan Yin, and Erhan Bas. Detecting and preventing hallucinations in large vision language models. arXiv.org, 2023. 1
work page 2023
-
[8]
Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen
Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. InPro- ceedings of the International Conference on Learning Rep- resentations (ICLR). OpenReview.net, 2022. 6
work page 2022
-
[9]
Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalan- tidis, Li-Jia Li, David A. Shamma, Michael S. Bernstein, and Li Fei-Fei. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision (IJCV), 2017. 4, 6, 1
work page 2017
-
[10]
Evaluating object hallucination in large vision-language models
Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision-language models. arXiv.org, 2023. 1, 2, 5
work page 2023
-
[11]
Truthfulqa: Measuring how models mimic human falsehoods
Stephanie Lin, Jacob Hilton, and Owain Evans. Truthfulqa: Measuring how models mimic human falsehoods. In Associ- ation for Computational Linguistics (ACL), 2022. 2
work page 2022
-
[12]
Aligning large multi-modal model with robust instruction tuning
Fuxiao Liu, Kevin Lin, Linjie Li, Jianfeng Wang, Yaser Ya- coob, and Lijuan Wang. Aligning large multi-modal model with robust instruction tuning. arXiv.org, 2023. 1, 2
work page 2023
-
[13]
Improved baselines with visual instruction tuning
Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. arXiv,
-
[14]
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. arXiv.org, 2023. 1
work page 2023
-
[15]
Ho, Robert Tyler Loftin, Bei Peng, Guan Wang, David L
James MacGlashan, Mark K. Ho, Robert Tyler Loftin, Bei Peng, Guan Wang, David L. Roberts, Matthew E. Taylor, and Michael L. Littman. Interactive learning from policy- dependent human feedback. In Proceedings of the Inter- national Conference on Machine learning (ICML) , pages 2285–2294. PMLR, 2017. 3
work page 2017
-
[16]
Sources of hallucination by large language models on inference tasks
Nick McKenna, Tianyi Li, Liang Cheng, Mohammad Javad Hosseini, Mark Johnson, and Mark Steedman. Sources of hallucination by large language models on inference tasks. arXiv.org, 2023. 2
work page 2023
-
[17]
Deep reinforcement learning: An overview
Seyed Sajad Mousavi, Michael Schukat, and Enda Howley. Deep reinforcement learning: An overview. In Intelligent Systems Conference (IntelliSys), 2016. 2
work page 2016
- [18]
-
[19]
Ian Osband, Benjamin Van Roy, Daniel J. Russo, and Zheng Wen. Deep exploration via randomized value functions. Journal of Machine Learning Research (JMLR), 2019. 2
work page 2019
-
[20]
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Car- roll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F. Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human f...
work page 2022
-
[21]
Guilherme Penedo, Quentin Malartic, Daniel Hesslow, Ruxandra Cojocaru, Alessandro Cappelli, Hamza Alobeidli, Baptiste Pannier, Ebtesam Almazrouei, and Julien Launay. The refinedweb dataset for falcon LLM: outperforming cu- rated corpora with web data, and web data only. arXiv.org,
-
[22]
Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Er- mon, Christopher D. Manning, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. arXiv.org, 2023. 2, 3, 5
work page 2023
-
[23]
Investigating the factual knowledge boundary of large language models with retrieval augmentation
Ruiyang Ren, Yuhao Wang, Yingqi Qu, Wayne Xin Zhao, Jing Liu, Hao Tian, Hua Wu, Ji-Rong Wen, and Haifeng Wang. Investigating the factual knowledge boundary of large language models with retrieval augmentation. arXiv.org,
-
[24]
Object hallucination in image cap- tioning
Anna Rohrbach, Lisa Anne Hendricks, Kaylee Burns, Trevor Darrell, and Kate Saenko. Object hallucination in image cap- tioning. In Empirical Methods in Natural Language Process- ing (EMNLP), 2018. 2
work page 2018
-
[25]
Proximal policy optimization algo- rithms
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Rad- ford, and Oleg Klimov. Proximal policy optimization algo- rithms. arXiv.org, 2017. 3 9
work page 2017
-
[26]
Aligning large multimodal models with factually augmented RLHF
Zhiqing Sun, Sheng Shen, Shengcao Cao, Haotian Liu, Chunyuan Li, Yikang Shen, Chuang Gan, Liang-Yan Gui, Yu-Xiong Wang, Yiming Yang, Kurt Keutzer, and Trevor Darrell. Aligning large multimodal models with factually augmented RLHF. arXiv.org, 2023. 1, 2
work page 2023
-
[27]
Llama 2: Open foundation and fine- tuned chat models
Hugo Touvron, Louis Martin, Kevin Stone, Peter Al- bert, Amjad Almahairi, Yasmine Babaei, Nikolay Bash- lykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhos- ale, Dan Bikel, Lukas Blecher, Cristian Canton-Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fer- nandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Antho...
work page 2023
-
[28]
Vigc: Visual instruction generation and correction
Bin Wang, Fan Wu, Xiao Han, Jiahui Peng, Huaping Zhong, Pan Zhang, Xiaoyi Dong, Weijia Li, Wei Li, Jiaqi Wang, and Conghui He. VIGC: visual instruction generation and cor- rection. arXiv.org, abs/2308.12714, 2023. 2
-
[29]
Evaluation and analysis of hallucination in large vision-language mod- els
Junyang Wang, Yiyang Zhou, Guohai Xu, Pengcheng Shi, Chenlin Zhao, Haiyang Xu, Qinghao Ye, Ming Yan, Ji Zhang, Jihua Zhu, Jitao Sang, and Haoyu Tang. Evaluation and analysis of hallucination in large vision-language mod- els. arXiv.org, 2023. 2
work page 2023
-
[30]
Woodpecker: Hallucination correction for multimodal large language models
Shukang Yin, Chaoyou Fu, Sirui Zhao, Tong Xu, Hao Wang, Dianbo Sui, Yunhang Shen, Ke Li, Xing Sun, and Enhong Chen. Woodpecker: Hallucination correction for multimodal large language models. arXiv.org, 2023. 2
work page 2023
-
[31]
Bohan Zhai, Shijia Yang, Xiangchen Zhao, Chenfeng Xu, Sheng Shen, Dongdi Zhao, Kurt Keutzer, Manling Li, Tan Yan, and Xiangjun Fan. Halle-switch: Rethinking and con- trolling object existence hallucinations in large vision lan- guage models for detailed caption. arXiv.org, 2023. 1
work page 2023
-
[32]
Pan Zhang, Xiaoyi Dong, Bin Wang, Yuhang Cao, Chao Xu, Linke Ouyang, Zhiyuan Zhao, Shuangrui Ding, Songyang Zhang, Haodong Duan, Wenwei Zhang, Hang Yan, Xinyue Zhang, Wei Li, Jingwen Li, Kai Chen, Con- ghui He, Xingcheng Zhang, Yu Qiao, Dahua Lin, and Jiaqi Wang. Internlm-xcomposer: A vision-language large model for advanced text-image comprehension and c...
work page 2023
-
[33]
Siren’s song in the AI ocean: A survey on hallucination in large language models
Yue Zhang, Yafu Li, Leyang Cui, Deng Cai, Lemao Liu, Tingchen Fu, Xinting Huang, Enbo Zhao, Yu Zhang, Yulong Chen, Longyue Wang, Anh Tuan Luu, Wei Bi, Freda Shi, and Shuming Shi. Siren’s song in the AI ocean: A survey on hallucination in large language models. arXiv.org, 2023. 2
work page 2023
-
[34]
LIMA: less is more for alignment
Chunting Zhou, Pengfei Liu, Puxin Xu, Srini Iyer, Jiao Sun, Yuning Mao, Xuezhe Ma, Avia Efrat, Ping Yu, Lili Yu, Susan Zhang, Gargi Ghosh, Mike Lewis, Luke Zettlemoyer, and Omer Levy. LIMA: less is more for alignment. arXiv.org,
-
[35]
Analyzing and mitigating object hallucination in large vision-language models
Yiyang Zhou, Chenhang Cui, Jaehong Yoon, Linjun Zhang, Zhun Deng, Chelsea Finn, Mohit Bansal, and Huaxiu Yao. Analyzing and mitigating object hallucination in large vision-language models. arXiv.org, 2023. 1, 2
work page 2023
-
[36]
Minigpt-4: Enhancing vision- language understanding with advanced large language mod- els
Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. Minigpt-4: Enhancing vision- language understanding with advanced large language mod- els. arXiv.org, 2023. 1, 6, 7
work page 2023
-
[37]
Fine-Tuning Language Models from Human Preferences
Daniel M. Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B. Brown, Alec Radford, Dario Amodei, Paul F. Christiano, and Geoffrey Irving. Fine-tuning language models from human preferences. arXiv.org, abs/1909.08593, 2019. 3 10 Beyond Hallucinations: Enhancing LVLMs through Hallucination-Aware Direct Preference Optimization Supplementary Material
work page internal anchor Pith review Pith/arXiv arXiv 1909
-
[38]
Visual Genome is a large-scale vision-language dataset that includes dense captions [9]
Dataset Visual Genome (VG). Visual Genome is a large-scale vision-language dataset that includes dense captions [9]. It contains over 100,000 densely annotated images, each aver- aging 21 objects, 18 attributes, and 18 object relationships. As the largest and densest dataset of image descriptions, ob- jects, attributes, relationships, and question answers...
-
[39]
Details of Hallucination Data Generation In this section, we delve into the specifics of the halluci- nation data generation process. This is illustrated through concrete examples at each of its three crucial stages: de- scription generation, hallucination detection and correction, and style-consistency data augmentation. 8.1. Description Generation We pr...
-
[40]
Details of SHR evaluation 9.1. Evaluation Example In the SHR evaluation, GPT-4 classifies each sentence in the model response as hallucination or correct. The SHR is then computed as the proportion of hallucinated sentences to total sentences. Consult Figure 12 for illustration. 9.2. Factual-assisted Evaluation During the SHR evaluation, some inaccuracies...
-
[41]
Additional Style-consistency Analysis To further demonstrate the effect of style-consistent con- trol, we quantitatively examine the role of data style- consistent control in preventing training instability. Specif- ically, MiniGPT4-LLaMA2 model is fine-tuned with and without style-consistency control and evaluated on a sub- set of SHR. Instability is eva...
-
[42]
Comparison with other hallucination miti- gation methods In this section, we compare our proposed method with other hallucination mitigation methods, results are shown in Ta- ble 7. Results show that HA-DPO outperforms other com- petitive methods and achieves SOTA (state-of-the-art) in POPE accuracy. Notably, LRV used 400, 000 training data, while our hal...
-
[43]
Quality Examples Figure 13, Figure 14 and Figure 15 present hallucination- eliminated examples generated by the MiniGPT4- LLaMA2-7B, InstructBLIP-13 and LLaV A-1.5, re- spectively. Models optimized using HA-DPO produce significantly less hallucinated content in both visual question-answering and image description tasks. 12.1. Efficacy and Potential of HA-...
-
[45]
Corrected Sentences:
-
[46]
Here is the comment for you to judge if it is hallucination and revise:
Here are the region descriptions of the image: [0.02, 0.02, 0.98, 0.65]: people are posing for the picture [0.81, 0.0, 0.94, 0.2]: the woman is wearing goggles [0.79, 0.0, 0.94, 0.22]: the woman is wearing a scarf ... Here is the comment for you to judge if it is hallucination and revise:
-
[49]
One person is standing at the top of the slope, holding a ski pole, while the others are skiing down the slope
-
[50]
The slope is covered in snow and there are trees in the background
-
[51]
The sky is blue and there are some clouds in the distance
- [52]
-
[53]
The image shows a group of people skiing down a snowy slope. ( correct: The region descriptions mention four skiers on a snowy hill, and objects related to skiing.)
-
[54]
(correct: The region descriptions include ski gear like helmets, goggles, ski poles and skis.)
They are all wearing ski gear, including helmets, goggles, and skis. (correct: The region descriptions include ski gear like helmets, goggles, ski poles and skis.)
-
[55]
One person is standing at the top of the slope, holding a ski pole, while the others are skiing down the slope. (hallucination: There is no description mentioning one person standing at the top of the slope while the others are skiing down.)
-
[56]
The slope is covered in snow and there are trees in the background. ( hallucination: There is no description involving trees in the background and the slope being covered in snow. )
-
[57]
( hallucination: There is no description involving the sky or clouds.)
The sky is blue and there are some clouds in the distance. ( hallucination: There is no description involving the sky or clouds.)
-
[58]
( correct: The overall mood is adventurous.) Corrected Sentences:
The overall mood of the image is one of excitement and adventure. ( correct: The overall mood is adventurous.) Corrected Sentences:
-
[59]
The image shows a group of people skiing down a snowy slope
-
[60]
They are all wearing ski gear, including helmets, goggles, and skis
- [61]
- [62]
- [63]
-
[64]
The image portrays a winter sport scenario, with skiers ready for an adventure. GPT-4 Output Prompt Figure 9. Hallucination Detection and Correction in LVLM-Generated Sentences. 4 Help me rewrite the given sentence. Don't change any detail and information in the original sentence. Don't add any new information. Give 3 different versions. The format of you...
-
[65]
They are all wearing ski gear, including helmets, goggles, and skis
The sentence you need to rewrite: The image shows a group of people skiing down a snowy slope. They are all wearing ski gear, including helmets, goggles, and skis. One person is standing at the top of the slope, holding a ski pole, while the others are skiing down the slope. The slope is covered in snow and there are trees in the background. The sky is bl...
-
[66]
Each person is outfitted in ski gear, complete with helmets, goggles, and skis
In the image, a snowy slope is depicted with a group of individuals skiing down it. Each person is outfitted in ski gear, complete with helmets, goggles, and skis. Positioned at the top of the slope, one individual stands, clutching a ski pole, while the rest joyfully descend. Snow blankets the slope, accompanied by a backdrop of trees. The sky above is a...
-
[67]
All members are donned in ski gear, incorporating helmets, goggles, and skis into their attire
A snowy slope takes center stage in the image, featuring a spirited group of skiers descending down its snowy expanse. All members are donned in ski gear, incorporating helmets, goggles, and skis into their attire. One person stands confidently at the peak, gripping a ski pole, while the remaining skiers revel in the thrill of their descent. The snow-cove...
-
[68]
Illustrated in the image is a snowy slope, where a cohesive group of skiing enthusiasts can be seen carving their way down. Their attire includes helmets, goggles, and skis, highlighting their readiness for an exhilarating adventure. Positioned at the summit of the slope, one individual stands upright, firmly grasping a ski pole, while the others relish i...
-
[69]
They are all wearing ski gear, including helmets, goggles, and skis
The sentence you need to rewrite: The image shows a group of people skiing down a snowy slope. They are all wearing ski gear, including helmets, goggles, and skis. A woman in ski gear is smiling at the camera, holding ski poles. Four skiers are present on a snowy hill. The skiers are equipped with helmets, goggles and ski poles. The image portrays a winte...
-
[70]
One woman in the group is smiling at the camera while holding ski poles
In the picture, there is a snowy hill with a group of four skiers all dressed in ski gear, including helmets, goggles, and skis. One woman in the group is smiling at the camera while holding ski poles
-
[71]
One of the skiers, a woman, can be seen smiling at the camera
The snowy slope depicted in the image is occupied by a group of four skiers who are all wearing ski gear, including helmets, goggles, and ski poles. One of the skiers, a woman, can be seen smiling at the camera
-
[72]
The image showcases a winter sport scene featuring four skiers skiing down a snowy hill. All of them are properly equipped with ski gear, such as helmets, goggles, and ski poles. Among the skiers, a woman stands out as she smiles at the camera while holding her ski poles. Prompt GPT-4 Output Figure 11. Style-Consistent Data Augmentation: Positive Sample (...
-
[73]
hallucination or correct or cannot judge: <reason>
-
[74]
Here is the comment for you to judge if it is hallucination:
Here are the region descriptions of the image: [0.24, 0.59, 0.34, 0.95]: woman walking on the street [0.46, 0.59, 0.62, 0.68]: people sitting on tables on the street [0.41, 0.57, 0.46, 0.68]: man wearing red top [0.1, 0.0, 0.36, 0.43]: a balcony in the building [0.36, 0.58, 0.43, 0.75]: woman holding a white bag ... Here is the comment for you to judge if...
-
[75]
The scene depicts a bustling city street filled with pedestrians, motorcycles, and bicycles
-
[76]
There are several people walking along the sidewalk, including a woman in a white dress who appears to be crossing the street
-
[77]
In addition to the pedestrians, there are two motorcycles parked on the side of the street, one closer to the left side and the other closer to the right side
-
[78]
Several bicycles can also be seen throughout the scene, some parked and others being ridden by individuals. Judgement:
-
[79]
( Hallucination: The description does not mention bicycles or the street being bustling
The scene depicts a bustling city street filled with pedestrians, motorcycles, and bicycles. ( Hallucination: The description does not mention bicycles or the street being bustling. )
-
[80]
(Correct: The description mentions a woman walking and other people on the sidewalk.)
There are several people walking along the sidewalk, including a woman in a white dress who appears to be crossing the street. (Correct: The description mentions a woman walking and other people on the sidewalk.)
-
[81]
(Correct: The description mentions two motorcycles parked on the side of the street.)
In addition to the pedestrians, there are two motorcycles parked on the side of the street, one closer to the left side and the other closer to the right side. (Correct: The description mentions two motorcycles parked on the side of the street.)
-
[82]
(Hallucination: There are no bicycles mentioned in the description
Several bicycles can also be seen throughout the scene, some parked and others being ridden by individuals. (Hallucination: There are no bicycles mentioned in the description. ) GPT-4 Output Prompt Figure 12. Illustration of Sentence Hallucination Ratio (SHR) Evaluation. 7 The image shows a plate of food on a table in front of a computer monitor. The plat...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.