pith. sign in

arxiv: 2311.16839 · v2 · pith:GDV46QHWnew · submitted 2023-11-28 · 💻 cs.CV · cs.CL

Beyond Hallucinations: Enhancing LVLMs through Hallucination-Aware Direct Preference Optimization

Pith reviewed 2026-05-17 10:48 UTC · model grok-4.3

classification 💻 cs.CV cs.CL
keywords hallucinationdirect preference optimizationmultimodal large language modelsvision-language modelspreference learningimage captioningmodel alignment
0
0 comments X p. Extension
pith:GDV46QHW Add to your LaTeX paper What is a Pith Number?
\usepackage{pith}
\pithnumber{GDV46QHW}

Prints a linked pith:GDV46QHW badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more

The pith

HA-DPO trains multimodal models to prefer accurate image descriptions over hallucinatory ones by optimizing on paired responses.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Hallucination-Aware Direct Preference Optimization to fix hallucinations in large vision-language models. It turns the problem into a preference task where the model learns to choose the non-hallucinating response when shown two versions for the same image. The authors also give a pipeline that builds consistent positive and negative sample pairs for training. Tests on three models show clear drops in hallucinations and gains in generalization, with one model jumping from 51 percent to 86 percent accuracy on a standard benchmark. Readers care because reliable image understanding matters for applications that depend on these models.

Core claim

By reframing hallucination mitigation as a direct preference optimization task and supplying an efficient pipeline for style-consistent positive and negative response pairs, HA-DPO trains models to favor accurate descriptions, which reduces hallucination rates and raises performance on downstream metrics such as POPE accuracy and MME scores across multiple multimodal architectures.

What carries the argument

Hallucination-Aware Direct Preference Optimization (HA-DPO), a preference-learning method that uses paired accurate and hallucinatory responses for the same image to steer the model toward non-hallucinating outputs.

If this is right

  • Models fine-tuned with HA-DPO exhibit lower hallucination rates on standard benchmarks such as POPE.
  • The same training yields higher overall scores on comprehensive evaluation suites like MME.
  • The approach transfers across three different mainstream multimodal models with consistent benefits.
  • The constructed preference dataset supports robust learning without requiring changes to the base model architecture.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The preference-pair construction pipeline could be reused to align models on other grounding tasks such as visual question answering with factual constraints.
  • If the gains hold on out-of-distribution images, the method might reduce the need for post-hoc hallucination detection modules in deployed systems.
  • Similar preference optimization could be tested on text-only models by generating synthetic contradictory response pairs.

Load-bearing premise

The pipeline produces positive and negative sample pairs that remain high-quality, style-consistent, and free of new biases capable of undermining preference learning.

What would settle it

Run the trained model on a new set of images where the preference pairs were deliberately constructed with mismatched styles or added factual errors and check whether the reported accuracy gains on POPE and MME disappear.

read the original abstract

Multimodal large language models have made significant advancements in recent years, yet they still suffer from a common issue known as the "hallucination problem", in which the models generate textual descriptions that inaccurately depict or entirely fabricate content from associated images. This paper introduces a novel solution, Hallucination-Aware Direct Preference Optimization (HA-DPO), which reframes the hallucination problem as a preference selection task. The model is trained to favor the non-hallucinating response when presented with two responses of the same image (one accurate and one hallucinatory). Furthermore, this paper proposes an efficient pipeline for constructing positive~(non-hallucinatory) and negative~(hallucinatory) sample pairs, ensuring a high-quality, style-consistent dataset for robust preference learning. When applied to three mainstream multimodal models, HA-DPO significantly reduced hallucination issues and amplified the models' generalization capabilities. Notably, the MiniGPT-4 model, when enhanced with HA-DPO, demonstrated a substantial improvement: POPE accuracy rose from 51.13% to 86.13% (an absolute improvement of 35%), and the MME score surged from 932.00 to 1326.46 (a relative improvement of 42.32%). The codes, models, and datasets are made accessible at https://opendatalab.github.io/HA-DPO.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes Hallucination-Aware Direct Preference Optimization (HA-DPO) to address hallucinations in Large Vision-Language Models (LVLMs). It reframes the problem as a preference selection task, training the model to prefer accurate (non-hallucinatory) responses over hallucinatory ones for the same image. An efficient pipeline is introduced to construct high-quality, style-consistent positive and negative sample pairs for robust preference learning. Experiments apply HA-DPO to three mainstream multimodal models and report large gains on POPE and MME benchmarks, including MiniGPT-4 improving from 51.13% to 86.13% on POPE and from 932.00 to 1326.46 on MME. Code, models, and datasets are released.

Significance. If the gains arise from genuine learning of visual grounding rather than artifacts in pair construction, the approach could provide a practical and scalable way to reduce hallucinations via preference optimization. The open release of code, models, and datasets is a clear strength that supports reproducibility and further work. The empirical deltas on established benchmarks are notable, but their interpretation depends on confirming that the method does not exploit surface cues in the constructed pairs.

major comments (2)
  1. [Method section (pair-construction pipeline)] The headline performance claims (e.g., MiniGPT-4 POPE +35 pp, MME +42%) rest on the pipeline producing style-consistent, bias-free positive/negative pairs. The description of this 'efficient pipeline' provides no ablation or control experiment that isolates style-matching, response length, hedging language, or factual density differences between positives and negatives. Without such controls, it remains possible that DPO optimizes for these surface statistics rather than visual grounding, which would undermine the generalization interpretation of the POPE and MME results.
  2. [Experiments section] No experiment is reported that compares HA-DPO against a baseline using randomly generated or deliberately mismatched negative responses. Such an ablation would directly test whether the reported improvements require the claimed high-quality, style-consistent pairs or could be obtained more cheaply.
minor comments (2)
  1. Clarify the exact three models used beyond the MiniGPT-4 example and report per-model results in a single table for easier comparison.
  2. The abstract and results mention DPO hyperparameters only implicitly; add a short paragraph or table listing the chosen beta, learning rate, and any selection procedure.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and agree that additional controls would help strengthen the interpretation of our results. We plan to incorporate the suggested experiments in the revised version.

read point-by-point responses
  1. Referee: [Method section (pair-construction pipeline)] The headline performance claims (e.g., MiniGPT-4 POPE +35 pp, MME +42%) rest on the pipeline producing style-consistent, bias-free positive/negative pairs. The description of this 'efficient pipeline' provides no ablation or control experiment that isolates style-matching, response length, hedging language, or factual density differences between positives and negatives. Without such controls, it remains possible that DPO optimizes for these surface statistics rather than visual grounding, which would undermine the generalization interpretation of the POPE and MME results.

    Authors: We acknowledge the validity of this concern. While the pipeline generates positive and negative pairs from the same LVLM using accuracy-targeted prompts and consistent instruction templates to promote stylistic similarity, the current manuscript does not include explicit ablations for length, hedging, or factual density. In the revised manuscript we will add controls that (i) truncate responses to matched lengths, (ii) construct style-mismatched pairs, and (iii) quantify hedging language differences, to test whether gains persist beyond surface cues. revision: yes

  2. Referee: [Experiments section] No experiment is reported that compares HA-DPO against a baseline using randomly generated or deliberately mismatched negative responses. Such an ablation would directly test whether the reported improvements require the claimed high-quality, style-consistent pairs or could be obtained more cheaply.

    Authors: We agree that a direct comparison to random or mismatched negatives is a useful control. In the revised manuscript we will report results from a baseline that uses randomly sampled or deliberately mismatched negative responses (e.g., generic incorrect captions or shuffled pairs) and show that performance is substantially lower than with our style-consistent pairs, thereby supporting the value of the proposed pipeline. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical gains on external benchmarks

full rationale

The paper introduces HA-DPO as a reframing of hallucination mitigation into a preference optimization task using constructed positive/negative response pairs, then reports measured improvements on held-out benchmarks (POPE, MME) after fine-tuning three models. No equations, derivations, or first-principles claims appear in the provided text; the central results are performance deltas obtained via standard training and evaluation rather than quantities forced by construction from fitted inputs or self-citations. The method relies on an external DPO framework and benchmark metrics independent of the training pairs, satisfying the self-contained criterion against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The approach rests on the standard DPO loss formulation plus the domain assumption that preference pairs can be generated at scale without introducing confounding style or content biases; no new physical entities or ad-hoc constants are introduced.

free parameters (1)
  • DPO beta and learning-rate hyperparameters
    Standard preference-optimization hyperparameters that are typically tuned on validation data.
axioms (1)
  • domain assumption Direct preference optimization loss can be applied to vision-language model outputs to reduce hallucinations when positive and negative pairs are available.
    Invoked when the method is presented as a drop-in reframing of the hallucination problem.

pith-pipeline@v0.9.0 · 5564 in / 1411 out tokens · 64153 ms · 2026-05-17T10:48:14.256416+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 20 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Aligning with Your Own Voice: Self-Corrected Preference Learning for Hallucination Mitigation in LVLMs

    cs.AI 2026-04 unverdicted novelty 7.0

    AVES-DPO mitigates hallucinations in LVLMs by creating in-distribution preference pairs through the model's self-correction, outperforming baselines with only 5.2k samples.

  2. Don't Let the Video Speak: Audio-Contrastive Preference Optimization for Audio-Visual Language Models

    cs.CV 2026-04 unverdicted novelty 7.0

    Audio-Contrastive Preference Optimization (ACPO) mitigates audio hallucination in AVLMs via output-contrastive and input-contrastive objectives that enforce faithful audio grounding.

  3. Decoding by Perturbation: Mitigating MLLM Hallucinations via Dynamic Textual Perturbation

    cs.CL 2026-04 unverdicted novelty 7.0

    DeP mitigates MLLM hallucinations by dynamically perturbing text prompts to identify and reinforce stable visual evidence regions while counteracting language prior biases using attention variance and logit statistics.

  4. Focus Matters: Phase-Aware Suppression for Hallucination in Vision-Language Models

    cs.CV 2026-04 unverdicted novelty 7.0

    Suppressing low-attention tokens during the focus phase of vision-encoder processing reduces object hallucinations in LVLMs while preserving caption quality and adding negligible inference time.

  5. Unified Reward Model for Multimodal Understanding and Generation

    cs.CV 2025-03 unverdicted novelty 7.0

    UnifiedReward is the first unified reward model that jointly assesses multimodal understanding and generation to provide better preference signals for aligning vision models via DPO.

  6. Deep Pre-Alignment for VLMs

    cs.CV 2026-05 unverdicted novelty 6.0

    Deep Pre-Alignment uses a small VLM perceiver instead of ViT to pre-align visual features with LLM text space, yielding 1.9-3.0 point gains on multimodal benchmarks and 32.9% less language forgetting.

  7. Mitigating Action-Relation Hallucinations in LVLMs via Relation-aware Visual Enhancement

    cs.CV 2026-05 unverdicted novelty 6.0

    A new attention-enhancement method using ARS scores and RVE reduces action-relation hallucinations in LVLMs while generalizing to spatial and object hallucinations.

  8. Vocabulary Hijacking in LVLMs: Unveiling Critical Attention Heads by Excluding Inert Tokens to Mitigate Hallucination

    cs.MM 2026-05 unverdicted novelty 6.0

    LVLMs show vocabulary hijacking by inert tokens that decode to hijacking anchors; HABI locates them, NHAR finds resilient heads, and HAVAE boosts those heads to cut hallucinations.

  9. Object Hallucination-Free Reinforcement Unlearning for Vision-Language Models

    cs.CV 2026-05 unverdicted novelty 6.0

    HFRU is a two-stage reinforcement unlearning method operating on the vision encoder with GRPO optimization and an abstraction reward that achieves over 98% forgetting and retention on object and face tasks with neglig...

  10. S2H-DPO: Hardness-Aware Preference Optimization for Vision-Language Models

    cs.CV 2026-04 unverdicted novelty 6.0

    S2H-DPO generates hierarchical prompt-driven preference pairs to improve multi-image reasoning in VLMs while keeping single-image performance intact.

  11. Revis: Sparse Latent Steering to Mitigate Object Hallucination in Large Vision-Language Models

    cs.AI 2026-02 unverdicted novelty 6.0

    REVIS reduces object hallucination in large vision-language models by about 19% via sparse orthogonal projection in latent space at suppression depths while keeping reasoning intact.

  12. Visual-RFT: Visual Reinforcement Fine-Tuning

    cs.CV 2025-03 conditional novelty 6.0

    Visual-RFT applies reinforcement learning with verifiable perception rewards to improve large vision-language models on fine-grained classification, few-shot detection, and grounding tasks.

  13. Uncertainty-Aware Exploratory Direct Preference Optimization for Multimodal Large Language Models

    cs.LG 2026-05 unverdicted novelty 5.0

    UE-DPO quantifies epistemic uncertainty from grounding failures to direct more learning pressure on hard visual tokens in preferred samples while easing penalties on dispreferred ones.

  14. Multimodal Large Language Models with Adaptive Preference Optimization for Sequential Recommendation

    cs.IR 2025-11 unverdicted novelty 5.0

    HaNoRec dynamically weights harder preference samples and applies Gaussian perturbations to output distributions to improve multimodal LLM performance on sequential recommendation tasks.

  15. InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output

    cs.CV 2024-07 conditional novelty 5.0

    InternLM-XComposer-2.5 is a 7B vision-language model supporting up to 96K context that reaches GPT-4V-level performance on image, video, and multi-turn tasks and adds LoRA-driven text-image composition capabilities.

  16. Hallucination of Multimodal Large Language Models: A Survey

    cs.CV 2024-04 accept novelty 5.0

    The survey organizes causes of hallucinations in MLLMs, reviews evaluation benchmarks and metrics, and outlines mitigation approaches plus open questions.

  17. Aligning Modalities in Vision Large Language Models via Preference Fine-tuning

    cs.LG 2024-02 unverdicted novelty 5.0

    POVID generates AI-created preference data to fine-tune vision-language models with DPO, reducing hallucinations and improving benchmark scores.

  18. InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model

    cs.CV 2024-01 unverdicted novelty 5.0

    InternLM-XComposer2 introduces Partial LoRA on InternLM2-7B to enable high-quality free-form text-image composition while matching or exceeding GPT-4V on select vision-language benchmarks.

  19. Vision-EKIPL: External Knowledge-Infused Policy Learning for Visual Reasoning

    cs.CV 2025-06 unverdicted novelty 4.0

    Vision-EKIPL injects high-quality actions from external models into RL training to expand exploration and raise the reasoning ceiling of MLLMs, reporting up to 5% gains on the Reason-RFT-CoT benchmark.

  20. A Survey on Hallucination in Large Vision-Language Models

    cs.CV 2024-02 unverdicted novelty 3.0

    This survey reviews the definition, symptoms, evaluation benchmarks, root causes, and mitigation methods for hallucinations in large vision-language models.

Reference graph

Works this paper leans on

79 extracted references · 79 canonical work pages · cited by 20 Pith papers · 1 internal anchor

  1. [1]

    Brown, Jack Clark, Sam McCandlish, Chris Olah, Benjamin Mann, and Jared Kaplan

    Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, Nicholas Joseph, Saurav Ka- davath, Jackson Kernion, Tom Conerly, Sheer El Showk, Nelson Elhage, Zac Hatfield-Dodds, Danny Hernandez, Tris- tan Hume, Scott Johnston, Shauna Kravec, Liane Lovitt, Neel Nanda, Catherine Olss...

  2. [2]

    The exploration-exploitation dilemma: a multidisci- plinary framework

    Oded Berger-Tal, Jonathan Nathan, Ehud Meron, and David Saltz. The exploration-exploitation dilemma: a multidisci- plinary framework. PLOS ONE, 2014. 2

  3. [3]

    Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Sub- biah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakan- tan, Pranav Shyam, Girish Sastry, Amanda Askell, Sand- hini Agarwal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz...

  4. [4]

    Christiano, Jan Leike, Tom B

    Paul F. Christiano, Jan Leike, Tom B. Brown, Miljan Mar- tic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences. In Advances in Neural Information Processing Systems (NIPS), 2017. 2

  5. [5]

    Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven C. H. Hoi. Instructblip: Towards general- purpose vision-language models with instruction tuning. arXiv.org, 2023. 1, 2, 7

  6. [6]

    MME: A comprehensive evaluation benchmark for multimodal large language models

    Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Zhenyu Qiu, Wei Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, and Rongrong Ji. MME: A comprehensive evaluation benchmark for multimodal large language models. arXiv.org, 2023. 8

  7. [7]

    Detecting and preventing hallucinations in large vision language models

    Anisha Gunjal, Jihan Yin, and Erhan Bas. Detecting and preventing hallucinations in large vision language models. arXiv.org, 2023. 1

  8. [8]

    Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

    Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. InPro- ceedings of the International Conference on Learning Rep- resentations (ICLR). OpenReview.net, 2022. 6

  9. [9]

    Shamma, Michael S

    Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalan- tidis, Li-Jia Li, David A. Shamma, Michael S. Bernstein, and Li Fei-Fei. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision (IJCV), 2017. 4, 6, 1

  10. [10]

    Evaluating object hallucination in large vision-language models

    Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision-language models. arXiv.org, 2023. 1, 2, 5

  11. [11]

    Truthfulqa: Measuring how models mimic human falsehoods

    Stephanie Lin, Jacob Hilton, and Owain Evans. Truthfulqa: Measuring how models mimic human falsehoods. In Associ- ation for Computational Linguistics (ACL), 2022. 2

  12. [12]

    Aligning large multi-modal model with robust instruction tuning

    Fuxiao Liu, Kevin Lin, Linjie Li, Jianfeng Wang, Yaser Ya- coob, and Lijuan Wang. Aligning large multi-modal model with robust instruction tuning. arXiv.org, 2023. 1, 2

  13. [13]

    Improved baselines with visual instruction tuning

    Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. arXiv,

  14. [14]

    Visual instruction tuning

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. arXiv.org, 2023. 1

  15. [15]

    Ho, Robert Tyler Loftin, Bei Peng, Guan Wang, David L

    James MacGlashan, Mark K. Ho, Robert Tyler Loftin, Bei Peng, Guan Wang, David L. Roberts, Matthew E. Taylor, and Michael L. Littman. Interactive learning from policy- dependent human feedback. In Proceedings of the Inter- national Conference on Machine learning (ICML) , pages 2285–2294. PMLR, 2017. 3

  16. [16]

    Sources of hallucination by large language models on inference tasks

    Nick McKenna, Tianyi Li, Liang Cheng, Mohammad Javad Hosseini, Mark Johnson, and Mark Steedman. Sources of hallucination by large language models on inference tasks. arXiv.org, 2023. 2

  17. [17]

    Deep reinforcement learning: An overview

    Seyed Sajad Mousavi, Michael Schukat, and Enda Howley. Deep reinforcement learning: An overview. In Intelligent Systems Conference (IntelliSys), 2016. 2

  18. [18]

    GPT-4 technical report

    OpenAI. GPT-4 technical report. arXiv.org, 2023. 2, 3

  19. [19]

    Russo, and Zheng Wen

    Ian Osband, Benjamin Van Roy, Daniel J. Russo, and Zheng Wen. Deep exploration via randomized value functions. Journal of Machine Learning Research (JMLR), 2019. 2

  20. [20]

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Car- roll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F. Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human f...

  21. [21]

    The refinedweb dataset for falcon LLM: outperforming cu- rated corpora with web data, and web data only

    Guilherme Penedo, Quentin Malartic, Daniel Hesslow, Ruxandra Cojocaru, Alessandro Cappelli, Hamza Alobeidli, Baptiste Pannier, Ebtesam Almazrouei, and Julien Launay. The refinedweb dataset for falcon LLM: outperforming cu- rated corpora with web data, and web data only. arXiv.org,

  22. [22]

    Manning, and Chelsea Finn

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Er- mon, Christopher D. Manning, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. arXiv.org, 2023. 2, 3, 5

  23. [23]

    Investigating the factual knowledge boundary of large language models with retrieval augmentation

    Ruiyang Ren, Yuhao Wang, Yingqi Qu, Wayne Xin Zhao, Jing Liu, Hao Tian, Hua Wu, Ji-Rong Wen, and Haifeng Wang. Investigating the factual knowledge boundary of large language models with retrieval augmentation. arXiv.org,

  24. [24]

    Object hallucination in image cap- tioning

    Anna Rohrbach, Lisa Anne Hendricks, Kaylee Burns, Trevor Darrell, and Kate Saenko. Object hallucination in image cap- tioning. In Empirical Methods in Natural Language Process- ing (EMNLP), 2018. 2

  25. [25]

    Proximal policy optimization algo- rithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Rad- ford, and Oleg Klimov. Proximal policy optimization algo- rithms. arXiv.org, 2017. 3 9

  26. [26]

    Aligning large multimodal models with factually augmented RLHF

    Zhiqing Sun, Sheng Shen, Shengcao Cao, Haotian Liu, Chunyuan Li, Yikang Shen, Chuang Gan, Liang-Yan Gui, Yu-Xiong Wang, Yiming Yang, Kurt Keutzer, and Trevor Darrell. Aligning large multimodal models with factually augmented RLHF. arXiv.org, 2023. 1, 2

  27. [27]

    Llama 2: Open foundation and fine- tuned chat models

    Hugo Touvron, Louis Martin, Kevin Stone, Peter Al- bert, Amjad Almahairi, Yasmine Babaei, Nikolay Bash- lykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhos- ale, Dan Bikel, Lukas Blecher, Cristian Canton-Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fer- nandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Antho...

  28. [28]

    Vigc: Visual instruction generation and correction

    Bin Wang, Fan Wu, Xiao Han, Jiahui Peng, Huaping Zhong, Pan Zhang, Xiaoyi Dong, Weijia Li, Wei Li, Jiaqi Wang, and Conghui He. VIGC: visual instruction generation and cor- rection. arXiv.org, abs/2308.12714, 2023. 2

  29. [29]

    Evaluation and analysis of hallucination in large vision-language mod- els

    Junyang Wang, Yiyang Zhou, Guohai Xu, Pengcheng Shi, Chenlin Zhao, Haiyang Xu, Qinghao Ye, Ming Yan, Ji Zhang, Jihua Zhu, Jitao Sang, and Haoyu Tang. Evaluation and analysis of hallucination in large vision-language mod- els. arXiv.org, 2023. 2

  30. [30]

    Woodpecker: Hallucination correction for multimodal large language models

    Shukang Yin, Chaoyou Fu, Sirui Zhao, Tong Xu, Hao Wang, Dianbo Sui, Yunhang Shen, Ke Li, Xing Sun, and Enhong Chen. Woodpecker: Hallucination correction for multimodal large language models. arXiv.org, 2023. 2

  31. [31]

    Halle-switch: Rethinking and con- trolling object existence hallucinations in large vision lan- guage models for detailed caption

    Bohan Zhai, Shijia Yang, Xiangchen Zhao, Chenfeng Xu, Sheng Shen, Dongdi Zhao, Kurt Keutzer, Manling Li, Tan Yan, and Xiangjun Fan. Halle-switch: Rethinking and con- trolling object existence hallucinations in large vision lan- guage models for detailed caption. arXiv.org, 2023. 1

  32. [32]

    Internlm-xcomposer: A vision-language large model for advanced text-image comprehension and composition

    Pan Zhang, Xiaoyi Dong, Bin Wang, Yuhang Cao, Chao Xu, Linke Ouyang, Zhiyuan Zhao, Shuangrui Ding, Songyang Zhang, Haodong Duan, Wenwei Zhang, Hang Yan, Xinyue Zhang, Wei Li, Jingwen Li, Kai Chen, Con- ghui He, Xingcheng Zhang, Yu Qiao, Dahua Lin, and Jiaqi Wang. Internlm-xcomposer: A vision-language large model for advanced text-image comprehension and c...

  33. [33]

    Siren’s song in the AI ocean: A survey on hallucination in large language models

    Yue Zhang, Yafu Li, Leyang Cui, Deng Cai, Lemao Liu, Tingchen Fu, Xinting Huang, Enbo Zhao, Yu Zhang, Yulong Chen, Longyue Wang, Anh Tuan Luu, Wei Bi, Freda Shi, and Shuming Shi. Siren’s song in the AI ocean: A survey on hallucination in large language models. arXiv.org, 2023. 2

  34. [34]

    LIMA: less is more for alignment

    Chunting Zhou, Pengfei Liu, Puxin Xu, Srini Iyer, Jiao Sun, Yuning Mao, Xuezhe Ma, Avia Efrat, Ping Yu, Lili Yu, Susan Zhang, Gargi Ghosh, Mike Lewis, Luke Zettlemoyer, and Omer Levy. LIMA: less is more for alignment. arXiv.org,

  35. [35]

    Analyzing and mitigating object hallucination in large vision-language models

    Yiyang Zhou, Chenhang Cui, Jaehong Yoon, Linjun Zhang, Zhun Deng, Chelsea Finn, Mohit Bansal, and Huaxiu Yao. Analyzing and mitigating object hallucination in large vision-language models. arXiv.org, 2023. 1, 2

  36. [36]

    Minigpt-4: Enhancing vision- language understanding with advanced large language mod- els

    Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. Minigpt-4: Enhancing vision- language understanding with advanced large language mod- els. arXiv.org, 2023. 1, 6, 7

  37. [37]

    Fine-Tuning Language Models from Human Preferences

    Daniel M. Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B. Brown, Alec Radford, Dario Amodei, Paul F. Christiano, and Geoffrey Irving. Fine-tuning language models from human preferences. arXiv.org, abs/1909.08593, 2019. 3 10 Beyond Hallucinations: Enhancing LVLMs through Hallucination-Aware Direct Preference Optimization Supplementary Material

  38. [38]

    Visual Genome is a large-scale vision-language dataset that includes dense captions [9]

    Dataset Visual Genome (VG). Visual Genome is a large-scale vision-language dataset that includes dense captions [9]. It contains over 100,000 densely annotated images, each aver- aging 21 objects, 18 attributes, and 18 object relationships. As the largest and densest dataset of image descriptions, ob- jects, attributes, relationships, and question answers...

  39. [39]

    Describe the image in detail

    Details of Hallucination Data Generation In this section, we delve into the specifics of the halluci- nation data generation process. This is illustrated through concrete examples at each of its three crucial stages: de- scription generation, hallucination detection and correction, and style-consistency data augmentation. 8.1. Description Generation We pr...

  40. [40]

    Evaluation Example In the SHR evaluation, GPT-4 classifies each sentence in the model response as hallucination or correct

    Details of SHR evaluation 9.1. Evaluation Example In the SHR evaluation, GPT-4 classifies each sentence in the model response as hallucination or correct. The SHR is then computed as the proportion of hallucinated sentences to total sentences. Consult Figure 12 for illustration. 9.2. Factual-assisted Evaluation During the SHR evaluation, some inaccuracies...

  41. [41]

    Specif- ically, MiniGPT4-LLaMA2 model is fine-tuned with and without style-consistency control and evaluated on a sub- set of SHR

    Additional Style-consistency Analysis To further demonstrate the effect of style-consistent con- trol, we quantitatively examine the role of data style- consistent control in preventing training instability. Specif- ically, MiniGPT4-LLaMA2 model is fine-tuned with and without style-consistency control and evaluated on a sub- set of SHR. Instability is eva...

  42. [42]

    Results show that HA-DPO outperforms other com- petitive methods and achieves SOTA (state-of-the-art) in POPE accuracy

    Comparison with other hallucination miti- gation methods In this section, we compare our proposed method with other hallucination mitigation methods, results are shown in Ta- ble 7. Results show that HA-DPO outperforms other com- petitive methods and achieves SOTA (state-of-the-art) in POPE accuracy. Notably, LRV used 400, 000 training data, while our hal...

  43. [43]

    Cannot judge

    Quality Examples Figure 13, Figure 14 and Figure 15 present hallucination- eliminated examples generated by the MiniGPT4- LLaMA2-7B, InstructBLIP-13 and LLaV A-1.5, re- spectively. Models optimized using HA-DPO produce significantly less hallucinated content in both visual question-answering and image description tasks. 12.1. Efficacy and Potential of HA-...

  44. [45]

    Corrected Sentences:

  45. [46]

    Here is the comment for you to judge if it is hallucination and revise:

    Here are the region descriptions of the image: [0.02, 0.02, 0.98, 0.65]: people are posing for the picture [0.81, 0.0, 0.94, 0.2]: the woman is wearing goggles [0.79, 0.0, 0.94, 0.22]: the woman is wearing a scarf ... Here is the comment for you to judge if it is hallucination and revise:

  46. [49]

    One person is standing at the top of the slope, holding a ski pole, while the others are skiing down the slope

  47. [50]

    The slope is covered in snow and there are trees in the background

  48. [51]

    The sky is blue and there are some clouds in the distance

  49. [52]

    Judgement:

    The overall mood of the image is one of excitement and adventure. Judgement:

  50. [53]

    ( correct: The region descriptions mention four skiers on a snowy hill, and objects related to skiing.)

    The image shows a group of people skiing down a snowy slope. ( correct: The region descriptions mention four skiers on a snowy hill, and objects related to skiing.)

  51. [54]

    (correct: The region descriptions include ski gear like helmets, goggles, ski poles and skis.)

    They are all wearing ski gear, including helmets, goggles, and skis. (correct: The region descriptions include ski gear like helmets, goggles, ski poles and skis.)

  52. [55]

    (hallucination: There is no description mentioning one person standing at the top of the slope while the others are skiing down.)

    One person is standing at the top of the slope, holding a ski pole, while the others are skiing down the slope. (hallucination: There is no description mentioning one person standing at the top of the slope while the others are skiing down.)

  53. [56]

    ( hallucination: There is no description involving trees in the background and the slope being covered in snow

    The slope is covered in snow and there are trees in the background. ( hallucination: There is no description involving trees in the background and the slope being covered in snow. )

  54. [57]

    ( hallucination: There is no description involving the sky or clouds.)

    The sky is blue and there are some clouds in the distance. ( hallucination: There is no description involving the sky or clouds.)

  55. [58]

    ( correct: The overall mood is adventurous.) Corrected Sentences:

    The overall mood of the image is one of excitement and adventure. ( correct: The overall mood is adventurous.) Corrected Sentences:

  56. [59]

    The image shows a group of people skiing down a snowy slope

  57. [60]

    They are all wearing ski gear, including helmets, goggles, and skis

  58. [61]

    (Corrected)

    A woman in ski gear is smiling at the camera, holding ski poles. (Corrected)

  59. [62]

    (Corrected)

    Four skiers are present on a snowy hill. (Corrected)

  60. [63]

    (Corrected)

    The skiers are equipped with helmets, goggles and ski poles. (Corrected)

  61. [64]

    GPT-4 Output Prompt Figure 9

    The image portrays a winter sport scenario, with skiers ready for an adventure. GPT-4 Output Prompt Figure 9. Hallucination Detection and Correction in LVLM-Generated Sentences. 4 Help me rewrite the given sentence. Don't change any detail and information in the original sentence. Don't add any new information. Give 3 different versions. The format of you...

  62. [65]

    They are all wearing ski gear, including helmets, goggles, and skis

    The sentence you need to rewrite: The image shows a group of people skiing down a snowy slope. They are all wearing ski gear, including helmets, goggles, and skis. One person is standing at the top of the slope, holding a ski pole, while the others are skiing down the slope. The slope is covered in snow and there are trees in the background. The sky is bl...

  63. [66]

    Each person is outfitted in ski gear, complete with helmets, goggles, and skis

    In the image, a snowy slope is depicted with a group of individuals skiing down it. Each person is outfitted in ski gear, complete with helmets, goggles, and skis. Positioned at the top of the slope, one individual stands, clutching a ski pole, while the rest joyfully descend. Snow blankets the slope, accompanied by a backdrop of trees. The sky above is a...

  64. [67]

    All members are donned in ski gear, incorporating helmets, goggles, and skis into their attire

    A snowy slope takes center stage in the image, featuring a spirited group of skiers descending down its snowy expanse. All members are donned in ski gear, incorporating helmets, goggles, and skis into their attire. One person stands confidently at the peak, gripping a ski pole, while the remaining skiers revel in the thrill of their descent. The snow-cove...

  65. [68]

    Their attire includes helmets, goggles, and skis, highlighting their readiness for an exhilarating adventure

    Illustrated in the image is a snowy slope, where a cohesive group of skiing enthusiasts can be seen carving their way down. Their attire includes helmets, goggles, and skis, highlighting their readiness for an exhilarating adventure. Positioned at the summit of the slope, one individual stands upright, firmly grasping a ski pole, while the others relish i...

  66. [69]

    They are all wearing ski gear, including helmets, goggles, and skis

    The sentence you need to rewrite: The image shows a group of people skiing down a snowy slope. They are all wearing ski gear, including helmets, goggles, and skis. A woman in ski gear is smiling at the camera, holding ski poles. Four skiers are present on a snowy hill. The skiers are equipped with helmets, goggles and ski poles. The image portrays a winte...

  67. [70]

    One woman in the group is smiling at the camera while holding ski poles

    In the picture, there is a snowy hill with a group of four skiers all dressed in ski gear, including helmets, goggles, and skis. One woman in the group is smiling at the camera while holding ski poles

  68. [71]

    One of the skiers, a woman, can be seen smiling at the camera

    The snowy slope depicted in the image is occupied by a group of four skiers who are all wearing ski gear, including helmets, goggles, and ski poles. One of the skiers, a woman, can be seen smiling at the camera

  69. [72]

    Cannot judge

    The image showcases a winter sport scene featuring four skiers skiing down a snowy hill. All of them are properly equipped with ski gear, such as helmets, goggles, and ski poles. Among the skiers, a woman stands out as she smiles at the camera while holding her ski poles. Prompt GPT-4 Output Figure 11. Style-Consistent Data Augmentation: Positive Sample (...

  70. [73]

    hallucination or correct or cannot judge: <reason>

  71. [74]

    Here is the comment for you to judge if it is hallucination:

    Here are the region descriptions of the image: [0.24, 0.59, 0.34, 0.95]: woman walking on the street [0.46, 0.59, 0.62, 0.68]: people sitting on tables on the street [0.41, 0.57, 0.46, 0.68]: man wearing red top [0.1, 0.0, 0.36, 0.43]: a balcony in the building [0.36, 0.58, 0.43, 0.75]: woman holding a white bag ... Here is the comment for you to judge if...

  72. [75]

    The scene depicts a bustling city street filled with pedestrians, motorcycles, and bicycles

  73. [76]

    There are several people walking along the sidewalk, including a woman in a white dress who appears to be crossing the street

  74. [77]

    In addition to the pedestrians, there are two motorcycles parked on the side of the street, one closer to the left side and the other closer to the right side

  75. [78]

    Judgement:

    Several bicycles can also be seen throughout the scene, some parked and others being ridden by individuals. Judgement:

  76. [79]

    ( Hallucination: The description does not mention bicycles or the street being bustling

    The scene depicts a bustling city street filled with pedestrians, motorcycles, and bicycles. ( Hallucination: The description does not mention bicycles or the street being bustling. )

  77. [80]

    (Correct: The description mentions a woman walking and other people on the sidewalk.)

    There are several people walking along the sidewalk, including a woman in a white dress who appears to be crossing the street. (Correct: The description mentions a woman walking and other people on the sidewalk.)

  78. [81]

    (Correct: The description mentions two motorcycles parked on the side of the street.)

    In addition to the pedestrians, there are two motorcycles parked on the side of the street, one closer to the left side and the other closer to the right side. (Correct: The description mentions two motorcycles parked on the side of the street.)

  79. [82]

    (Hallucination: There are no bicycles mentioned in the description

    Several bicycles can also be seen throughout the scene, some parked and others being ridden by individuals. (Hallucination: There are no bicycles mentioned in the description. ) GPT-4 Output Prompt Figure 12. Illustration of Sentence Hallucination Ratio (SHR) Evaluation. 7 The image shows a plate of food on a table in front of a computer monitor. The plat...