pith. machine review for the scientific record. sign in

arxiv: 2604.20696 · v1 · submitted 2026-04-22 · 💻 cs.CV

Recognition: unknown

R-CoV: Region-Aware Chain-of-Verification for Alleviating Object Hallucinations in LVLMs

Authors on Pith no claims yet

Pith reviewed 2026-05-10 01:24 UTC · model grok-4.3

classification 💻 cs.CV
keywords object hallucinationslarge vision-language modelschain-of-verificationregion-aware processingtraining-free methodsmultimodal reasoninghallucination mitigation
0
0 comments X

The pith

Vision-language models can reduce their own object hallucinations by verifying descriptions of specific image regions they generate.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows how to prompt large vision-language models to extract objects from their initial answers, generate coordinates for those objects in the image, describe the regions, and then verify consistency before producing a final answer. This chain uses the model's own outputs as the cue for checking, rather than external tools or retraining. A reader would care because models often claim objects exist in images when they do not, which breaks trust in tasks such as captioning or visual question answering. If the approach works, it turns the model's tendency to focus on regions into a built-in error-correction step that runs after any existing model produces an answer.

Core claim

By breaking the verification task into six sequential steps that begin with an initial response and end with a revised response, the method elicits coordinate generation and region descriptions from the same model and uses those descriptions to detect and remove claims of nonexistent objects.

What carries the argument

The six-step Region-aware Chain-of-Verification process that chains initial response generation, entity extraction, coordinate generation, region description, verification execution, and final response generation, using the model's self-produced regions as the verification signal.

If this is right

  • The method applies to many existing vision-language models without any weight updates.
  • It requires no separate object-detection model or other external components.
  • Performance gains appear on multiple public benchmarks that measure object hallucinations.
  • The same prompting structure can be reused across different model sizes and architectures.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The results imply that current models already encode usable location information that can be surfaced by targeted prompts rather than being absent.
  • Similar self-chaining steps could be tested on other hallucination types such as incorrect attributes or relations.
  • The approach might scale to video or multi-image inputs if the coordinate and region steps are adapted accordingly.

Load-bearing premise

The model must produce accurate coordinates and region descriptions in the intermediate steps without creating new errors that invalidate the later verification.

What would settle it

Apply the full chain to a model on a standard hallucination benchmark and measure whether the rate of invented objects in the final answer stays the same or increases compared with the initial answer.

Figures

Figures reproduced from arXiv: 2604.20696 by Alessio Tonioni, Bernt Schiele, Federico Tombari, Jiahao Xie, Nathalie Rauschmayr.

Figure 1
Figure 1. Figure 1: (a) An example of object hallucinations in LVLMs with the hallucinated object highlighted in red. (b) By eliciting region￾level processing from LVLMs and using it as a chaining cue, we can detect and alleviate their own object hallucinations. struction fine-tuning [15, 21, 27, 36, 45], (ii) decoding pro￾cess optimization [16, 22, 29], and (iii) integration of ex￾ternal expert models [19, 32, 40, 49, 50, 54… view at source ↗
Figure 2
Figure 2. Figure 2 [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Trade-off between performance and test-time com￾pute. The vanilla model is LLaVA-v1.5-7B. We report both accu￾racy and F1 score averaged across the three splits of POPE. The computational cost is represented as the time (seconds) per query, measured on a single 48GB NVIDIA L40S GPU. R-CoV is more efficient than LogicCheckGPT, always achieving the same perfor￾mance with less computational cost. and 67.68% F… view at source ↗
Figure 4
Figure 4. Figure 4: Example results of R-CoV for LLaVA-1.5. The hallucinated objects are highlighted in red. We sample three region descriptions per examinee to produce a set of diverse answers for verification. More examples are provided in the supplementary material. stead outputs regions corresponding to “hand” and “skier”, respectively, which commonly co-occur in similar visual contexts. When prompted with the image conta… view at source ↗
Figure 5
Figure 5. Figure 5: Example results of R-CoV for yes-or-no questions. The vanilla LVLM is LLaVA-v1.5-7B. The existent objects are highlighted in blue, while the hallucinated objects are highlighted in red. We sample three region descriptions per examinee to produce a set of diverse answers for verification. 16 [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Example results of R-CoV for open-ended questions. The vanilla LVLM is LLaVA-v1.5-7B. The hallucinated objects are highlighted in red. We sample three region descriptions per examinee to produce a set of diverse answers for verification. 17 [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗
read the original abstract

Large vision-language models (LVLMs) have demonstrated impressive performance in various multimodal understanding and reasoning tasks. However, they still struggle with object hallucinations, i.e., the claim of nonexistent objects in the visual input. To address this challenge, we propose Region-aware Chain-of-Verification (R-CoV), a visual chain-of-verification method to alleviate object hallucinations in LVLMs in a post-hoc manner. Motivated by how humans comprehend intricate visual information -- often focusing on specific image regions or details within a given sample -- we elicit such region-level processing from LVLMs themselves and use it as a chaining cue to detect and alleviate their own object hallucinations. Specifically, our R-CoV consists of six steps: initial response generation, entity extraction, coordinate generation, region description, verification execution, and final response generation. As a simple yet effective method, R-CoV can be seamlessly integrated into various LVLMs in a training-free manner and without relying on external detection models. Extensive experiments on several widely used hallucination benchmarks across multiple LVLMs demonstrate that R-CoV can significantly alleviate object hallucinations in LVLMs. Project page: https://github.com/Jiahao000/R-CoV.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes Region-aware Chain-of-Verification (R-CoV), a training-free post-hoc method for alleviating object hallucinations in LVLMs. It consists of a six-step pipeline (initial response generation, entity extraction, coordinate generation, region description, verification execution, and final response generation) that elicits region-level processing from the LVLM itself to serve as a chaining cue for self-verification, without requiring external detectors or fine-tuning. The central claim is that this approach can be seamlessly integrated into various LVLMs and yields significant reductions in object hallucinations, as demonstrated through experiments on multiple standard benchmarks across several models.

Significance. If the results hold after addressing the robustness concerns, R-CoV would provide a practical, lightweight way to improve LVLM reliability by leveraging the model's internal region cues rather than external tools or retraining. This aligns with the trend toward inference-time interventions and could see adoption in deployed multimodal systems. The training-free nature and lack of reliance on detection models are clear strengths, as is the evaluation across multiple LVLMs and benchmarks.

major comments (2)
  1. [Method section (six-step pipeline)] Method section (description of the six-step pipeline, particularly coordinate generation and region description): The verification step operates on regions defined by coordinates and descriptions generated by the same LVLM. LVLMs have well-documented weaknesses in precise spatial localization and grounding. If the generated coordinates have low overlap with actual objects, verification occurs on irrelevant patches, which directly risks failing to correct (or even exacerbating) hallucinations. This assumption is load-bearing for the claim of effective alleviation in a training-free manner without external detectors; it requires explicit validation such as reporting localization metrics (e.g., IoU with ground-truth boxes) or ablation studies isolating the impact of coordinate accuracy.
  2. [Experiments] Experimental results section: The abstract asserts that R-CoV 'significantly alleviate[s] object hallucinations,' yet the magnitude, statistical significance, and controls (e.g., for prompt sensitivity across the multi-step pipeline) are not detailed here. Without per-benchmark tables showing effect sizes, variance, and comparisons to strong baselines that also use multi-prompt strategies, it is difficult to assess whether improvements exceed what could be achieved by simpler prompting variants.
minor comments (2)
  1. [Abstract / Method] A pipeline diagram would greatly improve clarity for the six-step process described in the abstract and method.
  2. [Project page / Experiments] Ensure the GitHub repository includes all prompts used in each step and exact evaluation scripts for full reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the constructive and detailed feedback on our manuscript. We have carefully reviewed the major comments and provide point-by-point responses below. Where the concerns are valid, we outline specific revisions to strengthen the paper while preserving its core contributions.

read point-by-point responses
  1. Referee: Method section (six-step pipeline)] Method section (description of the six-step pipeline, particularly coordinate generation and region description): The verification step operates on regions defined by coordinates and descriptions generated by the same LVLM. LVLMs have well-documented weaknesses in precise spatial localization and grounding. If the generated coordinates have low overlap with actual objects, verification occurs on irrelevant patches, which directly risks failing to correct (or even exacerbating) hallucinations. This assumption is load-bearing for the claim of effective alleviation in a training-free manner without external detectors; it requires explicit validation such as reporting localization metrics (e.g., IoU with ground-truth boxes) or ablation studies isolating the impact of coordinate accuracy.

    Authors: We acknowledge the referee's valid point that coordinate accuracy is a critical assumption. LVLMs do have known limitations in precise grounding, and inaccurate coordinates could in principle direct verification to irrelevant regions. However, our design intentionally uses the LVLM's own outputs for consistency: the same model generates both the initial response and the region cues, allowing it to verify against its internal representation rather than external ground truth. This self-referential chaining is the core of the training-free approach. To directly address the concern, we will add (1) an ablation study that isolates the coordinate generation step (comparing full R-CoV against a variant using only entity extraction and description without coordinates) and (2) localization quality metrics (average IoU with ground-truth boxes) on any benchmarks that provide bounding-box annotations. These additions will quantify the robustness of the assumption. revision: partial

  2. Referee: Experimental results section: The abstract asserts that R-CoV 'significantly alleviate[s] object hallucinations,' yet the magnitude, statistical significance, and controls (e.g., for prompt sensitivity across the multi-step pipeline) are not detailed here. Without per-benchmark tables showing effect sizes, variance, and comparisons to strong baselines that also use multi-prompt strategies, it is difficult to assess whether improvements exceed what could be achieved by simpler prompting variants.

    Authors: We agree that the current experimental presentation would benefit from greater granularity to support the 'significant alleviation' claim. In the revised manuscript we will expand the results section with: per-benchmark tables reporting absolute and relative reductions in hallucination rates, standard deviations across multiple independent runs (to quantify variance and prompt sensitivity), and paired statistical significance tests. We will also add direct comparisons to strong multi-prompt baselines (e.g., self-consistency, iterative Chain-of-Thought, and self-verification variants) that use comparable numbers of forward passes, thereby demonstrating that gains are attributable to the region-aware chaining rather than prompt length alone. revision: yes

Circularity Check

0 steps flagged

No circularity in the derivation chain

full rationale

The paper presents a purely procedural, training-free method consisting of six explicit steps (initial response, entity extraction, coordinate generation, region description, verification, final response) that elicits region-level processing from the LVLM to verify its own outputs. No mathematical equations, fitted parameters, or self-citations are used to derive the central claim; effectiveness is shown via empirical evaluation on external hallucination benchmarks across multiple LVLMs. The result does not reduce to its inputs by construction, and the method remains self-contained against external benchmarks without any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The method rests on the domain assumption that intermediate region descriptions produced by the same model are sufficiently accurate to serve as verification signals. No free parameters or invented entities are introduced.

axioms (1)
  • domain assumption LVLMs can produce usable bounding-box coordinates and region descriptions when explicitly prompted in the intermediate steps.
    Invoked in the coordinate generation and region description steps of the six-step pipeline.

pith-pipeline@v0.9.0 · 5535 in / 1142 out tokens · 28675 ms · 2026-05-10T01:24:43.969383+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

67 extracted references · 18 canonical work pages · 12 internal anchors

  1. [1]

    Flamingo: a visual language model for few-shot learning

    Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. InNeurIPS,

  2. [2]

    Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

    Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A frontier large vision-language model with versatile abilities.arXiv preprint arXiv:2308.12966, 2023. 1, 2

  3. [3]

    Qwen2.5-VL Technical Report

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2.5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 4

  4. [4]

    Lan- guage models are few-shot learners

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Sub- biah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakan- tan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Lan- guage models are few-shot learners. InNeurIPS, 2020. 1, 2

  5. [5]

    Sharegpt4v: Improving large multi-modal models with better captions

    Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Conghui He, Jiaqi Wang, Feng Zhao, and Dahua Lin. Sharegpt4v: Improving large multi-modal models with better captions. In ECCV, 2024. 1

  6. [6]

    Unified hallucination detection for multi- modal large language models

    Xiang Chen, Chenxi Wang, Yida Xue, Ningyu Zhang, Xi- aoyan Yang, Qiang Li, Yue Shen, Lei Liang, Jinjie Gu, and Huajun Chen. Unified hallucination detection for multi- modal large language models. InACL, 2024. 2

  7. [7]

    Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks

    Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks. InCVPR,

  8. [8]

    Halc: Object hallucination reduc- tion via adaptive focal-contrast decoding

    Zhaorun Chen, Zhuokai Zhao, Hongyin Luo, Huaxiu Yao, Bo Li, and Jiawei Zhou. Halc: Object hallucination reduc- tion via adaptive focal-contrast decoding. InICML, 2024. 2

  9. [9]

    Holistic analysis of hallucination in gpt-4v (ision): Bias and interference chal- lenges

    Chenhang Cui, Yiyang Zhou, Xinyu Yang, Shirley Wu, Lin- jun Zhang, James Zou, and Huaxiu Yao. Holistic analysis of hallucination in gpt-4v (ision): Bias and interference chal- lenges.arXiv preprint arXiv:2311.03287, 2023. 2

  10. [10]

    Instructblip: Towards general-purpose vision-language models with instruction tuning

    Wenliang Dai, Junnan Li, Dongxu Li, Anthony Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning. InNeurIPS, 2023. 1, 2 9

  11. [11]

    Chain-of-verification reduces hallucination in large language models

    Shehzaad Dhuliawala, Mojtaba Komeili, Jing Xu, Roberta Raileanu, Xian Li, Asli Celikyilmaz, and Jason Weston. Chain-of-verification reduces hallucination in large language models. InACL Findings, 2024. 1

  12. [12]

    The Llama 3 Herd of Models

    Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Ab- hishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783,

  13. [13]

    MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

    Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, et al. Mme: A comprehensive evaluation bench- mark for multimodal large language models.arXiv preprint arXiv:2306.13394, 2023. 2, 4

  14. [14]

    Hallusionbench: an advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models

    Tianrui Guan, Fuxiao Liu, Xiyang Wu, Ruiqi Xian, Zongxia Li, Xiaoyu Liu, Xijun Wang, Lichang Chen, Furong Huang, Yaser Yacoob, et al. Hallusionbench: an advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models. InCVPR, 2024. 2

  15. [15]

    Detecting and preventing hallucinations in large vision language models

    Anisha Gunjal, Jihan Yin, and Erhan Bas. Detecting and preventing hallucinations in large vision language models. InAAAI, 2024. 1, 2

  16. [16]

    Opera: Alleviating hallucination in multi- modal large language models via over-trust penalty and retrospection-allocation

    Qidong Huang, Xiaoyi Dong, Pan Zhang, Bin Wang, Con- ghui He, Jiaqi Wang, Dahua Lin, Weiming Zhang, and Nenghai Yu. Opera: Alleviating hallucination in multi- modal large language models via over-trust penalty and retrospection-allocation. InCVPR, 2024. 1, 2, 8

  17. [17]

    OpenAI o1 System Card

    Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richard- son, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card.arXiv preprint arXiv:2412.16720, 2024. 2

  18. [18]

    Faith- score: Fine-grained evaluations of hallucinations in large vision-language models

    Liqiang Jing, Ruosen Li, Yunmo Chen, and Xinya Du. Faith- score: Fine-grained evaluations of hallucinations in large vision-language models. InEMNLP Findings, 2024. 2

  19. [19]

    Brave: Broadening the visual encoding of vision-language models

    O ˘guzhan Fatih Kar, Alessio Tonioni, Petra Poklukar, Achin Kulshrestha, Amir Zamir, and Federico Tombari. Brave: Broadening the visual encoding of vision-language models. InECCV, 2024. 1, 2

  20. [20]

    Geochat: Grounded large vision-language model for remote sensing

    Kartik Kuckreja, Muhammad Sohail Danish, Muzammal Naseer, Abhijit Das, Salman Khan, and Fahad Shahbaz Khan. Geochat: Grounded large vision-language model for remote sensing. InCVPR, 2024. 1

  21. [21]

    V olcano: mitigating multimodal hallucination through self-feedback guided revision

    Seongyun Lee, Sue Hyun Park, Yongrae Jo, and Minjoon Seo. V olcano: mitigating multimodal hallucination through self-feedback guided revision. InNAACL, 2024. 1, 2

  22. [22]

    Mitigating object hal- lucinations in large vision-language models through visual contrastive decoding

    Sicong Leng, Hang Zhang, Guanzheng Chen, Xin Li, Shijian Lu, Chunyan Miao, and Lidong Bing. Mitigating object hal- lucinations in large vision-language models through visual contrastive decoding. InCVPR, 2024. 1, 2, 8

  23. [23]

    Otter: A Multi-Modal Model with In-Context Instruction Tuning

    Bo Li, Yuanhan Zhang, Liangyu Chen, Jinghao Wang, Jingkang Yang, and Ziwei Liu. Otter: A multi-modal model with in-context instruction tuning.arXiv preprint arXiv:2305.03726, 2023. 1, 2

  24. [24]

    Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation

    Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. InICML,

  25. [25]

    Evaluating object hallucination in large vision-language models

    Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision-language models. InEMNLP, 2023. 2, 4

  26. [26]

    Microsoft coco: Common objects in context

    Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In ECCV, 2014. 4

  27. [27]

    Mitigating hallucination in large multi-modal models via robust instruction tuning

    Fuxiao Liu, Kevin Lin, Linjie Li, Jianfeng Wang, Yaser Ya- coob, and Lijuan Wang. Mitigating hallucination in large multi-modal models via robust instruction tuning. InICLR,

  28. [28]

    Visual instruction tuning

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. InNeurIPS, 2023. 1, 2, 4

  29. [29]

    Paying more atten- tion to image: A training-free method for alleviating halluci- nation in lvlms

    Shi Liu, Kecheng Zheng, and Wei Chen. Paying more atten- tion to image: A training-free method for alleviating halluci- nation in lvlms. InECCV, 2024. 1, 2, 8

  30. [30]

    Negative object presence evaluation (nope) to measure object hallucination in vision-language models

    Holy Lovenia, Wenliang Dai, Samuel Cahyawijaya, Ziwei Ji, and Pascale Fung. Negative object presence evaluation (nope) to measure object hallucination in vision-language models. InACLW, 2024. 2

  31. [31]

    DeepSeek-VL: Towards Real-World Vision-Language Understanding

    Haoyu Lu, Wen Liu, Bo Zhang, Bingxuan Wang, Kai Dong, Bo Liu, Jingxiang Sun, Tongzheng Ren, Zhuoshu Li, Yaofeng Sun, et al. Deepseek-vl: towards real-world vision- language understanding.arXiv preprint arXiv:2403.05525,

  32. [32]

    Self- checkgpt: Zero-resource black-box hallucination detection for generative large language models

    Potsawee Manakul, Adian Liusie, and Mark JF Gales. Self- checkgpt: Zero-resource black-box hallucination detection for generative large language models. InEMNLP, 2023. 1, 2, 4

  33. [33]

    The llama 4 herd: The beginning of a new era of natively multimodal ai innovation.https://ai

    AI Meta. The llama 4 herd: The beginning of a new era of natively multimodal ai innovation.https://ai. meta. com/blog/llama-4-multimodal-intelligence/, checked on, 4 (7):2025, 2025. 1, 2

  34. [34]

    Clipttt: Clip-guided test-time training helps lvlms see better

    Mriganka Nath, Anurag Das, Jiahao Xie, and Bernt Schiele. Clipttt: Clip-guided test-time training helps lvlms see better. arXiv preprint arXiv:2603.26486, 2026. 2

  35. [35]

    GPT-4 Technical Report

    OpenAI. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023. 1, 2

  36. [36]

    Training lan- guage models to follow instructions with human feedback

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Car- roll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training lan- guage models to follow instructions with human feedback. InNeurIPS, 2022. 1, 2

  37. [37]

    Pytorch: An imperative style, high-performance deep learning library

    Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high-performance deep learning library. In NeurIPS, 2019. 4

  38. [38]

    Object hallucination in image cap- tioning

    Anna Rohrbach, Lisa Anne Hendricks, Kaylee Burns, Trevor Darrell, and Kate Saenko. Object hallucination in image cap- tioning. InEMNLP, 2018. 2, 4, 8

  39. [39]

    Visual chain of thought: bridging logical gaps with multimodal infillings

    Daniel Rose, Vaishnavi Himakunthala, Andy Ouyang, Ryan He, Alex Mei, Yujie Lu, Michael Saxon, Chinmay Sonar, Diba Mirza, and William Yang Wang. Visual chain of thought: bridging logical gaps with multimodal infillings. arXiv preprint arXiv:2305.02317, 2023. 2 10

  40. [40]

    Eagle: Exploring the de- sign space for multimodal llms with mixture of encoders

    Min Shi, Fuxiao Liu, Shihao Wang, Shijia Liao, Subhashree Radhakrishnan, Yilin Zhao, De-An Huang, Hongxu Yin, Karan Sapra, Yaser Yacoob, et al. Eagle: Exploring the de- sign space for multimodal llms with mixture of encoders. In ICLR, 2025. 1, 2

  41. [41]

    Aligning large multi- modal models with factually augmented rlhf

    Zhiqing Sun, Sheng Shen, Shengcao Cao, Haotian Liu, Chunyuan Li, Yikang Shen, Chuang Gan, Liang-Yan Gui, Yu-Xiong Wang, Yiming Yang, et al. Aligning large multi- modal models with factually augmented rlhf. InACL Find- ings, 2024. 2

  42. [42]

    Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

    Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of con- text.arXiv preprint arXiv:2403.05530, 2024. 1, 2

  43. [43]

    Gemma 3 Technical Report

    Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Ta- tiana Matejovicova, Alexandre Ram ´e, Morgane Rivi `ere, et al. Gemma 3 technical report.arXiv preprint arXiv:2503.19786, 2025. 1, 2

  44. [44]

    Winoground: Probing vision and language models for visio- linguistic compositionality

    Tristan Thrush, Ryan Jiang, Max Bartolo, Amanpreet Singh, Adina Williams, Douwe Kiela, and Candace Ross. Winoground: Probing vision and language models for visio- linguistic compositionality. InCVPR, 2022. 1

  45. [45]

    Vigc: Visual instruction generation and correction

    Bin Wang, Fan Wu, Xiao Han, Jiahui Peng, Huaping Zhong, Pan Zhang, Xiaoyi Dong, Weijia Li, Wei Li, Jiaqi Wang, et al. Vigc: Visual instruction generation and correction. In AAAI, 2024. 1, 2

  46. [46]

    Amber: An llm-free multi-dimensional benchmark for mllms hallucination evaluation.arXiv preprint arXiv:2311.07397, 2023

    Junyang Wang, Yuhang Wang, Guohai Xu, Jing Zhang, Yukai Gu, Haitao Jia, Jiaqi Wang, Haiyang Xu, Ming Yan, Ji Zhang, et al. Amber: An llm-free multi-dimensional bench- mark for mllms hallucination evaluation.arXiv preprint arXiv:2311.07397, 2023. 2

  47. [47]

    Chain-of-thought prompting elicits reasoning in large lan- guage models

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large lan- guage models. InNeurIPS, 2022. 1, 2

  48. [48]

    HuggingFace's Transformers: State-of-the-art Natural Language Processing

    Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chau- mond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, R´emi Louf, Morgan Funtowicz, et al. Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771, 2019. 4

  49. [49]

    Logical closed loop: Uncover- ing object hallucinations in large vision-language models

    Junfei Wu, Qiang Liu, Ding Wang, Jinghao Zhang, Shu Wu, Liang Wang, and Tieniu Tan. Logical closed loop: Uncover- ing object hallucinations in large vision-language models. In ACL, 2024. 1, 2, 3, 4, 5, 6

  50. [50]

    Combating mul- timodal llm hallucination via bottom-up holistic reasoning

    Shengqiong Wu, Hao Fei, Liangming Pan, William Yang Wang, Shuicheng Yan, and Tat-Seng Chua. Combating mul- timodal llm hallucination via bottom-up holistic reasoning. InAAAI, 2025. 1

  51. [51]

    Lvlm-ehub: A comprehensive evaluation benchmark for large vision-language models.TPAMI, 2024

    Peng Xu, Wenqi Shao, Kaipeng Zhang, Peng Gao, Shuo Liu, Meng Lei, Fanqing Meng, Siyuan Huang, Yu Qiao, and Ping Luo. Lvlm-ehub: A comprehensive evaluation benchmark for large vision-language models.TPAMI, 2024. 2

  52. [52]

    Fiha: Automated fine-grained hallu- cinations evaluations in large vision language models with davidson scene graphs

    Bowen Yan, Zhengsong Zhang, Liqiang Jing, Eftekhar Hos- sain, and Xinya Du. Fiha: Automated fine-grained hallu- cinations evaluations in large vision language models with davidson scene graphs. InACL Findings, 2025. 2

  53. [53]

    mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality

    Qinghao Ye, Haiyang Xu, Guohai Xu, Jiabo Ye, Ming Yan, Yiyang Zhou, Junyang Wang, Anwen Hu, Pengcheng Shi, Yaya Shi, et al. mplug-owl: Modularization empowers large language models with multimodality.arXiv preprint arXiv:2304.14178, 2023. 1, 2, 4

  54. [54]

    Woodpecker: Hallucination correction for multimodal large language models.SCIS, 2024

    Shukang Yin, Chaoyou Fu, Sirui Zhao, Tong Xu, Hao Wang, Dianbo Sui, Yunhang Shen, Ke Li, Xing Sun, and Enhong Chen. Woodpecker: Hallucination correction for multimodal large language models.SCIS, 2024. 1, 2, 4

  55. [55]

    Rlaif-v: Open-source ai feedback leads to super gpt-4v trustworthiness

    Tianyu Yu, Haoye Zhang, Qiming Li, Qixin Xu, Yuan Yao, Da Chen, Xiaoman Lu, Ganqu Cui, Yunkai Dang, Taiwen He, et al. Rlaif-v: Open-source ai feedback leads to super gpt-4v trustworthiness. InCVPR, 2025. 2

  56. [56]

    Multimodal chain-of-thought rea- soning in language models.TMLR, 2024

    Zhuosheng Zhang, Aston Zhang, Mu Li, Hai Zhao, George Karypis, and Alex Smola. Multimodal chain-of-thought rea- soning in language models.TMLR, 2024. 2

  57. [57]

    arXiv preprint arXiv:2402.11411 , year=

    Yiyang Zhou, Chenhang Cui, Rafael Rafailov, Chelsea Finn, and Huaxiu Yao. Aligning modalities in vision large lan- guage models via preference fine-tuning.arXiv preprint arXiv:2402.11411, 2024. 2

  58. [58]

    Analyzing and mitigating object hallucination in large vision-language models

    Yiyang Zhou, Chenhang Cui, Jaehong Yoon, Linjun Zhang, Zhun Deng, Chelsea Finn, Mohit Bansal, and Huaxiu Yao. Analyzing and mitigating object hallucination in large vision-language models. InICLR, 2024. 1, 2, 4

  59. [59]

    MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

    Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mo- hamed Elhoseiny. Minigpt-4: Enhancing vision-language understanding with advanced large language models.arXiv preprint arXiv:2304.10592, 2023. 1, 2, 4 11 Appendix In this supplementary material, we provide the detailed prompt templates in Section A and visualize more quali- tative results in Section B,...

  60. [60]

    Yes”/“No

    Your response should be limited to one of the following two choices: “Yes”/“No”

  61. [61]

    For example, a baseball is a subclass of the sports ball

    Note that instances of a certain category can also belong to its super-categories. For example, a baseball is a subclass of the sports ball

  62. [62]

    Note that the table is equivalent to the dining table here

  63. [63]

    DO NOT RESPOND WITH ANYTHING ELSE. [Response Format] Yes/No Now complete the following: [Statement] {Input statement} [Question] Is there a{object}in the statement? [Response] 13 Table 11.Prompt template for final response generation.{In-context examples}are in-context examples for better instruction.{Input query}is the question asked by the user.{Input p...

  64. [64]

    Remove the objects that are confirmed to not exist in the supplementary information

    Correct the sentences in the passage if they are inconsistent with the supplementary information. Remove the objects that are confirmed to not exist in the supplementary information

  65. [65]

    Do not modify correct sentences and introduce additional information

  66. [66]

    The refined passage should be a reasonable answer to the query

    When giving refined passage, also pay attention to the given query. The refined passage should be a reasonable answer to the query

  67. [67]

    Output only the corrected passage, without introducing extra contents

    Note the dining table is equivalent to the table. Output only the corrected passage, without introducing extra contents. Here are examples: {In-context examples} Now complete the following: [Query] {Input query} [Passage] {Input passage} [Supplementary Information] {Input information} [Response] 14 Table 12.Prompt template for GPT-4o assisted evaluation.{...