arxiv: 2604.20696 · v1 · submitted 2026-04-22 · 💻 cs.CV

Recognition: unknown

R-CoV: Region-Aware Chain-of-Verification for Alleviating Object Hallucinations in LVLMs

Jiahao Xie , Alessio Tonioni , Nathalie Rauschmayr , Federico Tombari , Bernt Schiele

Authors on Pith no claims yet

Pith reviewed 2026-05-10 01:24 UTC · model grok-4.3

classification 💻 cs.CV

keywords object hallucinationslarge vision-language modelschain-of-verificationregion-aware processingtraining-free methodsmultimodal reasoninghallucination mitigation

0 comments

The pith

Vision-language models can reduce their own object hallucinations by verifying descriptions of specific image regions they generate.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows how to prompt large vision-language models to extract objects from their initial answers, generate coordinates for those objects in the image, describe the regions, and then verify consistency before producing a final answer. This chain uses the model's own outputs as the cue for checking, rather than external tools or retraining. A reader would care because models often claim objects exist in images when they do not, which breaks trust in tasks such as captioning or visual question answering. If the approach works, it turns the model's tendency to focus on regions into a built-in error-correction step that runs after any existing model produces an answer.

Core claim

By breaking the verification task into six sequential steps that begin with an initial response and end with a revised response, the method elicits coordinate generation and region descriptions from the same model and uses those descriptions to detect and remove claims of nonexistent objects.

What carries the argument

The six-step Region-aware Chain-of-Verification process that chains initial response generation, entity extraction, coordinate generation, region description, verification execution, and final response generation, using the model's self-produced regions as the verification signal.

If this is right

The method applies to many existing vision-language models without any weight updates.
It requires no separate object-detection model or other external components.
Performance gains appear on multiple public benchmarks that measure object hallucinations.
The same prompting structure can be reused across different model sizes and architectures.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The results imply that current models already encode usable location information that can be surfaced by targeted prompts rather than being absent.
Similar self-chaining steps could be tested on other hallucination types such as incorrect attributes or relations.
The approach might scale to video or multi-image inputs if the coordinate and region steps are adapted accordingly.

Load-bearing premise

The model must produce accurate coordinates and region descriptions in the intermediate steps without creating new errors that invalidate the later verification.

What would settle it

Apply the full chain to a model on a standard hallucination benchmark and measure whether the rate of invented objects in the final answer stays the same or increases compared with the initial answer.

Figures

Figures reproduced from arXiv: 2604.20696 by Alessio Tonioni, Bernt Schiele, Federico Tombari, Jiahao Xie, Nathalie Rauschmayr.

**Figure 1.** Figure 1: (a) An example of object hallucinations in LVLMs with the hallucinated object highlighted in red. (b) By eliciting regionlevel processing from LVLMs and using it as a chaining cue, we can detect and alleviate their own object hallucinations. struction fine-tuning [15, 21, 27, 36, 45], (ii) decoding process optimization [16, 22, 29], and (iii) integration of external expert models [19, 32, 40, 49, 50, 54… view at source ↗

**Figure 2.** Figure 2 [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Trade-off between performance and test-time compute. The vanilla model is LLaVA-v1.5-7B. We report both accuracy and F1 score averaged across the three splits of POPE. The computational cost is represented as the time (seconds) per query, measured on a single 48GB NVIDIA L40S GPU. R-CoV is more efficient than LogicCheckGPT, always achieving the same performance with less computational cost. and 67.68% F… view at source ↗

**Figure 4.** Figure 4: Example results of R-CoV for LLaVA-1.5. The hallucinated objects are highlighted in red. We sample three region descriptions per examinee to produce a set of diverse answers for verification. More examples are provided in the supplementary material. stead outputs regions corresponding to “hand” and “skier”, respectively, which commonly co-occur in similar visual contexts. When prompted with the image conta… view at source ↗

**Figure 5.** Figure 5: Example results of R-CoV for yes-or-no questions. The vanilla LVLM is LLaVA-v1.5-7B. The existent objects are highlighted in blue, while the hallucinated objects are highlighted in red. We sample three region descriptions per examinee to produce a set of diverse answers for verification. 16 [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗

**Figure 6.** Figure 6: Example results of R-CoV for open-ended questions. The vanilla LVLM is LLaVA-v1.5-7B. The hallucinated objects are highlighted in red. We sample three region descriptions per examinee to produce a set of diverse answers for verification. 17 [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗

read the original abstract

Large vision-language models (LVLMs) have demonstrated impressive performance in various multimodal understanding and reasoning tasks. However, they still struggle with object hallucinations, i.e., the claim of nonexistent objects in the visual input. To address this challenge, we propose Region-aware Chain-of-Verification (R-CoV), a visual chain-of-verification method to alleviate object hallucinations in LVLMs in a post-hoc manner. Motivated by how humans comprehend intricate visual information -- often focusing on specific image regions or details within a given sample -- we elicit such region-level processing from LVLMs themselves and use it as a chaining cue to detect and alleviate their own object hallucinations. Specifically, our R-CoV consists of six steps: initial response generation, entity extraction, coordinate generation, region description, verification execution, and final response generation. As a simple yet effective method, R-CoV can be seamlessly integrated into various LVLMs in a training-free manner and without relying on external detection models. Extensive experiments on several widely used hallucination benchmarks across multiple LVLMs demonstrate that R-CoV can significantly alleviate object hallucinations in LVLMs. Project page: https://github.com/Jiahao000/R-CoV.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

R-CoV chains verification on the model's own region coordinates to cut object hallucinations without training or external tools, but this only works if those coordinates are reliable enough.

read the letter

The punchline on this paper is that R-CoV is a training-free post-hoc fix that has the LVLM generate its own entity coordinates and region descriptions to verify and revise its initial answers. The six steps are straightforward and the authors report gains on standard hallucination benchmarks across several models. What is actually new is the specific choice to use self-generated spatial coordinates as the chaining signal rather than text-only verification or outside detectors. That keeps the method self-contained and easy to plug into existing LVLMs, which is a practical plus for anyone who wants to reduce hallucinations without retraining or adding dependencies. The experiments appear to test multiple models and datasets, which gives some sense of generality. The soft spot is the one the stress-test note flags. LVLMs are already known to be weak at precise localization, so the coordinate-generation step can easily point at the wrong image patch. When that happens the region descriptions and verification operate on irrelevant content, which undercuts the claim that the method reliably alleviates hallucinations. The abstract does not show separate measurements of coordinate accuracy or ablations that isolate this failure mode, so it is unclear how often the pipeline is actually verifying the right thing versus just adding another layer of model output. If the full paper has those controls or shows the method is robust even with noisy coordinates, the results would land more solidly. This paper is for groups working on practical reliability fixes for vision-language systems in applications like robotics or accessibility. Readers who need lightweight, model-agnostic tools will find the pipeline easy to try. It deserves peer review because the problem is real, the method is simple to reproduce, and referees can check the experimental details on localization quality and statistical robustness.

Referee Report

2 major / 2 minor

Summary. The paper proposes Region-aware Chain-of-Verification (R-CoV), a training-free post-hoc method for alleviating object hallucinations in LVLMs. It consists of a six-step pipeline (initial response generation, entity extraction, coordinate generation, region description, verification execution, and final response generation) that elicits region-level processing from the LVLM itself to serve as a chaining cue for self-verification, without requiring external detectors or fine-tuning. The central claim is that this approach can be seamlessly integrated into various LVLMs and yields significant reductions in object hallucinations, as demonstrated through experiments on multiple standard benchmarks across several models.

Significance. If the results hold after addressing the robustness concerns, R-CoV would provide a practical, lightweight way to improve LVLM reliability by leveraging the model's internal region cues rather than external tools or retraining. This aligns with the trend toward inference-time interventions and could see adoption in deployed multimodal systems. The training-free nature and lack of reliance on detection models are clear strengths, as is the evaluation across multiple LVLMs and benchmarks.

major comments (2)

[Method section (six-step pipeline)] Method section (description of the six-step pipeline, particularly coordinate generation and region description): The verification step operates on regions defined by coordinates and descriptions generated by the same LVLM. LVLMs have well-documented weaknesses in precise spatial localization and grounding. If the generated coordinates have low overlap with actual objects, verification occurs on irrelevant patches, which directly risks failing to correct (or even exacerbating) hallucinations. This assumption is load-bearing for the claim of effective alleviation in a training-free manner without external detectors; it requires explicit validation such as reporting localization metrics (e.g., IoU with ground-truth boxes) or ablation studies isolating the impact of coordinate accuracy.
[Experiments] Experimental results section: The abstract asserts that R-CoV 'significantly alleviate[s] object hallucinations,' yet the magnitude, statistical significance, and controls (e.g., for prompt sensitivity across the multi-step pipeline) are not detailed here. Without per-benchmark tables showing effect sizes, variance, and comparisons to strong baselines that also use multi-prompt strategies, it is difficult to assess whether improvements exceed what could be achieved by simpler prompting variants.

minor comments (2)

[Abstract / Method] A pipeline diagram would greatly improve clarity for the six-step process described in the abstract and method.
[Project page / Experiments] Ensure the GitHub repository includes all prompts used in each step and exact evaluation scripts for full reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the constructive and detailed feedback on our manuscript. We have carefully reviewed the major comments and provide point-by-point responses below. Where the concerns are valid, we outline specific revisions to strengthen the paper while preserving its core contributions.

read point-by-point responses

Referee: Method section (six-step pipeline)] Method section (description of the six-step pipeline, particularly coordinate generation and region description): The verification step operates on regions defined by coordinates and descriptions generated by the same LVLM. LVLMs have well-documented weaknesses in precise spatial localization and grounding. If the generated coordinates have low overlap with actual objects, verification occurs on irrelevant patches, which directly risks failing to correct (or even exacerbating) hallucinations. This assumption is load-bearing for the claim of effective alleviation in a training-free manner without external detectors; it requires explicit validation such as reporting localization metrics (e.g., IoU with ground-truth boxes) or ablation studies isolating the impact of coordinate accuracy.

Authors: We acknowledge the referee's valid point that coordinate accuracy is a critical assumption. LVLMs do have known limitations in precise grounding, and inaccurate coordinates could in principle direct verification to irrelevant regions. However, our design intentionally uses the LVLM's own outputs for consistency: the same model generates both the initial response and the region cues, allowing it to verify against its internal representation rather than external ground truth. This self-referential chaining is the core of the training-free approach. To directly address the concern, we will add (1) an ablation study that isolates the coordinate generation step (comparing full R-CoV against a variant using only entity extraction and description without coordinates) and (2) localization quality metrics (average IoU with ground-truth boxes) on any benchmarks that provide bounding-box annotations. These additions will quantify the robustness of the assumption. revision: partial
Referee: Experimental results section: The abstract asserts that R-CoV 'significantly alleviate[s] object hallucinations,' yet the magnitude, statistical significance, and controls (e.g., for prompt sensitivity across the multi-step pipeline) are not detailed here. Without per-benchmark tables showing effect sizes, variance, and comparisons to strong baselines that also use multi-prompt strategies, it is difficult to assess whether improvements exceed what could be achieved by simpler prompting variants.

Authors: We agree that the current experimental presentation would benefit from greater granularity to support the 'significant alleviation' claim. In the revised manuscript we will expand the results section with: per-benchmark tables reporting absolute and relative reductions in hallucination rates, standard deviations across multiple independent runs (to quantify variance and prompt sensitivity), and paired statistical significance tests. We will also add direct comparisons to strong multi-prompt baselines (e.g., self-consistency, iterative Chain-of-Thought, and self-verification variants) that use comparable numbers of forward passes, thereby demonstrating that gains are attributable to the region-aware chaining rather than prompt length alone. revision: yes

Circularity Check

0 steps flagged

No circularity in the derivation chain

full rationale

The paper presents a purely procedural, training-free method consisting of six explicit steps (initial response, entity extraction, coordinate generation, region description, verification, final response) that elicits region-level processing from the LVLM to verify its own outputs. No mathematical equations, fitted parameters, or self-citations are used to derive the central claim; effectiveness is shown via empirical evaluation on external hallucination benchmarks across multiple LVLMs. The result does not reduce to its inputs by construction, and the method remains self-contained against external benchmarks without any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The method rests on the domain assumption that intermediate region descriptions produced by the same model are sufficiently accurate to serve as verification signals. No free parameters or invented entities are introduced.

axioms (1)

domain assumption LVLMs can produce usable bounding-box coordinates and region descriptions when explicitly prompted in the intermediate steps.
Invoked in the coordinate generation and region description steps of the six-step pipeline.

pith-pipeline@v0.9.0 · 5535 in / 1142 out tokens · 28675 ms · 2026-05-10T01:24:43.969383+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

67 extracted references · 18 canonical work pages · 12 internal anchors

[1]

Flamingo: a visual language model for few-shot learning

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. InNeurIPS,
[2]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A frontier large vision-language model with versatile abilities.arXiv preprint arXiv:2308.12966, 2023. 1, 2

work page internal anchor Pith review arXiv 2023
[3]

Qwen2.5-VL Technical Report

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2.5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 4

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

Lan- guage models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Sub- biah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakan- tan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Lan- guage models are few-shot learners. InNeurIPS, 2020. 1, 2

2020
[5]

Sharegpt4v: Improving large multi-modal models with better captions

Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Conghui He, Jiaqi Wang, Feng Zhao, and Dahua Lin. Sharegpt4v: Improving large multi-modal models with better captions. In ECCV, 2024. 1

2024
[6]

Unified hallucination detection for multi- modal large language models

Xiang Chen, Chenxi Wang, Yida Xue, Ningyu Zhang, Xi- aoyan Yang, Qiang Li, Yue Shen, Lei Liang, Jinjie Gu, and Huajun Chen. Unified hallucination detection for multi- modal large language models. InACL, 2024. 2

2024
[7]

Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks

Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks. InCVPR,
[8]

Halc: Object hallucination reduc- tion via adaptive focal-contrast decoding

Zhaorun Chen, Zhuokai Zhao, Hongyin Luo, Huaxiu Yao, Bo Li, and Jiawei Zhou. Halc: Object hallucination reduc- tion via adaptive focal-contrast decoding. InICML, 2024. 2

2024
[9]

Holistic analysis of hallucination in gpt-4v (ision): Bias and interference chal- lenges

Chenhang Cui, Yiyang Zhou, Xinyu Yang, Shirley Wu, Lin- jun Zhang, James Zou, and Huaxiu Yao. Holistic analysis of hallucination in gpt-4v (ision): Bias and interference chal- lenges.arXiv preprint arXiv:2311.03287, 2023. 2

work page arXiv 2023
[10]

Instructblip: Towards general-purpose vision-language models with instruction tuning

Wenliang Dai, Junnan Li, Dongxu Li, Anthony Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning. InNeurIPS, 2023. 1, 2 9

2023
[11]

Chain-of-verification reduces hallucination in large language models

Shehzaad Dhuliawala, Mojtaba Komeili, Jing Xu, Roberta Raileanu, Xian Li, Asli Celikyilmaz, and Jason Weston. Chain-of-verification reduces hallucination in large language models. InACL Findings, 2024. 1

2024
[12]

The Llama 3 Herd of Models

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Ab- hishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783,

work page internal anchor Pith review Pith/arXiv arXiv
[13]

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, et al. Mme: A comprehensive evaluation bench- mark for multimodal large language models.arXiv preprint arXiv:2306.13394, 2023. 2, 4

work page internal anchor Pith review arXiv 2023
[14]

Hallusionbench: an advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models

Tianrui Guan, Fuxiao Liu, Xiyang Wu, Ruiqi Xian, Zongxia Li, Xiaoyu Liu, Xijun Wang, Lichang Chen, Furong Huang, Yaser Yacoob, et al. Hallusionbench: an advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models. InCVPR, 2024. 2

2024
[15]

Detecting and preventing hallucinations in large vision language models

Anisha Gunjal, Jihan Yin, and Erhan Bas. Detecting and preventing hallucinations in large vision language models. InAAAI, 2024. 1, 2

2024
[16]

Opera: Alleviating hallucination in multi- modal large language models via over-trust penalty and retrospection-allocation

Qidong Huang, Xiaoyi Dong, Pan Zhang, Bin Wang, Con- ghui He, Jiaqi Wang, Dahua Lin, Weiming Zhang, and Nenghai Yu. Opera: Alleviating hallucination in multi- modal large language models via over-trust penalty and retrospection-allocation. InCVPR, 2024. 1, 2, 8

2024
[17]

OpenAI o1 System Card

Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richard- son, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card.arXiv preprint arXiv:2412.16720, 2024. 2

work page internal anchor Pith review Pith/arXiv arXiv 2024
[18]

Faith- score: Fine-grained evaluations of hallucinations in large vision-language models

Liqiang Jing, Ruosen Li, Yunmo Chen, and Xinya Du. Faith- score: Fine-grained evaluations of hallucinations in large vision-language models. InEMNLP Findings, 2024. 2

2024
[19]

Brave: Broadening the visual encoding of vision-language models

O ˘guzhan Fatih Kar, Alessio Tonioni, Petra Poklukar, Achin Kulshrestha, Amir Zamir, and Federico Tombari. Brave: Broadening the visual encoding of vision-language models. InECCV, 2024. 1, 2

2024
[20]

Geochat: Grounded large vision-language model for remote sensing

Kartik Kuckreja, Muhammad Sohail Danish, Muzammal Naseer, Abhijit Das, Salman Khan, and Fahad Shahbaz Khan. Geochat: Grounded large vision-language model for remote sensing. InCVPR, 2024. 1

2024
[21]

V olcano: mitigating multimodal hallucination through self-feedback guided revision

Seongyun Lee, Sue Hyun Park, Yongrae Jo, and Minjoon Seo. V olcano: mitigating multimodal hallucination through self-feedback guided revision. InNAACL, 2024. 1, 2

2024
[22]

Mitigating object hal- lucinations in large vision-language models through visual contrastive decoding

Sicong Leng, Hang Zhang, Guanzheng Chen, Xin Li, Shijian Lu, Chunyan Miao, and Lidong Bing. Mitigating object hal- lucinations in large vision-language models through visual contrastive decoding. InCVPR, 2024. 1, 2, 8

2024
[23]

Otter: A Multi-Modal Model with In-Context Instruction Tuning

Bo Li, Yuanhan Zhang, Liangyu Chen, Jinghao Wang, Jingkang Yang, and Ziwei Liu. Otter: A multi-modal model with in-context instruction tuning.arXiv preprint arXiv:2305.03726, 2023. 1, 2

work page internal anchor Pith review arXiv 2023
[24]

Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation

Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. InICML,
[25]

Evaluating object hallucination in large vision-language models

Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision-language models. InEMNLP, 2023. 2, 4

2023
[26]

Microsoft coco: Common objects in context

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In ECCV, 2014. 4

2014
[27]

Mitigating hallucination in large multi-modal models via robust instruction tuning

Fuxiao Liu, Kevin Lin, Linjie Li, Jianfeng Wang, Yaser Ya- coob, and Lijuan Wang. Mitigating hallucination in large multi-modal models via robust instruction tuning. InICLR,
[28]

Visual instruction tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. InNeurIPS, 2023. 1, 2, 4

2023
[29]

Paying more atten- tion to image: A training-free method for alleviating halluci- nation in lvlms

Shi Liu, Kecheng Zheng, and Wei Chen. Paying more atten- tion to image: A training-free method for alleviating halluci- nation in lvlms. InECCV, 2024. 1, 2, 8

2024
[30]

Negative object presence evaluation (nope) to measure object hallucination in vision-language models

Holy Lovenia, Wenliang Dai, Samuel Cahyawijaya, Ziwei Ji, and Pascale Fung. Negative object presence evaluation (nope) to measure object hallucination in vision-language models. InACLW, 2024. 2

2024
[31]

DeepSeek-VL: Towards Real-World Vision-Language Understanding

Haoyu Lu, Wen Liu, Bo Zhang, Bingxuan Wang, Kai Dong, Bo Liu, Jingxiang Sun, Tongzheng Ren, Zhuoshu Li, Yaofeng Sun, et al. Deepseek-vl: towards real-world vision- language understanding.arXiv preprint arXiv:2403.05525,

work page internal anchor Pith review arXiv
[32]

Self- checkgpt: Zero-resource black-box hallucination detection for generative large language models

Potsawee Manakul, Adian Liusie, and Mark JF Gales. Self- checkgpt: Zero-resource black-box hallucination detection for generative large language models. InEMNLP, 2023. 1, 2, 4

2023
[33]

The llama 4 herd: The beginning of a new era of natively multimodal ai innovation.https://ai

AI Meta. The llama 4 herd: The beginning of a new era of natively multimodal ai innovation.https://ai. meta. com/blog/llama-4-multimodal-intelligence/, checked on, 4 (7):2025, 2025. 1, 2

2025
[34]

Clipttt: Clip-guided test-time training helps lvlms see better

Mriganka Nath, Anurag Das, Jiahao Xie, and Bernt Schiele. Clipttt: Clip-guided test-time training helps lvlms see better. arXiv preprint arXiv:2603.26486, 2026. 2

work page arXiv 2026
[35]

GPT-4 Technical Report

OpenAI. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023. 1, 2

work page internal anchor Pith review Pith/arXiv arXiv 2023
[36]

Training lan- guage models to follow instructions with human feedback

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Car- roll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training lan- guage models to follow instructions with human feedback. InNeurIPS, 2022. 1, 2

2022
[37]

Pytorch: An imperative style, high-performance deep learning library

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high-performance deep learning library. In NeurIPS, 2019. 4

2019
[38]

Object hallucination in image cap- tioning

Anna Rohrbach, Lisa Anne Hendricks, Kaylee Burns, Trevor Darrell, and Kate Saenko. Object hallucination in image cap- tioning. InEMNLP, 2018. 2, 4, 8

2018
[39]

Visual chain of thought: bridging logical gaps with multimodal infillings

Daniel Rose, Vaishnavi Himakunthala, Andy Ouyang, Ryan He, Alex Mei, Yujie Lu, Michael Saxon, Chinmay Sonar, Diba Mirza, and William Yang Wang. Visual chain of thought: bridging logical gaps with multimodal infillings. arXiv preprint arXiv:2305.02317, 2023. 2 10

work page arXiv 2023
[40]

Eagle: Exploring the de- sign space for multimodal llms with mixture of encoders

Min Shi, Fuxiao Liu, Shihao Wang, Shijia Liao, Subhashree Radhakrishnan, Yilin Zhao, De-An Huang, Hongxu Yin, Karan Sapra, Yaser Yacoob, et al. Eagle: Exploring the de- sign space for multimodal llms with mixture of encoders. In ICLR, 2025. 1, 2

2025
[41]

Aligning large multi- modal models with factually augmented rlhf

Zhiqing Sun, Sheng Shen, Shengcao Cao, Haotian Liu, Chunyuan Li, Yikang Shen, Chuang Gan, Liang-Yan Gui, Yu-Xiong Wang, Yiming Yang, et al. Aligning large multi- modal models with factually augmented rlhf. InACL Find- ings, 2024. 2

2024
[42]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of con- text.arXiv preprint arXiv:2403.05530, 2024. 1, 2

work page internal anchor Pith review arXiv 2024
[43]

Gemma 3 Technical Report

Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Ta- tiana Matejovicova, Alexandre Ram ´e, Morgane Rivi `ere, et al. Gemma 3 technical report.arXiv preprint arXiv:2503.19786, 2025. 1, 2

work page internal anchor Pith review Pith/arXiv arXiv 2025
[44]

Winoground: Probing vision and language models for visio- linguistic compositionality

Tristan Thrush, Ryan Jiang, Max Bartolo, Amanpreet Singh, Adina Williams, Douwe Kiela, and Candace Ross. Winoground: Probing vision and language models for visio- linguistic compositionality. InCVPR, 2022. 1

2022
[45]

Vigc: Visual instruction generation and correction

Bin Wang, Fan Wu, Xiao Han, Jiahui Peng, Huaping Zhong, Pan Zhang, Xiaoyi Dong, Weijia Li, Wei Li, Jiaqi Wang, et al. Vigc: Visual instruction generation and correction. In AAAI, 2024. 1, 2

2024
[46]

Amber: An llm-free multi-dimensional benchmark for mllms hallucination evaluation.arXiv preprint arXiv:2311.07397, 2023

Junyang Wang, Yuhang Wang, Guohai Xu, Jing Zhang, Yukai Gu, Haitao Jia, Jiaqi Wang, Haiyang Xu, Ming Yan, Ji Zhang, et al. Amber: An llm-free multi-dimensional bench- mark for mllms hallucination evaluation.arXiv preprint arXiv:2311.07397, 2023. 2

work page arXiv 2023
[47]

Chain-of-thought prompting elicits reasoning in large lan- guage models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large lan- guage models. InNeurIPS, 2022. 1, 2

2022
[48]

HuggingFace's Transformers: State-of-the-art Natural Language Processing

Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chau- mond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, R´emi Louf, Morgan Funtowicz, et al. Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771, 2019. 4

work page internal anchor Pith review arXiv 1910
[49]

Logical closed loop: Uncover- ing object hallucinations in large vision-language models

Junfei Wu, Qiang Liu, Ding Wang, Jinghao Zhang, Shu Wu, Liang Wang, and Tieniu Tan. Logical closed loop: Uncover- ing object hallucinations in large vision-language models. In ACL, 2024. 1, 2, 3, 4, 5, 6

2024
[50]

Combating mul- timodal llm hallucination via bottom-up holistic reasoning

Shengqiong Wu, Hao Fei, Liangming Pan, William Yang Wang, Shuicheng Yan, and Tat-Seng Chua. Combating mul- timodal llm hallucination via bottom-up holistic reasoning. InAAAI, 2025. 1

2025
[51]

Lvlm-ehub: A comprehensive evaluation benchmark for large vision-language models.TPAMI, 2024

Peng Xu, Wenqi Shao, Kaipeng Zhang, Peng Gao, Shuo Liu, Meng Lei, Fanqing Meng, Siyuan Huang, Yu Qiao, and Ping Luo. Lvlm-ehub: A comprehensive evaluation benchmark for large vision-language models.TPAMI, 2024. 2

2024
[52]

Fiha: Automated fine-grained hallu- cinations evaluations in large vision language models with davidson scene graphs

Bowen Yan, Zhengsong Zhang, Liqiang Jing, Eftekhar Hos- sain, and Xinya Du. Fiha: Automated fine-grained hallu- cinations evaluations in large vision language models with davidson scene graphs. InACL Findings, 2025. 2

2025
[53]

mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality

Qinghao Ye, Haiyang Xu, Guohai Xu, Jiabo Ye, Ming Yan, Yiyang Zhou, Junyang Wang, Anwen Hu, Pengcheng Shi, Yaya Shi, et al. mplug-owl: Modularization empowers large language models with multimodality.arXiv preprint arXiv:2304.14178, 2023. 1, 2, 4

work page Pith review arXiv 2023
[54]

Woodpecker: Hallucination correction for multimodal large language models.SCIS, 2024

Shukang Yin, Chaoyou Fu, Sirui Zhao, Tong Xu, Hao Wang, Dianbo Sui, Yunhang Shen, Ke Li, Xing Sun, and Enhong Chen. Woodpecker: Hallucination correction for multimodal large language models.SCIS, 2024. 1, 2, 4

2024
[55]

Rlaif-v: Open-source ai feedback leads to super gpt-4v trustworthiness

Tianyu Yu, Haoye Zhang, Qiming Li, Qixin Xu, Yuan Yao, Da Chen, Xiaoman Lu, Ganqu Cui, Yunkai Dang, Taiwen He, et al. Rlaif-v: Open-source ai feedback leads to super gpt-4v trustworthiness. InCVPR, 2025. 2

2025
[56]

Multimodal chain-of-thought rea- soning in language models.TMLR, 2024

Zhuosheng Zhang, Aston Zhang, Mu Li, Hai Zhao, George Karypis, and Alex Smola. Multimodal chain-of-thought rea- soning in language models.TMLR, 2024. 2

2024
[57]

arXiv preprint arXiv:2402.11411 , year=

Yiyang Zhou, Chenhang Cui, Rafael Rafailov, Chelsea Finn, and Huaxiu Yao. Aligning modalities in vision large lan- guage models via preference fine-tuning.arXiv preprint arXiv:2402.11411, 2024. 2

work page arXiv 2024
[58]

Analyzing and mitigating object hallucination in large vision-language models

Yiyang Zhou, Chenhang Cui, Jaehong Yoon, Linjun Zhang, Zhun Deng, Chelsea Finn, Mohit Bansal, and Huaxiu Yao. Analyzing and mitigating object hallucination in large vision-language models. InICLR, 2024. 1, 2, 4

2024
[59]

MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mo- hamed Elhoseiny. Minigpt-4: Enhancing vision-language understanding with advanced large language models.arXiv preprint arXiv:2304.10592, 2023. 1, 2, 4 11 Appendix In this supplementary material, we provide the detailed prompt templates in Section A and visualize more quali- tative results in Section B,...

work page internal anchor Pith review arXiv 2023
[60]

Yes”/“No

Your response should be limited to one of the following two choices: “Yes”/“No”
[61]

For example, a baseball is a subclass of the sports ball

Note that instances of a certain category can also belong to its super-categories. For example, a baseball is a subclass of the sports ball
[62]

Note that the table is equivalent to the dining table here
[63]

DO NOT RESPOND WITH ANYTHING ELSE. [Response Format] Yes/No Now complete the following: [Statement] {Input statement} [Question] Is there a{object}in the statement? [Response] 13 Table 11.Prompt template for final response generation.{In-context examples}are in-context examples for better instruction.{Input query}is the question asked by the user.{Input p...
[64]

Remove the objects that are confirmed to not exist in the supplementary information

Correct the sentences in the passage if they are inconsistent with the supplementary information. Remove the objects that are confirmed to not exist in the supplementary information
[65]

Do not modify correct sentences and introduce additional information
[66]

The refined passage should be a reasonable answer to the query

When giving refined passage, also pay attention to the given query. The refined passage should be a reasonable answer to the query
[67]

Output only the corrected passage, without introducing extra contents

Note the dining table is equivalent to the table. Output only the corrected passage, without introducing extra contents. Here are examples: {In-context examples} Now complete the following: [Query] {Input query} [Passage] {Input passage} [Supplementary Information] {Input information} [Response] 14 Table 12.Prompt template for GPT-4o assisted evaluation.{...