Tell Model Where to Look: Mitigating Hallucinations in MLLMs by Vision-Guided Attention

Chong Feng; Feng Zhang; Jianfei Zhao; Xin Sun; Zhixing Tan

arxiv: 2511.20032 · v3 · submitted 2025-11-25 · 💻 cs.CV

Tell Model Where to Look: Mitigating Hallucinations in MLLMs by Vision-Guided Attention

Jianfei Zhao , Feng Zhang , Xin Sun , Chong Feng , Zhixing Tan This is my paper

Pith reviewed 2026-05-17 04:55 UTC · model grok-4.3

classification 💻 cs.CV

keywords multimodal large language modelshallucination mitigationvision-guided attentionvisual groundingtraining-free methodattention steeringimage captioning

0 comments

The pith

Vision-Guided Attention reduces hallucinations in MLLMs by building precise visual grounding from token semantics and steering focus to relevant image regions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that MLLMs already pull accurate semantics from visual tokens yet still produce hallucinations because their attention does not localize well enough during inference. VGA fixes this by first turning those semantics into explicit grounding maps and then using the maps to redirect attention without any model retraining. The approach adds almost no extra cost because each token runs through only one forward pass and works with existing fast attention kernels. For captioning tasks the grounding updates on the fly to ignore regions already mentioned in the output. Experiments across several models and standard hallucination tests show clear gains over previous methods.

Core claim

The central claim is that explicit visual guidance derived from the semantic content already present in visual tokens can steer MLLMs toward accurate image regions and thereby reduce hallucinations. For image captioning the guidance is refined dynamically by suppressing regions that have already been described. The entire process requires only a single forward pass per token and remains fully compatible with efficient attention implementations.

What carries the argument

Vision-Guided Attention (VGA), which constructs precise visual grounding from the semantics in visual tokens and then applies that grounding to direct the model's attention during inference.

If this is right

VGA delivers state-of-the-art results on multiple hallucination benchmarks across diverse MLLMs.
The method adds negligible latency because every token requires only a single forward pass.
No training or parameter updates are needed on the underlying MLLM.
Dynamic suppression of already-described regions improves captioning accuracy.
Full compatibility with FlashAttention preserves efficient inference.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same grounding construction could be tested on video or multi-image inputs where localization errors compound over time.
Current MLLM attention layers may be systematically under-using information that their own visual encoders already provide.
Combining VGA with existing training-based dehallucination methods might produce additive improvements.
One could check whether analogous guidance signals help in non-visual modalities by mapping token semantics to other modalities.

Load-bearing premise

The semantic information already inside visual tokens can be turned into grounding signals the model is not already using, and directing attention with those signals will reduce hallucinations without creating new failure modes.

What would settle it

An experiment that supplies perfect visual grounding to the model yet still records the original rate of hallucinations on standard benchmarks would show the guidance step is not the decisive factor.

Figures

Figures reproduced from arXiv: 2511.20032 by Chong Feng, Feng Zhang, Jianfei Zhao, Xin Sun, Zhixing Tan.

**Figure 4.** Figure 4: Performance comparison of LLaVA-1.5’s response [PITH_FULL_IMAGE:figures/full_fig_p003_4.png] view at source ↗

**Figure 5.** Figure 5: Comparison of visual grounding performance between [PITH_FULL_IMAGE:figures/full_fig_p003_5.png] view at source ↗

**Figure 6.** Figure 6: Point-biserial correlation between visual semantic [PITH_FULL_IMAGE:figures/full_fig_p004_6.png] view at source ↗

**Figure 7.** Figure 7: Accuracy on positive and negative samples with dif [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗

**Figure 8.** Figure 8: VSC’s visual grounding for absent and present objects. [PITH_FULL_IMAGE:figures/full_fig_p007_8.png] view at source ↗

**Figure 9.** Figure 9: Programmd Vision-Guidance. We set γ to 0.1 in this case to better illustrate the dynamic process. Vision-Guidance with static VSS-based grounding, with the results presented in [PITH_FULL_IMAGE:figures/full_fig_p008_9.png] view at source ↗

**Figure 10.** Figure 10: The model’s attention to the BOS token in each layer. [PITH_FULL_IMAGE:figures/full_fig_p012_10.png] view at source ↗

**Figure 11.** Figure 11: Effectiveness of γ in MSCOCO’s Random set for LLaVA-1.5, layer 0 for LLaVA-Next, and layer 4 for Qwen2.5-VL. We terminate VGA application at the middle of the model: at layer 24 for LLaVA-1.5-13B and at layer 16 for all other models. We adopt the default settings of γ = 0.2 and λ = 0.02. For the larger model (LLaVA1.5-13B) and visually simpler tasks (POPE), we increase γ to 0.25 to apply stronger vision … view at source ↗

read the original abstract

Visual attention serves as the primary mechanism through which MLLMs interpret visual information; however, its limited localization capability often leads to hallucinations. We observe that although MLLMs can accurately extract visual semantics from visual tokens, they fail to fully leverage this advantage during subsequent inference. To address this limitation, we propose Vision-Guided Attention (VGA), a training-free method that first constructs precise visual grounding by exploiting the semantic content of visual tokens, and then uses this grounding to guide the model's focus toward relevant visual regions. In image captioning, VGA further refines this guidance dynamically during generation by suppressing regions that have already been described. In VGA, each token undergoes only a single forward pass, introducing a negligible latency overhead. In addition, VGA is fully compatible with efficient attention implementations such as FlashAttention. Extensive experiments across diverse MLLMs and multiple hallucination benchmarks demonstrate that VGA achieves state-of-the-art dehallucination performance. Further analysis confirms that explicit visual guidance plays a crucial role in enhancing the visual understanding capabilities of MLLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

VGA is a lightweight training-free tweak that steers MLLM attention using semantics already in the visual tokens, plus dynamic suppression for captioning, but the real lift depends on whether that grounding actually differs from native patterns.

read the letter

The main thing here is that this paper offers a simple inference-time method called Vision-Guided Attention. It pulls grounding directly from the semantic content of visual tokens, then applies that to guide where the model looks, with an added step that suppresses already-described regions during caption generation. The whole thing runs with one forward pass and stays compatible with FlashAttention, so the overhead stays small.

Referee Report

2 major / 2 minor

Summary. The paper proposes Vision-Guided Attention (VGA), a training-free inference-time intervention for MLLMs. It first extracts semantic content from visual tokens to build explicit visual grounding, then steers attention toward relevant image regions; in captioning it further suppresses already-described regions dynamically. Each token requires only one forward pass, the method is compatible with FlashAttention, and the authors claim SOTA dehallucination results across multiple MLLMs and benchmarks together with improved visual understanding.

Significance. If the central empirical claims hold, VGA would be a useful practical contribution: a lightweight, training-free technique that exploits already-computed visual-token representations to reduce hallucinations. The training-free nature, negligible latency, and FlashAttention compatibility are clear strengths that lower the barrier to adoption. The dynamic suppression mechanism in captioning is a reasonable extension of the core idea.

major comments (2)

[§4 and abstract] §4 (Experimental Results) and the abstract: the claim of 'state-of-the-art dehallucination performance' is presented without any quantitative numbers, ablation tables, or error analysis in the visible text. Without these data it is impossible to verify the magnitude of improvement or to check whether the method reduces hallucinations without introducing new omission or over-suppression failures.
[§3] §3 (Method): the construction of 'precise visual grounding' from visual-token semantics is described at a high level but lacks a concrete procedure or comparison showing that the resulting attention maps differ meaningfully from the model's native attention. If the grounding largely reproduces existing patterns, the intervention is redundant and the central assumption that 'the model is not already using' this information remains untested.

minor comments (2)

[Abstract] Abstract: including one or two key quantitative results (e.g., percentage reduction on a standard benchmark) would make the SOTA claim immediately verifiable.
[§3] Notation: the distinction between 'visual grounding' and 'attention guidance' should be clarified with a short equation or pseudocode snippet to avoid ambiguity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their valuable feedback on our manuscript. We provide point-by-point responses to the major comments and outline the revisions we plan to make.

read point-by-point responses

Referee: [§4 and abstract] §4 (Experimental Results) and the abstract: the claim of 'state-of-the-art dehallucination performance' is presented without any quantitative numbers, ablation tables, or error analysis in the visible text. Without these data it is impossible to verify the magnitude of improvement or to check whether the method reduces hallucinations without introducing new omission or over-suppression failures.

Authors: We appreciate this comment. Although the full manuscript's Section 4 provides quantitative results, ablation tables, and error analysis supporting the state-of-the-art dehallucination performance and analyzing omission and over-suppression, the abstract summarizes without specifics. We will revise the abstract to highlight key quantitative improvements and reference the detailed tables in Section 4. This will allow better verification of the improvements and failure mode analysis. revision: yes
Referee: [§3] §3 (Method): the construction of 'precise visual grounding' from visual-token semantics is described at a high level but lacks a concrete procedure or comparison showing that the resulting attention maps differ meaningfully from the model's native attention. If the grounding largely reproduces existing patterns, the intervention is redundant and the central assumption that 'the model is not already using' this information remains untested.

Authors: We agree that more concrete details would be beneficial. The manuscript's Section 3 outlines the process of extracting semantic content from visual tokens to construct the grounding and then guiding attention. To strengthen this, we will expand the method description with a precise algorithmic procedure and include additional experiments or visualizations that compare the guided attention maps to the native ones. These additions will demonstrate the meaningful differences and validate that the intervention leverages information not fully utilized by the model. revision: yes

Circularity Check

0 steps flagged

No significant circularity; VGA is an inference-time intervention with independent experimental validation.

full rationale

The paper describes a training-free method that extracts semantic content from existing visual tokens to construct grounding and steer attention, with dynamic suppression in captioning. No equations, fitted parameters, self-citations, or uniqueness theorems are invoked that would reduce the claimed dehallucination gains to a redefinition or statistical forcing of the inputs. The SOTA performance is asserted via external benchmarks and diverse MLLM experiments, which stand apart from the method's construction. The derivation chain is self-contained against those benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that visual tokens already contain extractable semantics that the model under-utilizes, plus the premise that an explicit grounding map derived from those semantics will steer attention productively.

axioms (1)

domain assumption MLLMs accurately extract visual semantics from visual tokens yet fail to fully leverage this during subsequent inference
Directly stated in the abstract as the observed limitation that VGA is designed to address.

pith-pipeline@v0.9.0 · 5495 in / 1210 out tokens · 35353 ms · 2026-05-17T04:55:05.311539+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We propose Vision-Guided Attention (VGA), which uses the visual grounding produced by VSC to guide visual attention... ˆαh,−1:,s:e = αh,−1:,s:e + β·G
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

VSC... cvi(O) = softmax[logit vi(O)]... GO = Norm[{cvi(o0)}]

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

37 extracted references · 37 canonical work pages · 3 internal anchors

[1]

Qwen2.5-VL Technical Report

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2.5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 5

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

Per- turbollava: Reducing multimodal hallucinations with pertur- bative visual training

Cong Chen, Mingyu Liu, Chenchen Jing, Yizhou Zhou, Fengyun Rao, Hao Chen, Bo Zhang, and Chunhua Shen. Per- turbollava: Reducing multimodal hallucinations with pertur- bative visual training. InICLR, 2025. 2

work page 2025
[3]

Why is spatial reasoning hard for vlms? an attention mechanism perspective on focus areas

Shiqi Chen, Tongyao Zhu, Ruochen Zhou, Jinghan Zhang, Siyang Gao, Juan Carlos Niebles, Mor Geva, Junxian He, Jiajun Wu, and Manling Li. Why is spatial reasoning hard for vlms? an attention mechanism perspective on focus areas. In ICML, 2025. 2, 3

work page 2025
[4]

Flashattention: Fast and memory-efficient exact at- tention with io-awareness

Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christo- pher R´e. Flashattention: Fast and memory-efficient exact at- tention with io-awareness. InNeurIPS, pages 16344–16359,

work page
[5]

Cracking the code of hallucination in lvlms with vision-aware head divergence

Jinghan He, Kuan Zhu, Haiyun Guo, Junfeng Fang, Zhenglin Hua, Yuheng Jia, Ming Tang, Tat-Seng Chua, and Jinqiao Wang. Cracking the code of hallucination in lvlms with vision-aware head divergence. InACL, 2025. 2

work page 2025
[6]

Opera: Alleviating hallucination in multi- modal large language models via over-trust penalty and retrospection-allocation

Qidong Huang, Xiaoyi Dong, Pan Zhang, Bin Wang, Con- ghui He, Jiaqi Wang, Dahua Lin, Weiming Zhang, and Nenghai Yu. Opera: Alleviating hallucination in multi- modal large language models via over-trust penalty and retrospection-allocation. InCVPR, pages 13418–13427,

work page
[7]

Interpreting and editing vision-language representations to mitigate hallucinations

Nicholas Jiang, Anish Kachinthaya, Suzanne Petryk, and Yossi Gandelsman. Interpreting and editing vision-language representations to mitigate hallucinations. InICLR, 2025. 2

work page 2025
[8]

See what you are told: Visual attention sink in large multimodal models

Seil Kang, Jinyeong Kim, Junhyeok Kim, and Seong Jae Hwang. See what you are told: Visual attention sink in large multimodal models. InICLR, 2025. 2, 4

work page 2025
[9]

Mitigating object hal- lucinations in large vision-language models through visual contrastive decoding

Sicong Leng, Hang Zhang, Guanzheng Chen, Xin Li, Shijian Lu, Chunyan Miao, and Lidong Bing. Mitigating object hal- lucinations in large vision-language models through visual contrastive decoding. InCVPR, pages 13872–13882, 2024. 2

work page 2024
[10]

Treble counterfactual vlms: A causal approach to hallucination

Li Li, Jiashu Qu, Linxin Song, Yuxiao Zhou, Yuehan Qin, Tiankai Yang, and Yue Zhao. Treble counterfactual vlms: A causal approach to hallucination. InFEMNLP, pages 18423– 18434, 2025. 2

work page 2025
[11]

Evaluating object hallucination in large vision-language models

Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision-language models. InEMNLP, pages 292–305,

work page
[12]

The hidden life of tokens: Reducing hallucination of large vision-language models via visual in- formation steering

Zhuowei Li, Haizhou Shi, Yunhe Gao, Di Liu, Zhenting Wang, Yuxiao Chen, Ting Liu, Long Zhao, Hao Wang, and Dimitris N Metaxas. The hidden life of tokens: Reducing hallucination of large vision-language models via visual in- formation steering. InICML, 2025. 2

work page 2025
[13]

Microsoft coco: Common objects in context

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In ECCV, pages 740–755, 2014. 1

work page 2014
[14]

Improved baselines with visual instruction tuning

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. InCVPR, pages 26296–26306, 2024. 5

work page 2024
[15]

Llava-next: Im- proved reasoning, ocr, and world knowledge, 2024

Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava-next: Im- proved reasoning, ocr, and world knowledge, 2024. 5

work page 2024
[16]

Paying more at- tention to image: A training-free method for alleviating hal- lucination in lvlms

Shi Liu, Kecheng Zheng, and Wei Chen. Paying more at- tention to image: A training-free method for alleviating hal- lucination in lvlms. InECCV, pages 125–140, 2024. 1, 2, 6

work page 2024
[17]

Mitigating hallucination through theory-consistent symmetric multimodal preference optimization.arXiv preprint arXiv:2506.11712, 2025

Wenqi Liu, Xuemeng Song, Jiaxi Li, Yinwei Wei, Na Zheng, Jianhua Yin, and Liqiang Nie. Mitigating hallucination through theory-consistent symmetric multimodal preference optimization.arXiv preprint arXiv:2506.11712, 2025. 1, 2

work page arXiv 2025
[18]

Second: Mitigating perceptual hallucination in vision- language models via selective and contrastive decoding

Woohyeon Park, Woojin Kim, Jaeik Kim, and Jaeyoung Do. Second: Mitigating perceptual hallucination in vision- language models via selective and contrastive decoding. In ICML, 2025. 4

work page 2025
[19]

Stanza: A python natural language processing toolkit for many human languages

Peng Qi, Yuhao Zhang, Yuhui Zhang, Jason Bolton, and Christopher D Manning. Stanza: A python natural language processing toolkit for many human languages. InACL, pages 101–108, 2020. 7, 2

work page 2020
[20]

Object hallucination in image cap- tioning

Anna Rohrbach, Lisa Anne Hendricks, Kaylee Burns, Trevor Darrell, and Kate Saenko. Object hallucination in image cap- tioning. InEMNLP, pages 4035–4045, 2018. 6, 1

work page 2018
[21]

Mitigating ob- ject hallucination in mllms via data-augmented phrase-level alignment

Pritam Sarkar, Sayna Ebrahimi, Ali Etemad, Ahmad Beirami, Sercan O Arik, and Tomas Pfister. Mitigating ob- ject hallucination in mllms via data-augmented phrase-level alignment. InICLR, 2025. 1, 2

work page 2025
[22]

AMBER: An LLM-free Multi-dimensional Benchmark for MLLMs Hallucination Evaluation

Junyang Wang, Yuhang Wang, Guohai Xu, Jing Zhang, Yukai Gu, Haitao Jia, Jiaqi Wang, Haiyang Xu, Ming Yan, Ji Zhang, et al. Amber: An llm-free multi-dimensional bench- mark for mllms hallucination evaluation.arXiv preprint arXiv:2311.07397, 2023. 6, 1

work page internal anchor Pith review Pith/arXiv arXiv 2023
[23]

Damo: Decoding by accumulating activations mo- mentum for mitigating hallucinations in vision-language models

Kaishen Wang, Hengrui Gu, Meijun Gao, and Kaixiong Zhou. Damo: Decoding by accumulating activations mo- mentum for mitigating hallucinations in vision-language models. InICLR, 2025. 2

work page 2025
[24]

Detecting and mitigating hallucination in large vi- sion language models via fine-grained ai feedback

Wenyi Xiao, Ziwei Huang, Leilei Gan, Wanggui He, Haoyuan Li, Zhelun Yu, Fangxun Shu, Hao Jiang, and Lin- chao Zhu. Detecting and mitigating hallucination in large vi- sion language models via fine-grained ai feedback. InAAAI, pages 25543–25551, 2025. 1, 2

work page 2025
[25]

TARAC: Mitigating Hallucination in LVLMs via Temporal Attention Real-time Accumulative Connection

Chunzhao Xie, Tongxuan Liu, Lei Jiang, Yuting Zeng, Yun- heng Shen, Weizhe Huang, Jing Li, Xiaohua Xu, et al. Tarac: Mitigating hallucination in lvlms via temporal at- tention real-time accumulative connection.arXiv preprint arXiv:2504.04099, 2025. 1, 2, 5, 6

work page internal anchor Pith review Pith/arXiv arXiv 2025
[26]

Un- derstanding and mitigating hallucination in large vision- language models via modular attribution and intervention

Tianyun Yang, Ziniu Li, Juan Cao, and Chang Xu. Un- derstanding and mitigating hallucination in large vision- language models via modular attribution and intervention. In ICLR, 2025. 2

work page 2025
[27]

Clearsight: visual signal enhancement for object hallucination mitigation in multimodal large language models

Hao Yin, Guangzong Si, and Zilei Wang. Clearsight: visual signal enhancement for object hallucination mitigation in multimodal large language models. InCVPR, pages 14625– 14634, 2025. 1, 2, 6

work page 2025
[28]

Lifting the veil on visual information flow in mllms: Unlocking pathways to faster inference

Hao Yin, Guangzong Si, and Zilei Wang. Lifting the veil on visual information flow in mllms: Unlocking pathways to faster inference. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 9382–9391, 2025. 1

work page 2025
[29]

Woodpecker: Hallucination correction for multimodal large language models.Science China Information Sciences, 67(12):220105, 2024

Shukang Yin, Chaoyou Fu, Sirui Zhao, Tong Xu, Hao Wang, Dianbo Sui, Yunhang Shen, Ke Li, Xing Sun, and Enhong Chen. Woodpecker: Hallucination correction for multimodal large language models.Science China Information Sciences, 67(12):220105, 2024. 2

work page 2024
[30]

Self- correcting decoding with generative feedback for mitigating hallucinations in large vision-language models

Ce Zhang, Zifu Wan, Zhehan Kan, Martin Q Ma, Si- mon Stepputtis, Deva Ramanan, Russ Salakhutdinov, Louis- Philippe Morency, Katia Sycara, and Yaqi Xie. Self- correcting decoding with generative feedback for mitigating hallucinations in large vision-language models. InICLR,

work page
[31]

Cross-modal information flow in multimodal large language models

Zhi Zhang, Srishti Yadav, Fengze Han, and Ekaterina Shutova. Cross-modal information flow in multimodal large language models. InCVPR, pages 19781–19791, 2025. 1

work page 2025
[32]

Cross- image contrastive decoding: Precise, lossless suppression of language priors in large vision-language models.arXiv preprint arXiv:2505.10634, 2025

Jianfei Zhao, Feng Zhang, Xin Sun, and Chong Feng. Cross- image contrastive decoding: Precise, lossless suppression of language priors in large vision-language models.arXiv preprint arXiv:2505.10634, 2025. 2

work page arXiv 2025
[33]

Align- ing attention distribution to information flow for hallucina- tion mitigation in large vision-language models, 2025

Jianfei Zhao, Feng Zhang, Xin Sun, and Chong Feng. Align- ing attention distribution to information flow for hallucina- tion mitigation in large vision-language models, 2025. 2

work page 2025
[34]

Cross-layer vision smoothing: Enhancing visual understanding via sustained focus on key objects in large vision-language models.arXiv preprint arXiv:2509.12897, 2025

Jianfei Zhao, Feng Zhang, Xin Sun, Lingxing Kong, Zhix- ing Tan, and Chong Feng. Cross-layer vision smoothing: Enhancing visual understanding via sustained focus on key objects in large vision-language models.arXiv preprint arXiv:2509.12897, 2025. 1, 2, 4

work page arXiv 2025
[35]

Mitigating object hallucination in large vision-language models via image-grounded guidance

Linxi Zhao, Yihe Deng, Weitong Zhang, and Quanquan Gu. Mitigating object hallucination in large vision-language models via image-grounded guidance. InICML, 2025. 2

work page 2025
[36]

Ibd: Alleviating hallucinations in large vision- language models via image-biased decoding

Lanyun Zhu, Deyi Ji, Tianrun Chen, Peng Xu, Jieping Ye, and Jun Liu. Ibd: Alleviating hallucinations in large vision- language models via image-biased decoding. InCVPR, pages 1624–1633, 2025. 2

work page 2025
[37]

Look twice before you answer: Memory- space visual retracing for hallucination mitigation in multi- modal large language models

Xin Zou, Yizhou Wang, Yibo Yan, Yuanhuiyi Lyu, Ken- ing Zheng, Sirui Huang, Junkai Chen, Peijie Jiang, Jia Liu, Chang Tang, et al. Look twice before you answer: Memory- space visual retracing for hallucination mitigation in multi- modal large language models. InICML, 2025. 2, 4 Tell Model Where to Look: Mitigating Hallucinations in MLLMs by Vision-Guided ...

work page 2025

[1] [1]

Qwen2.5-VL Technical Report

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2.5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 5

work page internal anchor Pith review Pith/arXiv arXiv 2025

[2] [2]

Per- turbollava: Reducing multimodal hallucinations with pertur- bative visual training

Cong Chen, Mingyu Liu, Chenchen Jing, Yizhou Zhou, Fengyun Rao, Hao Chen, Bo Zhang, and Chunhua Shen. Per- turbollava: Reducing multimodal hallucinations with pertur- bative visual training. InICLR, 2025. 2

work page 2025

[3] [3]

Why is spatial reasoning hard for vlms? an attention mechanism perspective on focus areas

Shiqi Chen, Tongyao Zhu, Ruochen Zhou, Jinghan Zhang, Siyang Gao, Juan Carlos Niebles, Mor Geva, Junxian He, Jiajun Wu, and Manling Li. Why is spatial reasoning hard for vlms? an attention mechanism perspective on focus areas. In ICML, 2025. 2, 3

work page 2025

[4] [4]

Flashattention: Fast and memory-efficient exact at- tention with io-awareness

Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christo- pher R´e. Flashattention: Fast and memory-efficient exact at- tention with io-awareness. InNeurIPS, pages 16344–16359,

work page

[5] [5]

Cracking the code of hallucination in lvlms with vision-aware head divergence

Jinghan He, Kuan Zhu, Haiyun Guo, Junfeng Fang, Zhenglin Hua, Yuheng Jia, Ming Tang, Tat-Seng Chua, and Jinqiao Wang. Cracking the code of hallucination in lvlms with vision-aware head divergence. InACL, 2025. 2

work page 2025

[6] [6]

Opera: Alleviating hallucination in multi- modal large language models via over-trust penalty and retrospection-allocation

Qidong Huang, Xiaoyi Dong, Pan Zhang, Bin Wang, Con- ghui He, Jiaqi Wang, Dahua Lin, Weiming Zhang, and Nenghai Yu. Opera: Alleviating hallucination in multi- modal large language models via over-trust penalty and retrospection-allocation. InCVPR, pages 13418–13427,

work page

[7] [7]

Interpreting and editing vision-language representations to mitigate hallucinations

Nicholas Jiang, Anish Kachinthaya, Suzanne Petryk, and Yossi Gandelsman. Interpreting and editing vision-language representations to mitigate hallucinations. InICLR, 2025. 2

work page 2025

[8] [8]

See what you are told: Visual attention sink in large multimodal models

Seil Kang, Jinyeong Kim, Junhyeok Kim, and Seong Jae Hwang. See what you are told: Visual attention sink in large multimodal models. InICLR, 2025. 2, 4

work page 2025

[9] [9]

Mitigating object hal- lucinations in large vision-language models through visual contrastive decoding

Sicong Leng, Hang Zhang, Guanzheng Chen, Xin Li, Shijian Lu, Chunyan Miao, and Lidong Bing. Mitigating object hal- lucinations in large vision-language models through visual contrastive decoding. InCVPR, pages 13872–13882, 2024. 2

work page 2024

[10] [10]

Treble counterfactual vlms: A causal approach to hallucination

Li Li, Jiashu Qu, Linxin Song, Yuxiao Zhou, Yuehan Qin, Tiankai Yang, and Yue Zhao. Treble counterfactual vlms: A causal approach to hallucination. InFEMNLP, pages 18423– 18434, 2025. 2

work page 2025

[11] [11]

Evaluating object hallucination in large vision-language models

Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision-language models. InEMNLP, pages 292–305,

work page

[12] [12]

The hidden life of tokens: Reducing hallucination of large vision-language models via visual in- formation steering

Zhuowei Li, Haizhou Shi, Yunhe Gao, Di Liu, Zhenting Wang, Yuxiao Chen, Ting Liu, Long Zhao, Hao Wang, and Dimitris N Metaxas. The hidden life of tokens: Reducing hallucination of large vision-language models via visual in- formation steering. InICML, 2025. 2

work page 2025

[13] [13]

Microsoft coco: Common objects in context

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In ECCV, pages 740–755, 2014. 1

work page 2014

[14] [14]

Improved baselines with visual instruction tuning

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. InCVPR, pages 26296–26306, 2024. 5

work page 2024

[15] [15]

Llava-next: Im- proved reasoning, ocr, and world knowledge, 2024

Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava-next: Im- proved reasoning, ocr, and world knowledge, 2024. 5

work page 2024

[16] [16]

Paying more at- tention to image: A training-free method for alleviating hal- lucination in lvlms

Shi Liu, Kecheng Zheng, and Wei Chen. Paying more at- tention to image: A training-free method for alleviating hal- lucination in lvlms. InECCV, pages 125–140, 2024. 1, 2, 6

work page 2024

[17] [17]

Mitigating hallucination through theory-consistent symmetric multimodal preference optimization.arXiv preprint arXiv:2506.11712, 2025

Wenqi Liu, Xuemeng Song, Jiaxi Li, Yinwei Wei, Na Zheng, Jianhua Yin, and Liqiang Nie. Mitigating hallucination through theory-consistent symmetric multimodal preference optimization.arXiv preprint arXiv:2506.11712, 2025. 1, 2

work page arXiv 2025

[18] [18]

Second: Mitigating perceptual hallucination in vision- language models via selective and contrastive decoding

Woohyeon Park, Woojin Kim, Jaeik Kim, and Jaeyoung Do. Second: Mitigating perceptual hallucination in vision- language models via selective and contrastive decoding. In ICML, 2025. 4

work page 2025

[19] [19]

Stanza: A python natural language processing toolkit for many human languages

Peng Qi, Yuhao Zhang, Yuhui Zhang, Jason Bolton, and Christopher D Manning. Stanza: A python natural language processing toolkit for many human languages. InACL, pages 101–108, 2020. 7, 2

work page 2020

[20] [20]

Object hallucination in image cap- tioning

Anna Rohrbach, Lisa Anne Hendricks, Kaylee Burns, Trevor Darrell, and Kate Saenko. Object hallucination in image cap- tioning. InEMNLP, pages 4035–4045, 2018. 6, 1

work page 2018

[21] [21]

Mitigating ob- ject hallucination in mllms via data-augmented phrase-level alignment

Pritam Sarkar, Sayna Ebrahimi, Ali Etemad, Ahmad Beirami, Sercan O Arik, and Tomas Pfister. Mitigating ob- ject hallucination in mllms via data-augmented phrase-level alignment. InICLR, 2025. 1, 2

work page 2025

[22] [22]

AMBER: An LLM-free Multi-dimensional Benchmark for MLLMs Hallucination Evaluation

Junyang Wang, Yuhang Wang, Guohai Xu, Jing Zhang, Yukai Gu, Haitao Jia, Jiaqi Wang, Haiyang Xu, Ming Yan, Ji Zhang, et al. Amber: An llm-free multi-dimensional bench- mark for mllms hallucination evaluation.arXiv preprint arXiv:2311.07397, 2023. 6, 1

work page internal anchor Pith review Pith/arXiv arXiv 2023

[23] [23]

Damo: Decoding by accumulating activations mo- mentum for mitigating hallucinations in vision-language models

Kaishen Wang, Hengrui Gu, Meijun Gao, and Kaixiong Zhou. Damo: Decoding by accumulating activations mo- mentum for mitigating hallucinations in vision-language models. InICLR, 2025. 2

work page 2025

[24] [24]

Detecting and mitigating hallucination in large vi- sion language models via fine-grained ai feedback

Wenyi Xiao, Ziwei Huang, Leilei Gan, Wanggui He, Haoyuan Li, Zhelun Yu, Fangxun Shu, Hao Jiang, and Lin- chao Zhu. Detecting and mitigating hallucination in large vi- sion language models via fine-grained ai feedback. InAAAI, pages 25543–25551, 2025. 1, 2

work page 2025

[25] [25]

TARAC: Mitigating Hallucination in LVLMs via Temporal Attention Real-time Accumulative Connection

Chunzhao Xie, Tongxuan Liu, Lei Jiang, Yuting Zeng, Yun- heng Shen, Weizhe Huang, Jing Li, Xiaohua Xu, et al. Tarac: Mitigating hallucination in lvlms via temporal at- tention real-time accumulative connection.arXiv preprint arXiv:2504.04099, 2025. 1, 2, 5, 6

work page internal anchor Pith review Pith/arXiv arXiv 2025

[26] [26]

Un- derstanding and mitigating hallucination in large vision- language models via modular attribution and intervention

Tianyun Yang, Ziniu Li, Juan Cao, and Chang Xu. Un- derstanding and mitigating hallucination in large vision- language models via modular attribution and intervention. In ICLR, 2025. 2

work page 2025

[27] [27]

Clearsight: visual signal enhancement for object hallucination mitigation in multimodal large language models

Hao Yin, Guangzong Si, and Zilei Wang. Clearsight: visual signal enhancement for object hallucination mitigation in multimodal large language models. InCVPR, pages 14625– 14634, 2025. 1, 2, 6

work page 2025

[28] [28]

Lifting the veil on visual information flow in mllms: Unlocking pathways to faster inference

Hao Yin, Guangzong Si, and Zilei Wang. Lifting the veil on visual information flow in mllms: Unlocking pathways to faster inference. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 9382–9391, 2025. 1

work page 2025

[29] [29]

Woodpecker: Hallucination correction for multimodal large language models.Science China Information Sciences, 67(12):220105, 2024

Shukang Yin, Chaoyou Fu, Sirui Zhao, Tong Xu, Hao Wang, Dianbo Sui, Yunhang Shen, Ke Li, Xing Sun, and Enhong Chen. Woodpecker: Hallucination correction for multimodal large language models.Science China Information Sciences, 67(12):220105, 2024. 2

work page 2024

[30] [30]

Self- correcting decoding with generative feedback for mitigating hallucinations in large vision-language models

Ce Zhang, Zifu Wan, Zhehan Kan, Martin Q Ma, Si- mon Stepputtis, Deva Ramanan, Russ Salakhutdinov, Louis- Philippe Morency, Katia Sycara, and Yaqi Xie. Self- correcting decoding with generative feedback for mitigating hallucinations in large vision-language models. InICLR,

work page

[31] [31]

Cross-modal information flow in multimodal large language models

Zhi Zhang, Srishti Yadav, Fengze Han, and Ekaterina Shutova. Cross-modal information flow in multimodal large language models. InCVPR, pages 19781–19791, 2025. 1

work page 2025

[32] [32]

Cross- image contrastive decoding: Precise, lossless suppression of language priors in large vision-language models.arXiv preprint arXiv:2505.10634, 2025

Jianfei Zhao, Feng Zhang, Xin Sun, and Chong Feng. Cross- image contrastive decoding: Precise, lossless suppression of language priors in large vision-language models.arXiv preprint arXiv:2505.10634, 2025. 2

work page arXiv 2025

[33] [33]

Align- ing attention distribution to information flow for hallucina- tion mitigation in large vision-language models, 2025

Jianfei Zhao, Feng Zhang, Xin Sun, and Chong Feng. Align- ing attention distribution to information flow for hallucina- tion mitigation in large vision-language models, 2025. 2

work page 2025

[34] [34]

Cross-layer vision smoothing: Enhancing visual understanding via sustained focus on key objects in large vision-language models.arXiv preprint arXiv:2509.12897, 2025

Jianfei Zhao, Feng Zhang, Xin Sun, Lingxing Kong, Zhix- ing Tan, and Chong Feng. Cross-layer vision smoothing: Enhancing visual understanding via sustained focus on key objects in large vision-language models.arXiv preprint arXiv:2509.12897, 2025. 1, 2, 4

work page arXiv 2025

[35] [35]

Mitigating object hallucination in large vision-language models via image-grounded guidance

Linxi Zhao, Yihe Deng, Weitong Zhang, and Quanquan Gu. Mitigating object hallucination in large vision-language models via image-grounded guidance. InICML, 2025. 2

work page 2025

[36] [36]

Ibd: Alleviating hallucinations in large vision- language models via image-biased decoding

Lanyun Zhu, Deyi Ji, Tianrun Chen, Peng Xu, Jieping Ye, and Jun Liu. Ibd: Alleviating hallucinations in large vision- language models via image-biased decoding. InCVPR, pages 1624–1633, 2025. 2

work page 2025

[37] [37]

Look twice before you answer: Memory- space visual retracing for hallucination mitigation in multi- modal large language models

Xin Zou, Yizhou Wang, Yibo Yan, Yuanhuiyi Lyu, Ken- ing Zheng, Sirui Huang, Junkai Chen, Peijie Jiang, Jia Liu, Chang Tang, et al. Look twice before you answer: Memory- space visual retracing for hallucination mitigation in multi- modal large language models. InICML, 2025. 2, 4 Tell Model Where to Look: Mitigating Hallucinations in MLLMs by Vision-Guided ...

work page 2025