Tell Model Where to Look: Mitigating Hallucinations in MLLMs by Vision-Guided Attention
Pith reviewed 2026-05-17 04:55 UTC · model grok-4.3
The pith
Vision-Guided Attention reduces hallucinations in MLLMs by building precise visual grounding from token semantics and steering focus to relevant image regions.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that explicit visual guidance derived from the semantic content already present in visual tokens can steer MLLMs toward accurate image regions and thereby reduce hallucinations. For image captioning the guidance is refined dynamically by suppressing regions that have already been described. The entire process requires only a single forward pass per token and remains fully compatible with efficient attention implementations.
What carries the argument
Vision-Guided Attention (VGA), which constructs precise visual grounding from the semantics in visual tokens and then applies that grounding to direct the model's attention during inference.
If this is right
- VGA delivers state-of-the-art results on multiple hallucination benchmarks across diverse MLLMs.
- The method adds negligible latency because every token requires only a single forward pass.
- No training or parameter updates are needed on the underlying MLLM.
- Dynamic suppression of already-described regions improves captioning accuracy.
- Full compatibility with FlashAttention preserves efficient inference.
Where Pith is reading between the lines
- The same grounding construction could be tested on video or multi-image inputs where localization errors compound over time.
- Current MLLM attention layers may be systematically under-using information that their own visual encoders already provide.
- Combining VGA with existing training-based dehallucination methods might produce additive improvements.
- One could check whether analogous guidance signals help in non-visual modalities by mapping token semantics to other modalities.
Load-bearing premise
The semantic information already inside visual tokens can be turned into grounding signals the model is not already using, and directing attention with those signals will reduce hallucinations without creating new failure modes.
What would settle it
An experiment that supplies perfect visual grounding to the model yet still records the original rate of hallucinations on standard benchmarks would show the guidance step is not the decisive factor.
Figures
read the original abstract
Visual attention serves as the primary mechanism through which MLLMs interpret visual information; however, its limited localization capability often leads to hallucinations. We observe that although MLLMs can accurately extract visual semantics from visual tokens, they fail to fully leverage this advantage during subsequent inference. To address this limitation, we propose Vision-Guided Attention (VGA), a training-free method that first constructs precise visual grounding by exploiting the semantic content of visual tokens, and then uses this grounding to guide the model's focus toward relevant visual regions. In image captioning, VGA further refines this guidance dynamically during generation by suppressing regions that have already been described. In VGA, each token undergoes only a single forward pass, introducing a negligible latency overhead. In addition, VGA is fully compatible with efficient attention implementations such as FlashAttention. Extensive experiments across diverse MLLMs and multiple hallucination benchmarks demonstrate that VGA achieves state-of-the-art dehallucination performance. Further analysis confirms that explicit visual guidance plays a crucial role in enhancing the visual understanding capabilities of MLLMs.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Vision-Guided Attention (VGA), a training-free inference-time intervention for MLLMs. It first extracts semantic content from visual tokens to build explicit visual grounding, then steers attention toward relevant image regions; in captioning it further suppresses already-described regions dynamically. Each token requires only one forward pass, the method is compatible with FlashAttention, and the authors claim SOTA dehallucination results across multiple MLLMs and benchmarks together with improved visual understanding.
Significance. If the central empirical claims hold, VGA would be a useful practical contribution: a lightweight, training-free technique that exploits already-computed visual-token representations to reduce hallucinations. The training-free nature, negligible latency, and FlashAttention compatibility are clear strengths that lower the barrier to adoption. The dynamic suppression mechanism in captioning is a reasonable extension of the core idea.
major comments (2)
- [§4 and abstract] §4 (Experimental Results) and the abstract: the claim of 'state-of-the-art dehallucination performance' is presented without any quantitative numbers, ablation tables, or error analysis in the visible text. Without these data it is impossible to verify the magnitude of improvement or to check whether the method reduces hallucinations without introducing new omission or over-suppression failures.
- [§3] §3 (Method): the construction of 'precise visual grounding' from visual-token semantics is described at a high level but lacks a concrete procedure or comparison showing that the resulting attention maps differ meaningfully from the model's native attention. If the grounding largely reproduces existing patterns, the intervention is redundant and the central assumption that 'the model is not already using' this information remains untested.
minor comments (2)
- [Abstract] Abstract: including one or two key quantitative results (e.g., percentage reduction on a standard benchmark) would make the SOTA claim immediately verifiable.
- [§3] Notation: the distinction between 'visual grounding' and 'attention guidance' should be clarified with a short equation or pseudocode snippet to avoid ambiguity.
Simulated Author's Rebuttal
We thank the referee for their valuable feedback on our manuscript. We provide point-by-point responses to the major comments and outline the revisions we plan to make.
read point-by-point responses
-
Referee: [§4 and abstract] §4 (Experimental Results) and the abstract: the claim of 'state-of-the-art dehallucination performance' is presented without any quantitative numbers, ablation tables, or error analysis in the visible text. Without these data it is impossible to verify the magnitude of improvement or to check whether the method reduces hallucinations without introducing new omission or over-suppression failures.
Authors: We appreciate this comment. Although the full manuscript's Section 4 provides quantitative results, ablation tables, and error analysis supporting the state-of-the-art dehallucination performance and analyzing omission and over-suppression, the abstract summarizes without specifics. We will revise the abstract to highlight key quantitative improvements and reference the detailed tables in Section 4. This will allow better verification of the improvements and failure mode analysis. revision: yes
-
Referee: [§3] §3 (Method): the construction of 'precise visual grounding' from visual-token semantics is described at a high level but lacks a concrete procedure or comparison showing that the resulting attention maps differ meaningfully from the model's native attention. If the grounding largely reproduces existing patterns, the intervention is redundant and the central assumption that 'the model is not already using' this information remains untested.
Authors: We agree that more concrete details would be beneficial. The manuscript's Section 3 outlines the process of extracting semantic content from visual tokens to construct the grounding and then guiding attention. To strengthen this, we will expand the method description with a precise algorithmic procedure and include additional experiments or visualizations that compare the guided attention maps to the native ones. These additions will demonstrate the meaningful differences and validate that the intervention leverages information not fully utilized by the model. revision: yes
Circularity Check
No significant circularity; VGA is an inference-time intervention with independent experimental validation.
full rationale
The paper describes a training-free method that extracts semantic content from existing visual tokens to construct grounding and steer attention, with dynamic suppression in captioning. No equations, fitted parameters, self-citations, or uniqueness theorems are invoked that would reduce the claimed dehallucination gains to a redefinition or statistical forcing of the inputs. The SOTA performance is asserted via external benchmarks and diverse MLLM experiments, which stand apart from the method's construction. The derivation chain is self-contained against those benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption MLLMs accurately extract visual semantics from visual tokens yet fail to fully leverage this during subsequent inference
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We propose Vision-Guided Attention (VGA), which uses the visual grounding produced by VSC to guide visual attention... ˆαh,−1:,s:e = αh,−1:,s:e + β·G
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
VSC... cvi(O) = softmax[logit vi(O)]... GO = Norm[{cvi(o0)}]
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2.5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 5
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[2]
Per- turbollava: Reducing multimodal hallucinations with pertur- bative visual training
Cong Chen, Mingyu Liu, Chenchen Jing, Yizhou Zhou, Fengyun Rao, Hao Chen, Bo Zhang, and Chunhua Shen. Per- turbollava: Reducing multimodal hallucinations with pertur- bative visual training. InICLR, 2025. 2
work page 2025
-
[3]
Why is spatial reasoning hard for vlms? an attention mechanism perspective on focus areas
Shiqi Chen, Tongyao Zhu, Ruochen Zhou, Jinghan Zhang, Siyang Gao, Juan Carlos Niebles, Mor Geva, Junxian He, Jiajun Wu, and Manling Li. Why is spatial reasoning hard for vlms? an attention mechanism perspective on focus areas. In ICML, 2025. 2, 3
work page 2025
-
[4]
Flashattention: Fast and memory-efficient exact at- tention with io-awareness
Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christo- pher R´e. Flashattention: Fast and memory-efficient exact at- tention with io-awareness. InNeurIPS, pages 16344–16359,
-
[5]
Cracking the code of hallucination in lvlms with vision-aware head divergence
Jinghan He, Kuan Zhu, Haiyun Guo, Junfeng Fang, Zhenglin Hua, Yuheng Jia, Ming Tang, Tat-Seng Chua, and Jinqiao Wang. Cracking the code of hallucination in lvlms with vision-aware head divergence. InACL, 2025. 2
work page 2025
-
[6]
Qidong Huang, Xiaoyi Dong, Pan Zhang, Bin Wang, Con- ghui He, Jiaqi Wang, Dahua Lin, Weiming Zhang, and Nenghai Yu. Opera: Alleviating hallucination in multi- modal large language models via over-trust penalty and retrospection-allocation. InCVPR, pages 13418–13427,
-
[7]
Interpreting and editing vision-language representations to mitigate hallucinations
Nicholas Jiang, Anish Kachinthaya, Suzanne Petryk, and Yossi Gandelsman. Interpreting and editing vision-language representations to mitigate hallucinations. InICLR, 2025. 2
work page 2025
-
[8]
See what you are told: Visual attention sink in large multimodal models
Seil Kang, Jinyeong Kim, Junhyeok Kim, and Seong Jae Hwang. See what you are told: Visual attention sink in large multimodal models. InICLR, 2025. 2, 4
work page 2025
-
[9]
Sicong Leng, Hang Zhang, Guanzheng Chen, Xin Li, Shijian Lu, Chunyan Miao, and Lidong Bing. Mitigating object hal- lucinations in large vision-language models through visual contrastive decoding. InCVPR, pages 13872–13882, 2024. 2
work page 2024
-
[10]
Treble counterfactual vlms: A causal approach to hallucination
Li Li, Jiashu Qu, Linxin Song, Yuxiao Zhou, Yuehan Qin, Tiankai Yang, and Yue Zhao. Treble counterfactual vlms: A causal approach to hallucination. InFEMNLP, pages 18423– 18434, 2025. 2
work page 2025
-
[11]
Evaluating object hallucination in large vision-language models
Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision-language models. InEMNLP, pages 292–305,
-
[12]
Zhuowei Li, Haizhou Shi, Yunhe Gao, Di Liu, Zhenting Wang, Yuxiao Chen, Ting Liu, Long Zhao, Hao Wang, and Dimitris N Metaxas. The hidden life of tokens: Reducing hallucination of large vision-language models via visual in- formation steering. InICML, 2025. 2
work page 2025
-
[13]
Microsoft coco: Common objects in context
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In ECCV, pages 740–755, 2014. 1
work page 2014
-
[14]
Improved baselines with visual instruction tuning
Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. InCVPR, pages 26296–26306, 2024. 5
work page 2024
-
[15]
Llava-next: Im- proved reasoning, ocr, and world knowledge, 2024
Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava-next: Im- proved reasoning, ocr, and world knowledge, 2024. 5
work page 2024
-
[16]
Paying more at- tention to image: A training-free method for alleviating hal- lucination in lvlms
Shi Liu, Kecheng Zheng, and Wei Chen. Paying more at- tention to image: A training-free method for alleviating hal- lucination in lvlms. InECCV, pages 125–140, 2024. 1, 2, 6
work page 2024
-
[17]
Wenqi Liu, Xuemeng Song, Jiaxi Li, Yinwei Wei, Na Zheng, Jianhua Yin, and Liqiang Nie. Mitigating hallucination through theory-consistent symmetric multimodal preference optimization.arXiv preprint arXiv:2506.11712, 2025. 1, 2
-
[18]
Woohyeon Park, Woojin Kim, Jaeik Kim, and Jaeyoung Do. Second: Mitigating perceptual hallucination in vision- language models via selective and contrastive decoding. In ICML, 2025. 4
work page 2025
-
[19]
Stanza: A python natural language processing toolkit for many human languages
Peng Qi, Yuhao Zhang, Yuhui Zhang, Jason Bolton, and Christopher D Manning. Stanza: A python natural language processing toolkit for many human languages. InACL, pages 101–108, 2020. 7, 2
work page 2020
-
[20]
Object hallucination in image cap- tioning
Anna Rohrbach, Lisa Anne Hendricks, Kaylee Burns, Trevor Darrell, and Kate Saenko. Object hallucination in image cap- tioning. InEMNLP, pages 4035–4045, 2018. 6, 1
work page 2018
-
[21]
Mitigating ob- ject hallucination in mllms via data-augmented phrase-level alignment
Pritam Sarkar, Sayna Ebrahimi, Ali Etemad, Ahmad Beirami, Sercan O Arik, and Tomas Pfister. Mitigating ob- ject hallucination in mllms via data-augmented phrase-level alignment. InICLR, 2025. 1, 2
work page 2025
-
[22]
AMBER: An LLM-free Multi-dimensional Benchmark for MLLMs Hallucination Evaluation
Junyang Wang, Yuhang Wang, Guohai Xu, Jing Zhang, Yukai Gu, Haitao Jia, Jiaqi Wang, Haiyang Xu, Ming Yan, Ji Zhang, et al. Amber: An llm-free multi-dimensional bench- mark for mllms hallucination evaluation.arXiv preprint arXiv:2311.07397, 2023. 6, 1
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[23]
Kaishen Wang, Hengrui Gu, Meijun Gao, and Kaixiong Zhou. Damo: Decoding by accumulating activations mo- mentum for mitigating hallucinations in vision-language models. InICLR, 2025. 2
work page 2025
-
[24]
Wenyi Xiao, Ziwei Huang, Leilei Gan, Wanggui He, Haoyuan Li, Zhelun Yu, Fangxun Shu, Hao Jiang, and Lin- chao Zhu. Detecting and mitigating hallucination in large vi- sion language models via fine-grained ai feedback. InAAAI, pages 25543–25551, 2025. 1, 2
work page 2025
-
[25]
TARAC: Mitigating Hallucination in LVLMs via Temporal Attention Real-time Accumulative Connection
Chunzhao Xie, Tongxuan Liu, Lei Jiang, Yuting Zeng, Yun- heng Shen, Weizhe Huang, Jing Li, Xiaohua Xu, et al. Tarac: Mitigating hallucination in lvlms via temporal at- tention real-time accumulative connection.arXiv preprint arXiv:2504.04099, 2025. 1, 2, 5, 6
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[26]
Tianyun Yang, Ziniu Li, Juan Cao, and Chang Xu. Un- derstanding and mitigating hallucination in large vision- language models via modular attribution and intervention. In ICLR, 2025. 2
work page 2025
-
[27]
Hao Yin, Guangzong Si, and Zilei Wang. Clearsight: visual signal enhancement for object hallucination mitigation in multimodal large language models. InCVPR, pages 14625– 14634, 2025. 1, 2, 6
work page 2025
-
[28]
Lifting the veil on visual information flow in mllms: Unlocking pathways to faster inference
Hao Yin, Guangzong Si, and Zilei Wang. Lifting the veil on visual information flow in mllms: Unlocking pathways to faster inference. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 9382–9391, 2025. 1
work page 2025
-
[29]
Shukang Yin, Chaoyou Fu, Sirui Zhao, Tong Xu, Hao Wang, Dianbo Sui, Yunhang Shen, Ke Li, Xing Sun, and Enhong Chen. Woodpecker: Hallucination correction for multimodal large language models.Science China Information Sciences, 67(12):220105, 2024. 2
work page 2024
-
[30]
Ce Zhang, Zifu Wan, Zhehan Kan, Martin Q Ma, Si- mon Stepputtis, Deva Ramanan, Russ Salakhutdinov, Louis- Philippe Morency, Katia Sycara, and Yaqi Xie. Self- correcting decoding with generative feedback for mitigating hallucinations in large vision-language models. InICLR,
-
[31]
Cross-modal information flow in multimodal large language models
Zhi Zhang, Srishti Yadav, Fengze Han, and Ekaterina Shutova. Cross-modal information flow in multimodal large language models. InCVPR, pages 19781–19791, 2025. 1
work page 2025
-
[32]
Jianfei Zhao, Feng Zhang, Xin Sun, and Chong Feng. Cross- image contrastive decoding: Precise, lossless suppression of language priors in large vision-language models.arXiv preprint arXiv:2505.10634, 2025. 2
-
[33]
Jianfei Zhao, Feng Zhang, Xin Sun, and Chong Feng. Align- ing attention distribution to information flow for hallucina- tion mitigation in large vision-language models, 2025. 2
work page 2025
-
[34]
Jianfei Zhao, Feng Zhang, Xin Sun, Lingxing Kong, Zhix- ing Tan, and Chong Feng. Cross-layer vision smoothing: Enhancing visual understanding via sustained focus on key objects in large vision-language models.arXiv preprint arXiv:2509.12897, 2025. 1, 2, 4
-
[35]
Mitigating object hallucination in large vision-language models via image-grounded guidance
Linxi Zhao, Yihe Deng, Weitong Zhang, and Quanquan Gu. Mitigating object hallucination in large vision-language models via image-grounded guidance. InICML, 2025. 2
work page 2025
-
[36]
Ibd: Alleviating hallucinations in large vision- language models via image-biased decoding
Lanyun Zhu, Deyi Ji, Tianrun Chen, Peng Xu, Jieping Ye, and Jun Liu. Ibd: Alleviating hallucinations in large vision- language models via image-biased decoding. InCVPR, pages 1624–1633, 2025. 2
work page 2025
-
[37]
Xin Zou, Yizhou Wang, Yibo Yan, Yuanhuiyi Lyu, Ken- ing Zheng, Sirui Huang, Junkai Chen, Peijie Jiang, Jia Liu, Chang Tang, et al. Look twice before you answer: Memory- space visual retracing for hallucination mitigation in multi- modal large language models. InICML, 2025. 2, 4 Tell Model Where to Look: Mitigating Hallucinations in MLLMs by Vision-Guided ...
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.