pith. sign in

arxiv: 2605.20965 · v1 · pith:DGXFYNR2new · submitted 2026-05-20 · 💻 cs.CV · cs.AI

Finding the Correct Visual Evidence Without Forgetting: Mitigating Hallucination in LVLMs via Inter-Layer Visual Attention Discrepancy

Pith reviewed 2026-05-21 04:59 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords hallucination mitigationlarge vision-language modelsvisual attentioninter-layer discrepancysaliency mapattention enhancementtraining-free method
0
0 comments X

The pith

LVLMs hallucinate by forgetting correct visual evidence but inter-layer attention discrepancies reveal a way to reinforce it.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that large vision-language models generate responses inconsistent with images mainly because they pay too little attention to the right visual parts and then lose track of them while producing text. Although overall attention to visuals is weak, the models show clear sensitivity to the accurate evidence in particular layers, visible as big differences between layers. By examining attention weights from the first few output tokens to image tokens across all layers, the authors locate image regions that get activated repeatedly and turn those into a saliency map. This map is applied during later generation steps to keep attention on the evidence and also to favor text tokens that match it well. The whole process needs no retraining and can be added directly to existing models.

Core claim

LVLMs tend to hallucinate when they pay insufficient attention to the correct visual evidence and gradually forget it during the generation process. Although LVLMs overall attend insufficiently to visual evidence, they exhibit sensitivity to the correct visual evidence in specific layers with notable inter-layer discrepancy. A saliency map is formed from attention weights of early generated tokens to visual tokens by selecting those repeatedly activated across layers; this map is then used to enhance attention to the evidence and to emphasize text tokens grounded in it, thereby reducing visual forgetting.

What carries the argument

Inter-Layer Visual Attention Discrepancy (ILVAD) identifies repeatedly activated visual tokens from early-generation attention weights across layers to build a saliency map that boosts attention and curbs forgetting.

If this is right

  • The method works without any model retraining and plugs directly into different LVLM architectures.
  • Hallucination rates drop consistently when the approach is applied to five recent models on multiple benchmarks.
  • Text tokens can be chosen and highlighted according to how strongly their attention aligns with the visual saliency map.
  • Maintaining boosted attention to the identified evidence throughout generation prevents gradual visual forgetting.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same layer-wise discrepancy pattern could appear in other multimodal models and might guide where to insert visual grounding checks.
  • Updating the saliency map at later steps could help maintain accuracy in very long generated responses.
  • Pairing the attention reinforcement with existing alignment techniques might produce even more reliable outputs.

Load-bearing premise

Tokens that receive repeated activation across layers from the attention patterns of early generated tokens are the correct visual evidence, and strengthening attention to them will cut hallucinations without creating new inconsistencies.

What would settle it

Run the saliency-map enhancement on standard hallucination benchmarks for the tested LVLMs and measure whether hallucination rates stay the same or rise instead of falling.

Figures

Figures reproduced from arXiv: 2605.20965 by Ran Wang, Wing W. Y. Ng, Xizhao Wang, Yuheng Jia, Yutong Xie, Zhenglin Hua.

Figure 1
Figure 1. Figure 1: This figure showcases our insights. The (0) blue box illustrates that LVLMs are prone to hallucination due to the neglect of [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the proposed method Inter-Layer Visual Attention Discrepancy (ILVAD). We extract the visual evidence saliency [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Visualization on the impact of τ towards visual evidence saliency map [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Impact evaluation on α and β, with the black boxes highlighting the best results. The vertical axis represents α, and the horizontal axis represents β. 8 10 GreedyBeam VCD CODEAGLA VAF VAR SPARCONLY VHR Ours 0 2 4 Inference Time (1x Greedy) [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Comparison of inference times for different methods. [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Visualization of visual evidence saliency maps. The queries of images above are “Is there a cup in the image?”, “Describe [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Case study for LLAVA-1.5-7B. The hallucinated text generated by the baseline (Greedy) and the corresponding real text [PITH_FULL_IMAGE:figures/full_fig_p018_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Case study for LLaVA-NeXT-7B. The hallucinated text generated by the baseline (Greedy) and the corresponding real text [PITH_FULL_IMAGE:figures/full_fig_p019_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Case study for Qwen2-VL-7B. The hallucinated text generated by the baseline (Greedy) and the corresponding real text [PITH_FULL_IMAGE:figures/full_fig_p020_9.png] view at source ↗
read the original abstract

Large Vision-Language Models (LVLMs) have shown remarkable performance on a wide range of vision-language tasks. Despite this progress, they are still prone to hallucination, generating responses that are inconsistent with visual content. In this work, we find that LVLMs tend to hallucinate when they pay insufficient attention to the correct visual evidence and gradually forget it during the generation process. We empirically find that although LVLMs overall attend insufficiently to visual evidence, they exhibit sensitivity to the correct visual evidence in specific layers, with notable inter-layer discrepancy. Motivated by this observation, we propose a novel hallucination mitigation method that enhances visual evidence based on Inter-Layer Visual Attention Discrepancy (ILVAD). Specifically, we obtain the attention weights from early generated tokens to visual tokens across layers and identify the tokens that are repeatedly activated as visual evidence, forming a saliency map. We then enhance attention to visual evidence during generation through the saliency map to reduce visual forgetting. In addition, we leverage the saliency map to obtain attention scores of generated text to visual evidence, in order to select and emphasize text tokens that are strongly grounded in visual evidence. Our method is training-free and plug-and-play. Multiple benchmark evaluations conducted on five recently released models show that our method can consistently mitigate hallucinations in different LVLMs over various architectures. Code is available at https://github.com/ytx-ML/ILVAD.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper claims that LVLMs hallucinate when paying insufficient attention to correct visual evidence and gradually forgetting it during generation. It empirically observes sensitivity to correct visual evidence in specific layers with notable inter-layer discrepancy. Motivated by this, the authors propose a training-free ILVAD method: attention weights from early generated tokens to visual tokens are used across layers to identify repeatedly activated tokens as visual evidence, forming a saliency map that boosts attention to these tokens during continued generation and re-weights text tokens strongly grounded in visual evidence. Evaluations on five recent LVLMs across benchmarks show consistent hallucination mitigation.

Significance. If the central assumption and empirical results hold, this work offers a significant practical advance by introducing a simple, training-free, plug-and-play technique for reducing hallucinations in LVLMs that leverages inter-layer attention patterns rather than model retraining. The method's reported generality across different architectures and the public code release support reproducibility and potential adoption.

major comments (2)
  1. [Method (ILVAD description)] The method section defines visual evidence as tokens repeatedly activated across layers from attention weights of the first few generated tokens, then uses the resulting saliency map to mitigate forgetting. No independent verification (ground-truth object labels, human annotations, or causal tests against random/uniform attention baselines) is reported to confirm these tokens are the correct evidence rather than spurious correlations; this assumption is load-bearing for the claim that reinforcement via ILVAD specifically counters visual forgetting and hallucinations.
  2. [Experiments and Evaluation] The experiments claim consistent mitigation across five models and various architectures, yet the manuscript provides insufficient quantitative results, ablation studies on saliency-map construction parameters (e.g., number of early tokens or layer selection), or baseline comparisons that would isolate the contribution of inter-layer discrepancy from generic increases in visual attention mass.
minor comments (1)
  1. [Abstract] The abstract would benefit from naming the specific benchmarks and hallucination metrics used to quantify improvements.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and the opportunity to clarify our work. We address each major comment below and describe the revisions we will make to the manuscript.

read point-by-point responses
  1. Referee: [Method (ILVAD description)] The method section defines visual evidence as tokens repeatedly activated across layers from attention weights of the first few generated tokens, then uses the resulting saliency map to mitigate forgetting. No independent verification (ground-truth object labels, human annotations, or causal tests against random/uniform attention baselines) is reported to confirm these tokens are the correct evidence rather than spurious correlations; this assumption is load-bearing for the claim that reinforcement via ILVAD specifically counters visual forgetting and hallucinations.

    Authors: We acknowledge that the manuscript relies on empirical observations of inter-layer attention discrepancies without providing independent verification such as ground-truth object labels or direct comparisons to random baselines. Our definition of visual evidence stems from the consistent activation patterns observed in early tokens across layers, which we link to reduced hallucinations when reinforced. To strengthen this, we will add experiments in the revised manuscript that include comparisons against random and uniform attention baselines, as well as any available causal analyses, to better demonstrate that the selected tokens are not spurious. revision: yes

  2. Referee: [Experiments and Evaluation] The experiments claim consistent mitigation across five models and various architectures, yet the manuscript provides insufficient quantitative results, ablation studies on saliency-map construction parameters (e.g., number of early tokens or layer selection), or baseline comparisons that would isolate the contribution of inter-layer discrepancy from generic increases in visual attention mass.

    Authors: The current manuscript reports consistent improvements across five LVLMs and multiple benchmarks, but we agree that the experimental section would benefit from more detailed quantitative breakdowns and ablations. We will expand the revised version to include ablation studies on the number of early tokens used, layer selection choices, and additional baselines that apply generic visual attention boosts without leveraging inter-layer discrepancy. These additions will help isolate the specific contribution of our approach. revision: yes

Circularity Check

0 steps flagged

No significant circularity; method is an explicit empirical procedure on observed attention patterns

full rationale

The paper's central contribution is an empirical observation that LVLMs exhibit inter-layer attention discrepancies to visual tokens, followed by a training-free procedural definition of ILVAD: extract attention weights from the first few generated tokens to image tokens across layers, select repeatedly activated tokens to form a saliency map, then use the map to boost visual attention and re-weight text tokens during continued generation. This procedure is defined directly in terms of the model's internal attention weights rather than any fitted parameter, self-referential equation, or prior result that reduces to the target claim by construction. No self-citation load-bearing steps, uniqueness theorems, or ansatz smuggling appear in the derivation; the assumption that repeatedly activated tokens constitute 'correct visual evidence' is presented as a motivated heuristic whose effectiveness is then tested on external benchmarks, keeping the chain self-contained and non-circular.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the empirical observation that inter-layer attention patterns identify correct visual evidence and that reinforcing those patterns reduces forgetting. This depends on the domain assumption that attention weights serve as a reliable proxy for factual grounding.

axioms (1)
  • domain assumption Attention weights from early generated tokens to visual tokens reflect the model's sensitivity to correct visual evidence.
    Invoked to justify constructing the saliency map from inter-layer discrepancies.

pith-pipeline@v0.9.0 · 5821 in / 1321 out tokens · 41303 ms · 2026-05-21T04:59:34.457103+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

78 extracted references · 78 canonical work pages · 6 internal anchors

  1. [1]

    2025 , pages =

    Huang, Xijie and Wang, Xinyuan and Zhang, Hantao and Zhu, Yinghao and Xi, Jiawen and An, Jingkun and Wang, Hao and Liang, Hao and Pan, Chengwei , title =. 2025 , pages =

  2. [2]

    Advances in Neural Information Processing Systems , volume =

    Haotian Liu and Chunyuan Li and Qingyang Wu and Yong Jae Lee , title =. Advances in Neural Information Processing Systems , volume =

  3. [3]

    The Twelfth International Conference on Learning Representations , year =

    Deyao Zhu and Jun Chen and Xiaoqian Shen and Xiang Li and Mohamed Elhoseiny , title =. The Twelfth International Conference on Learning Representations , year =

  4. [7]

    LLaVA-NeXT: Improved reasoning, OCR, and world knowledge , url=

    Liu, Haotian and Li, Chunyuan and Li, Yuheng and Li, Bo and Zhang, Yuanhan and Shen, Sheng and Lee, Yong Jae , year=. LLaVA-NeXT: Improved reasoning, OCR, and world knowledge , url=

  5. [8]

    The Thirteenth International Conference on Learning Representations , year =

    Seil Kang and Jinyeong Kim and Junhyeok Kim and Seong Jae Hwang , title =. The Thirteenth International Conference on Learning Representations , year =

  6. [9]

    The Twelfth International Conference on Learning Representations , year =

    Tell Your Model Where to Attend: Post-hoc Attention Steering for LLMs , author=. The Twelfth International Conference on Learning Representations , year =

  7. [10]

    Proceedings of the

    Sicong Leng and Hang Zhang and Guanzheng Chen and Xin Li and Shijian Lu and Chunyan Miao and Lidong Bing , title =. Proceedings of the

  8. [11]

    Advances in Neural Information Processing Systems , volume =

    Junho Kim and Hyunjun Kim and Yeonju Kim and Yong Man Ro , title =. Advances in Neural Information Processing Systems , volume =

  9. [12]

    Proceedings of the

    Wenbin An and Feng Tian and Sicong Leng and Jiahao Nie and Haonan Lin and Qianying Wang and Ping Chen and Xiaoqin Zhang and Shijian Lu , title =. Proceedings of the

  10. [13]

    and Stepputtis, Simon and Morency, Louis-Philippe and Ramanan, Deva and Sycara, Katia and Xie, Yaqi , booktitle=

    Wan, Zifu and Zhang, Ce and Yong, Silong and Ma, Martin Q. and Stepputtis, Simon and Morency, Louis-Philippe and Ramanan, Deva and Sycara, Katia and Xie, Yaqi , booktitle=. ONLY: One-Layer Intervention Sufficiently Mitigates Hallucinations in Large Vision-Language Models , pages=

  11. [14]

    Proceedings of the

    Hao Yin and Guangzong Si and Zilei Wang , title =. Proceedings of the

  12. [15]

    Proceedings of the 42nd International Conference on Machine Learning , pages =

    Mingi Jung and Saehyung Lee and Eunji Kim and Sungroh Yoon , title =. Proceedings of the 42nd International Conference on Machine Learning , pages =

  13. [16]

    Cracking the Code of Hallucination in LVLMs with Vision-aware Head Divergence , booktitle =

    Jinghan He and Kuan Zhu and Haiyun Guo and Junfeng Fang and Zhenglin Hua and Yuheng Jia and Ming Tang and Tat. Cracking the Code of Hallucination in LVLMs with Vision-aware Head Divergence , booktitle =

  14. [18]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

    Yu, Tianyu and Yao, Yuan and Zhang, Haoye and He, Taiwen and Han, Yifeng and Cui, Ganqu and Hu, Jinyi and Liu, Zhiyuan and Zheng, Hai-Tao and Sun, Maosong and Chua, Tat-Seng , title =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

  15. [19]

    Steering LVLM s via Sparse Autoencoder for Hallucination Mitigation

    Hua, Zhenglin and He, Jinghan and Yao, Zijun and Han, Tianxu and Guo, Haiyun and Jia, Yuheng and Fang, Junfeng. Steering LVLM s via Sparse Autoencoder for Hallucination Mitigation. Findings of the Association for Computational Linguistics: EMNLP. 2025

  16. [20]

    The Thirteenth International Conference on Learning Representations , year=

    AlphaEdit: Null-Space Constrained Knowledge Editing for Language Models , author=. The Thirteenth International Conference on Learning Representations , year=

  17. [21]

    Science China Information Sciences , volume =

    Woodpecker: Hallucination correction for multimodal large language models , author=. Science China Information Sciences , volume =

  18. [22]

    Shallow Focus, Deep Fixes: Enhancing Shallow Layers Vision Attention Sinks to Alleviate Hallucination in LVLM s

    Zhang, Xiaofeng and Quan, Yihao and Shen, Chen and Gu, Chaochen and Yuan, Xiaosong and Yan, Shaotian and Cao, Jiawei and Cheng, Hao and Wu, Kaijie and Ye, Jieping. Shallow Focus, Deep Fixes: Enhancing Shallow Layers Vision Attention Sinks to Alleviate Hallucination in LVLM s. Proceedings of the Conference on Empirical Methods in Natural Language Processing. 2025

  19. [24]

    Self-Introspective Decoding: Alleviating Hallucinations for Large Vision-Language Models , year =

    Huo, Fushuo and Xu, Wenchao and Zhang, Zhong and Wang, Haozhao and Chen, Zhicheng and Zhao, Peilin , booktitle=. Self-Introspective Decoding: Alleviating Hallucinations for Large Vision-Language Models , year =

  20. [25]

    Seeing Far and Clearly: Mitigating Hallucinations in MLLMs with Attention Causal Decoding , pages=

    Tang, Feilong and Liu, Chengzhi and Xu, Zhongxing and Hu, Ming and Huang, Zile and Xue, Haochen and Chen, Ziyang and Peng, Zelin and Yang, Zhiwei and Zhou, Sijin and Li, Wenxue and Li, Yulong and Song, Wenxuan and Su, Shiyan and Feng, Wei and Su, Jionglong and Lin, Minquan and Peng, Yifan and Cheng, Xuelian and Razzak, Imran and Ge, Zongyuan , booktitle=....

  21. [27]

    VQAG uider: Guiding Multimodal Large Language Models to Answer Complex Video Questions

    Chen, Yuyan and Jia, Jiyuan and Lu, Jiaxin and Li, Siyue and Guan, Yu and Yang, Ming and Guo, Qingpei. VQAG uider: Guiding Multimodal Large Language Models to Answer Complex Video Questions. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics. 2025

  22. [28]

    Grounding Multimodal Large Language Models to the World , year =

    Peng, Zhiliang and Wang, Wenhui and Dong, Li and Hao, Yaru and Huang, Shaohan and Ma, Shuming and Ye, Qixiang and Wei, Furu , booktitle=. Grounding Multimodal Large Language Models to the World , year =

  23. [29]

    Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

    Zhang, Ruifei and Zhang, Wei and Tan, Xiao and Yang, Sibei and Wan, Xiang and Luo, Xiaonan and Li, Guanbin , title =. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

  24. [30]

    Object Hallucination in Image Captioning

    Rohrbach, Anna and Hendricks, Lisa Anne and Burns, Kaylee and Darrell, Trevor and Saenko, Kate. Object Hallucination in Image Captioning. Proceedings of the Conference on Empirical Methods in Natural Language Processing. 2018

  25. [31]

    Evaluating Object Hallucination in Large Vision-Language Models

    Li, Yifan and Du, Yifan and Zhou, Kun and Wang, Jinpeng and Zhao, Xin and Wen, Ji-Rong. Evaluating Object Hallucination in Large Vision-Language Models. Proceedings of the Conference on Empirical Methods in Natural Language Processing. 2023

  26. [33]

    Aligning Large Multimodal Models with Factually Augmented RLHF

    Sun, Zhiqing and Shen, Sheng and Cao, Shengcao and Liu, Haotian and Li, Chunyuan and Shen, Yikang and Gan, Chuang and Gui, Liangyan and Wang, Yu-Xiong and Yang, Yiming and Keutzer, Kurt and Darrell, Trevor. Aligning Large Multimodal Models with Factually Augmented RLHF. Findings of the Association for Computational Linguistics: ACL. 2024

  27. [34]

    Improved Baselines with Visual Instruction Tuning , pages=

    Liu, Haotian and Li, Chunyuan and Li, Yuheng and Lee, Yong Jae , booktitle=. Improved Baselines with Visual Instruction Tuning , pages=

  28. [35]

    Lawrence

    Lin, Tsung-Yi and Maire, Michael and Belongie, Serge and Hays, James and Perona, Pietro and Ramanan, Deva and Doll \'a r, Piotr and Zitnick, C. Lawrence. Microsoft COCO: Common Objects in Context. 2014

  29. [36]

    2022 , booktitle =

    Schwenk, Dustin and Khandelwal, Apoorv and Clark, Christopher and Marino, Kenneth and Mottaghi, Roozbeh , title =. 2022 , booktitle =

  30. [37]

    and Manning, Christopher D

    Hudson, Drew A. and Manning, Christopher D. , booktitle=. GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering , year=

  31. [38]

    Proceedings of the Thirty-Ninth

    Qi Sun and Marc Pickett and Aakash Kumar Nain and Llion Jones , title =. Proceedings of the Thirty-Ninth

  32. [39]

    DAMRO : Dive into the Attention Mechanism of LVLM to Reduce Object Hallucination

    Gong, Xuan and Ming, Tianshi and Wang, Xinpeng and Wei, Zhihua. DAMRO : Dive into the Attention Mechanism of LVLM to Reduce Object Hallucination. Proceedings of the Conference on Empirical Methods in Natural Language Processing. 2024

  33. [40]

    Proceedings of the 42nd International Conference on Machine Learning , volume =

    Toward Robust Hyper-Detailed Image Captioning: A Multiagent Approach and Dual Evaluation Metrics for Factuality and Coverage , author =. Proceedings of the 42nd International Conference on Machine Learning , volume =

  34. [41]

    I mage I n W ords: Unlocking Hyper-Detailed Image Descriptions

    Garg, Roopal and Burns, Andrea and Karagol Ayan, Burcu and Bitton, Yonatan and Montgomery, Ceslee and Onoe, Yasumasa and Bunner, Andrew and Krishna, Ranjay and Baldridge, Jason Michael and Soricut, Radu. I mage I n W ords: Unlocking Hyper-Detailed Image Descriptions. Proceedings of the Conference on Empirical Methods in Natural Language Processing. 2024

  35. [42]

    and Petryk, Suzanne and Gonzalez, Joseph E

    Chan, David M. and Petryk, Suzanne and Gonzalez, Joseph E. and Darrell, Trevor and Canny, John. CLAIR : Evaluating Image Captions with Large Language Models. Proceedings of the Conference on Empirical Methods in Natural Language Processing. 2023

  36. [43]

    and Kachinthaya, Anish and Zou, Haodi and Canny, John and Gonzalez, Joseph E

    Petryk, Suzanne and Chan, David M. and Kachinthaya, Anish and Zou, Haodi and Canny, John and Gonzalez, Joseph E. and Darrell, Trevor. ALOH a: A New Measure for Hallucination in Captioning Models. Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2024

  37. [44]

    Proceedings of the Fortieth

    Xinyue Wang and Yuheng Jia and Hui Liu and Junhui Hou , title =. Proceedings of the Fortieth

  38. [45]

    Mitigating object hallucinations in large vision-language models with assembly of global and local attention

    An, W., Tian, F., Leng, S., Nie, J., Lin, H., Wang, Q., Chen, P., Zhang, X., and Lu, S. Mitigating object hallucinations in large vision-language models with assembly of global and local attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pp.\ 29915--29926, 2025

  39. [46]

    Qwen3-VL Technical Report

    Bai, S., Cai, Y., Chen, R., Chen, K., Chen, X., Cheng, Z., Deng, L., Ding, W., Gao, C., Ge, C., Ge, W., Guo, Z., Huang, Q., Huang, J., Huang, F., Hui, B., Jiang, S., Li, Z., Li, M., Li, M., Li, K., Lin, Z., Lin, J., Liu, X., Liu, J., Liu, C., Liu, Y., Liu, D., Liu, S., Lu, D., Luo, R., Lv, C., Men, R., Meng, L., Ren, X., Ren, X., Song, S., Sun, Y., Tang, ...

  40. [47]

    M., Petryk, S., Gonzalez, J

    Chan, D. M., Petryk, S., Gonzalez, J. E., Darrell, T., and Canny, J. CLAIR : Evaluating image captions with large language models. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp.\ 13638--13646, 2023

  41. [48]

    VQAG uider: Guiding multimodal large language models to answer complex video questions

    Chen, Y., Jia, J., Lu, J., Li, S., Guan, Y., Yang, M., and Guo, Q. VQAG uider: Guiding multimodal large language models to answer complex video questions. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics, pp.\ 7821--7834, 2025

  42. [49]

    Alphaedit: Null-space constrained knowledge editing for language models

    Fang, J., Jiang, H., Wang, K., Ma, Y., Shi, J., Wang, X., He, X., and Chua, T.-S. Alphaedit: Null-space constrained knowledge editing for language models. In The Thirteenth International Conference on Learning Representations, 2025

  43. [50]

    MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

    Fu, C., Chen, P., Shen, Y., Qin, Y., Zhang, M., Lin, X., Qiu, Z., Lin, W., Yang, J., Zheng, X., Li, K., Sun, X., and Ji, R. MME: A comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394, 2023

  44. [51]

    M., and Soricut, R

    Garg, R., Burns, A., Karagol Ayan, B., Bitton, Y., Montgomery, C., Onoe, Y., Bunner, A., Krishna, R., Baldridge, J. M., and Soricut, R. I mage I n W ords: Unlocking hyper-detailed image descriptions. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp.\ 93--127, 2024

  45. [52]

    DAMRO : Dive into the attention mechanism of LVLM to reduce object hallucination

    Gong, X., Ming, T., Wang, X., and Wei, Z. DAMRO : Dive into the attention mechanism of LVLM to reduce object hallucination. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp.\ 7696--7712, 2024

  46. [53]

    Cracking the code of hallucination in lvlms with vision-aware head divergence

    He, J., Zhu, K., Guo, H., Fang, J., Hua, Z., Jia, Y., Tang, M., Chua, T., and Wang, J. Cracking the code of hallucination in lvlms with vision-aware head divergence. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics, pp.\ 3488--3501, 2025

  47. [54]

    Steering LVLM s via sparse autoencoder for hallucination mitigation

    Hua, Z., He, J., Yao, Z., Han, T., Guo, H., Jia, Y., and Fang, J. Steering LVLM s via sparse autoencoder for hallucination mitigation. In Findings of the Association for Computational Linguistics: EMNLP, pp.\ 10808--10828, 2025

  48. [55]

    Medical mllm is vulnerable: cross-modality jailbreak and mismatched attacks on medical multimodal large language models

    Huang, X., Wang, X., Zhang, H., Zhu, Y., Xi, J., An, J., Wang, H., Liang, H., and Pan, C. Medical mllm is vulnerable: cross-modality jailbreak and mismatched attacks on medical multimodal large language models. In Proceedings of the Thirty-Ninth AAAI Conference on Artificial Intelligence, pp.\ 3797--3805, 2025

  49. [56]

    Self-introspective decoding: Alleviating hallucinations for large vision-language models

    Huo, F., Xu, W., Zhang, Z., Wang, H., Chen, Z., and Zhao, P. Self-introspective decoding: Alleviating hallucinations for large vision-language models. In The Thirteenth International Conference on Learning Representations, 2025

  50. [57]

    Visual attention never fades: Selective progressive attention recalibration for detailed image captioning in multimodal large language models

    Jung, M., Lee, S., Kim, E., and Yoon, S. Visual attention never fades: Selective progressive attention recalibration for detailed image captioning in multimodal large language models. In Proceedings of the 42nd International Conference on Machine Learning, volume 267, pp.\ 28527--28551, 2025

  51. [58]

    Kang, S., Kim, J., Kim, J., and Hwang, S. J. See what you are told: Visual attention sink in large multimodal models. In The Thirteenth International Conference on Learning Representations, 2025

  52. [59]

    Kim, J., Kim, H., Kim, Y., and Ro, Y. M. CODE: contrasting self-generated description to combat hallucination in large multi-modal models. In Advances in Neural Information Processing Systems, volume 37, pp.\ 133571--133599, 2024

  53. [60]

    Toward robust hyper-detailed image captioning: A multiagent approach and dual evaluation metrics for factuality and coverage

    Lee, S., Yoon, S., Bui, T., Shi, J., and Yoon, S. Toward robust hyper-detailed image captioning: A multiagent approach and dual evaluation metrics for factuality and coverage. In Proceedings of the 42nd International Conference on Machine Learning, volume 267, pp.\ 33815--33832, 2025

  54. [61]

    Mitigating object hallucinations in large vision-language models through visual contrastive decoding

    Leng, S., Zhang, H., Chen, G., Li, X., Lu, S., Miao, C., and Bing, L. Mitigating object hallucinations in large vision-language models through visual contrastive decoding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pp.\ 13872--13882, 2024

  55. [62]

    Evaluating object hallucination in large vision-language models

    Li, Y., Du, Y., Zhou, K., Wang, J., Zhao, X., and Wen, J.-R. Evaluating object hallucination in large vision-language models. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp.\ 292--305, 2023

  56. [63]

    Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Doll \'a r, P., and Zitnick, C. L. Microsoft coco: Common objects in context. In European Conference on Computer Vision, pp.\ 740--755, 2014

  57. [64]

    Liu, H., Li, C., Wu, Q., and Lee, Y. J. Visual instruction tuning. In Advances in Neural Information Processing Systems, volume 36, pp.\ 34892--34916, 2023

  58. [65]

    Liu, H., Li, C., Li, Y., and Lee, Y. J. Improved baselines with visual instruction tuning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 26286--26296, 2024 a

  59. [66]

    Liu, H., Li, C., Li, Y., Li, B., Zhang, Y., Shen, S., and Lee, Y. J. Llava-next: Improved reasoning, ocr, and world knowledge, 2024 b . URL https://llava-vl.github.io/blog/2024-01-30-llava-next/

  60. [67]

    A Survey on Hallucination in Large Vision-Language Models

    Liu, H., Xue, W., Chen, Y., Chen, D., Zhao, X., Wang, K., Hou, L., Li, R., and Peng, W. A survey on hallucination in large vision-language models. arXiv preprint arXiv:2402.00253, 2024 c

  61. [68]

    Grounding multimodal large language models to the world

    Peng, Z., Wang, W., Dong, L., Hao, Y., Huang, S., Ma, S., Ye, Q., and Wei, F. Grounding multimodal large language models to the world. In The Twelfth International Conference on Learning Representations, 2024

  62. [69]

    M., Kachinthaya, A., Zou, H., Canny, J., Gonzalez, J

    Petryk, S., Chan, D. M., Kachinthaya, A., Zou, H., Canny, J., Gonzalez, J. E., and Darrell, T. ALOH a: A new measure for hallucination in captioning models. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp.\ 342--357, 2024

  63. [70]

    A., Burns, K., Darrell, T., and Saenko, K

    Rohrbach, A., Hendricks, L. A., Burns, K., Darrell, T., and Saenko, K. Object hallucination in image captioning. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp.\ 4035--4045, 2018

  64. [71]

    Aligning large multimodal models with factually augmented RLHF

    Sun, Z., Shen, S., Cao, S., Liu, H., Li, C., Shen, Y., Gan, C., Gui, L., Wang, Y.-X., Yang, Y., Keutzer, K., and Darrell, T. Aligning large multimodal models with factually augmented RLHF . In Findings of the Association for Computational Linguistics: ACL, pp.\ 13088--13110, 2024

  65. [72]

    Seeing far and clearly: Mitigating hallucinations in mllms with attention causal decoding

    Tang, F., Liu, C., Xu, Z., Hu, M., Huang, Z., Xue, H., Chen, Z., Peng, Z., Yang, Z., Zhou, S., Li, W., Li, Y., Song, W., Su, S., Feng, W., Su, J., Lin, M., Peng, Y., Cheng, X., Razzak, I., and Ge, Z. Seeing far and clearly: Mitigating hallucinations in mllms with attention causal decoding. In Proceedings of the IEEE/CVF Conference on Computer Vision and P...

  66. [73]

    Q., Stepputtis, S., Morency, L.-P., Ramanan, D., Sycara, K., and Xie, Y

    Wan, Z., Zhang, C., Yong, S., Ma, M. Q., Stepputtis, S., Morency, L.-P., Ramanan, D., Sycara, K., and Xie, Y. Only: One-layer intervention sufficiently mitigates hallucinations in large vision-language models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.\ 3225--3234, 2025

  67. [74]

    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

    Wang, P., Bai, S., Tan, S., Wang, S., Fan, Z., Bai, J., Chen, K., Liu, X., Wang, J., Ge, W., Fan, Y., Dang, K., Du, M., Ren, X., Men, R., Liu, D., Zhou, C., Zhou, J., and Lin, J. Qwen2-vl: Enhancing vision-language model's perception of the world at any resolution. arXiv preprint arXiv:2409.12191, 2024

  68. [75]

    Caption anything: Interactive image description with diverse multimodal controls

    Wang, T., Zhang, J., Fei, J., Ge, Y., Zheng, H., Tang, Y., Li, Z., Gao, M., Zhao, S., Shan, Y., and Zheng, F. Caption anything: Interactive image description with diverse multimodal controls. arXiv preprint arXiv:2305.02677, 2023

  69. [76]

    ESMC: mllm-based embedding selection for explainable multiple clustering

    Wang, X., Jia, Y., Liu, H., and Hou, J. ESMC: mllm-based embedding selection for explainable multiple clustering. In Proceedings of the Fortieth AAAI Conference on Artificial Intelligence , pp.\ 26588--26596, 2026

  70. [77]

    Clearsight: Visual signal enhancement for object hallucination mitigation in multimodal large language models

    Yin, H., Si, G., and Wang, Z. Clearsight: Visual signal enhancement for object hallucination mitigation in multimodal large language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pp.\ 14625--14634, 2025

  71. [78]

    Woodpecker: Hallucination correction for multimodal large language models

    Yin, S., Fu, C., Zhao, S., Xu, T., Wang, H., Sui, D., Shen, Y., Li, K., Sun, X., and Chen, E. Woodpecker: Hallucination correction for multimodal large language models. Science China Information Sciences, 67, 2024

  72. [79]

    Rlhf-v: Towards trustworthy mllms via behavior alignment from fine-grained correctional human feedback

    Yu, T., Yao, Y., Zhang, H., He, T., Han, Y., Cui, G., Hu, J., Liu, Z., Zheng, H.-T., Sun, M., and Chua, T.-S. Rlhf-v: Towards trustworthy mllms via behavior alignment from fine-grained correctional human feedback. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 13807--13816, 2024

  73. [80]

    Tell your model where to attend: Post-hoc attention steering for llms

    Zhang, Q., Singh, C., Liu, L., Liu, X., Yu, B., Gao, J., and Zhao, T. Tell your model where to attend: Post-hoc attention steering for llms. In The Twelfth International Conference on Learning Representations, 2024

  74. [81]

    Vldrive: Vision-augmented lightweight mllms for efficient language-grounded autonomous driving

    Zhang, R., Zhang, W., Tan, X., Yang, S., Wan, X., Luo, X., and Li, G. Vldrive: Vision-augmented lightweight mllms for efficient language-grounded autonomous driving. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.\ 5923--5933, 2025 a

  75. [82]

    Shallow focus, deep fixes: Enhancing shallow layers vision attention sinks to alleviate hallucination in LVLM s

    Zhang, X., Quan, Y., Shen, C., Gu, C., Yuan, X., Yan, S., Cao, J., Cheng, H., Wu, K., and Ye, J. Shallow focus, deep fixes: Enhancing shallow layers vision attention sinks to alleviate hallucination in LVLM s. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp.\ 3512--3534, 2025 b

  76. [83]

    Beyond Hallucinations: Enhancing LVLMs through Hallucination-Aware Direct Preference Optimization

    Zhao, Z., Wang, B., Ouyang, L., wen Dong, X., Wang, J., and He, C. Beyond hallucinations: Enhancing lvlms through hallucination-aware direct preference optimization. arXiv preprint arXiv:2311.16839, 2023

  77. [84]

    Minigpt-4: Enhancing vision-language understanding with advanced large language models

    Zhu, D., Chen, J., Shen, X., Li, X., and Elhoseiny, M. Minigpt-4: Enhancing vision-language understanding with advanced large language models. In The Twelfth International Conference on Learning Representations, 2024

  78. [85]

    InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

    Zhu, J., Wang, W., Chen, Z., Liu, Z., Ye, S., Gu, L., Tian, H., Duan, Y., Su, W., Shao, J., Gao, Z., Cui, E., Wang, X., Cao, Y., Liu, Y., Wei, X., Zhang, H., Wang, H., Xu, W., Li, H., Wang, J., Deng, N., Li, S., He, Y., Jiang, T., Luo, J., Wang, Y., He, C., Shi, B., Zhang, X., Shao, W., He, J., Xiong, Y., Qu, W., Sun, P., Jiao, P., Lv, H., Wu, L., Zhang, ...