Finding the Correct Visual Evidence Without Forgetting: Mitigating Hallucination in LVLMs via Inter-Layer Visual Attention Discrepancy
Pith reviewed 2026-05-21 04:59 UTC · model grok-4.3
The pith
LVLMs hallucinate by forgetting correct visual evidence but inter-layer attention discrepancies reveal a way to reinforce it.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
LVLMs tend to hallucinate when they pay insufficient attention to the correct visual evidence and gradually forget it during the generation process. Although LVLMs overall attend insufficiently to visual evidence, they exhibit sensitivity to the correct visual evidence in specific layers with notable inter-layer discrepancy. A saliency map is formed from attention weights of early generated tokens to visual tokens by selecting those repeatedly activated across layers; this map is then used to enhance attention to the evidence and to emphasize text tokens grounded in it, thereby reducing visual forgetting.
What carries the argument
Inter-Layer Visual Attention Discrepancy (ILVAD) identifies repeatedly activated visual tokens from early-generation attention weights across layers to build a saliency map that boosts attention and curbs forgetting.
If this is right
- The method works without any model retraining and plugs directly into different LVLM architectures.
- Hallucination rates drop consistently when the approach is applied to five recent models on multiple benchmarks.
- Text tokens can be chosen and highlighted according to how strongly their attention aligns with the visual saliency map.
- Maintaining boosted attention to the identified evidence throughout generation prevents gradual visual forgetting.
Where Pith is reading between the lines
- The same layer-wise discrepancy pattern could appear in other multimodal models and might guide where to insert visual grounding checks.
- Updating the saliency map at later steps could help maintain accuracy in very long generated responses.
- Pairing the attention reinforcement with existing alignment techniques might produce even more reliable outputs.
Load-bearing premise
Tokens that receive repeated activation across layers from the attention patterns of early generated tokens are the correct visual evidence, and strengthening attention to them will cut hallucinations without creating new inconsistencies.
What would settle it
Run the saliency-map enhancement on standard hallucination benchmarks for the tested LVLMs and measure whether hallucination rates stay the same or rise instead of falling.
Figures
read the original abstract
Large Vision-Language Models (LVLMs) have shown remarkable performance on a wide range of vision-language tasks. Despite this progress, they are still prone to hallucination, generating responses that are inconsistent with visual content. In this work, we find that LVLMs tend to hallucinate when they pay insufficient attention to the correct visual evidence and gradually forget it during the generation process. We empirically find that although LVLMs overall attend insufficiently to visual evidence, they exhibit sensitivity to the correct visual evidence in specific layers, with notable inter-layer discrepancy. Motivated by this observation, we propose a novel hallucination mitigation method that enhances visual evidence based on Inter-Layer Visual Attention Discrepancy (ILVAD). Specifically, we obtain the attention weights from early generated tokens to visual tokens across layers and identify the tokens that are repeatedly activated as visual evidence, forming a saliency map. We then enhance attention to visual evidence during generation through the saliency map to reduce visual forgetting. In addition, we leverage the saliency map to obtain attention scores of generated text to visual evidence, in order to select and emphasize text tokens that are strongly grounded in visual evidence. Our method is training-free and plug-and-play. Multiple benchmark evaluations conducted on five recently released models show that our method can consistently mitigate hallucinations in different LVLMs over various architectures. Code is available at https://github.com/ytx-ML/ILVAD.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that LVLMs hallucinate when paying insufficient attention to correct visual evidence and gradually forgetting it during generation. It empirically observes sensitivity to correct visual evidence in specific layers with notable inter-layer discrepancy. Motivated by this, the authors propose a training-free ILVAD method: attention weights from early generated tokens to visual tokens are used across layers to identify repeatedly activated tokens as visual evidence, forming a saliency map that boosts attention to these tokens during continued generation and re-weights text tokens strongly grounded in visual evidence. Evaluations on five recent LVLMs across benchmarks show consistent hallucination mitigation.
Significance. If the central assumption and empirical results hold, this work offers a significant practical advance by introducing a simple, training-free, plug-and-play technique for reducing hallucinations in LVLMs that leverages inter-layer attention patterns rather than model retraining. The method's reported generality across different architectures and the public code release support reproducibility and potential adoption.
major comments (2)
- [Method (ILVAD description)] The method section defines visual evidence as tokens repeatedly activated across layers from attention weights of the first few generated tokens, then uses the resulting saliency map to mitigate forgetting. No independent verification (ground-truth object labels, human annotations, or causal tests against random/uniform attention baselines) is reported to confirm these tokens are the correct evidence rather than spurious correlations; this assumption is load-bearing for the claim that reinforcement via ILVAD specifically counters visual forgetting and hallucinations.
- [Experiments and Evaluation] The experiments claim consistent mitigation across five models and various architectures, yet the manuscript provides insufficient quantitative results, ablation studies on saliency-map construction parameters (e.g., number of early tokens or layer selection), or baseline comparisons that would isolate the contribution of inter-layer discrepancy from generic increases in visual attention mass.
minor comments (1)
- [Abstract] The abstract would benefit from naming the specific benchmarks and hallucination metrics used to quantify improvements.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and the opportunity to clarify our work. We address each major comment below and describe the revisions we will make to the manuscript.
read point-by-point responses
-
Referee: [Method (ILVAD description)] The method section defines visual evidence as tokens repeatedly activated across layers from attention weights of the first few generated tokens, then uses the resulting saliency map to mitigate forgetting. No independent verification (ground-truth object labels, human annotations, or causal tests against random/uniform attention baselines) is reported to confirm these tokens are the correct evidence rather than spurious correlations; this assumption is load-bearing for the claim that reinforcement via ILVAD specifically counters visual forgetting and hallucinations.
Authors: We acknowledge that the manuscript relies on empirical observations of inter-layer attention discrepancies without providing independent verification such as ground-truth object labels or direct comparisons to random baselines. Our definition of visual evidence stems from the consistent activation patterns observed in early tokens across layers, which we link to reduced hallucinations when reinforced. To strengthen this, we will add experiments in the revised manuscript that include comparisons against random and uniform attention baselines, as well as any available causal analyses, to better demonstrate that the selected tokens are not spurious. revision: yes
-
Referee: [Experiments and Evaluation] The experiments claim consistent mitigation across five models and various architectures, yet the manuscript provides insufficient quantitative results, ablation studies on saliency-map construction parameters (e.g., number of early tokens or layer selection), or baseline comparisons that would isolate the contribution of inter-layer discrepancy from generic increases in visual attention mass.
Authors: The current manuscript reports consistent improvements across five LVLMs and multiple benchmarks, but we agree that the experimental section would benefit from more detailed quantitative breakdowns and ablations. We will expand the revised version to include ablation studies on the number of early tokens used, layer selection choices, and additional baselines that apply generic visual attention boosts without leveraging inter-layer discrepancy. These additions will help isolate the specific contribution of our approach. revision: yes
Circularity Check
No significant circularity; method is an explicit empirical procedure on observed attention patterns
full rationale
The paper's central contribution is an empirical observation that LVLMs exhibit inter-layer attention discrepancies to visual tokens, followed by a training-free procedural definition of ILVAD: extract attention weights from the first few generated tokens to image tokens across layers, select repeatedly activated tokens to form a saliency map, then use the map to boost visual attention and re-weight text tokens during continued generation. This procedure is defined directly in terms of the model's internal attention weights rather than any fitted parameter, self-referential equation, or prior result that reduces to the target claim by construction. No self-citation load-bearing steps, uniqueness theorems, or ansatz smuggling appear in the derivation; the assumption that repeatedly activated tokens constitute 'correct visual evidence' is presented as a motivated heuristic whose effectiveness is then tested on external benchmarks, keeping the chain self-contained and non-circular.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Attention weights from early generated tokens to visual tokens reflect the model's sensitivity to correct visual evidence.
Reference graph
Works this paper leans on
-
[1]
Huang, Xijie and Wang, Xinyuan and Zhang, Hantao and Zhu, Yinghao and Xi, Jiawen and An, Jingkun and Wang, Hao and Liang, Hao and Pan, Chengwei , title =. 2025 , pages =
work page 2025
-
[2]
Advances in Neural Information Processing Systems , volume =
Haotian Liu and Chunyuan Li and Qingyang Wu and Yong Jae Lee , title =. Advances in Neural Information Processing Systems , volume =
-
[3]
The Twelfth International Conference on Learning Representations , year =
Deyao Zhu and Jun Chen and Xiaoqian Shen and Xiang Li and Mohamed Elhoseiny , title =. The Twelfth International Conference on Learning Representations , year =
-
[7]
LLaVA-NeXT: Improved reasoning, OCR, and world knowledge , url=
Liu, Haotian and Li, Chunyuan and Li, Yuheng and Li, Bo and Zhang, Yuanhan and Shen, Sheng and Lee, Yong Jae , year=. LLaVA-NeXT: Improved reasoning, OCR, and world knowledge , url=
-
[8]
The Thirteenth International Conference on Learning Representations , year =
Seil Kang and Jinyeong Kim and Junhyeok Kim and Seong Jae Hwang , title =. The Thirteenth International Conference on Learning Representations , year =
-
[9]
The Twelfth International Conference on Learning Representations , year =
Tell Your Model Where to Attend: Post-hoc Attention Steering for LLMs , author=. The Twelfth International Conference on Learning Representations , year =
-
[10]
Sicong Leng and Hang Zhang and Guanzheng Chen and Xin Li and Shijian Lu and Chunyan Miao and Lidong Bing , title =. Proceedings of the
-
[11]
Advances in Neural Information Processing Systems , volume =
Junho Kim and Hyunjun Kim and Yeonju Kim and Yong Man Ro , title =. Advances in Neural Information Processing Systems , volume =
-
[12]
Wenbin An and Feng Tian and Sicong Leng and Jiahao Nie and Haonan Lin and Qianying Wang and Ping Chen and Xiaoqin Zhang and Shijian Lu , title =. Proceedings of the
-
[13]
Wan, Zifu and Zhang, Ce and Yong, Silong and Ma, Martin Q. and Stepputtis, Simon and Morency, Louis-Philippe and Ramanan, Deva and Sycara, Katia and Xie, Yaqi , booktitle=. ONLY: One-Layer Intervention Sufficiently Mitigates Hallucinations in Large Vision-Language Models , pages=
- [14]
-
[15]
Proceedings of the 42nd International Conference on Machine Learning , pages =
Mingi Jung and Saehyung Lee and Eunji Kim and Sungroh Yoon , title =. Proceedings of the 42nd International Conference on Machine Learning , pages =
-
[16]
Cracking the Code of Hallucination in LVLMs with Vision-aware Head Divergence , booktitle =
Jinghan He and Kuan Zhu and Haiyun Guo and Junfeng Fang and Zhenglin Hua and Yuheng Jia and Ming Tang and Tat. Cracking the Code of Hallucination in LVLMs with Vision-aware Head Divergence , booktitle =
-
[18]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
Yu, Tianyu and Yao, Yuan and Zhang, Haoye and He, Taiwen and Han, Yifeng and Cui, Ganqu and Hu, Jinyi and Liu, Zhiyuan and Zheng, Hai-Tao and Sun, Maosong and Chua, Tat-Seng , title =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
-
[19]
Steering LVLM s via Sparse Autoencoder for Hallucination Mitigation
Hua, Zhenglin and He, Jinghan and Yao, Zijun and Han, Tianxu and Guo, Haiyun and Jia, Yuheng and Fang, Junfeng. Steering LVLM s via Sparse Autoencoder for Hallucination Mitigation. Findings of the Association for Computational Linguistics: EMNLP. 2025
work page 2025
-
[20]
The Thirteenth International Conference on Learning Representations , year=
AlphaEdit: Null-Space Constrained Knowledge Editing for Language Models , author=. The Thirteenth International Conference on Learning Representations , year=
-
[21]
Science China Information Sciences , volume =
Woodpecker: Hallucination correction for multimodal large language models , author=. Science China Information Sciences , volume =
-
[22]
Zhang, Xiaofeng and Quan, Yihao and Shen, Chen and Gu, Chaochen and Yuan, Xiaosong and Yan, Shaotian and Cao, Jiawei and Cheng, Hao and Wu, Kaijie and Ye, Jieping. Shallow Focus, Deep Fixes: Enhancing Shallow Layers Vision Attention Sinks to Alleviate Hallucination in LVLM s. Proceedings of the Conference on Empirical Methods in Natural Language Processing. 2025
work page 2025
-
[24]
Self-Introspective Decoding: Alleviating Hallucinations for Large Vision-Language Models , year =
Huo, Fushuo and Xu, Wenchao and Zhang, Zhong and Wang, Haozhao and Chen, Zhicheng and Zhao, Peilin , booktitle=. Self-Introspective Decoding: Alleviating Hallucinations for Large Vision-Language Models , year =
-
[25]
Seeing Far and Clearly: Mitigating Hallucinations in MLLMs with Attention Causal Decoding , pages=
Tang, Feilong and Liu, Chengzhi and Xu, Zhongxing and Hu, Ming and Huang, Zile and Xue, Haochen and Chen, Ziyang and Peng, Zelin and Yang, Zhiwei and Zhou, Sijin and Li, Wenxue and Li, Yulong and Song, Wenxuan and Su, Shiyan and Feng, Wei and Su, Jionglong and Lin, Minquan and Peng, Yifan and Cheng, Xuelian and Razzak, Imran and Ge, Zongyuan , booktitle=....
-
[27]
VQAG uider: Guiding Multimodal Large Language Models to Answer Complex Video Questions
Chen, Yuyan and Jia, Jiyuan and Lu, Jiaxin and Li, Siyue and Guan, Yu and Yang, Ming and Guo, Qingpei. VQAG uider: Guiding Multimodal Large Language Models to Answer Complex Video Questions. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics. 2025
work page 2025
-
[28]
Grounding Multimodal Large Language Models to the World , year =
Peng, Zhiliang and Wang, Wenhui and Dong, Li and Hao, Yaru and Huang, Shaohan and Ma, Shuming and Ye, Qixiang and Wei, Furu , booktitle=. Grounding Multimodal Large Language Models to the World , year =
-
[29]
Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
Zhang, Ruifei and Zhang, Wei and Tan, Xiao and Yang, Sibei and Wan, Xiang and Luo, Xiaonan and Li, Guanbin , title =. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
-
[30]
Object Hallucination in Image Captioning
Rohrbach, Anna and Hendricks, Lisa Anne and Burns, Kaylee and Darrell, Trevor and Saenko, Kate. Object Hallucination in Image Captioning. Proceedings of the Conference on Empirical Methods in Natural Language Processing. 2018
work page 2018
-
[31]
Evaluating Object Hallucination in Large Vision-Language Models
Li, Yifan and Du, Yifan and Zhou, Kun and Wang, Jinpeng and Zhao, Xin and Wen, Ji-Rong. Evaluating Object Hallucination in Large Vision-Language Models. Proceedings of the Conference on Empirical Methods in Natural Language Processing. 2023
work page 2023
-
[33]
Aligning Large Multimodal Models with Factually Augmented RLHF
Sun, Zhiqing and Shen, Sheng and Cao, Shengcao and Liu, Haotian and Li, Chunyuan and Shen, Yikang and Gan, Chuang and Gui, Liangyan and Wang, Yu-Xiong and Yang, Yiming and Keutzer, Kurt and Darrell, Trevor. Aligning Large Multimodal Models with Factually Augmented RLHF. Findings of the Association for Computational Linguistics: ACL. 2024
work page 2024
-
[34]
Improved Baselines with Visual Instruction Tuning , pages=
Liu, Haotian and Li, Chunyuan and Li, Yuheng and Lee, Yong Jae , booktitle=. Improved Baselines with Visual Instruction Tuning , pages=
- [35]
-
[36]
Schwenk, Dustin and Khandelwal, Apoorv and Clark, Christopher and Marino, Kenneth and Mottaghi, Roozbeh , title =. 2022 , booktitle =
work page 2022
-
[37]
Hudson, Drew A. and Manning, Christopher D. , booktitle=. GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering , year=
-
[38]
Proceedings of the Thirty-Ninth
Qi Sun and Marc Pickett and Aakash Kumar Nain and Llion Jones , title =. Proceedings of the Thirty-Ninth
-
[39]
DAMRO : Dive into the Attention Mechanism of LVLM to Reduce Object Hallucination
Gong, Xuan and Ming, Tianshi and Wang, Xinpeng and Wei, Zhihua. DAMRO : Dive into the Attention Mechanism of LVLM to Reduce Object Hallucination. Proceedings of the Conference on Empirical Methods in Natural Language Processing. 2024
work page 2024
-
[40]
Proceedings of the 42nd International Conference on Machine Learning , volume =
Toward Robust Hyper-Detailed Image Captioning: A Multiagent Approach and Dual Evaluation Metrics for Factuality and Coverage , author =. Proceedings of the 42nd International Conference on Machine Learning , volume =
-
[41]
I mage I n W ords: Unlocking Hyper-Detailed Image Descriptions
Garg, Roopal and Burns, Andrea and Karagol Ayan, Burcu and Bitton, Yonatan and Montgomery, Ceslee and Onoe, Yasumasa and Bunner, Andrew and Krishna, Ranjay and Baldridge, Jason Michael and Soricut, Radu. I mage I n W ords: Unlocking Hyper-Detailed Image Descriptions. Proceedings of the Conference on Empirical Methods in Natural Language Processing. 2024
work page 2024
-
[42]
and Petryk, Suzanne and Gonzalez, Joseph E
Chan, David M. and Petryk, Suzanne and Gonzalez, Joseph E. and Darrell, Trevor and Canny, John. CLAIR : Evaluating Image Captions with Large Language Models. Proceedings of the Conference on Empirical Methods in Natural Language Processing. 2023
work page 2023
-
[43]
and Kachinthaya, Anish and Zou, Haodi and Canny, John and Gonzalez, Joseph E
Petryk, Suzanne and Chan, David M. and Kachinthaya, Anish and Zou, Haodi and Canny, John and Gonzalez, Joseph E. and Darrell, Trevor. ALOH a: A New Measure for Hallucination in Captioning Models. Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2024
work page 2024
-
[44]
Xinyue Wang and Yuheng Jia and Hui Liu and Junhui Hou , title =. Proceedings of the Fortieth
-
[45]
An, W., Tian, F., Leng, S., Nie, J., Lin, H., Wang, Q., Chen, P., Zhang, X., and Lu, S. Mitigating object hallucinations in large vision-language models with assembly of global and local attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pp.\ 29915--29926, 2025
work page 2025
-
[46]
Bai, S., Cai, Y., Chen, R., Chen, K., Chen, X., Cheng, Z., Deng, L., Ding, W., Gao, C., Ge, C., Ge, W., Guo, Z., Huang, Q., Huang, J., Huang, F., Hui, B., Jiang, S., Li, Z., Li, M., Li, M., Li, K., Lin, Z., Lin, J., Liu, X., Liu, J., Liu, C., Liu, Y., Liu, D., Liu, S., Lu, D., Luo, R., Lv, C., Men, R., Meng, L., Ren, X., Ren, X., Song, S., Sun, Y., Tang, ...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[47]
Chan, D. M., Petryk, S., Gonzalez, J. E., Darrell, T., and Canny, J. CLAIR : Evaluating image captions with large language models. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp.\ 13638--13646, 2023
work page 2023
-
[48]
VQAG uider: Guiding multimodal large language models to answer complex video questions
Chen, Y., Jia, J., Lu, J., Li, S., Guan, Y., Yang, M., and Guo, Q. VQAG uider: Guiding multimodal large language models to answer complex video questions. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics, pp.\ 7821--7834, 2025
work page 2025
-
[49]
Alphaedit: Null-space constrained knowledge editing for language models
Fang, J., Jiang, H., Wang, K., Ma, Y., Shi, J., Wang, X., He, X., and Chua, T.-S. Alphaedit: Null-space constrained knowledge editing for language models. In The Thirteenth International Conference on Learning Representations, 2025
work page 2025
-
[50]
MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models
Fu, C., Chen, P., Shen, Y., Qin, Y., Zhang, M., Lin, X., Qiu, Z., Lin, W., Yang, J., Zheng, X., Li, K., Sun, X., and Ji, R. MME: A comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[51]
Garg, R., Burns, A., Karagol Ayan, B., Bitton, Y., Montgomery, C., Onoe, Y., Bunner, A., Krishna, R., Baldridge, J. M., and Soricut, R. I mage I n W ords: Unlocking hyper-detailed image descriptions. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp.\ 93--127, 2024
work page 2024
-
[52]
DAMRO : Dive into the attention mechanism of LVLM to reduce object hallucination
Gong, X., Ming, T., Wang, X., and Wei, Z. DAMRO : Dive into the attention mechanism of LVLM to reduce object hallucination. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp.\ 7696--7712, 2024
work page 2024
-
[53]
Cracking the code of hallucination in lvlms with vision-aware head divergence
He, J., Zhu, K., Guo, H., Fang, J., Hua, Z., Jia, Y., Tang, M., Chua, T., and Wang, J. Cracking the code of hallucination in lvlms with vision-aware head divergence. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics, pp.\ 3488--3501, 2025
work page 2025
-
[54]
Steering LVLM s via sparse autoencoder for hallucination mitigation
Hua, Z., He, J., Yao, Z., Han, T., Guo, H., Jia, Y., and Fang, J. Steering LVLM s via sparse autoencoder for hallucination mitigation. In Findings of the Association for Computational Linguistics: EMNLP, pp.\ 10808--10828, 2025
work page 2025
-
[55]
Huang, X., Wang, X., Zhang, H., Zhu, Y., Xi, J., An, J., Wang, H., Liang, H., and Pan, C. Medical mllm is vulnerable: cross-modality jailbreak and mismatched attacks on medical multimodal large language models. In Proceedings of the Thirty-Ninth AAAI Conference on Artificial Intelligence, pp.\ 3797--3805, 2025
work page 2025
-
[56]
Self-introspective decoding: Alleviating hallucinations for large vision-language models
Huo, F., Xu, W., Zhang, Z., Wang, H., Chen, Z., and Zhao, P. Self-introspective decoding: Alleviating hallucinations for large vision-language models. In The Thirteenth International Conference on Learning Representations, 2025
work page 2025
-
[57]
Jung, M., Lee, S., Kim, E., and Yoon, S. Visual attention never fades: Selective progressive attention recalibration for detailed image captioning in multimodal large language models. In Proceedings of the 42nd International Conference on Machine Learning, volume 267, pp.\ 28527--28551, 2025
work page 2025
-
[58]
Kang, S., Kim, J., Kim, J., and Hwang, S. J. See what you are told: Visual attention sink in large multimodal models. In The Thirteenth International Conference on Learning Representations, 2025
work page 2025
-
[59]
Kim, J., Kim, H., Kim, Y., and Ro, Y. M. CODE: contrasting self-generated description to combat hallucination in large multi-modal models. In Advances in Neural Information Processing Systems, volume 37, pp.\ 133571--133599, 2024
work page 2024
-
[60]
Lee, S., Yoon, S., Bui, T., Shi, J., and Yoon, S. Toward robust hyper-detailed image captioning: A multiagent approach and dual evaluation metrics for factuality and coverage. In Proceedings of the 42nd International Conference on Machine Learning, volume 267, pp.\ 33815--33832, 2025
work page 2025
-
[61]
Mitigating object hallucinations in large vision-language models through visual contrastive decoding
Leng, S., Zhang, H., Chen, G., Li, X., Lu, S., Miao, C., and Bing, L. Mitigating object hallucinations in large vision-language models through visual contrastive decoding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pp.\ 13872--13882, 2024
work page 2024
-
[62]
Evaluating object hallucination in large vision-language models
Li, Y., Du, Y., Zhou, K., Wang, J., Zhao, X., and Wen, J.-R. Evaluating object hallucination in large vision-language models. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp.\ 292--305, 2023
work page 2023
-
[63]
Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Doll \'a r, P., and Zitnick, C. L. Microsoft coco: Common objects in context. In European Conference on Computer Vision, pp.\ 740--755, 2014
work page 2014
-
[64]
Liu, H., Li, C., Wu, Q., and Lee, Y. J. Visual instruction tuning. In Advances in Neural Information Processing Systems, volume 36, pp.\ 34892--34916, 2023
work page 2023
-
[65]
Liu, H., Li, C., Li, Y., and Lee, Y. J. Improved baselines with visual instruction tuning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 26286--26296, 2024 a
work page 2024
-
[66]
Liu, H., Li, C., Li, Y., Li, B., Zhang, Y., Shen, S., and Lee, Y. J. Llava-next: Improved reasoning, ocr, and world knowledge, 2024 b . URL https://llava-vl.github.io/blog/2024-01-30-llava-next/
work page 2024
-
[67]
A Survey on Hallucination in Large Vision-Language Models
Liu, H., Xue, W., Chen, Y., Chen, D., Zhao, X., Wang, K., Hou, L., Li, R., and Peng, W. A survey on hallucination in large vision-language models. arXiv preprint arXiv:2402.00253, 2024 c
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[68]
Grounding multimodal large language models to the world
Peng, Z., Wang, W., Dong, L., Hao, Y., Huang, S., Ma, S., Ye, Q., and Wei, F. Grounding multimodal large language models to the world. In The Twelfth International Conference on Learning Representations, 2024
work page 2024
-
[69]
M., Kachinthaya, A., Zou, H., Canny, J., Gonzalez, J
Petryk, S., Chan, D. M., Kachinthaya, A., Zou, H., Canny, J., Gonzalez, J. E., and Darrell, T. ALOH a: A new measure for hallucination in captioning models. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp.\ 342--357, 2024
work page 2024
-
[70]
A., Burns, K., Darrell, T., and Saenko, K
Rohrbach, A., Hendricks, L. A., Burns, K., Darrell, T., and Saenko, K. Object hallucination in image captioning. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp.\ 4035--4045, 2018
work page 2018
-
[71]
Aligning large multimodal models with factually augmented RLHF
Sun, Z., Shen, S., Cao, S., Liu, H., Li, C., Shen, Y., Gan, C., Gui, L., Wang, Y.-X., Yang, Y., Keutzer, K., and Darrell, T. Aligning large multimodal models with factually augmented RLHF . In Findings of the Association for Computational Linguistics: ACL, pp.\ 13088--13110, 2024
work page 2024
-
[72]
Seeing far and clearly: Mitigating hallucinations in mllms with attention causal decoding
Tang, F., Liu, C., Xu, Z., Hu, M., Huang, Z., Xue, H., Chen, Z., Peng, Z., Yang, Z., Zhou, S., Li, W., Li, Y., Song, W., Su, S., Feng, W., Su, J., Lin, M., Peng, Y., Cheng, X., Razzak, I., and Ge, Z. Seeing far and clearly: Mitigating hallucinations in mllms with attention causal decoding. In Proceedings of the IEEE/CVF Conference on Computer Vision and P...
work page 2025
-
[73]
Q., Stepputtis, S., Morency, L.-P., Ramanan, D., Sycara, K., and Xie, Y
Wan, Z., Zhang, C., Yong, S., Ma, M. Q., Stepputtis, S., Morency, L.-P., Ramanan, D., Sycara, K., and Xie, Y. Only: One-layer intervention sufficiently mitigates hallucinations in large vision-language models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.\ 3225--3234, 2025
work page 2025
-
[74]
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
Wang, P., Bai, S., Tan, S., Wang, S., Fan, Z., Bai, J., Chen, K., Liu, X., Wang, J., Ge, W., Fan, Y., Dang, K., Du, M., Ren, X., Men, R., Liu, D., Zhou, C., Zhou, J., and Lin, J. Qwen2-vl: Enhancing vision-language model's perception of the world at any resolution. arXiv preprint arXiv:2409.12191, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[75]
Caption anything: Interactive image description with diverse multimodal controls
Wang, T., Zhang, J., Fei, J., Ge, Y., Zheng, H., Tang, Y., Li, Z., Gao, M., Zhao, S., Shan, Y., and Zheng, F. Caption anything: Interactive image description with diverse multimodal controls. arXiv preprint arXiv:2305.02677, 2023
-
[76]
ESMC: mllm-based embedding selection for explainable multiple clustering
Wang, X., Jia, Y., Liu, H., and Hou, J. ESMC: mllm-based embedding selection for explainable multiple clustering. In Proceedings of the Fortieth AAAI Conference on Artificial Intelligence , pp.\ 26588--26596, 2026
work page 2026
-
[77]
Yin, H., Si, G., and Wang, Z. Clearsight: Visual signal enhancement for object hallucination mitigation in multimodal large language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pp.\ 14625--14634, 2025
work page 2025
-
[78]
Woodpecker: Hallucination correction for multimodal large language models
Yin, S., Fu, C., Zhao, S., Xu, T., Wang, H., Sui, D., Shen, Y., Li, K., Sun, X., and Chen, E. Woodpecker: Hallucination correction for multimodal large language models. Science China Information Sciences, 67, 2024
work page 2024
-
[79]
Yu, T., Yao, Y., Zhang, H., He, T., Han, Y., Cui, G., Hu, J., Liu, Z., Zheng, H.-T., Sun, M., and Chua, T.-S. Rlhf-v: Towards trustworthy mllms via behavior alignment from fine-grained correctional human feedback. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 13807--13816, 2024
work page 2024
-
[80]
Tell your model where to attend: Post-hoc attention steering for llms
Zhang, Q., Singh, C., Liu, L., Liu, X., Yu, B., Gao, J., and Zhao, T. Tell your model where to attend: Post-hoc attention steering for llms. In The Twelfth International Conference on Learning Representations, 2024
work page 2024
-
[81]
Vldrive: Vision-augmented lightweight mllms for efficient language-grounded autonomous driving
Zhang, R., Zhang, W., Tan, X., Yang, S., Wan, X., Luo, X., and Li, G. Vldrive: Vision-augmented lightweight mllms for efficient language-grounded autonomous driving. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.\ 5923--5933, 2025 a
work page 2025
-
[82]
Zhang, X., Quan, Y., Shen, C., Gu, C., Yuan, X., Yan, S., Cao, J., Cheng, H., Wu, K., and Ye, J. Shallow focus, deep fixes: Enhancing shallow layers vision attention sinks to alleviate hallucination in LVLM s. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp.\ 3512--3534, 2025 b
work page 2025
-
[83]
Beyond Hallucinations: Enhancing LVLMs through Hallucination-Aware Direct Preference Optimization
Zhao, Z., Wang, B., Ouyang, L., wen Dong, X., Wang, J., and He, C. Beyond hallucinations: Enhancing lvlms through hallucination-aware direct preference optimization. arXiv preprint arXiv:2311.16839, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[84]
Minigpt-4: Enhancing vision-language understanding with advanced large language models
Zhu, D., Chen, J., Shen, X., Li, X., and Elhoseiny, M. Minigpt-4: Enhancing vision-language understanding with advanced large language models. In The Twelfth International Conference on Learning Representations, 2024
work page 2024
-
[85]
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
Zhu, J., Wang, W., Chen, Z., Liu, Z., Ye, S., Gu, L., Tian, H., Duan, Y., Su, W., Shao, J., Gao, Z., Cui, E., Wang, X., Cao, Y., Liu, Y., Wei, X., Zhang, H., Wang, H., Xu, W., Li, H., Wang, J., Deng, N., Li, S., He, Y., Jiang, T., Luo, J., Wang, Y., He, C., Shi, B., Zhang, X., Shao, W., He, J., Xiong, Y., Qu, W., Sun, P., Jiao, P., Lv, H., Wu, L., Zhang, ...
work page internal anchor Pith review Pith/arXiv arXiv 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.