Mitigating Hallucination in Large Vision-Language Models via Adaptive Attention Calibration
Pith reviewed 2026-05-19 12:44 UTC · model grok-4.3
The pith
Large vision-language models reduce hallucinations by calibrating attention using their own confidence scores to counter spatial and modality biases.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that the Confidence-Aware Attention Calibration framework mitigates hallucination by addressing spatial perception bias via Visual-Token Calibration to distribute attention more evenly across visual tokens and modality bias via Adaptive Attention Re-Scaling that strengthens visual grounding in proportion to the model's confidence, thereby maintaining consistent visual alignment as generation proceeds.
What carries the argument
The Confidence-Aware Attention Calibration framework, which performs Visual-Token Calibration to equalize attention across image tokens followed by Adaptive Attention Re-Scaling that modulates attention weights according to the model's internal confidence.
If this is right
- The calibrated models produce fewer hallucinations than prior training-free methods, with the largest gains appearing in extended generation sequences.
- Attention remains more steadily focused on visual content rather than drifting toward text as output length grows.
- Visual grounding improves while preserving the original model's accuracy on tasks that require image understanding.
- The adjustments operate at inference time and require no additional training data or parameter updates.
Where Pith is reading between the lines
- The same confidence-driven re-scaling idea could be examined for reducing factual drift in text-only language models during long responses.
- Deploying the calibration in practical image-captioning systems would test whether users notice fewer invented details in real applications.
- Combining the inference-time adjustment with targeted fine-tuning might yield further gains in overall multimodal reliability.
Load-bearing premise
The approach rests on the premise that spatial perception bias and modality bias are the main sources of hallucination and that the model's confidence score supplies a trustworthy signal for attention adjustment without creating new inconsistencies.
What would settle it
Running the calibration on long-form image description tasks and finding that the rate of invented objects or attributes stays the same or rises relative to the baseline model would show the method does not deliver the claimed reduction.
Figures
read the original abstract
Large vision-language models (LVLMs) achieve impressive performance on multimodal tasks but often suffer from hallucination, and confidently describe objects or attributes not present in the image. Current training-free interventions struggle to maintain accuracy in open-ended and long-form generation scenarios. We introduce the Confidence-Aware Attention Calibration (CAAC) framework to address this challenge by targeting two key biases: spatial perception bias, which distributes attention disproportionately across image tokens, and modality bias, which shifts focus from visual to textual inputs over time. CAAC employs a two-step approach: Visual-Token Calibration (VTC) to balance attention across visual tokens, and Adaptive Attention Re-Scaling (AAR) to reinforce visual grounding guided by the model's confidence. This confidence-driven adjustment ensures consistent visual alignment during generation. Experiments on CHAIR, AMBER, and POPE benchmarks demonstrate that CAAC outperforms baselines, particularly in long-form generations, effectively reducing hallucination.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes the Confidence-Aware Attention Calibration (CAAC) framework to reduce hallucinations in large vision-language models. It identifies spatial perception bias and modality bias as primary causes and introduces a two-step intervention: Visual-Token Calibration (VTC) to balance attention across image tokens, followed by Adaptive Attention Re-Scaling (AAR) that uses the model's internal confidence score to reinforce visual grounding during generation. Experiments on the CHAIR, AMBER, and POPE benchmarks are reported to show that CAAC outperforms existing baselines, with stronger gains in long-form generation.
Significance. If the empirical claims hold, the work would provide a practical training-free method for improving reliability of LVLMs in open-ended settings, where hallucinations remain a central obstacle to deployment. The emphasis on long-form scenarios addresses a gap left by prior interventions that degrade under extended generation.
major comments (2)
- [Method, Adaptive Attention Re-Scaling] The AAR component (described in the method) treats the model's internal confidence score as a reliable indicator for when to re-scale attention toward visual tokens. No direct measurement or ablation is supplied that quantifies the correlation between high-confidence tokens and actual hallucination events, particularly once an early error has occurred in long-form output. This assumption is load-bearing for the central claim that AAR mitigates rather than potentially amplifies errors.
- [Experiments] The experimental section asserts outperformance on CHAIR, AMBER, and POPE yet supplies no numerical deltas, standard deviations, or ablation tables that isolate the contribution of VTC versus AAR. Without these data the strength of the benchmark claims cannot be verified against the stated improvements.
minor comments (1)
- [Method] Notation for the confidence threshold and the precise re-scaling formula in AAR should be stated explicitly with an equation number for reproducibility.
Simulated Author's Rebuttal
We thank the referee for their thorough review and constructive suggestions. We address each major comment in detail below and have made revisions to the manuscript to incorporate the feedback where appropriate.
read point-by-point responses
-
Referee: [Method, Adaptive Attention Re-Scaling] The AAR component (described in the method) treats the model's internal confidence score as a reliable indicator for when to re-scale attention toward visual tokens. No direct measurement or ablation is supplied that quantifies the correlation between high-confidence tokens and actual hallucination events, particularly once an early error has occurred in long-form output. This assumption is load-bearing for the central claim that AAR mitigates rather than potentially amplifies errors.
Authors: We agree that validating the correlation between the model's confidence scores and hallucination occurrences is important for justifying the AAR mechanism. In the original manuscript, we motivated this choice based on the observed decrease in visual attention over generation steps and the role of confidence in detecting potential errors. To directly address this, we have added a new subsection in the experiments that analyzes the relationship between token confidence and hallucinated content. Specifically, we manually annotated a set of long-form generations for hallucinations and computed the average confidence for hallucinated vs. non-hallucinated tokens, showing a statistically significant difference. We also include an ablation study comparing AAR with a variant that uses fixed re-scaling without confidence guidance, demonstrating that the adaptive, confidence-based approach yields better performance and does not amplify errors. These additions are included in the revised manuscript. revision: yes
-
Referee: [Experiments] The experimental section asserts outperformance on CHAIR, AMBER, and POPE yet supplies no numerical deltas, standard deviations, or ablation tables that isolate the contribution of VTC versus AAR. Without these data the strength of the benchmark claims cannot be verified against the stated improvements.
Authors: We appreciate this observation regarding the presentation of results. The full paper does include comparative tables on the three benchmarks, but we acknowledge that explicit numerical deltas, standard deviations across runs, and a clear ablation isolating VTC and AAR were not sufficiently highlighted. In the revised version, we have expanded the experimental section with a new table that reports mean performance with standard deviations from 3 independent runs. We also provide delta improvements over the strongest baseline for each metric. Furthermore, we added an ablation study table that shows the incremental contributions: baseline, +VTC, +AAR, and full CAAC. This allows verification of the individual and combined effects, particularly highlighting the gains in long-form generation scenarios. revision: yes
Circularity Check
No significant circularity in derivation chain
full rationale
The paper introduces the CAAC framework as a new two-step intervention (VTC for balancing visual tokens and AAR for confidence-guided re-scaling) to target spatial perception and modality biases. No equations, derivations, or fitted parameters are described that reduce the claimed hallucination reductions or benchmark improvements to quantities already defined or fitted inside the paper itself. Validation relies on external experiments on CHAIR, AMBER, and POPE rather than any self-referential construction or prediction-by-fit. No load-bearing self-citations, uniqueness theorems, or ansatzes imported from prior author work appear in the provided description, leaving the central approach independent and self-contained.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Spatial perception bias and modality bias are the key causes of hallucination that can be mitigated by attention calibration.
Forward citations
Cited by 1 Pith paper
-
Combating Visual Neglect and Semantic Drift in Large Multimodal Models for Enhanced Cross-Modal Retrieval
SSA-ME uses saliency-aware modeling to reduce visual neglect and semantic drift, achieving SOTA results on the MMEB benchmark for multimodal retrieval.
Reference graph
Works this paper leans on
-
[1]
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
Qwen-VL: A Versatile Vision- Language Model for Understanding, Localization, Text Read- ing, and Beyond. ArXiv:2308.12966 [cs]. Bai, Z.; Wang, P.; Xiao, T.; He, T.; Han, Z.; Zhang, Z.; and Shou, M. Z
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
Hallucination of Multimodal Large Language Models: A Survey
Hallucination of Multimodal Large Lan- guage Models: A Survey. ArXiv:2404.18930 [cs]. Chefer, H.; Gur, S.; and Wolf, L
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic
Shikra: Unleashing Multimodal LLM’s Referential Dialogue Magic. ArXiv:2306.15195 [cs]. Chen, Z.; Wu, J.; Wang, W.; Su, W.; Chen, G.; Xing, S.; Zhong, M.; Zhang, Q.; Zhu, X.; Lu, L.; Li, B.; Luo, P.; Lu, T.; Qiao, Y .; and Dai, J
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
InternVL: Scaling up Vision Foun- dation Models and Aligning for Generic Visual-Linguistic Tasks. ArXiv:2312.14238 [cs]. Dai, W.; Li, J.; Li, D.; Tiong, A. M. H.; Zhao, J.; Wang, W.; Li, B.; Fung, P.; and Hoi, S
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning
InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning. ArXiv:2305.06500. Fang, Y .; Wang, W.; Xie, B.; Sun, Q.; Wu, L.; Wang, X.; Huang, T.; Wang, X.; and Cao, Y
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
In 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 19358–19369
EV A: Exploring the Limits of Masked Visual Representation Learning at Scale. In 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 19358–19369. Vancouver, BC, Canada: IEEE. ISBN 979-8-3503-0129-8. Favero, A.; Zancato, L.; Trager, M.; Choudhary, S.; Perera, P.; Achille, A.; Swaminathan, A.; and Soatto, S. ???? Multi- Modal Halluci...
work page 2023
-
[7]
DAMRO: Dive into the Attention Mechanism of LVLM to Reduce Object Hallucination. In Al-Onaizan, Y .; Bansal, M.; and Chen, Y .-N., eds.,Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 7696–
work page 2024
-
[8]
HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusion in Large Vision-Language Models. ArXiv:2310.14566 [cs]. Gunjal, A.; Yin, J.; and Bas, E
work page internal anchor Pith review Pith/arXiv arXiv
-
[9]
Contrastive decoding: Open-ended text generation as optimization
Ex- posing and mitigating spurious correlations for cross-modal retrieval. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2585–2595. Leng, S.; Zhang, H.; Chen, G.; Li, X.; Lu, S.; Miao, C.; and Bing, L. ???? Mitigating Object Hallucinations in Large Vision-Language Models through Visual Contrastive Decod- ing. Li, X. ...
-
[10]
Microsoft COCO: Common Objects in Context
Microsoft COCO: Common Objects in Context. ArXiv:1405.0312 [cs]. Liu, F.; Lin, K.; Li, L.; Wang, J.; Yacoob, Y .; and Wang, L. 2024a. Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning. ArXiv:2306.14565 [cs]. Liu, H.; Li, C.; Wu, Q.; and Lee, Y . J
work page internal anchor Pith review Pith/arXiv arXiv
-
[11]
Visual Instruction Tuning. ArXiv:2304.08485 [cs]. Liu, H.; Xue, W.; Chen, Y .; Chen, D.; Zhao, X.; Wang, K.; Hou, L.; Li, R.; and Peng, W. 2024b. A Survey on Hallucina- tion in Large Vision-Language Models. ArXiv:2402.00253 [cs]. Liu, S.; Zheng, K.; and Chen, W
work page internal anchor Pith review Pith/arXiv arXiv
-
[12]
Paying more atten- tion to image: A training-free method for alleviating halluci- nation in lvlms
Paying More At- tention to Image: A Training-Free Method for Alleviating Hallucination in LVLMs. ArXiv:2407.21771 [cs]. Magesh, V .; Surani, F.; Dahl, M.; Suzgun, M.; Manning, C. D.; and Ho, D. E
-
[13]
Hallucination-Free? Assessing the Reliability of Leading AI Legal Research Tools.Jour- nal of Empirical Legal Studies, 22(2): 216–242. eprint: https://onlinelibrary.wiley.com/doi/pdf/10.1111/jels.12413. Radford, A.; Kim, J. W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; Krueger, G.; and Sutskever, I
-
[14]
Learning Transferable Visual Models From Natural Language Supervision
Learning Transfer- able Visual Models From Natural Language Supervision. ArXiv:2103.00020 [cs]. Rohrbach, A.; Hendricks, L. A.; Burns, K.; Darrell, T.; and Saenko, K
work page internal anchor Pith review Pith/arXiv arXiv
-
[15]
Object Hallucination in Image Captioning
Object Hallucination in Image Captioning. ArXiv:1809.02156 [cs]. Shi, C.; Yang, H.; Cai, D.; Zhang, Z.; Wang, Y .; Yang, Y .; and Lam, W
work page internal anchor Pith review Pith/arXiv arXiv
-
[16]
A Thorough Examination of Decoding Methods in the Era of LLMs. ArXiv:2402.06925 [cs]. Suo, W.; Zhang, L.; Sun, M.; Wu, L. Y .; Wang, P.; and Zhang, Y
-
[17]
Octopus: Alleviating Hallucination via Dynamic Contrastive Decoding. ArXiv:2503.00361 [cs]. Touvron, H.; Lavril, T.; Izacard, G.; Martinet, X.; Lachaux, M.-A.; Lacroix, T.; Rozi `ere, B.; Goyal, N.; Hambro, E.; Azhar, F.; Rodriguez, A.; Joulin, A.; Grave, E.; and Lample, G
-
[18]
LLaMA: Open and Efficient Foundation Language Models
LLaMA: Open and Efficient Foundation Language Models. ArXiv:2302.13971 [cs]. Wang, J.; Wang, Y .; Xu, G.; Zhang, J.; Gu, Y .; Jia, H.; Wang, J.; Xu, H.; Yan, M.; Zhang, J.; and Sang, J
work page internal anchor Pith review Pith/arXiv arXiv
-
[19]
AMBER: An LLM-free Multi-dimensional Benchmark for MLLMs Hallucination Evaluation
AMBER: An LLM-free Multi-dimensional Benchmark for MLLMs Hallucination Evaluation. ArXiv:2311.07397 [cs]. Woo, S.; Kim, D.; Jang, J.; Choi, Y .; and Kim, C
work page internal anchor Pith review Pith/arXiv arXiv
-
[20]
Don’t Miss the Forest for the Trees: Attentional Vision Calibration for Large Vision Language Models. ArXiv:2405.17820 [cs]. Xu, P.; Shao, W.; Zhang, K.; Gao, P.; Liu, S.; Lei, M.; Meng, F.; Huang, S.; Qiao, Y .; and Luo, P
-
[21]
mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality
mPLUG- Owl: Modularization Empowers Large Language Models with Multimodality. ArXiv:2304.14178 [cs]. Yin, S.; Fu, C.; Zhao, S.; Xu, T.; Wang, H.; Sui, D.; Shen, Y .; Li, K.; Sun, X.; and Chen, E
work page internal anchor Pith review Pith/arXiv arXiv
-
[22]
Woodpecker: Hallucination correction for multimodal large language models,
Woodpecker: Halluci- nation Correction for Multimodal Large Language Models. ArXiv:2310.16045 [cs]. Zhang, X.; Quan, Y .; Gu, C.; Shen, C.; Yuan, X.; Yan, S.; Cheng, H.; Wu, K.; and Ye, J
-
[23]
Seeing Clearly by Layer Two: Enhancing Attention Heads to Alleviate Hallucination in LVLMs. ArXiv:2411.09968 [cs]. Zheng, L.; Chiang, W.-L.; Sheng, Y .; Zhuang, S.; Wu, Z.; Zhuang, Y .; Lin, Z.; Li, Z.; Li, D.; Xing, E. P.; Zhang, H.; Gonzalez, J. E.; and Stoica, I
-
[24]
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. ArXiv:2306.05685 [cs]. Zhou, Y .; Cui, C.; Yoon, J.; Zhang, L.; Deng, Z.; Finn, C.; Bansal, M.; and Yao, H
work page internal anchor Pith review Pith/arXiv arXiv
-
[25]
Analyzing and Mitigating Object Hallucination in Large Vision-Language Models
Analyzing and Mitigat- ing Object Hallucination in Large Vision-Language Models. ArXiv:2310.00754 [cs]. Zhu, D.; Chen, J.; Shen, X.; Li, X.; and Elhoseiny, M
work page internal anchor Pith review Pith/arXiv arXiv
-
[26]
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models. ArXiv:2304.10592 [cs]. Zhu, Y .; Tao, L.; Dong, M.; and Xu, C
work page internal anchor Pith review Pith/arXiv arXiv
-
[27]
Mitigating Object Hallucinations in Large Vision-Language Models via Attention Calibration. ArXiv:2502.01969 [cs]
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.