pith. sign in

arxiv: 2505.21472 · v2 · submitted 2025-05-27 · 💻 cs.CV · cs.CL· cs.MM

Mitigating Hallucination in Large Vision-Language Models via Adaptive Attention Calibration

Pith reviewed 2026-05-19 12:44 UTC · model grok-4.3

classification 💻 cs.CV cs.CLcs.MM
keywords hallucination mitigationvision-language modelsattention calibrationconfidence-aware adjustmentvisual groundingmodality biasspatial perception biaslong-form generation
0
0 comments X

The pith

Large vision-language models reduce hallucinations by calibrating attention using their own confidence scores to counter spatial and modality biases.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to show that hallucinations in large vision-language models, where the system describes objects or attributes absent from the image, can be lessened through a training-free adjustment to the attention mechanism. It focuses on two biases that grow during generation: uneven distribution of attention across image tokens and a progressive shift of focus from visual to textual inputs. The method first balances attention evenly over visual tokens, then uses the model's internal confidence to re-scale attention and keep outputs aligned with the image. If this holds, the result would be more reliable performance on open-ended and extended generation tasks without requiring model retraining.

Core claim

The central claim is that the Confidence-Aware Attention Calibration framework mitigates hallucination by addressing spatial perception bias via Visual-Token Calibration to distribute attention more evenly across visual tokens and modality bias via Adaptive Attention Re-Scaling that strengthens visual grounding in proportion to the model's confidence, thereby maintaining consistent visual alignment as generation proceeds.

What carries the argument

The Confidence-Aware Attention Calibration framework, which performs Visual-Token Calibration to equalize attention across image tokens followed by Adaptive Attention Re-Scaling that modulates attention weights according to the model's internal confidence.

If this is right

  • The calibrated models produce fewer hallucinations than prior training-free methods, with the largest gains appearing in extended generation sequences.
  • Attention remains more steadily focused on visual content rather than drifting toward text as output length grows.
  • Visual grounding improves while preserving the original model's accuracy on tasks that require image understanding.
  • The adjustments operate at inference time and require no additional training data or parameter updates.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same confidence-driven re-scaling idea could be examined for reducing factual drift in text-only language models during long responses.
  • Deploying the calibration in practical image-captioning systems would test whether users notice fewer invented details in real applications.
  • Combining the inference-time adjustment with targeted fine-tuning might yield further gains in overall multimodal reliability.

Load-bearing premise

The approach rests on the premise that spatial perception bias and modality bias are the main sources of hallucination and that the model's confidence score supplies a trustworthy signal for attention adjustment without creating new inconsistencies.

What would settle it

Running the calibration on long-form image description tasks and finding that the rate of invented objects or attributes stays the same or rises relative to the baseline model would show the method does not deliver the claimed reduction.

Figures

Figures reproduced from arXiv: 2505.21472 by Ahmet Sari, Bowen Wei, Mehrdad Fazli, Ziwei Zhu.

Figure 1
Figure 1. Figure 1: Comparison of the long-form generation (Max [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Distribution of image-token relevancy scores for In￾structBLIP given a black canvas as input image and the query ”Please describe the image.”. A pronounced skew toward a few image tokens can be witnessed. Proposed Method What causes LVLMs to describe objects or scenes absent from an image confidently? Our analysis identifies two pri￾mary culprits: spatial perception bias (Zhu et al. 2025), a skewed attenti… view at source ↗
Figure 3
Figure 3. Figure 3: (a) Normalized histogram of relative image relevancy scores for truthful (blue) and hallucinatory (orange) tokens, [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Overview of the CAAC Framework. The CAAC framework comprises two key components: VTC, which adjusts [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
read the original abstract

Large vision-language models (LVLMs) achieve impressive performance on multimodal tasks but often suffer from hallucination, and confidently describe objects or attributes not present in the image. Current training-free interventions struggle to maintain accuracy in open-ended and long-form generation scenarios. We introduce the Confidence-Aware Attention Calibration (CAAC) framework to address this challenge by targeting two key biases: spatial perception bias, which distributes attention disproportionately across image tokens, and modality bias, which shifts focus from visual to textual inputs over time. CAAC employs a two-step approach: Visual-Token Calibration (VTC) to balance attention across visual tokens, and Adaptive Attention Re-Scaling (AAR) to reinforce visual grounding guided by the model's confidence. This confidence-driven adjustment ensures consistent visual alignment during generation. Experiments on CHAIR, AMBER, and POPE benchmarks demonstrate that CAAC outperforms baselines, particularly in long-form generations, effectively reducing hallucination.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes the Confidence-Aware Attention Calibration (CAAC) framework to reduce hallucinations in large vision-language models. It identifies spatial perception bias and modality bias as primary causes and introduces a two-step intervention: Visual-Token Calibration (VTC) to balance attention across image tokens, followed by Adaptive Attention Re-Scaling (AAR) that uses the model's internal confidence score to reinforce visual grounding during generation. Experiments on the CHAIR, AMBER, and POPE benchmarks are reported to show that CAAC outperforms existing baselines, with stronger gains in long-form generation.

Significance. If the empirical claims hold, the work would provide a practical training-free method for improving reliability of LVLMs in open-ended settings, where hallucinations remain a central obstacle to deployment. The emphasis on long-form scenarios addresses a gap left by prior interventions that degrade under extended generation.

major comments (2)
  1. [Method, Adaptive Attention Re-Scaling] The AAR component (described in the method) treats the model's internal confidence score as a reliable indicator for when to re-scale attention toward visual tokens. No direct measurement or ablation is supplied that quantifies the correlation between high-confidence tokens and actual hallucination events, particularly once an early error has occurred in long-form output. This assumption is load-bearing for the central claim that AAR mitigates rather than potentially amplifies errors.
  2. [Experiments] The experimental section asserts outperformance on CHAIR, AMBER, and POPE yet supplies no numerical deltas, standard deviations, or ablation tables that isolate the contribution of VTC versus AAR. Without these data the strength of the benchmark claims cannot be verified against the stated improvements.
minor comments (1)
  1. [Method] Notation for the confidence threshold and the precise re-scaling formula in AAR should be stated explicitly with an equation number for reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thorough review and constructive suggestions. We address each major comment in detail below and have made revisions to the manuscript to incorporate the feedback where appropriate.

read point-by-point responses
  1. Referee: [Method, Adaptive Attention Re-Scaling] The AAR component (described in the method) treats the model's internal confidence score as a reliable indicator for when to re-scale attention toward visual tokens. No direct measurement or ablation is supplied that quantifies the correlation between high-confidence tokens and actual hallucination events, particularly once an early error has occurred in long-form output. This assumption is load-bearing for the central claim that AAR mitigates rather than potentially amplifies errors.

    Authors: We agree that validating the correlation between the model's confidence scores and hallucination occurrences is important for justifying the AAR mechanism. In the original manuscript, we motivated this choice based on the observed decrease in visual attention over generation steps and the role of confidence in detecting potential errors. To directly address this, we have added a new subsection in the experiments that analyzes the relationship between token confidence and hallucinated content. Specifically, we manually annotated a set of long-form generations for hallucinations and computed the average confidence for hallucinated vs. non-hallucinated tokens, showing a statistically significant difference. We also include an ablation study comparing AAR with a variant that uses fixed re-scaling without confidence guidance, demonstrating that the adaptive, confidence-based approach yields better performance and does not amplify errors. These additions are included in the revised manuscript. revision: yes

  2. Referee: [Experiments] The experimental section asserts outperformance on CHAIR, AMBER, and POPE yet supplies no numerical deltas, standard deviations, or ablation tables that isolate the contribution of VTC versus AAR. Without these data the strength of the benchmark claims cannot be verified against the stated improvements.

    Authors: We appreciate this observation regarding the presentation of results. The full paper does include comparative tables on the three benchmarks, but we acknowledge that explicit numerical deltas, standard deviations across runs, and a clear ablation isolating VTC and AAR were not sufficiently highlighted. In the revised version, we have expanded the experimental section with a new table that reports mean performance with standard deviations from 3 independent runs. We also provide delta improvements over the strongest baseline for each metric. Furthermore, we added an ablation study table that shows the incremental contributions: baseline, +VTC, +AAR, and full CAAC. This allows verification of the individual and combined effects, particularly highlighting the gains in long-form generation scenarios. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper introduces the CAAC framework as a new two-step intervention (VTC for balancing visual tokens and AAR for confidence-guided re-scaling) to target spatial perception and modality biases. No equations, derivations, or fitted parameters are described that reduce the claimed hallucination reductions or benchmark improvements to quantities already defined or fitted inside the paper itself. Validation relies on external experiments on CHAIR, AMBER, and POPE rather than any self-referential construction or prediction-by-fit. No load-bearing self-citations, uniqueness theorems, or ansatzes imported from prior author work appear in the provided description, leaving the central approach independent and self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The method rests on domain assumptions about the existence and dominance of spatial perception and modality biases plus the utility of confidence as a grounding signal; no free parameters or invented entities are explicitly introduced in the abstract.

axioms (1)
  • domain assumption Spatial perception bias and modality bias are the key causes of hallucination that can be mitigated by attention calibration.
    Abstract states these two biases are targeted by the framework.

pith-pipeline@v0.9.0 · 5696 in / 1247 out tokens · 44764 ms · 2026-05-19T12:44:42.832727+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Combating Visual Neglect and Semantic Drift in Large Multimodal Models for Enhanced Cross-Modal Retrieval

    cs.CV 2026-04 unverdicted novelty 5.0

    SSA-ME uses saliency-aware modeling to reduce visual neglect and semantic drift, achieving SOTA results on the MMEB benchmark for multimodal retrieval.

Reference graph

Works this paper leans on

27 extracted references · 27 canonical work pages · cited by 1 Pith paper · 16 internal anchors

  1. [1]

    Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

    Qwen-VL: A Versatile Vision- Language Model for Understanding, Localization, Text Read- ing, and Beyond. ArXiv:2308.12966 [cs]. Bai, Z.; Wang, P.; Xiao, T.; He, T.; Han, Z.; Zhang, Z.; and Shou, M. Z

  2. [2]

    Hallucination of Multimodal Large Language Models: A Survey

    Hallucination of Multimodal Large Lan- guage Models: A Survey. ArXiv:2404.18930 [cs]. Chefer, H.; Gur, S.; and Wolf, L

  3. [3]

    Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic

    Shikra: Unleashing Multimodal LLM’s Referential Dialogue Magic. ArXiv:2306.15195 [cs]. Chen, Z.; Wu, J.; Wang, W.; Su, W.; Chen, G.; Xing, S.; Zhong, M.; Zhang, Q.; Zhu, X.; Lu, L.; Li, B.; Luo, P.; Lu, T.; Qiao, Y .; and Dai, J

  4. [4]

    InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

    InternVL: Scaling up Vision Foun- dation Models and Aligning for Generic Visual-Linguistic Tasks. ArXiv:2312.14238 [cs]. Dai, W.; Li, J.; Li, D.; Tiong, A. M. H.; Zhao, J.; Wang, W.; Li, B.; Fung, P.; and Hoi, S

  5. [5]

    InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning

    InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning. ArXiv:2305.06500. Fang, Y .; Wang, W.; Xie, B.; Sun, Q.; Wu, L.; Wang, X.; Huang, T.; Wang, X.; and Cao, Y

  6. [6]

    In 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 19358–19369

    EV A: Exploring the Limits of Masked Visual Representation Learning at Scale. In 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 19358–19369. Vancouver, BC, Canada: IEEE. ISBN 979-8-3503-0129-8. Favero, A.; Zancato, L.; Trager, M.; Choudhary, S.; Perera, P.; Achille, A.; Swaminathan, A.; and Soatto, S. ???? Multi- Modal Halluci...

  7. [7]

    In Al-Onaizan, Y .; Bansal, M.; and Chen, Y .-N., eds.,Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 7696–

    DAMRO: Dive into the Attention Mechanism of LVLM to Reduce Object Hallucination. In Al-Onaizan, Y .; Bansal, M.; and Chen, Y .-N., eds.,Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 7696–

  8. [8]

    HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusion in Large Vision-Language Models

    HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusion in Large Vision-Language Models. ArXiv:2310.14566 [cs]. Gunjal, A.; Yin, J.; and Bas, E

  9. [9]

    Contrastive decoding: Open-ended text generation as optimization

    Ex- posing and mitigating spurious correlations for cross-modal retrieval. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2585–2595. Leng, S.; Zhang, H.; Chen, G.; Li, X.; Lu, S.; Miao, C.; and Bing, L. ???? Mitigating Object Hallucinations in Large Vision-Language Models through Visual Contrastive Decod- ing. Li, X. ...

  10. [10]

    Microsoft COCO: Common Objects in Context

    Microsoft COCO: Common Objects in Context. ArXiv:1405.0312 [cs]. Liu, F.; Lin, K.; Li, L.; Wang, J.; Yacoob, Y .; and Wang, L. 2024a. Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning. ArXiv:2306.14565 [cs]. Liu, H.; Li, C.; Wu, Q.; and Lee, Y . J

  11. [11]

    Visual Instruction Tuning

    Visual Instruction Tuning. ArXiv:2304.08485 [cs]. Liu, H.; Xue, W.; Chen, Y .; Chen, D.; Zhao, X.; Wang, K.; Hou, L.; Li, R.; and Peng, W. 2024b. A Survey on Hallucina- tion in Large Vision-Language Models. ArXiv:2402.00253 [cs]. Liu, S.; Zheng, K.; and Chen, W

  12. [12]

    Paying more atten- tion to image: A training-free method for alleviating halluci- nation in lvlms

    Paying More At- tention to Image: A Training-Free Method for Alleviating Hallucination in LVLMs. ArXiv:2407.21771 [cs]. Magesh, V .; Surani, F.; Dahl, M.; Suzgun, M.; Manning, C. D.; and Ho, D. E

  13. [13]

    Magesh, F

    Hallucination-Free? Assessing the Reliability of Leading AI Legal Research Tools.Jour- nal of Empirical Legal Studies, 22(2): 216–242. eprint: https://onlinelibrary.wiley.com/doi/pdf/10.1111/jels.12413. Radford, A.; Kim, J. W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; Krueger, G.; and Sutskever, I

  14. [14]

    Learning Transferable Visual Models From Natural Language Supervision

    Learning Transfer- able Visual Models From Natural Language Supervision. ArXiv:2103.00020 [cs]. Rohrbach, A.; Hendricks, L. A.; Burns, K.; Darrell, T.; and Saenko, K

  15. [15]

    Object Hallucination in Image Captioning

    Object Hallucination in Image Captioning. ArXiv:1809.02156 [cs]. Shi, C.; Yang, H.; Cai, D.; Zhang, Z.; Wang, Y .; Yang, Y .; and Lam, W

  16. [16]

    ArXiv:2402.06925 [cs]

    A Thorough Examination of Decoding Methods in the Era of LLMs. ArXiv:2402.06925 [cs]. Suo, W.; Zhang, L.; Sun, M.; Wu, L. Y .; Wang, P.; and Zhang, Y

  17. [17]

    ArXiv:2503.00361 [cs]

    Octopus: Alleviating Hallucination via Dynamic Contrastive Decoding. ArXiv:2503.00361 [cs]. Touvron, H.; Lavril, T.; Izacard, G.; Martinet, X.; Lachaux, M.-A.; Lacroix, T.; Rozi `ere, B.; Goyal, N.; Hambro, E.; Azhar, F.; Rodriguez, A.; Joulin, A.; Grave, E.; and Lample, G

  18. [18]

    LLaMA: Open and Efficient Foundation Language Models

    LLaMA: Open and Efficient Foundation Language Models. ArXiv:2302.13971 [cs]. Wang, J.; Wang, Y .; Xu, G.; Zhang, J.; Gu, Y .; Jia, H.; Wang, J.; Xu, H.; Yan, M.; Zhang, J.; and Sang, J

  19. [19]

    AMBER: An LLM-free Multi-dimensional Benchmark for MLLMs Hallucination Evaluation

    AMBER: An LLM-free Multi-dimensional Benchmark for MLLMs Hallucination Evaluation. ArXiv:2311.07397 [cs]. Woo, S.; Kim, D.; Jang, J.; Choi, Y .; and Kim, C

  20. [20]

    ArXiv:2405.17820 [cs]

    Don’t Miss the Forest for the Trees: Attentional Vision Calibration for Large Vision Language Models. ArXiv:2405.17820 [cs]. Xu, P.; Shao, W.; Zhang, K.; Gao, P.; Liu, S.; Lei, M.; Meng, F.; Huang, S.; Qiao, Y .; and Luo, P

  21. [21]

    mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality

    mPLUG- Owl: Modularization Empowers Large Language Models with Multimodality. ArXiv:2304.14178 [cs]. Yin, S.; Fu, C.; Zhao, S.; Xu, T.; Wang, H.; Sui, D.; Shen, Y .; Li, K.; Sun, X.; and Chen, E

  22. [22]

    Woodpecker: Hallucination correction for multimodal large language models,

    Woodpecker: Halluci- nation Correction for Multimodal Large Language Models. ArXiv:2310.16045 [cs]. Zhang, X.; Quan, Y .; Gu, C.; Shen, C.; Yuan, X.; Yan, S.; Cheng, H.; Wu, K.; and Ye, J

  23. [23]

    ArXiv:2411.09968 [cs]

    Seeing Clearly by Layer Two: Enhancing Attention Heads to Alleviate Hallucination in LVLMs. ArXiv:2411.09968 [cs]. Zheng, L.; Chiang, W.-L.; Sheng, Y .; Zhuang, S.; Wu, Z.; Zhuang, Y .; Lin, Z.; Li, Z.; Li, D.; Xing, E. P.; Zhang, H.; Gonzalez, J. E.; and Stoica, I

  24. [24]

    Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

    Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. ArXiv:2306.05685 [cs]. Zhou, Y .; Cui, C.; Yoon, J.; Zhang, L.; Deng, Z.; Finn, C.; Bansal, M.; and Yao, H

  25. [25]

    Analyzing and Mitigating Object Hallucination in Large Vision-Language Models

    Analyzing and Mitigat- ing Object Hallucination in Large Vision-Language Models. ArXiv:2310.00754 [cs]. Zhu, D.; Chen, J.; Shen, X.; Li, X.; and Elhoseiny, M

  26. [26]

    MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

    MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models. ArXiv:2304.10592 [cs]. Zhu, Y .; Tao, L.; Dong, M.; and Xu, C

  27. [27]

    Mitigating object hallucinations in large vision-language models via attention calibration.arXiv preprint arXiv:2502.01969,

    Mitigating Object Hallucinations in Large Vision-Language Models via Attention Calibration. ArXiv:2502.01969 [cs]