pith. sign in

arxiv: 2605.23344 · v1 · pith:FN5EBS45new · submitted 2026-05-22 · 💻 cs.CV · cs.AI

CHASD: Language Increment-Calibrated Contrastive Decoding against Hallucination in LVLMs

Pith reviewed 2026-05-25 05:01 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords contrastive decodinghallucination mitigationlarge vision-language modelsinference-time calibrationobject hallucinationuncertainty gatinglocalized perturbationattention-guided decoding
0
0 comments X

The pith

CHASD activates contrastive decoding only on low-confidence tokens using localized visual perturbations to reduce hallucinations in vision-language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that hallucination risks in large vision-language models occur only at specific tokens where visual evidence is weak, so a confidence gate can safely limit contrastive calibration to those steps. It constructs the negative branch by perturbing only the visual tokens currently attended to, rather than applying global changes or running the branch at every token. This selective approach is meant to cut object hallucinations while avoiding extra computation on high-confidence steps that follow the original distribution. A reader would care if the method delivers better benchmark scores than prior training-free contrastive techniques without slowing inference much.

Core claim

CHASD is an inference-time framework that uses an uncertainty-driven confidence gate to activate the contrastive branch only when the maximum probability of the next token is below a threshold, and builds the negative branch through attention-guided localized perturbations of the currently salient visual tokens, thereby reducing unnecessary negative-branch forward passes while preserving the original distribution for high-confidence steps.

What carries the argument

Uncertainty-driven confidence gate that triggers attention-guided localized perturbations of salient visual tokens for the negative branch only on uncertain decoding steps.

If this is right

  • Hallucination-related metrics improve over strong training-free baselines on POPE, AMBER, MME, MMHal-Bench, and CHAIR.
  • Inference efficiency stays competitive because the negative branch runs only on low-confidence steps.
  • High-confidence steps keep the original model distribution without perturbation.
  • Localized perturbations avoid altering useful visual evidence that global methods might change.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same gating idea could be tested on other error types such as factual inconsistencies in text-only models.
  • Varying the confidence threshold per model size might yield further efficiency gains.
  • If attention shifts reliably mark hallucination-prone steps, the method might extend to video or audio-language models.

Load-bearing premise

Hallucination risks are transient and limited to specific low-confidence tokens, so skipping contrastive calibration on high-confidence steps does not miss critical hallucinations.

What would settle it

A benchmark run in which CHASD shows no improvement or a drop in hallucination metrics on POPE or AMBER compared to always-on contrastive baselines, indicating that high-confidence steps still produce uncorrected hallucinations.

Figures

Figures reproduced from arXiv: 2605.23344 by Kejia Zhang, Xiaoyi Huang, Zhiming Luo.

Figure 1
Figure 1. Figure 1: Visualization of step-wise probabilities and cross-attention maps during the LVLM generation process. Top: The model assigns varying confidence scores to each token. The high confidence observed in functional tokens suggests that the contrastive decoding branch can be selectively bypassed to enhance inference efficiency. Bottom: The model’s visual focus areas shift dynamically when generating different tok… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the CHASD at time step t. For the initial token generated at the current step, it first undergoes (I) Uncertainty-driven Confidence Gating: if the maximum predictive probability P exceeds the threshold τ , the system directly outputs the candidate token. Otherwise, the system enters the (II) Attention-guided Localized Visual Perturbation branch, leveraging an attention map to identify salient r… view at source ↗
Figure 3
Figure 3. Figure 3: Comparison of image descriptions generated by different methods on LLaVA-Bench (Left: LLaVA; Right: InstructBLIP). Hallucinated content is highlighted in red. 4.3 Ablation Study In this section, we first evaluate the computational efficiency of CHASD by comparing the inference latency and GPU memory footprint across various methods. Subsequently, we conduct a sensitivity analysis on the key hyperparameters… view at source ↗
Figure 5
Figure 5. Figure 5: Sensitivity of the hyperparameter for the confidence gating threshold τ . threshold τ , to investigate their impact on the balance between performance and overhead. All of the above experiments were conducted on POPE benchmark (LLaVA-1.5, COCO, Adversarial) [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: MME-Fullset results for different methods (LLaVA-1.5 as backbone) [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: MME-Fullset results for different methods (InstructBLIP as backbone) [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Comparison of image descriptions generated by different methods on LLaVA-Bench (Left: LLaVA; Right: InstructBLIP). Hallucinated content is highlighted in red. G Prompts for GPT-4o During the MMHal-Benchmark evaluation, we used GPT-4o [13] to assist with scoring. Here, we will present the specific prompt we used to enable it to act as a judge. Please act as an impartial and objective judge and evaluate the … view at source ↗
read the original abstract

Large Vision-Language Models have shown strong multimodal reasoning capabilities, yet they remain susceptible to object hallucinations when language priors dominate insufficient or misaligned visual evidence. Training-free contrastive decoding methods mitigate this issue by comparing predictions from original and perturbed visual inputs, but existing approaches either apply global perturbations that may alter useful visual evidence or invoke an additional negative branch at every decoding step. In this paper, we observe that hallucination risks are transient and token-specific: visual attention shifts across generated tokens, while some functional tokens are produced with high confidence and do not require contrastive calibration. Based on this observation, we propose Contrastive Hallucination-Aware Step-wise Decoding (CHASD) for Large Vision-Language Models, an inference-time framework for "calibration on demand". CHASD uses an uncertainty-driven confidence gate to activate the contrastive branch only when the maximum probability of the next-token is less than the threshold, and constructs the negative branch through attention-guided localized perturbations of the currently salient visual tokens. This design reduces unnecessary negative-branch forward passes while preserving the original distribution for high-confidence steps. Experiments on POPE, AMBER, MME, MMHal-Bench, and CHAIR show that CHASD improves hallucination-related metrics over strong training-free baselines with competitive inference efficiency.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes Contrastive Hallucination-Aware Step-wise Decoding (CHASD), a training-free inference procedure for Large Vision-Language Models. It rests on the observation that hallucination risks are transient and token-specific, introducing an uncertainty-driven gate that activates contrastive calibration (with attention-guided localized visual perturbations for the negative branch) only when the next-token maximum probability falls below a threshold. Experiments on POPE, AMBER, MME, MMHal-Bench, and CHAIR report improved hallucination-related metrics relative to strong training-free baselines while preserving competitive inference efficiency.

Significance. If the selective-gating design holds, CHASD would constitute a practical refinement of contrastive decoding methods by reducing unnecessary negative-branch passes. The multi-benchmark evaluation across object-hallucination and general VLM benchmarks is a positive feature of the empirical section.

major comments (2)
  1. [Method description (confidence gate and negative-branch construction)] The central design choice—an uncertainty gate that safely omits contrastive calibration on high-confidence tokens—rests on the untested claim that hallucination risks are transient and token-specific. No analysis (e.g., hallucination rate conditioned on max-probability above the gate threshold, or missed-hallucination rate as a function of the threshold) is reported to support this assumption, which is load-bearing for both correctness and the claimed efficiency gain.
  2. [Experiments] The experimental section reports aggregate metric improvements but contains no ablation that isolates the effect of the confidence gate (i.e., CHASD versus always-on contrastive decoding with the same localized perturbations). Without this comparison it is impossible to determine whether the gate preserves the gains of full contrastive decoding or merely trades accuracy for speed.
minor comments (2)
  1. [Method] The threshold value and its selection procedure should be stated explicitly, together with any sensitivity analysis, rather than left as an implicit hyper-parameter.
  2. [Method] Implementation details for the attention-guided perturbation (exact masking procedure, number of tokens perturbed) are needed for reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the justification of the confidence gate and the experimental design. We address each major comment below and will revise the manuscript to incorporate additional analysis and ablations as outlined.

read point-by-point responses
  1. Referee: [Method description (confidence gate and negative-branch construction)] The central design choice—an uncertainty gate that safely omits contrastive calibration on high-confidence tokens—rests on the untested claim that hallucination risks are transient and token-specific. No analysis (e.g., hallucination rate conditioned on max-probability above the gate threshold, or missed-hallucination rate as a function of the threshold) is reported to support this assumption, which is load-bearing for both correctness and the claimed efficiency gain.

    Authors: We agree that the manuscript would benefit from explicit quantitative analysis to support the claim that hallucination risks are transient and token-specific. The design is motivated by observations during development that visual attention and token confidence vary across steps, but no such conditioned hallucination analysis is currently reported. In the revision we will add this analysis, including hallucination rates for tokens above the gate threshold and the effect of threshold choice on missed hallucinations versus efficiency. revision: yes

  2. Referee: [Experiments] The experimental section reports aggregate metric improvements but contains no ablation that isolates the effect of the confidence gate (i.e., CHASD versus always-on contrastive decoding with the same localized perturbations). Without this comparison it is impossible to determine whether the gate preserves the gains of full contrastive decoding or merely trades accuracy for speed.

    Authors: We acknowledge that a direct ablation isolating the confidence gate is necessary to quantify its contribution. The current results compare CHASD against other training-free baselines but do not include an always-on contrastive decoding variant using identical perturbations. We will add this ablation to the revised experimental section, reporting both hallucination metrics and inference-time measurements across the evaluated benchmarks. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained empirical proposal

full rationale

The paper states an empirical observation ('hallucination risks are transient and token-specific') and directly proposes CHASD as an inference procedure (uncertainty gate + attention-guided perturbations) motivated by that observation. No equations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text that reduce the central method or claims to inputs by construction. Experiments on external benchmarks (POPE, AMBER, etc.) are presented as independent validation. This matches the default expectation of a non-circular training-free method.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that hallucination risk varies meaningfully across tokens and can be detected via next-token probability, plus the unverified effectiveness of the localized perturbation strategy.

axioms (1)
  • domain assumption Hallucination risks are transient and token-specific
    Invoked as the key observation that justifies selective activation of the contrastive branch.

pith-pipeline@v0.9.0 · 5761 in / 1111 out tokens · 40218 ms · 2026-05-25T05:01:01.717152+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

49 extracted references · 49 canonical work pages · 5 internal anchors

  1. [1]

    S. Bai, K. Chen, X. Liu, J. Wang, W. Ge, S. Song, K. Dang, P. Wang, S. Wang, J. Tang, H. Zhong, Y . Zhu, M. Yang, Z. Li, J. Wan, P. Wang, W. Ding, Z. Fu, Y . Xu, J. Ye, X. Zhang, T. Xie, Z. Cheng, H. Zhang, Z. Yang, H. Xu, and J. Lin. Qwen2.5-vl technical report.arXiv preprint arXiv:2502.13923, 2025

  2. [2]

    K. Chen, Z. Zhang, W. Zeng, R. Zhang, F. Zhu, and R. Zhao. Shikra: Unleashing multimodal llm’s referential dialogue magic.arXiv preprint arXiv:2306.15195, 2023

  3. [3]

    Z. Chen, Z. Zhao, H. Luo, H. Yao, B. Li, and J. Zhou. Halc: Object hallucination reduction via adaptive focal-contrast decoding. InICML, 2024

  4. [4]

    Y . Cho, K. Kim, T. Hwang, and S. Cho. Do you keep an eye on what i ask? mitigating multimodal hallucination via attention-guided ensemble decoding. InICLR, 2025

  5. [5]

    W. Dai, J. Li, D. Li, A. Tiong, J. Zhao, W. Wang, B. Li, P. N. Fung, and S. Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning. InNeurIPS, 2023

  6. [6]

    Favero, L

    A. Favero, L. Zancato, M. Trager, S. Choudhary, P. Perera, A. Achille, A. Swaminathan, and S. Soatto. Multi-modal hallucination control by visual information grounding. InCVPR, 2024

  7. [7]

    S. Feng, W. Shi, Y . Wang, W. Ding, V . Balachandran, and Y . Tsvetkov. Don’t hallucinate, abstain: Identifying llm knowledge gaps via multi-llm collaboration. InACL, 2024

  8. [8]

    C. Fu, P. Chen, Y . Shen, Y . Qin, M. Zhang, X. Lin, J. Yang, X. Zheng, K. Li, X. Sun, Y . Wu, R. Ji, C. Shan, and R. He. MME: A comprehensive evaluation benchmark for multimodal large language models. InNeurIPS, 2025

  9. [9]

    T. Guan, F. Liu, X. Wu, R. Xian, Z. Li, X. Liu, X. Wang, L. Chen, F. Huang, Y . Yacoob, et al. Hallusionbench: an advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models. InCVPR, 2024

  10. [10]

    Huang, X

    Q. Huang, X. Dong, P. Zhang, B. Wang, C. He, J. Wang, D. Lin, W. Zhang, and N. Yu. Opera: Alleviating hallucination in multi-modal large language models via over-trust penalty and retrospection-allocation. InCVPR, 2024

  11. [11]

    D. A. Hudson and C. D. Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. InCVPR, 2019

  12. [12]

    F. Huo, W. Xu, Z. Zhang, H. Wang, Z. Chen, and P. Zhao. Self-introspective decoding: Alleviating hallucinations for large vision-language models. InICLR, 2025

  13. [13]

    GPT-4o System Card

    A. Hurst, A. Lerer, A. P. Goucher, A. Perelman, A. Ramesh, A. Clark, A. Ostrow, A. Welihinda, A. Hayes, A. Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024

  14. [14]

    Z. Ji, N. Lee, R. Frieske, T. Yu, D. Su, Y . Xu, E. Ishii, Y . J. Bang, A. Madotto, and P. Fung. Survey of hallucination in natural language generation.ACM computing surveys, 2023

  15. [15]

    Jiang, H

    X. Jiang, H. Ye, Y . Zhu, X. Zheng, Z. Chen, and J. Gong. Hicd: Hallucination-inducing via attention dispersion for contrastive decoding to mitigate hallucinations in large language models. InACL, 2025

  16. [16]

    J. Kim, J. Kim, Y . Kim, and S.-B. Cho. Fuzzy contrastive decoding to alleviate object hallucina- tion in large vision-language models. InICCV, 2025

  17. [17]

    S. Kim, B. Cho, S. Bae, S. Ahn, and S.-Y . Yun. Vacode: Visual augmented contrastive decoding. arXiv preprint arXiv:2408.05337, 2024

  18. [18]

    S. Leng, H. Zhang, G. Chen, X. Li, S. Lu, C. Miao, and L. Bing. Mitigating object hallucinations in large vision-language models through visual contrastive decoding. InCVPR, 2024

  19. [19]

    X. L. Li, A. Holtzman, D. Fried, P. Liang, J. Eisner, T. B. Hashimoto, L. Zettlemoyer, and M. Lewis. Contrastive decoding: Open-ended text generation as optimization. InACL, 2023. 10

  20. [20]

    Y . Li, Y . Du, K. Zhou, J. Wang, X. Zhao, and J.-R. Wen. Evaluating object hallucination in large vision-language models. InEMNLP, 2023

  21. [21]

    T.-Y . Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft coco: Common objects in context. InECCV, 2014

  22. [22]

    H. Liu, C. Li, Q. Wu, and Y . J. Lee. Visual instruction tuning. InNeurIPS, 2023

  23. [23]

    F. Ma, X. Jin, H. Wang, Y . Xian, J. Feng, and Y . Yang. Vista-llama: Reducing hallucination in video language models via equal distance to visual tokens. InCVPR, 2024

  24. [24]

    Y . Park, D. Lee, J. Choe, and B. Chang. Convis: Contrastive decoding with hallucination visualization for mitigating hallucinations in multimodal large language models. InAAAI, 2025

  25. [25]

    Rohrbach, L

    A. Rohrbach, L. A. Hendricks, K. Burns, T. Darrell, and K. Saenko. Object hallucination in image captioning. InEMNLP, 2018

  26. [26]

    Schwenk, A

    D. Schwenk, A. Khandelwal, C. Clark, K. Marino, and R. Mottaghi. A-okvqa: A benchmark for visual question answering using world knowledge. InECCV, 2022

  27. [27]

    Z. Sun, S. Shen, S. Cao, H. Liu, C. Li, Y . Shen, C. Gan, L. Gui, Y .-X. Wang, Y . Yang, et al. Aligning large multimodal models with factually augmented rlhf. InACL, 2024

  28. [28]

    W. Suo, L. Zhang, M. Sun, L. Y . Wu, P. Wang, and Y . Zhang. Octopus: Alleviating hallucination via dynamic contrastive decoding. InCVPR, 2025

  29. [29]

    Q. Team. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond.arXiv preprint arXiv:2308.12966, 2023

  30. [30]

    S. Tong, Z. Liu, Y . Zhai, Y . Ma, Y . LeCun, and S. Xie. Eyes wide shut? exploring the visual shortcomings of multimodal llms. InCVPR, 2024

  31. [31]

    J. Wang, Y . Wang, G. Xu, J. Zhang, Y . Gu, H. Jia, J. Wang, H. Xu, M. Yan, J. Zhang, et al. Amber: An llm-free multi-dimensional benchmark for mllms hallucination evaluation.arXiv preprint arXiv:2311.07397, 2023

  32. [32]

    X. Wang, J. Pan, L. Ding, and C. Biemann. Mitigating hallucinations in large vision-language models with instruction contrastive decoding. InACL, 2024

  33. [33]

    S. Woo, D. Kim, J. Jang, Y . Choi, and C. Kim. Don’t miss the forest for the trees: Attentional vision calibration for large vision language models. InACL, 2025

  34. [34]

    M. Wu, J. Ji, O. Huang, J. Li, Y . Wu, X. Sun, and R. Ji. Evaluating and analyzing relationship hallucinations in lvlms.arXiv preprint arXiv:2406.16449, 4, 2024

  35. [35]

    Y . Xia, S. Wang, and P. Li. Sdcd: Structure-disrupted contrastive decoding for mitigating hallucinations in large vision-language models.arXiv preprint arXiv:2601.03500, 2026

  36. [36]

    H. You, H. Zhang, Z. Gan, X. Du, B. Zhang, Z. Wang, L. Cao, S.-F. Chang, and Y . Yang. Ferret: Refer and ground anything anywhere at any granularity. InICLR, 2024

  37. [37]

    Z. Yue, L. Zhang, and Q. Jin. Less is more: Mitigating multimodal hallucination from an eos decision perspective. InACL, 2024

  38. [38]

    Zhang, K

    K. Zhang, K. Tao, Z. Luo, C. Liu, J. Tang, and H. Wang. Tars: Minmax token-adaptive preference strategy for hallucination reduction in mllms.arXiv e-prints, 2025

  39. [39]

    Zhang, O

    M. Zhang, O. Press, W. Merrill, A. Liu, and N. A. Smith. How language model hallucinations can snowball. InICML, 2024

  40. [40]

    Y . Zhou, C. Cui, J. Yoon, L. Zhang, Z. Deng, C. Finn, M. Bansal, and H. Yao. Analyzing and mitigating object hallucination in large vision-language models. InICLR, 2024

  41. [41]

    calibration on demand,

    D. Zhu, J. Chen, X. Shen, X. Li, and M. Elhoseiny. Minigpt-4: Enhancing vision-language understanding with advanced large language models. InICLR, 2024. 11 Appendix Overview This appendix provides additional details and analyses to complement the main paper. It is organized as follows: • Section A. Social Impact.We discuss the ethical implications of theC...

  42. [42]

    person,"

    POPE [20]:To mitigate the bias of simple "Yes" answers, POPE employs three progressively difficult sampling settings: • Random:Negative objects are randomly sampled from those not present in the image. • Popular:Negative objects are selected from the most frequent categories in the dataset (e.g., "person," "chair"). • Adversarial:Negative objects are sele...

  43. [43]

    AMBER [31]:AMBER spans bothDiscriminative(Yes/No questions based on POPE-style sampling) andGenerative(Image Captioning) tasks to provide a multi-faceted assessment

  44. [44]

    • ThePerceptiontrack comprises 10 sub-tasks: Existence, Count, Color, Position, Celebrity, Landmark, Artwork, Poster, Movie, and Design, focusing on basic visual recognition

    MME [8]:A comprehensive benchmark designed to evaluate both perception and cognition capabilities. • ThePerceptiontrack comprises 10 sub-tasks: Existence, Count, Color, Position, Celebrity, Landmark, Artwork, Poster, Movie, and Design, focusing on basic visual recognition. • TheCognitiontrack consists of the remaining 4 sub-tasks: Commonsense Reasoning, N...

  45. [45]

    D.2 Precise Metric Calculations

    MMHal-Bench [27]:This benchmark evaluates hallucinatory responses across 8 complex reasoning dimensions: Attribute, Comparison, Counting, Existence, Localization, Relation, Scene, and Sport. D.2 Precise Metric Calculations

  46. [46]

    POPE Metrics:We report Accuracy ( Acc) and F1-score ( F1) to analyze the trade-off between sensitivity and specificity: Acc = T P+T N T P+T N+F P+F N ,F1 = 2T P 2T P+F P+F N (8) 13

  47. [47]

    AMBER Score Calculation:The final AMBER score is computed as the arithmetic mean of the discriminative F1-score and the generative fidelity (represented by100−CHAIR i), providing a balanced metric for both task types: ScoreAM BER = (100−CHAIR i) +F1 2 (9) Where CHAIRi denotes the instance-level hallucination rate in generative tasks and F1pope is the F1-s...

  48. [48]

    MME Scoring:For each category c, the score is the sum of raw accuracy ( Accc) and balanced accuracy (Acc+c). LetN c be the number of image-question pairs in categoryc: Accc = P Correct Responses Nc ,Score M M E = 14X c=1 (Accc ×100 +Acc+ c ×100)(10) The total maximum score for the Perception track is 2000

  49. [49]

    Welcome to Houston, Texas

    MMHal-Bench Scoring:The final informative-hallucination score ( Stotal) is the average across all 96 samples evaluated by an LLM-based judge: Stotal = 1 96 96X j=1 GPT-4-Judge(Responsej)(11) E More results E.1 MME-Fullset Detailed subtask evaluations on the MME benchmark are presented here. Figures 6 and 7 illustrate the performance profiles for LLaV A-1....