CHASD: Language Increment-Calibrated Contrastive Decoding against Hallucination in LVLMs
Pith reviewed 2026-05-25 05:01 UTC · model grok-4.3
The pith
CHASD activates contrastive decoding only on low-confidence tokens using localized visual perturbations to reduce hallucinations in vision-language models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
CHASD is an inference-time framework that uses an uncertainty-driven confidence gate to activate the contrastive branch only when the maximum probability of the next token is below a threshold, and builds the negative branch through attention-guided localized perturbations of the currently salient visual tokens, thereby reducing unnecessary negative-branch forward passes while preserving the original distribution for high-confidence steps.
What carries the argument
Uncertainty-driven confidence gate that triggers attention-guided localized perturbations of salient visual tokens for the negative branch only on uncertain decoding steps.
If this is right
- Hallucination-related metrics improve over strong training-free baselines on POPE, AMBER, MME, MMHal-Bench, and CHAIR.
- Inference efficiency stays competitive because the negative branch runs only on low-confidence steps.
- High-confidence steps keep the original model distribution without perturbation.
- Localized perturbations avoid altering useful visual evidence that global methods might change.
Where Pith is reading between the lines
- The same gating idea could be tested on other error types such as factual inconsistencies in text-only models.
- Varying the confidence threshold per model size might yield further efficiency gains.
- If attention shifts reliably mark hallucination-prone steps, the method might extend to video or audio-language models.
Load-bearing premise
Hallucination risks are transient and limited to specific low-confidence tokens, so skipping contrastive calibration on high-confidence steps does not miss critical hallucinations.
What would settle it
A benchmark run in which CHASD shows no improvement or a drop in hallucination metrics on POPE or AMBER compared to always-on contrastive baselines, indicating that high-confidence steps still produce uncorrected hallucinations.
Figures
read the original abstract
Large Vision-Language Models have shown strong multimodal reasoning capabilities, yet they remain susceptible to object hallucinations when language priors dominate insufficient or misaligned visual evidence. Training-free contrastive decoding methods mitigate this issue by comparing predictions from original and perturbed visual inputs, but existing approaches either apply global perturbations that may alter useful visual evidence or invoke an additional negative branch at every decoding step. In this paper, we observe that hallucination risks are transient and token-specific: visual attention shifts across generated tokens, while some functional tokens are produced with high confidence and do not require contrastive calibration. Based on this observation, we propose Contrastive Hallucination-Aware Step-wise Decoding (CHASD) for Large Vision-Language Models, an inference-time framework for "calibration on demand". CHASD uses an uncertainty-driven confidence gate to activate the contrastive branch only when the maximum probability of the next-token is less than the threshold, and constructs the negative branch through attention-guided localized perturbations of the currently salient visual tokens. This design reduces unnecessary negative-branch forward passes while preserving the original distribution for high-confidence steps. Experiments on POPE, AMBER, MME, MMHal-Bench, and CHAIR show that CHASD improves hallucination-related metrics over strong training-free baselines with competitive inference efficiency.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes Contrastive Hallucination-Aware Step-wise Decoding (CHASD), a training-free inference procedure for Large Vision-Language Models. It rests on the observation that hallucination risks are transient and token-specific, introducing an uncertainty-driven gate that activates contrastive calibration (with attention-guided localized visual perturbations for the negative branch) only when the next-token maximum probability falls below a threshold. Experiments on POPE, AMBER, MME, MMHal-Bench, and CHAIR report improved hallucination-related metrics relative to strong training-free baselines while preserving competitive inference efficiency.
Significance. If the selective-gating design holds, CHASD would constitute a practical refinement of contrastive decoding methods by reducing unnecessary negative-branch passes. The multi-benchmark evaluation across object-hallucination and general VLM benchmarks is a positive feature of the empirical section.
major comments (2)
- [Method description (confidence gate and negative-branch construction)] The central design choice—an uncertainty gate that safely omits contrastive calibration on high-confidence tokens—rests on the untested claim that hallucination risks are transient and token-specific. No analysis (e.g., hallucination rate conditioned on max-probability above the gate threshold, or missed-hallucination rate as a function of the threshold) is reported to support this assumption, which is load-bearing for both correctness and the claimed efficiency gain.
- [Experiments] The experimental section reports aggregate metric improvements but contains no ablation that isolates the effect of the confidence gate (i.e., CHASD versus always-on contrastive decoding with the same localized perturbations). Without this comparison it is impossible to determine whether the gate preserves the gains of full contrastive decoding or merely trades accuracy for speed.
minor comments (2)
- [Method] The threshold value and its selection procedure should be stated explicitly, together with any sensitivity analysis, rather than left as an implicit hyper-parameter.
- [Method] Implementation details for the attention-guided perturbation (exact masking procedure, number of tokens perturbed) are needed for reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on the justification of the confidence gate and the experimental design. We address each major comment below and will revise the manuscript to incorporate additional analysis and ablations as outlined.
read point-by-point responses
-
Referee: [Method description (confidence gate and negative-branch construction)] The central design choice—an uncertainty gate that safely omits contrastive calibration on high-confidence tokens—rests on the untested claim that hallucination risks are transient and token-specific. No analysis (e.g., hallucination rate conditioned on max-probability above the gate threshold, or missed-hallucination rate as a function of the threshold) is reported to support this assumption, which is load-bearing for both correctness and the claimed efficiency gain.
Authors: We agree that the manuscript would benefit from explicit quantitative analysis to support the claim that hallucination risks are transient and token-specific. The design is motivated by observations during development that visual attention and token confidence vary across steps, but no such conditioned hallucination analysis is currently reported. In the revision we will add this analysis, including hallucination rates for tokens above the gate threshold and the effect of threshold choice on missed hallucinations versus efficiency. revision: yes
-
Referee: [Experiments] The experimental section reports aggregate metric improvements but contains no ablation that isolates the effect of the confidence gate (i.e., CHASD versus always-on contrastive decoding with the same localized perturbations). Without this comparison it is impossible to determine whether the gate preserves the gains of full contrastive decoding or merely trades accuracy for speed.
Authors: We acknowledge that a direct ablation isolating the confidence gate is necessary to quantify its contribution. The current results compare CHASD against other training-free baselines but do not include an always-on contrastive decoding variant using identical perturbations. We will add this ablation to the revised experimental section, reporting both hallucination metrics and inference-time measurements across the evaluated benchmarks. revision: yes
Circularity Check
No significant circularity; derivation is self-contained empirical proposal
full rationale
The paper states an empirical observation ('hallucination risks are transient and token-specific') and directly proposes CHASD as an inference procedure (uncertainty gate + attention-guided perturbations) motivated by that observation. No equations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text that reduce the central method or claims to inputs by construction. Experiments on external benchmarks (POPE, AMBER, etc.) are presented as independent validation. This matches the default expectation of a non-circular training-free method.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Hallucination risks are transient and token-specific
Reference graph
Works this paper leans on
-
[1]
S. Bai, K. Chen, X. Liu, J. Wang, W. Ge, S. Song, K. Dang, P. Wang, S. Wang, J. Tang, H. Zhong, Y . Zhu, M. Yang, Z. Li, J. Wan, P. Wang, W. Ding, Z. Fu, Y . Xu, J. Ye, X. Zhang, T. Xie, Z. Cheng, H. Zhang, Z. Yang, H. Xu, and J. Lin. Qwen2.5-vl technical report.arXiv preprint arXiv:2502.13923, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[2]
K. Chen, Z. Zhang, W. Zeng, R. Zhang, F. Zhu, and R. Zhao. Shikra: Unleashing multimodal llm’s referential dialogue magic.arXiv preprint arXiv:2306.15195, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[3]
Z. Chen, Z. Zhao, H. Luo, H. Yao, B. Li, and J. Zhou. Halc: Object hallucination reduction via adaptive focal-contrast decoding. InICML, 2024
work page 2024
-
[4]
Y . Cho, K. Kim, T. Hwang, and S. Cho. Do you keep an eye on what i ask? mitigating multimodal hallucination via attention-guided ensemble decoding. InICLR, 2025
work page 2025
-
[5]
W. Dai, J. Li, D. Li, A. Tiong, J. Zhao, W. Wang, B. Li, P. N. Fung, and S. Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning. InNeurIPS, 2023
work page 2023
- [6]
-
[7]
S. Feng, W. Shi, Y . Wang, W. Ding, V . Balachandran, and Y . Tsvetkov. Don’t hallucinate, abstain: Identifying llm knowledge gaps via multi-llm collaboration. InACL, 2024
work page 2024
-
[8]
C. Fu, P. Chen, Y . Shen, Y . Qin, M. Zhang, X. Lin, J. Yang, X. Zheng, K. Li, X. Sun, Y . Wu, R. Ji, C. Shan, and R. He. MME: A comprehensive evaluation benchmark for multimodal large language models. InNeurIPS, 2025
work page 2025
-
[9]
T. Guan, F. Liu, X. Wu, R. Xian, Z. Li, X. Liu, X. Wang, L. Chen, F. Huang, Y . Yacoob, et al. Hallusionbench: an advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models. InCVPR, 2024
work page 2024
- [10]
-
[11]
D. A. Hudson and C. D. Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. InCVPR, 2019
work page 2019
-
[12]
F. Huo, W. Xu, Z. Zhang, H. Wang, Z. Chen, and P. Zhao. Self-introspective decoding: Alleviating hallucinations for large vision-language models. InICLR, 2025
work page 2025
-
[13]
A. Hurst, A. Lerer, A. P. Goucher, A. Perelman, A. Ramesh, A. Clark, A. Ostrow, A. Welihinda, A. Hayes, A. Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[14]
Z. Ji, N. Lee, R. Frieske, T. Yu, D. Su, Y . Xu, E. Ishii, Y . J. Bang, A. Madotto, and P. Fung. Survey of hallucination in natural language generation.ACM computing surveys, 2023
work page 2023
- [15]
-
[16]
J. Kim, J. Kim, Y . Kim, and S.-B. Cho. Fuzzy contrastive decoding to alleviate object hallucina- tion in large vision-language models. InICCV, 2025
work page 2025
- [17]
-
[18]
S. Leng, H. Zhang, G. Chen, X. Li, S. Lu, C. Miao, and L. Bing. Mitigating object hallucinations in large vision-language models through visual contrastive decoding. InCVPR, 2024
work page 2024
-
[19]
X. L. Li, A. Holtzman, D. Fried, P. Liang, J. Eisner, T. B. Hashimoto, L. Zettlemoyer, and M. Lewis. Contrastive decoding: Open-ended text generation as optimization. InACL, 2023. 10
work page 2023
-
[20]
Y . Li, Y . Du, K. Zhou, J. Wang, X. Zhao, and J.-R. Wen. Evaluating object hallucination in large vision-language models. InEMNLP, 2023
work page 2023
-
[21]
T.-Y . Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft coco: Common objects in context. InECCV, 2014
work page 2014
-
[22]
H. Liu, C. Li, Q. Wu, and Y . J. Lee. Visual instruction tuning. InNeurIPS, 2023
work page 2023
-
[23]
F. Ma, X. Jin, H. Wang, Y . Xian, J. Feng, and Y . Yang. Vista-llama: Reducing hallucination in video language models via equal distance to visual tokens. InCVPR, 2024
work page 2024
-
[24]
Y . Park, D. Lee, J. Choe, and B. Chang. Convis: Contrastive decoding with hallucination visualization for mitigating hallucinations in multimodal large language models. InAAAI, 2025
work page 2025
-
[25]
A. Rohrbach, L. A. Hendricks, K. Burns, T. Darrell, and K. Saenko. Object hallucination in image captioning. InEMNLP, 2018
work page 2018
-
[26]
D. Schwenk, A. Khandelwal, C. Clark, K. Marino, and R. Mottaghi. A-okvqa: A benchmark for visual question answering using world knowledge. InECCV, 2022
work page 2022
-
[27]
Z. Sun, S. Shen, S. Cao, H. Liu, C. Li, Y . Shen, C. Gan, L. Gui, Y .-X. Wang, Y . Yang, et al. Aligning large multimodal models with factually augmented rlhf. InACL, 2024
work page 2024
-
[28]
W. Suo, L. Zhang, M. Sun, L. Y . Wu, P. Wang, and Y . Zhang. Octopus: Alleviating hallucination via dynamic contrastive decoding. InCVPR, 2025
work page 2025
-
[29]
Q. Team. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond.arXiv preprint arXiv:2308.12966, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[30]
S. Tong, Z. Liu, Y . Zhai, Y . Ma, Y . LeCun, and S. Xie. Eyes wide shut? exploring the visual shortcomings of multimodal llms. InCVPR, 2024
work page 2024
-
[31]
J. Wang, Y . Wang, G. Xu, J. Zhang, Y . Gu, H. Jia, J. Wang, H. Xu, M. Yan, J. Zhang, et al. Amber: An llm-free multi-dimensional benchmark for mllms hallucination evaluation.arXiv preprint arXiv:2311.07397, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[32]
X. Wang, J. Pan, L. Ding, and C. Biemann. Mitigating hallucinations in large vision-language models with instruction contrastive decoding. InACL, 2024
work page 2024
-
[33]
S. Woo, D. Kim, J. Jang, Y . Choi, and C. Kim. Don’t miss the forest for the trees: Attentional vision calibration for large vision language models. InACL, 2025
work page 2025
- [34]
- [35]
-
[36]
H. You, H. Zhang, Z. Gan, X. Du, B. Zhang, Z. Wang, L. Cao, S.-F. Chang, and Y . Yang. Ferret: Refer and ground anything anywhere at any granularity. InICLR, 2024
work page 2024
-
[37]
Z. Yue, L. Zhang, and Q. Jin. Less is more: Mitigating multimodal hallucination from an eos decision perspective. InACL, 2024
work page 2024
- [38]
- [39]
-
[40]
Y . Zhou, C. Cui, J. Yoon, L. Zhang, Z. Deng, C. Finn, M. Bansal, and H. Yao. Analyzing and mitigating object hallucination in large vision-language models. InICLR, 2024
work page 2024
-
[41]
D. Zhu, J. Chen, X. Shen, X. Li, and M. Elhoseiny. Minigpt-4: Enhancing vision-language understanding with advanced large language models. InICLR, 2024. 11 Appendix Overview This appendix provides additional details and analyses to complement the main paper. It is organized as follows: • Section A. Social Impact.We discuss the ethical implications of theC...
work page 2024
-
[42]
POPE [20]:To mitigate the bias of simple "Yes" answers, POPE employs three progressively difficult sampling settings: • Random:Negative objects are randomly sampled from those not present in the image. • Popular:Negative objects are selected from the most frequent categories in the dataset (e.g., "person," "chair"). • Adversarial:Negative objects are sele...
-
[43]
AMBER [31]:AMBER spans bothDiscriminative(Yes/No questions based on POPE-style sampling) andGenerative(Image Captioning) tasks to provide a multi-faceted assessment
-
[44]
MME [8]:A comprehensive benchmark designed to evaluate both perception and cognition capabilities. • ThePerceptiontrack comprises 10 sub-tasks: Existence, Count, Color, Position, Celebrity, Landmark, Artwork, Poster, Movie, and Design, focusing on basic visual recognition. • TheCognitiontrack consists of the remaining 4 sub-tasks: Commonsense Reasoning, N...
-
[45]
D.2 Precise Metric Calculations
MMHal-Bench [27]:This benchmark evaluates hallucinatory responses across 8 complex reasoning dimensions: Attribute, Comparison, Counting, Existence, Localization, Relation, Scene, and Sport. D.2 Precise Metric Calculations
-
[46]
POPE Metrics:We report Accuracy ( Acc) and F1-score ( F1) to analyze the trade-off between sensitivity and specificity: Acc = T P+T N T P+T N+F P+F N ,F1 = 2T P 2T P+F P+F N (8) 13
-
[47]
AMBER Score Calculation:The final AMBER score is computed as the arithmetic mean of the discriminative F1-score and the generative fidelity (represented by100−CHAIR i), providing a balanced metric for both task types: ScoreAM BER = (100−CHAIR i) +F1 2 (9) Where CHAIRi denotes the instance-level hallucination rate in generative tasks and F1pope is the F1-s...
-
[48]
MME Scoring:For each category c, the score is the sum of raw accuracy ( Accc) and balanced accuracy (Acc+c). LetN c be the number of image-question pairs in categoryc: Accc = P Correct Responses Nc ,Score M M E = 14X c=1 (Accc ×100 +Acc+ c ×100)(10) The total maximum score for the Perception track is 2000
work page 2000
-
[49]
MMHal-Bench Scoring:The final informative-hallucination score ( Stotal) is the average across all 96 samples evaluated by an LLM-based judge: Stotal = 1 96 96X j=1 GPT-4-Judge(Responsej)(11) E More results E.1 MME-Fullset Detailed subtask evaluations on the MME benchmark are presented here. Figures 6 and 7 illustrate the performance profiles for LLaV A-1....
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.