CHASD: Language Increment-Calibrated Contrastive Decoding against Hallucination in LVLMs

Kejia Zhang; Xiaoyi Huang; Zhiming Luo

arxiv: 2605.23344 · v1 · pith:FN5EBS45new · submitted 2026-05-22 · 💻 cs.CV · cs.AI

CHASD: Language Increment-Calibrated Contrastive Decoding against Hallucination in LVLMs

Xiaoyi Huang , Kejia Zhang , Zhiming Luo This is my paper

Pith reviewed 2026-05-25 05:01 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords contrastive decodinghallucination mitigationlarge vision-language modelsinference-time calibrationobject hallucinationuncertainty gatinglocalized perturbationattention-guided decoding

0 comments

The pith

CHASD activates contrastive decoding only on low-confidence tokens using localized visual perturbations to reduce hallucinations in vision-language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that hallucination risks in large vision-language models occur only at specific tokens where visual evidence is weak, so a confidence gate can safely limit contrastive calibration to those steps. It constructs the negative branch by perturbing only the visual tokens currently attended to, rather than applying global changes or running the branch at every token. This selective approach is meant to cut object hallucinations while avoiding extra computation on high-confidence steps that follow the original distribution. A reader would care if the method delivers better benchmark scores than prior training-free contrastive techniques without slowing inference much.

Core claim

CHASD is an inference-time framework that uses an uncertainty-driven confidence gate to activate the contrastive branch only when the maximum probability of the next token is below a threshold, and builds the negative branch through attention-guided localized perturbations of the currently salient visual tokens, thereby reducing unnecessary negative-branch forward passes while preserving the original distribution for high-confidence steps.

What carries the argument

Uncertainty-driven confidence gate that triggers attention-guided localized perturbations of salient visual tokens for the negative branch only on uncertain decoding steps.

If this is right

Hallucination-related metrics improve over strong training-free baselines on POPE, AMBER, MME, MMHal-Bench, and CHAIR.
Inference efficiency stays competitive because the negative branch runs only on low-confidence steps.
High-confidence steps keep the original model distribution without perturbation.
Localized perturbations avoid altering useful visual evidence that global methods might change.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same gating idea could be tested on other error types such as factual inconsistencies in text-only models.
Varying the confidence threshold per model size might yield further efficiency gains.
If attention shifts reliably mark hallucination-prone steps, the method might extend to video or audio-language models.

Load-bearing premise

Hallucination risks are transient and limited to specific low-confidence tokens, so skipping contrastive calibration on high-confidence steps does not miss critical hallucinations.

What would settle it

A benchmark run in which CHASD shows no improvement or a drop in hallucination metrics on POPE or AMBER compared to always-on contrastive baselines, indicating that high-confidence steps still produce uncorrected hallucinations.

Figures

Figures reproduced from arXiv: 2605.23344 by Kejia Zhang, Xiaoyi Huang, Zhiming Luo.

**Figure 1.** Figure 1: Visualization of step-wise probabilities and cross-attention maps during the LVLM generation process. Top: The model assigns varying confidence scores to each token. The high confidence observed in functional tokens suggests that the contrastive decoding branch can be selectively bypassed to enhance inference efficiency. Bottom: The model’s visual focus areas shift dynamically when generating different tok… view at source ↗

**Figure 2.** Figure 2: Overview of the CHASD at time step t. For the initial token generated at the current step, it first undergoes (I) Uncertainty-driven Confidence Gating: if the maximum predictive probability P exceeds the threshold τ , the system directly outputs the candidate token. Otherwise, the system enters the (II) Attention-guided Localized Visual Perturbation branch, leveraging an attention map to identify salient r… view at source ↗

**Figure 3.** Figure 3: Comparison of image descriptions generated by different methods on LLaVA-Bench (Left: LLaVA; Right: InstructBLIP). Hallucinated content is highlighted in red. 4.3 Ablation Study In this section, we first evaluate the computational efficiency of CHASD by comparing the inference latency and GPU memory footprint across various methods. Subsequently, we conduct a sensitivity analysis on the key hyperparameters… view at source ↗

**Figure 5.** Figure 5: Sensitivity of the hyperparameter for the confidence gating threshold τ . threshold τ , to investigate their impact on the balance between performance and overhead. All of the above experiments were conducted on POPE benchmark (LLaVA-1.5, COCO, Adversarial) [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: MME-Fullset results for different methods (LLaVA-1.5 as backbone) [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗

**Figure 7.** Figure 7: MME-Fullset results for different methods (InstructBLIP as backbone) [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗

**Figure 8.** Figure 8: Comparison of image descriptions generated by different methods on LLaVA-Bench (Left: LLaVA; Right: InstructBLIP). Hallucinated content is highlighted in red. G Prompts for GPT-4o During the MMHal-Benchmark evaluation, we used GPT-4o [13] to assist with scoring. Here, we will present the specific prompt we used to enable it to act as a judge. Please act as an impartial and objective judge and evaluate the … view at source ↗

read the original abstract

Large Vision-Language Models have shown strong multimodal reasoning capabilities, yet they remain susceptible to object hallucinations when language priors dominate insufficient or misaligned visual evidence. Training-free contrastive decoding methods mitigate this issue by comparing predictions from original and perturbed visual inputs, but existing approaches either apply global perturbations that may alter useful visual evidence or invoke an additional negative branch at every decoding step. In this paper, we observe that hallucination risks are transient and token-specific: visual attention shifts across generated tokens, while some functional tokens are produced with high confidence and do not require contrastive calibration. Based on this observation, we propose Contrastive Hallucination-Aware Step-wise Decoding (CHASD) for Large Vision-Language Models, an inference-time framework for "calibration on demand". CHASD uses an uncertainty-driven confidence gate to activate the contrastive branch only when the maximum probability of the next-token is less than the threshold, and constructs the negative branch through attention-guided localized perturbations of the currently salient visual tokens. This design reduces unnecessary negative-branch forward passes while preserving the original distribution for high-confidence steps. Experiments on POPE, AMBER, MME, MMHal-Bench, and CHAIR show that CHASD improves hallucination-related metrics over strong training-free baselines with competitive inference efficiency.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CHASD adds a confidence gate and localized perturbations to contrastive decoding, but the claim that high-confidence tokens can safely skip calibration rests on an untested assumption.

read the letter

The main thing here is a gated contrastive decoding setup that only runs the negative branch when the next-token max probability falls below some threshold, with the negative view created by attention-guided local changes to salient visual tokens rather than global image tweaks. This is meant to cut extra forward passes while still catching hallucinations when the model is uncertain. The localized perturbation is a sensible step beyond prior global methods because it tries to leave useful visual evidence untouched. The gating idea follows directly from the observation that attention shifts and some tokens are produced confidently. Both choices are straightforward engineering moves that could matter for deployment where inference cost matters. The experiments claim better hallucination metrics than strong training-free baselines on POPE, AMBER, MME, MMHal-Bench, and CHAIR while keeping competitive speed, which is the kind of result that would interest people already running contrastive decoding. The soft spot is the load-bearing assumption that hallucination risks are transient and token-specific enough that skipping the contrastive step on high-confidence tokens is safe. The abstract states this as an observation but supplies no conditional analysis of hallucination rates on high- versus low-confidence steps, no ablation on the threshold choice, and no check on missed hallucinations when the gate is active. If even a modest share of hallucinations occur on confident tokens, the efficiency gain either comes with reduced reliability or forces a lower threshold that erodes the claimed savings. Without those checks visible, the reported improvements are hard to interpret. This is a narrow but coherent extension inside the hallucination-mitigation literature. Readers already working on inference-time fixes for vision-language models would get the most from the design details. It is not a big conceptual advance, but the work is coherent enough on its own terms to deserve referee time so the experiments and the gate validation can be examined directly.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes Contrastive Hallucination-Aware Step-wise Decoding (CHASD), a training-free inference procedure for Large Vision-Language Models. It rests on the observation that hallucination risks are transient and token-specific, introducing an uncertainty-driven gate that activates contrastive calibration (with attention-guided localized visual perturbations for the negative branch) only when the next-token maximum probability falls below a threshold. Experiments on POPE, AMBER, MME, MMHal-Bench, and CHAIR report improved hallucination-related metrics relative to strong training-free baselines while preserving competitive inference efficiency.

Significance. If the selective-gating design holds, CHASD would constitute a practical refinement of contrastive decoding methods by reducing unnecessary negative-branch passes. The multi-benchmark evaluation across object-hallucination and general VLM benchmarks is a positive feature of the empirical section.

major comments (2)

[Method description (confidence gate and negative-branch construction)] The central design choice—an uncertainty gate that safely omits contrastive calibration on high-confidence tokens—rests on the untested claim that hallucination risks are transient and token-specific. No analysis (e.g., hallucination rate conditioned on max-probability above the gate threshold, or missed-hallucination rate as a function of the threshold) is reported to support this assumption, which is load-bearing for both correctness and the claimed efficiency gain.
[Experiments] The experimental section reports aggregate metric improvements but contains no ablation that isolates the effect of the confidence gate (i.e., CHASD versus always-on contrastive decoding with the same localized perturbations). Without this comparison it is impossible to determine whether the gate preserves the gains of full contrastive decoding or merely trades accuracy for speed.

minor comments (2)

[Method] The threshold value and its selection procedure should be stated explicitly, together with any sensitivity analysis, rather than left as an implicit hyper-parameter.
[Method] Implementation details for the attention-guided perturbation (exact masking procedure, number of tokens perturbed) are needed for reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the justification of the confidence gate and the experimental design. We address each major comment below and will revise the manuscript to incorporate additional analysis and ablations as outlined.

read point-by-point responses

Referee: [Method description (confidence gate and negative-branch construction)] The central design choice—an uncertainty gate that safely omits contrastive calibration on high-confidence tokens—rests on the untested claim that hallucination risks are transient and token-specific. No analysis (e.g., hallucination rate conditioned on max-probability above the gate threshold, or missed-hallucination rate as a function of the threshold) is reported to support this assumption, which is load-bearing for both correctness and the claimed efficiency gain.

Authors: We agree that the manuscript would benefit from explicit quantitative analysis to support the claim that hallucination risks are transient and token-specific. The design is motivated by observations during development that visual attention and token confidence vary across steps, but no such conditioned hallucination analysis is currently reported. In the revision we will add this analysis, including hallucination rates for tokens above the gate threshold and the effect of threshold choice on missed hallucinations versus efficiency. revision: yes
Referee: [Experiments] The experimental section reports aggregate metric improvements but contains no ablation that isolates the effect of the confidence gate (i.e., CHASD versus always-on contrastive decoding with the same localized perturbations). Without this comparison it is impossible to determine whether the gate preserves the gains of full contrastive decoding or merely trades accuracy for speed.

Authors: We acknowledge that a direct ablation isolating the confidence gate is necessary to quantify its contribution. The current results compare CHASD against other training-free baselines but do not include an always-on contrastive decoding variant using identical perturbations. We will add this ablation to the revised experimental section, reporting both hallucination metrics and inference-time measurements across the evaluated benchmarks. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained empirical proposal

full rationale

The paper states an empirical observation ('hallucination risks are transient and token-specific') and directly proposes CHASD as an inference procedure (uncertainty gate + attention-guided perturbations) motivated by that observation. No equations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text that reduce the central method or claims to inputs by construction. Experiments on external benchmarks (POPE, AMBER, etc.) are presented as independent validation. This matches the default expectation of a non-circular training-free method.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that hallucination risk varies meaningfully across tokens and can be detected via next-token probability, plus the unverified effectiveness of the localized perturbation strategy.

axioms (1)

domain assumption Hallucination risks are transient and token-specific
Invoked as the key observation that justifies selective activation of the contrastive branch.

pith-pipeline@v0.9.0 · 5761 in / 1111 out tokens · 40218 ms · 2026-05-25T05:01:01.717152+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

49 extracted references · 49 canonical work pages · 5 internal anchors

[1]

S. Bai, K. Chen, X. Liu, J. Wang, W. Ge, S. Song, K. Dang, P. Wang, S. Wang, J. Tang, H. Zhong, Y . Zhu, M. Yang, Z. Li, J. Wan, P. Wang, W. Ding, Z. Fu, Y . Xu, J. Ye, X. Zhang, T. Xie, Z. Cheng, H. Zhang, Z. Yang, H. Xu, and J. Lin. Qwen2.5-vl technical report.arXiv preprint arXiv:2502.13923, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

K. Chen, Z. Zhang, W. Zeng, R. Zhang, F. Zhu, and R. Zhao. Shikra: Unleashing multimodal llm’s referential dialogue magic.arXiv preprint arXiv:2306.15195, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

Z. Chen, Z. Zhao, H. Luo, H. Yao, B. Li, and J. Zhou. Halc: Object hallucination reduction via adaptive focal-contrast decoding. InICML, 2024

work page 2024
[4]

Y . Cho, K. Kim, T. Hwang, and S. Cho. Do you keep an eye on what i ask? mitigating multimodal hallucination via attention-guided ensemble decoding. InICLR, 2025

work page 2025
[5]

W. Dai, J. Li, D. Li, A. Tiong, J. Zhao, W. Wang, B. Li, P. N. Fung, and S. Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning. InNeurIPS, 2023

work page 2023
[6]

Favero, L

A. Favero, L. Zancato, M. Trager, S. Choudhary, P. Perera, A. Achille, A. Swaminathan, and S. Soatto. Multi-modal hallucination control by visual information grounding. InCVPR, 2024

work page 2024
[7]

S. Feng, W. Shi, Y . Wang, W. Ding, V . Balachandran, and Y . Tsvetkov. Don’t hallucinate, abstain: Identifying llm knowledge gaps via multi-llm collaboration. InACL, 2024

work page 2024
[8]

C. Fu, P. Chen, Y . Shen, Y . Qin, M. Zhang, X. Lin, J. Yang, X. Zheng, K. Li, X. Sun, Y . Wu, R. Ji, C. Shan, and R. He. MME: A comprehensive evaluation benchmark for multimodal large language models. InNeurIPS, 2025

work page 2025
[9]

T. Guan, F. Liu, X. Wu, R. Xian, Z. Li, X. Liu, X. Wang, L. Chen, F. Huang, Y . Yacoob, et al. Hallusionbench: an advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models. InCVPR, 2024

work page 2024
[10]

Huang, X

Q. Huang, X. Dong, P. Zhang, B. Wang, C. He, J. Wang, D. Lin, W. Zhang, and N. Yu. Opera: Alleviating hallucination in multi-modal large language models via over-trust penalty and retrospection-allocation. InCVPR, 2024

work page 2024
[11]

D. A. Hudson and C. D. Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. InCVPR, 2019

work page 2019
[12]

F. Huo, W. Xu, Z. Zhang, H. Wang, Z. Chen, and P. Zhao. Self-introspective decoding: Alleviating hallucinations for large vision-language models. InICLR, 2025

work page 2025
[13]

GPT-4o System Card

A. Hurst, A. Lerer, A. P. Goucher, A. Perelman, A. Ramesh, A. Clark, A. Ostrow, A. Welihinda, A. Hayes, A. Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[14]

Z. Ji, N. Lee, R. Frieske, T. Yu, D. Su, Y . Xu, E. Ishii, Y . J. Bang, A. Madotto, and P. Fung. Survey of hallucination in natural language generation.ACM computing surveys, 2023

work page 2023
[15]

Jiang, H

X. Jiang, H. Ye, Y . Zhu, X. Zheng, Z. Chen, and J. Gong. Hicd: Hallucination-inducing via attention dispersion for contrastive decoding to mitigate hallucinations in large language models. InACL, 2025

work page 2025
[16]

J. Kim, J. Kim, Y . Kim, and S.-B. Cho. Fuzzy contrastive decoding to alleviate object hallucina- tion in large vision-language models. InICCV, 2025

work page 2025
[17]

S. Kim, B. Cho, S. Bae, S. Ahn, and S.-Y . Yun. Vacode: Visual augmented contrastive decoding. arXiv preprint arXiv:2408.05337, 2024

work page arXiv 2024
[18]

S. Leng, H. Zhang, G. Chen, X. Li, S. Lu, C. Miao, and L. Bing. Mitigating object hallucinations in large vision-language models through visual contrastive decoding. InCVPR, 2024

work page 2024
[19]

X. L. Li, A. Holtzman, D. Fried, P. Liang, J. Eisner, T. B. Hashimoto, L. Zettlemoyer, and M. Lewis. Contrastive decoding: Open-ended text generation as optimization. InACL, 2023. 10

work page 2023
[20]

Y . Li, Y . Du, K. Zhou, J. Wang, X. Zhao, and J.-R. Wen. Evaluating object hallucination in large vision-language models. InEMNLP, 2023

work page 2023
[21]

T.-Y . Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft coco: Common objects in context. InECCV, 2014

work page 2014
[22]

H. Liu, C. Li, Q. Wu, and Y . J. Lee. Visual instruction tuning. InNeurIPS, 2023

work page 2023
[23]

F. Ma, X. Jin, H. Wang, Y . Xian, J. Feng, and Y . Yang. Vista-llama: Reducing hallucination in video language models via equal distance to visual tokens. InCVPR, 2024

work page 2024
[24]

Y . Park, D. Lee, J. Choe, and B. Chang. Convis: Contrastive decoding with hallucination visualization for mitigating hallucinations in multimodal large language models. InAAAI, 2025

work page 2025
[25]

Rohrbach, L

A. Rohrbach, L. A. Hendricks, K. Burns, T. Darrell, and K. Saenko. Object hallucination in image captioning. InEMNLP, 2018

work page 2018
[26]

Schwenk, A

D. Schwenk, A. Khandelwal, C. Clark, K. Marino, and R. Mottaghi. A-okvqa: A benchmark for visual question answering using world knowledge. InECCV, 2022

work page 2022
[27]

Z. Sun, S. Shen, S. Cao, H. Liu, C. Li, Y . Shen, C. Gan, L. Gui, Y .-X. Wang, Y . Yang, et al. Aligning large multimodal models with factually augmented rlhf. InACL, 2024

work page 2024
[28]

W. Suo, L. Zhang, M. Sun, L. Y . Wu, P. Wang, and Y . Zhang. Octopus: Alleviating hallucination via dynamic contrastive decoding. InCVPR, 2025

work page 2025
[29]

Q. Team. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond.arXiv preprint arXiv:2308.12966, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[30]

S. Tong, Z. Liu, Y . Zhai, Y . Ma, Y . LeCun, and S. Xie. Eyes wide shut? exploring the visual shortcomings of multimodal llms. InCVPR, 2024

work page 2024
[31]

J. Wang, Y . Wang, G. Xu, J. Zhang, Y . Gu, H. Jia, J. Wang, H. Xu, M. Yan, J. Zhang, et al. Amber: An llm-free multi-dimensional benchmark for mllms hallucination evaluation.arXiv preprint arXiv:2311.07397, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[32]

X. Wang, J. Pan, L. Ding, and C. Biemann. Mitigating hallucinations in large vision-language models with instruction contrastive decoding. InACL, 2024

work page 2024
[33]

S. Woo, D. Kim, J. Jang, Y . Choi, and C. Kim. Don’t miss the forest for the trees: Attentional vision calibration for large vision language models. InACL, 2025

work page 2025
[34]

M. Wu, J. Ji, O. Huang, J. Li, Y . Wu, X. Sun, and R. Ji. Evaluating and analyzing relationship hallucinations in lvlms.arXiv preprint arXiv:2406.16449, 4, 2024

work page arXiv 2024
[35]

Y . Xia, S. Wang, and P. Li. Sdcd: Structure-disrupted contrastive decoding for mitigating hallucinations in large vision-language models.arXiv preprint arXiv:2601.03500, 2026

work page arXiv 2026
[36]

H. You, H. Zhang, Z. Gan, X. Du, B. Zhang, Z. Wang, L. Cao, S.-F. Chang, and Y . Yang. Ferret: Refer and ground anything anywhere at any granularity. InICLR, 2024

work page 2024
[37]

Z. Yue, L. Zhang, and Q. Jin. Less is more: Mitigating multimodal hallucination from an eos decision perspective. InACL, 2024

work page 2024
[38]

Zhang, K

K. Zhang, K. Tao, Z. Luo, C. Liu, J. Tang, and H. Wang. Tars: Minmax token-adaptive preference strategy for hallucination reduction in mllms.arXiv e-prints, 2025

work page 2025
[39]

Zhang, O

M. Zhang, O. Press, W. Merrill, A. Liu, and N. A. Smith. How language model hallucinations can snowball. InICML, 2024

work page 2024
[40]

Y . Zhou, C. Cui, J. Yoon, L. Zhang, Z. Deng, C. Finn, M. Bansal, and H. Yao. Analyzing and mitigating object hallucination in large vision-language models. InICLR, 2024

work page 2024
[41]

calibration on demand,

D. Zhu, J. Chen, X. Shen, X. Li, and M. Elhoseiny. Minigpt-4: Enhancing vision-language understanding with advanced large language models. InICLR, 2024. 11 Appendix Overview This appendix provides additional details and analyses to complement the main paper. It is organized as follows: • Section A. Social Impact.We discuss the ethical implications of theC...

work page 2024
[42]

person,"

POPE [20]:To mitigate the bias of simple "Yes" answers, POPE employs three progressively difficult sampling settings: • Random:Negative objects are randomly sampled from those not present in the image. • Popular:Negative objects are selected from the most frequent categories in the dataset (e.g., "person," "chair"). • Adversarial:Negative objects are sele...

work page
[43]

AMBER [31]:AMBER spans bothDiscriminative(Yes/No questions based on POPE-style sampling) andGenerative(Image Captioning) tasks to provide a multi-faceted assessment

work page
[44]

• ThePerceptiontrack comprises 10 sub-tasks: Existence, Count, Color, Position, Celebrity, Landmark, Artwork, Poster, Movie, and Design, focusing on basic visual recognition

MME [8]:A comprehensive benchmark designed to evaluate both perception and cognition capabilities. • ThePerceptiontrack comprises 10 sub-tasks: Existence, Count, Color, Position, Celebrity, Landmark, Artwork, Poster, Movie, and Design, focusing on basic visual recognition. • TheCognitiontrack consists of the remaining 4 sub-tasks: Commonsense Reasoning, N...

work page
[45]

D.2 Precise Metric Calculations

MMHal-Bench [27]:This benchmark evaluates hallucinatory responses across 8 complex reasoning dimensions: Attribute, Comparison, Counting, Existence, Localization, Relation, Scene, and Sport. D.2 Precise Metric Calculations

work page
[46]

POPE Metrics:We report Accuracy ( Acc) and F1-score ( F1) to analyze the trade-off between sensitivity and specificity: Acc = T P+T N T P+T N+F P+F N ,F1 = 2T P 2T P+F P+F N (8) 13

work page
[47]

AMBER Score Calculation:The final AMBER score is computed as the arithmetic mean of the discriminative F1-score and the generative fidelity (represented by100−CHAIR i), providing a balanced metric for both task types: ScoreAM BER = (100−CHAIR i) +F1 2 (9) Where CHAIRi denotes the instance-level hallucination rate in generative tasks and F1pope is the F1-s...

work page
[48]

MME Scoring:For each category c, the score is the sum of raw accuracy ( Accc) and balanced accuracy (Acc+c). LetN c be the number of image-question pairs in categoryc: Accc = P Correct Responses Nc ,Score M M E = 14X c=1 (Accc ×100 +Acc+ c ×100)(10) The total maximum score for the Perception track is 2000

work page 2000
[49]

Welcome to Houston, Texas

MMHal-Bench Scoring:The final informative-hallucination score ( Stotal) is the average across all 96 samples evaluated by an LLM-based judge: Stotal = 1 96 96X j=1 GPT-4-Judge(Responsej)(11) E More results E.1 MME-Fullset Detailed subtask evaluations on the MME benchmark are presented here. Figures 6 and 7 illustrate the performance profiles for LLaV A-1....

work page

[1] [1]

S. Bai, K. Chen, X. Liu, J. Wang, W. Ge, S. Song, K. Dang, P. Wang, S. Wang, J. Tang, H. Zhong, Y . Zhu, M. Yang, Z. Li, J. Wan, P. Wang, W. Ding, Z. Fu, Y . Xu, J. Ye, X. Zhang, T. Xie, Z. Cheng, H. Zhang, Z. Yang, H. Xu, and J. Lin. Qwen2.5-vl technical report.arXiv preprint arXiv:2502.13923, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[2] [2]

K. Chen, Z. Zhang, W. Zeng, R. Zhang, F. Zhu, and R. Zhao. Shikra: Unleashing multimodal llm’s referential dialogue magic.arXiv preprint arXiv:2306.15195, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[3] [3]

Z. Chen, Z. Zhao, H. Luo, H. Yao, B. Li, and J. Zhou. Halc: Object hallucination reduction via adaptive focal-contrast decoding. InICML, 2024

work page 2024

[4] [4]

Y . Cho, K. Kim, T. Hwang, and S. Cho. Do you keep an eye on what i ask? mitigating multimodal hallucination via attention-guided ensemble decoding. InICLR, 2025

work page 2025

[5] [5]

W. Dai, J. Li, D. Li, A. Tiong, J. Zhao, W. Wang, B. Li, P. N. Fung, and S. Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning. InNeurIPS, 2023

work page 2023

[6] [6]

Favero, L

A. Favero, L. Zancato, M. Trager, S. Choudhary, P. Perera, A. Achille, A. Swaminathan, and S. Soatto. Multi-modal hallucination control by visual information grounding. InCVPR, 2024

work page 2024

[7] [7]

S. Feng, W. Shi, Y . Wang, W. Ding, V . Balachandran, and Y . Tsvetkov. Don’t hallucinate, abstain: Identifying llm knowledge gaps via multi-llm collaboration. InACL, 2024

work page 2024

[8] [8]

C. Fu, P. Chen, Y . Shen, Y . Qin, M. Zhang, X. Lin, J. Yang, X. Zheng, K. Li, X. Sun, Y . Wu, R. Ji, C. Shan, and R. He. MME: A comprehensive evaluation benchmark for multimodal large language models. InNeurIPS, 2025

work page 2025

[9] [9]

T. Guan, F. Liu, X. Wu, R. Xian, Z. Li, X. Liu, X. Wang, L. Chen, F. Huang, Y . Yacoob, et al. Hallusionbench: an advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models. InCVPR, 2024

work page 2024

[10] [10]

Huang, X

Q. Huang, X. Dong, P. Zhang, B. Wang, C. He, J. Wang, D. Lin, W. Zhang, and N. Yu. Opera: Alleviating hallucination in multi-modal large language models via over-trust penalty and retrospection-allocation. InCVPR, 2024

work page 2024

[11] [11]

D. A. Hudson and C. D. Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. InCVPR, 2019

work page 2019

[12] [12]

F. Huo, W. Xu, Z. Zhang, H. Wang, Z. Chen, and P. Zhao. Self-introspective decoding: Alleviating hallucinations for large vision-language models. InICLR, 2025

work page 2025

[13] [13]

GPT-4o System Card

A. Hurst, A. Lerer, A. P. Goucher, A. Perelman, A. Ramesh, A. Clark, A. Ostrow, A. Welihinda, A. Hayes, A. Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[14] [14]

Z. Ji, N. Lee, R. Frieske, T. Yu, D. Su, Y . Xu, E. Ishii, Y . J. Bang, A. Madotto, and P. Fung. Survey of hallucination in natural language generation.ACM computing surveys, 2023

work page 2023

[15] [15]

Jiang, H

X. Jiang, H. Ye, Y . Zhu, X. Zheng, Z. Chen, and J. Gong. Hicd: Hallucination-inducing via attention dispersion for contrastive decoding to mitigate hallucinations in large language models. InACL, 2025

work page 2025

[16] [16]

J. Kim, J. Kim, Y . Kim, and S.-B. Cho. Fuzzy contrastive decoding to alleviate object hallucina- tion in large vision-language models. InICCV, 2025

work page 2025

[17] [17]

S. Kim, B. Cho, S. Bae, S. Ahn, and S.-Y . Yun. Vacode: Visual augmented contrastive decoding. arXiv preprint arXiv:2408.05337, 2024

work page arXiv 2024

[18] [18]

S. Leng, H. Zhang, G. Chen, X. Li, S. Lu, C. Miao, and L. Bing. Mitigating object hallucinations in large vision-language models through visual contrastive decoding. InCVPR, 2024

work page 2024

[19] [19]

X. L. Li, A. Holtzman, D. Fried, P. Liang, J. Eisner, T. B. Hashimoto, L. Zettlemoyer, and M. Lewis. Contrastive decoding: Open-ended text generation as optimization. InACL, 2023. 10

work page 2023

[20] [20]

Y . Li, Y . Du, K. Zhou, J. Wang, X. Zhao, and J.-R. Wen. Evaluating object hallucination in large vision-language models. InEMNLP, 2023

work page 2023

[21] [21]

T.-Y . Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft coco: Common objects in context. InECCV, 2014

work page 2014

[22] [22]

H. Liu, C. Li, Q. Wu, and Y . J. Lee. Visual instruction tuning. InNeurIPS, 2023

work page 2023

[23] [23]

F. Ma, X. Jin, H. Wang, Y . Xian, J. Feng, and Y . Yang. Vista-llama: Reducing hallucination in video language models via equal distance to visual tokens. InCVPR, 2024

work page 2024

[24] [24]

Y . Park, D. Lee, J. Choe, and B. Chang. Convis: Contrastive decoding with hallucination visualization for mitigating hallucinations in multimodal large language models. InAAAI, 2025

work page 2025

[25] [25]

Rohrbach, L

A. Rohrbach, L. A. Hendricks, K. Burns, T. Darrell, and K. Saenko. Object hallucination in image captioning. InEMNLP, 2018

work page 2018

[26] [26]

Schwenk, A

D. Schwenk, A. Khandelwal, C. Clark, K. Marino, and R. Mottaghi. A-okvqa: A benchmark for visual question answering using world knowledge. InECCV, 2022

work page 2022

[27] [27]

Z. Sun, S. Shen, S. Cao, H. Liu, C. Li, Y . Shen, C. Gan, L. Gui, Y .-X. Wang, Y . Yang, et al. Aligning large multimodal models with factually augmented rlhf. InACL, 2024

work page 2024

[28] [28]

W. Suo, L. Zhang, M. Sun, L. Y . Wu, P. Wang, and Y . Zhang. Octopus: Alleviating hallucination via dynamic contrastive decoding. InCVPR, 2025

work page 2025

[29] [29]

Q. Team. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond.arXiv preprint arXiv:2308.12966, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[30] [30]

S. Tong, Z. Liu, Y . Zhai, Y . Ma, Y . LeCun, and S. Xie. Eyes wide shut? exploring the visual shortcomings of multimodal llms. InCVPR, 2024

work page 2024

[31] [31]

J. Wang, Y . Wang, G. Xu, J. Zhang, Y . Gu, H. Jia, J. Wang, H. Xu, M. Yan, J. Zhang, et al. Amber: An llm-free multi-dimensional benchmark for mllms hallucination evaluation.arXiv preprint arXiv:2311.07397, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[32] [32]

X. Wang, J. Pan, L. Ding, and C. Biemann. Mitigating hallucinations in large vision-language models with instruction contrastive decoding. InACL, 2024

work page 2024

[33] [33]

S. Woo, D. Kim, J. Jang, Y . Choi, and C. Kim. Don’t miss the forest for the trees: Attentional vision calibration for large vision language models. InACL, 2025

work page 2025

[34] [34]

M. Wu, J. Ji, O. Huang, J. Li, Y . Wu, X. Sun, and R. Ji. Evaluating and analyzing relationship hallucinations in lvlms.arXiv preprint arXiv:2406.16449, 4, 2024

work page arXiv 2024

[35] [35]

Y . Xia, S. Wang, and P. Li. Sdcd: Structure-disrupted contrastive decoding for mitigating hallucinations in large vision-language models.arXiv preprint arXiv:2601.03500, 2026

work page arXiv 2026

[36] [36]

H. You, H. Zhang, Z. Gan, X. Du, B. Zhang, Z. Wang, L. Cao, S.-F. Chang, and Y . Yang. Ferret: Refer and ground anything anywhere at any granularity. InICLR, 2024

work page 2024

[37] [37]

Z. Yue, L. Zhang, and Q. Jin. Less is more: Mitigating multimodal hallucination from an eos decision perspective. InACL, 2024

work page 2024

[38] [38]

Zhang, K

K. Zhang, K. Tao, Z. Luo, C. Liu, J. Tang, and H. Wang. Tars: Minmax token-adaptive preference strategy for hallucination reduction in mllms.arXiv e-prints, 2025

work page 2025

[39] [39]

Zhang, O

M. Zhang, O. Press, W. Merrill, A. Liu, and N. A. Smith. How language model hallucinations can snowball. InICML, 2024

work page 2024

[40] [40]

Y . Zhou, C. Cui, J. Yoon, L. Zhang, Z. Deng, C. Finn, M. Bansal, and H. Yao. Analyzing and mitigating object hallucination in large vision-language models. InICLR, 2024

work page 2024

[41] [41]

calibration on demand,

D. Zhu, J. Chen, X. Shen, X. Li, and M. Elhoseiny. Minigpt-4: Enhancing vision-language understanding with advanced large language models. InICLR, 2024. 11 Appendix Overview This appendix provides additional details and analyses to complement the main paper. It is organized as follows: • Section A. Social Impact.We discuss the ethical implications of theC...

work page 2024

[42] [42]

person,"

POPE [20]:To mitigate the bias of simple "Yes" answers, POPE employs three progressively difficult sampling settings: • Random:Negative objects are randomly sampled from those not present in the image. • Popular:Negative objects are selected from the most frequent categories in the dataset (e.g., "person," "chair"). • Adversarial:Negative objects are sele...

work page

[43] [43]

AMBER [31]:AMBER spans bothDiscriminative(Yes/No questions based on POPE-style sampling) andGenerative(Image Captioning) tasks to provide a multi-faceted assessment

work page

[44] [44]

• ThePerceptiontrack comprises 10 sub-tasks: Existence, Count, Color, Position, Celebrity, Landmark, Artwork, Poster, Movie, and Design, focusing on basic visual recognition

MME [8]:A comprehensive benchmark designed to evaluate both perception and cognition capabilities. • ThePerceptiontrack comprises 10 sub-tasks: Existence, Count, Color, Position, Celebrity, Landmark, Artwork, Poster, Movie, and Design, focusing on basic visual recognition. • TheCognitiontrack consists of the remaining 4 sub-tasks: Commonsense Reasoning, N...

work page

[45] [45]

D.2 Precise Metric Calculations

MMHal-Bench [27]:This benchmark evaluates hallucinatory responses across 8 complex reasoning dimensions: Attribute, Comparison, Counting, Existence, Localization, Relation, Scene, and Sport. D.2 Precise Metric Calculations

work page

[46] [46]

POPE Metrics:We report Accuracy ( Acc) and F1-score ( F1) to analyze the trade-off between sensitivity and specificity: Acc = T P+T N T P+T N+F P+F N ,F1 = 2T P 2T P+F P+F N (8) 13

work page

[47] [47]

AMBER Score Calculation:The final AMBER score is computed as the arithmetic mean of the discriminative F1-score and the generative fidelity (represented by100−CHAIR i), providing a balanced metric for both task types: ScoreAM BER = (100−CHAIR i) +F1 2 (9) Where CHAIRi denotes the instance-level hallucination rate in generative tasks and F1pope is the F1-s...

work page

[48] [48]

MME Scoring:For each category c, the score is the sum of raw accuracy ( Accc) and balanced accuracy (Acc+c). LetN c be the number of image-question pairs in categoryc: Accc = P Correct Responses Nc ,Score M M E = 14X c=1 (Accc ×100 +Acc+ c ×100)(10) The total maximum score for the Perception track is 2000

work page 2000

[49] [49]

Welcome to Houston, Texas

MMHal-Bench Scoring:The final informative-hallucination score ( Stotal) is the average across all 96 samples evaluated by an LLM-based judge: Stotal = 1 96 96X j=1 GPT-4-Judge(Responsej)(11) E More results E.1 MME-Fullset Detailed subtask evaluations on the MME benchmark are presented here. Figures 6 and 7 illustrate the performance profiles for LLaV A-1....

work page