MACD: Model-Aware Contrastive Decoding via Counterfactual Data

Kun Zhou; Qixin Xiao

arxiv: 2602.01740 · v3 · pith:2CUSZP2Ynew · submitted 2026-02-02 · 💻 cs.AI · cs.CV· cs.LG

MACD: Model-Aware Contrastive Decoding via Counterfactual Data

Qixin Xiao , Kun Zhou This is my paper

classification 💻 cs.AI cs.CVcs.LG

keywords contrastivecounterfactualdecodinghallucinationmacddatageneratinginputs

0 comments

read the original abstract

Video language models (Video-LLMs) are prone to hallucinations, generating plausible but ungrounded content when visual evidence is weak, ambiguous, or biased. Existing methods, such as contrastive decoding (CD), rely on random perturbations to construct contrastive data for hallucination mitigation, but often fail to target the visual cues that drive hallucination or align with model weaknesses. We propose Model-Aware Counterfactual Data based Contrastive Decoding (MACD), an inference strategy that combines model-guided counterfactual construction with contrastive decoding. MACD uses the Video-LLM's own feedback to identify object regions most responsible for hallucination, generating targeted object-level counterfactual inputs rather than arbitrary frame or temporal modifications. These counterfactual inputs are integrated into CD to enforce evidence-grounded token selection during decoding. Experiments on EventHallusion, MVBench, Perception-test, and Video-MME show that MACD consistently reduces hallucination while maintaining or improving task accuracy across diverse Video-LLMs, including Qwen and InternVL, with especially strong gains in scenarios involving small, occluded, or co-occurring objects.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Decoding by Perturbation: Mitigating MLLM Hallucinations via Dynamic Textual Perturbation
cs.CL 2026-04 unverdicted novelty 7.0

DeP mitigates MLLM hallucinations by dynamically perturbing text prompts to identify and reinforce stable visual evidence regions while counteracting language prior biases using attention variance and logit statistics.