arxiv: 2604.12424 · v1 · submitted 2026-04-14 · 💻 cs.CL · cs.AI· cs.CV

Recognition: unknown

Decoding by Perturbation: Mitigating MLLM Hallucinations via Dynamic Textual Perturbation

Sihang Jia , Shuliang Liu , Songbo Yang , Yibo Yan , Xin Zou , Xuming Hu

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:12 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.CV

keywords multimodal hallucinationMLLMdecoding perturbationlanguage priorsattention variancelogits statisticstraining-free mitigationvisual grounding

0 comments

The pith

Multimodal large language models produce fewer hallucinations when their decoding process applies dynamic textual perturbations to isolate language priors from visual evidence.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that hallucinations in multimodal models arise because visual grounding during decoding becomes overly sensitive to textual phrasing rather than sticking to the image content. It introduces Decoding by Perturbation as a training-free approach that probes the model with controlled changes to the text input at multiple levels to surface hidden language biases. Attention patterns then highlight stable visual regions while logit statistics define a correction direction to offset co-occurrence biases. This approach avoids changing the image or retraining the model, which matters for making outputs more reliable in real applications like visual question answering without losing the model's natural fluency.

Core claim

The paper establishes that multimodal hallucinations manifest as the hypersensitivity of visual grounding to textual phrasing during the decoding phase. Decoding by Perturbation counters this through a dynamic probe that applies multi-level textual perturbations to elicit latent language priors, then uses attention variance to strengthen stable evidence regions and suppress noise, while constructing an interpretable prior drift direction from logits statistics to counteract probability biases from textual co-occurrences.

What carries the argument

Decoding by Perturbation (DeP), a framework that uses multi-level textual perturbations during decoding to separate latent language priors from visual evidence through attention variance analysis and logit-derived drift directions.

If this is right

Hallucinations decrease across standard benchmarks without any model retraining or image alteration.
The method preserves the model's original generative fluency better than approaches that directly manipulate visual features.
The prior drift direction provides an explicit, interpretable mechanism to offset biases from textual co-occurrences.
Attention variance serves as a practical signal to identify and reinforce reliable visual evidence regions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same perturbation idea could extend to purely textual models to reduce similar prior-driven errors in long-form generation.
Combining textual and visual perturbation strategies might create more robust hybrid defenses against hallucinations.
Model architectures could incorporate lightweight decoding probes by default to automatically detect and adjust for such sensitivities.

Load-bearing premise

That controlled textual perturbations can reliably draw out and separate latent language priors from actual visual evidence without creating new biases or changing the model's natural generation behavior.

What would settle it

Apply DeP to a test set of images paired with prompts that contain strong conflicting language co-occurrences and measure whether hallucination rates drop compared to standard decoding on the same set.

Figures

Figures reproduced from arXiv: 2604.12424 by Shuliang Liu, Sihang Jia, Songbo Yang, Xin Zou, Xuming Hu, Yibo Yan.

**Figure 1.** Figure 1: Illustration of multimodal hallucination driven by text hypersensitivity. Although the modified prompts preserve the core semantics of the original query, minor variations in their surface structures lead to drastically different hallucinated outputs (e.g., “fake” or “red”). As revealed by the attention maps on the right, these textual perturbations cause a severe drift in visual grounding. Notably, the g… view at source ↗

**Figure 2.** Figure 2: Overview of the DeP framework. DeP mitigates language prior-driven hallucinations during inference based on text perturbations. First, DeP utilizes attention consistency statistics across perturbations to decouple visual evidence and suspicious regions for hidden state calibration. Then, it estimates the prior drift direction from perturbed logits, applying it as a penalty to yield the final prediction. p… view at source ↗

**Figure 3.** Figure 3: Sensitivity analysis of perturbation count M. (a) Score variance decreases as M increases, indicating improved stability in visual decoupling. (b) Inference latency scales linearly with M. (Popular) and MMHal benchmarks using LLaVA-1.5 as the baseline. The experimental results are presented in the Tab. 3a. When removing the language prior correction module (w/o PC), the F1 score of the model on POPE incre… view at source ↗

**Figure 4.** Figure 4: Sensitivity of DeP performance to β across different α values with γ fixed to 0.3. We report overall score (a) and hallucination rate (b). The best-performing configuration is marked with red star. The dashed line indicates the results of the current best method (VTI). Perturbation Mode Selection. We replace our adaptive Neff selection with a random baseline, applying the aggressive T3 with a fixed probab… view at source ↗

read the original abstract

Multimodal Large Language Models frequently suffer from inference hallucinations, partially stemming from language priors dominating visual evidence. Existing training-free mitigation methods either perturb the visual representation and deviate from the natural image distribution, or enforce intrusive manipulations that compromise the model's inherent generative fluency. We introduce a novel perspective that multimodal hallucination manifests as the hypersensitivity of visual grounding to textual phrasing during the decoding phase. Building on this insight, we propose Decoding by Perturbation (DeP), a training-free framework mitigating prior-induced hallucinations via controlled textual interventions. DeP employs a dynamic probe applying multi-level textual perturbations to elicit latent language priors. Leveraging attention variance, it enhances stable evidence regions while suppressing suspicious noise in the feature space. Furthermore, it constructs an interpretable prior drift direction using logits statistics to counteract probability biases from textual co-occurrences. Extensive experiments confirm DeP effectively reduces hallucinations and achieves superior performance across multiple benchmarks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DeP offers a training-free textual perturbation method to reduce MLLM hallucinations by targeting language priors during decoding, but the attention variance signal lacks clear grounding that it isolates visual evidence from other factors.

read the letter

The paper's core idea is a training-free approach called Decoding by Perturbation that applies multi-level text changes at inference time to pull apart language priors from visual signals in multimodal models. It treats hallucinations as over-sensitivity to phrasing and uses attention variance across those perturbations to boost stable regions while damping noise, plus a logit-based drift direction to adjust output probabilities. This framing and the specific combination of dynamic probes with variance and drift stats is what sets it apart from prior visual perturbation or manipulation methods. The work does a reasonable job keeping the intervention non-intrusive so the model's natural fluency stays intact, and it reports gains on multiple hallucination benchmarks without any retraining. That practical angle is useful for people who need reliability fixes they can apply right away. The soft spot is the central assumption that low attention variance reliably flags true visual grounding rather than just general sensitivity to text changes or model artifacts. The abstract gives no controlled check against labeled hallucination cases, so the variance step could be capturing something else. The prior drift direction also draws from the model's own logits, which adds some dependence even if it's not a fitted parameter. Experiments are claimed but without details on perturbation choices, controls, or error breakdowns it's difficult to judge how much the gains trace back to the proposed mechanism. This is worth a serious referee for groups working on inference-time reliability in vision-language systems. The idea has enough novelty and reported results to justify review time, even if the evidence for the key heuristic stays preliminary.

Referee Report

3 major / 2 minor

Summary. The paper claims that multimodal hallucinations in MLLMs arise from the hypersensitivity of visual grounding to textual phrasing during decoding. It introduces Decoding by Perturbation (DeP), a training-free method that applies dynamic multi-level textual perturbations to elicit latent language priors, uses attention variance across perturbations to enhance stable visual evidence regions while suppressing noise, and constructs an interpretable prior drift direction from logits statistics to counteract textual co-occurrence biases, resulting in reduced hallucinations and superior benchmark performance.

Significance. If the empirical claims hold after addressing validation gaps, DeP would represent a useful training-free alternative to visual perturbation or intrusive decoding methods, with added interpretability from the logit-based drift direction. The approach could improve MLLM reliability in vision-language tasks by intervening only on text during inference. However, the significance is tempered by the absence of direct evidence linking the attention variance heuristic to actual grounding errors, limiting immediate impact until stronger controls and ablations are provided.

major comments (3)

[Abstract / Method] Abstract and method description: The central claim that attention variance reliably separates stable visual evidence from prior-induced noise lacks direct empirical grounding; no controlled experiments are described correlating low/high variance regions with object-level hallucination labels or visual grounding errors versus other factors like token uncertainty or architecture artifacts.
[Abstract] Abstract: The assertion of 'superior performance across multiple benchmarks' is not supported by details on exact metrics (e.g., CHAIR, POPE, or others), baseline implementations, statistical significance, or controls for confounding effects from the perturbations themselves, making it impossible to assess whether gains validate the proposed mechanism.
[Method] Method: The construction of the 'prior drift direction' from logits statistics and the multi-level perturbation probe rely on internal model signals without reported ablations isolating their individual contributions or demonstrating they do not introduce new biases, which is load-bearing for the claim of mitigating prior dominance without deviating from natural behavior.

minor comments (2)

[Experiments] The free parameters for perturbation strength and count should be explicitly listed with sensitivity analysis in the experiments section for reproducibility.
[Method] Notation for attention variance and prior drift direction would benefit from a formal equation or pseudocode to clarify the computation steps.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We sincerely thank the referee for the thorough review and valuable suggestions. We have carefully addressed each major comment by providing additional empirical evidence, detailed experimental information, and ablations in the revised manuscript. Our responses are as follows.

read point-by-point responses

Referee: [Abstract / Method] Abstract and method description: The central claim that attention variance reliably separates stable visual evidence from prior-induced noise lacks direct empirical grounding; no controlled experiments are described correlating low/high variance regions with object-level hallucination labels or visual grounding errors versus other factors like token uncertainty or architecture artifacts.

Authors: We agree that direct empirical grounding for the attention variance heuristic would strengthen our central claim. Although the overall effectiveness of DeP in reducing hallucinations is demonstrated through benchmark improvements, we did not include object-level annotations correlating variance with specific errors in the original submission. In the revised version, we have incorporated controlled experiments on a subset of images from the POPE benchmark. We manually identified hallucinated objects and computed attention variance for corresponding visual regions, comparing them to correctly grounded objects and controlling for token uncertainty. The results show significantly lower variance in stable evidence regions, supporting the separation of prior-induced noise. We have also added discussion on potential architecture artifacts. revision: yes
Referee: [Abstract] Abstract: The assertion of 'superior performance across multiple benchmarks' is not supported by details on exact metrics (e.g., CHAIR, POPE, or others), baseline implementations, statistical significance, or controls for confounding effects from the perturbations themselves, making it impossible to assess whether gains validate the proposed mechanism.

Authors: The abstract is intended as a concise overview, with full experimental details provided in the body of the paper. However, to make the performance claims more transparent, we have revised the abstract to specify the benchmarks used (CHAIR, POPE, and others), report the key quantitative improvements, mention the baseline methods implemented, and note that statistical significance was assessed. Additionally, we have included a new analysis in the experiments section addressing potential confounding effects from perturbations by comparing DeP to variants with random text perturbations, showing that the structured perturbations are key to the gains. revision: yes
Referee: [Method] Method: The construction of the 'prior drift direction' from logits statistics and the multi-level perturbation probe rely on internal model signals without reported ablations isolating their individual contributions or demonstrating they do not introduce new biases, which is load-bearing for the claim of mitigating prior dominance without deviating from natural behavior.

Authors: We acknowledge the importance of isolating the contributions of each component. The original manuscript included some qualitative analysis of the prior drift direction, but we have now added quantitative ablations in a dedicated subsection. These include removing the drift direction (using only attention variance), using single-level perturbations instead of multi-level, and evaluating the effect on natural language generation quality using perplexity scores on held-out clean prompts. The ablations confirm additive benefits from each part without introducing new biases, as perplexity increases are minimal and not statistically significant compared to the baseline. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation relies on independent design choices and empirical claims

full rationale

The paper introduces a perspective on hallucinations as hypersensitivity to textual phrasing and proposes DeP using multi-level perturbations, attention variance for stable regions, and logits-based prior drift direction. No equations, self-definitions, or derivations are shown that reduce any claimed result to its inputs by construction (e.g., no fitted parameters renamed as predictions or ansatzes smuggled via self-citation). The method's use of internal model signals is a heuristic design choice, not a tautological loop, and the abstract emphasizes external benchmarks for validation. This is self-contained against the provided text with no load-bearing self-referential steps.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The approach rests on the assumption that language priors are the dominant source of hallucinations and that internal attention and logit signals can be used to isolate and correct them without external validation.

free parameters (1)

multi-level perturbation strength and count
The number and intensity of textual perturbations are not specified and must be chosen to elicit priors effectively.

axioms (1)

domain assumption Language priors dominate visual evidence during MLLM decoding
Stated directly in the abstract as the partial cause of hallucinations.

invented entities (1)

prior drift direction no independent evidence
purpose: An interpretable vector constructed from logits statistics to counteract textual co-occurrence biases
Introduced as a new construct without independent falsifiable evidence outside the model's own outputs.

pith-pipeline@v0.9.0 · 5472 in / 1453 out tokens · 71493 ms · 2026-05-10T15:12:16.465389+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

66 extracted references · 36 canonical work pages · 4 internal anchors

[1]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

An, W., Tian, F., Leng, S., Nie, J., Lin, H., Wang, Q., Chen, P., Zhang, X., Lu, S.: Mitigating object hallucinations in large vision-language models with assembly of global and local attention. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 29915–29926 (2025)

2025
[2]

Hallucination of Multimodal Large Language Models: A Survey

Bai, Z., Wang, P., Xiao, T., He, T., Han, Z., Zhang, Z., Shou, M.Z.: Hallucination of multimodal large language models: A survey. arXiv preprint arXiv:2404.18930 (2024)

work page internal anchor Pith review arXiv 2024
[3]

arXiv preprint arXiv:2506.21509 (2025)

Chen, J., He, J., Shao, Q., Chen, Q., Ying, J., Xu, H., Chen, J., Zheng, J., Wu, J.: Mitigating hallucination of large vision-language models via dynamic logits calibration. arXiv preprint arXiv:2506.21509 (2025)

work page arXiv 2025
[4]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Chen, J., Zhang, T., Huang, S., Niu, Y., Zhang, L., Wen, L., Hu, X.: Ict: Image- object cross-level trusted intervention for mitigating object hallucination in large vision-language models. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 4209–4221 (2025)

2025
[5]

In: Eu- ropean Conference on Computer Vision

Chen, L., Li, J., Dong, X., Zhang, P., He, C., Wang, J., Zhao, F., Lin, D.: Sharegpt4v: Improving large multi-modal models with better captions. In: Eu- ropean Conference on Computer Vision. pp. 370–387. Springer (2024)

2024
[6]

arXiv preprint arXiv:2504.08809 (2025)

Chen, W., Yan, X., Wen, B., Yang, F., Gao, T., Zhang, D., Chen, L.: Decou- pling contrastive decoding: Robust hallucination mitigation in multimodal large language models. arXiv preprint arXiv:2504.08809 (2025)

work page arXiv 2025
[7]

arXiv preprint (2025)

Chen, Y., Wang, P., Qin, G., Wu, W., Chen, M., Hao, Y.: Attention re-alignment in multimodal large language models via intermediate-layer guidance. arXiv preprint (2025)

2025
[8]

arXiv preprint arXiv:2505.17529 (2025)

Cho,Y.,Kim,K.,Hwang,T.,Cho,S.:Doyoukeepaneyeonwhatiask?mitigating multimodal hallucination via attention-guided ensemble decoding. arXiv preprint arXiv:2505.17529 (2025)

work page arXiv 2025
[9]

Advances in neural information processing systems36, 49250–49267 (2023)

Dai, W., Li, J., Li, D., Tiong, A., Zhao, J., Wang, W., Li, B., Fung, P.N., Hoi, S.: Instructblip: Towards general-purpose vision-language models with instruction tuning. Advances in neural information processing systems36, 49250–49267 (2023)

2023
[10]

A benchmark for ultra-high-resolution remote sensing mllms.arXiv preprint arXiv:2512.17319, 2025

Dang, Y., Zhu, M., Wang, D., Zhang, Y., Yang, J., Fan, Q., Yang, Y., Li, W., Miao, F., Gao, Y.: A benchmark for ultra-high-resolution remote sensing mllms. arXiv preprint arXiv:2512.17319 (2025)

work page arXiv 2025
[11]

arXiv preprint arXiv:2504.05810 (2025)

Ding, X., Zhang, K., Han, J., Hong, L., Xu, H., Li, X.: Pami-vdpo: Mitigating video hallucinations by prompt-aware multi-instance video preference learning. arXiv preprint arXiv:2504.05810 (2025)

work page arXiv 2025
[12]

Truthprint: Miti- gating lvlm object hallucination via latent truthful-guided pre- intervention,

Duan, J., Kong, F., Cheng, H., Diffenderfer, J., Kailkhura, B., Sun, L., Zhu, X., Shi, X., Xu, K.: Truthprint: Mitigating lvlm object hallucination via latent truthful- guided pre-intervention. arXiv preprint arXiv:2503.10602 (2025)

work page arXiv 2025
[13]

arXiv preprint arXiv:2505.19678 (2025)

Fang, H., Zhou, C., Kong, J., Gao, K., Chen, B., Liang, T., Ma, G., Xia, S.T.: Grounding language with vision: A conditional mutual information cal- ibrated decoding strategy for reducing hallucinations in lvlms. arXiv preprint arXiv:2505.19678 (2025)

work page arXiv 2025
[14]

Mme-survey: A comprehensive survey on evaluation of multimodal llms.arXiv preprint arXiv:2411.15296,

Fu, C., Zhang, Y.F., Yin, S., Li, B., Fang, X., Zhao, S., Duan, H., Sun, X., Liu, Z., Wang, L., et al.: Mme-survey: A comprehensive survey on evaluation of multimodal llms. arXiv preprint arXiv:2411.15296 (2024)

work page arXiv 2024
[15]

LingoLoop Attack: Trapping MLLMs via Linguistic Context and State Entrapment into Endless Loops

Fu, J., Jiang, K., Hong, L., Li, J., Guo, H., Yang, D., Chen, Z., Zhang, W.: Lin- goloop attack: Trapping mllms via linguistic context and state entrapment into endless loops. arXiv preprint arXiv:2506.14493 (2025) 16 S. Jia, S. Liu, et al

work page internal anchor Pith review Pith/arXiv arXiv 2025
[16]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Guan, T., Liu, F., Wu, X., Xian, R., Li, Z., Liu, X., Wang, X., Chen, L., Huang, F., Yacoob, Y., et al.: Hallusionbench: an advanced diagnostic suite for entan- gled language hallucination and visual illusion in large vision-language models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 14375–14385 (2024)

2024
[17]

In: The Thirty-ninth Annual Con- ference on Neural Information Processing Systems (NA)

He, L., Chen, Z., Shi, Z., Yu, T., Sheng, L., Shao, J.: Systematic reward gap optimization for mitigating vlm hallucinations. In: The Thirty-ninth Annual Con- ference on Neural Information Processing Systems (NA)
[18]

arXiv preprint arXiv:2205.02225 (2022)

Hu, X., Liu, S., Zhang, C., Li, S., Wen, L., Yu, P.S.: Hiure: Hierarchical ex- emplar contrastive learning for unsupervised relation extraction. arXiv preprint arXiv:2205.02225 (2022)

work page arXiv 2022
[19]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Huang, Q., Dong, X., Zhang, P., Wang, B., He, C., Wang, J., Lin, D., Zhang, W., Yu, N.: Opera: Alleviating hallucination in multi-modal large language models via over-trust penalty and retrospection-allocation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 13418–13427 (2024)

2024
[20]

arXiv preprint arXiv:2509.21057 (2025)

Huo, J., Liu, S., Wang, B., Zhang, J., Yan, Y., Liu, A., Hu, X., Zhou, M.: Pmark: Towards robust and distortion-free semantic-level watermarking with channel con- straints. arXiv preprint arXiv:2509.21057 (2025)

work page arXiv 2025
[21]

Attention-space Contrastive Guidance for Efficient Hallucination Mitigation in LVLMs

Jo, Y., Bae, S., Kim, T.: Attention-space contrastive guidance for efficient halluci- nation mitigation in lvlms. arXiv preprint arXiv:2601.13707 (2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026
[22]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Leng, S., Zhang, H., Chen, G., Li, X., Lu, S., Miao, C., Bing, L.: Mitigating object hallucinations in large vision-language models through visual contrastive decoding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 13872–13882 (2024)

2024
[23]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Li, C., Im, E.W., Fazli, P.: Vidhalluc: Evaluating temporal hallucinations in mul- timodal large language models for video understanding. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 13723–13733 (2025)

2025
[24]

arXiv preprint arXiv:2503.14895 (2025)

Li,S.,Sun,J.,Zheng,G.,Fan,X.,Shen,Y.,Lu,Y.,Xi,Z.,Yang,Y.,Tan,W.,Ji,T., et al.: Mitigating object hallucinations in mllms via multi-frequency perturbations. arXiv preprint arXiv:2503.14895 (2025)

work page arXiv 2025
[25]

In: Proceedings of the 2023 conference on empirical methods in natural language processing

Li, Y., Du, Y., Zhou, K., Wang, J., Zhao, W.X., Wen, J.R.: Evaluating object hal- lucination in large vision-language models. In: Proceedings of the 2023 conference on empirical methods in natural language processing. pp. 292–305 (2023)

2023
[26]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Liu, H., Li, C., Li, Y., Lee, Y.J.: Improved baselines with visual instruction tun- ing. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 26296–26306 (2024)

2024
[27]

Advances in neural information processing systems36, 34892–34916 (2023)

Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in neural information processing systems36, 34892–34916 (2023)

2023
[28]

Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning (2023)

2023
[29]

In: The Thirteenth International Conference on Learning Representations (2025)

Liu, S., Ye, H., Zou, J.: Reducing hallucinations in large vision-language models via latent space steering. In: The Thirteenth International Conference on Learning Representations (2025)

2025
[30]

arXiv preprint arXiv:2601.05144 (2026)

Liu, S., Li, X., Liu, H., Yan, Y., Duan, B., Zheng, Q., Fang, D., Su, L., Hu, X.: Distilling the thought, watermarking the answer: A principle semantic guided watermark for large reasoning models. arXiv preprint arXiv:2601.05144 (2026)

work page arXiv 2026
[31]

arXiv preprint arXiv:2507.05288 (2025)

Liu, S., Liu, H., Liu, A., Duan, B., Zheng, Q., Yan, Y., Geng, H., Jiang, P., Liu, J., Hu, X.: A survey on proactive defense strategies against misinformation in large language models. arXiv preprint arXiv:2507.05288 (2025)

work page arXiv 2025
[32]

arXiv preprint arXiv:2601.05159 (2026) DeP 17

Liu, S., Yang, S., Fang, D., Jia, S., Tang, Y., Su, L., Peng, R., Yan, Y., Zou, X., Hu, X.: Vision-language introspection: Mitigating overconfident hallucinations in mllms via interpretable bi-causal steering. arXiv preprint arXiv:2601.05159 (2026) DeP 17

work page arXiv 2026
[33]

arXiv preprint arXiv:2507.14067 (2025)

Liu,S.,Zheng,Q.,Xu,J.J.,Yan,Y.,Geng,H.,Liu,A.,Jiang,P.,Liu,J.,Tam,Y.C., Hu, X.: Vla-mark: A cross modal watermark for large vision-language alignment model. arXiv preprint arXiv:2507.14067 (2025)

work page arXiv 2025
[34]

Advances in Neural Information Processing Systems37, 122811–122832 (2024)

Lyu, X., Chen, B., Gao, L., Shen, H., Song, J.: Alleviating hallucinations in large vision-language models through hallucination-induced optimization. Advances in Neural Information Processing Systems37, 122811–122832 (2024)

2024
[35]

Computers in Biology and Medicine200, 111347 (2026)

Mahdavi, Z., Khodakaramimaghsoud, Z., Khaloo, H., Taleshani, S.B., Hashemi, E., Kaleybar, J.M., Manzari, O.N.: Med-vcd: Mitigating hallucination for medical large vision language models through visual contrastive decoding. Computers in Biology and Medicine200, 111347 (2026)

2026
[36]

arXiv preprint arXiv:2410.13321 (2024)

Min, K., Kim, M., Lee, K.i., Lee, D., Jung, K.: Mitigating hallucinations in large vision-language models via summary-guided decoding. arXiv preprint arXiv:2410.13321 (2024)

work page arXiv 2024
[37]

arXiv preprint arXiv:2601.06224 (2026)

Pan, M., Gan, W., Chen, J., Zhang, W., Sun, B., Yin, J., Zhang, X.: Ground what you see: Hallucination-resistant mllms via caption feedback, diversity-aware sampling, and conflict regularization. arXiv preprint arXiv:2601.06224 (2026)

work page arXiv 2026
[38]

arXiv preprint arXiv:2505.21448 (2025)

Peng, Z., Liu, J., Zhang, H., Liu, X., Tang, S., Wan, P., Zhang, D., Liu, H., He, J.: Omnisync: Towards universal lip synchronization via diffusion transformers. arXiv preprint arXiv:2505.21448 (2025)

work page arXiv 2025
[39]

In: Inter- national Conference on Sustainable Computing and Intelligent Systems

Saxena, V., Sathe, A., Sandosh, S.: Mitigating hallucinations in large language models: A comprehensive survey on detection and reduction strategies. In: Inter- national Conference on Sustainable Computing and Intelligent Systems. pp. 39–52. Springer (2024)

2024
[40]

arXiv preprint arXiv:2601.23041 (2026)

Shi, Y., Yang, S., Liu, D.: One-shot optimized steering vector for hallucination mitigation for vlms. arXiv preprint arXiv:2601.23041 (2026)

work page arXiv 2026
[41]

Sivakumar, A.: Model Control through Lightweight Activation Steering for Vision Language Models. Ph.D. thesis, Virginia Tech (2025)

2025
[42]

In: Proceedings of the 63rd Annual Meeting of the Asso- ciation for Computational Linguistics (Volume 1: Long Papers)

Su, J., Chen, J., Li, H., Chen, Y., Qing, L., Zhang, Z.: Activation steering decod- ing: Mitigating hallucination in large vision-language models through bidirectional hidden state intervention. In: Proceedings of the 63rd Annual Meeting of the Asso- ciation for Computational Linguistics (Volume 1: Long Papers). pp. 12964–12974 (2025)

2025
[43]

In: Ku, L.W., Martins, A., Srikumar, V

Sun, Z., Shen, S., Cao, S., Liu, H., Li, C., Shen, Y., Gan, C., Gui, L., Wang, Y.X., Yang, Y., Keutzer, K., Darrell, T.: Aligning large multimodal models with factu- ally augmented RLHF. In: Ku, L.W., Martins, A., Srikumar, V. (eds.) Findings of the Association for Computational Linguistics: ACL 2024. pp. 13088–13110. Association for Computational Linguis...

2024
[44]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Tang, F., Liu, C., Xu, Z., Hu, M., Huang, Z., Xue, H., Chen, Z., Peng, Z., Yang, Z., Zhou, S., et al.: Seeing far and clearly: Mitigating hallucinations in mllms with attention causal decoding. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 26147–26159 (2025)

2025
[45]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Wang, S., Zhang, Y., Zhu, Y., Liu, E., Li, J., Liu, Y., Ji, X.: Shift: Smoothing hallucinations by information flow tuning for multimodal large language models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 3639–3649 (2025)

2025
[46]

arXiv preprint arXiv:2504.13169 , year=

Wu, T.H., Lee, H., Ge, J., Gonzalez, J.E., Darrell, T., Chan, D.M.: Generate, but verify: Reducing hallucination in vision-language models with retrospective resampling. arXiv preprint arXiv:2504.13169 (2025)

work page arXiv 2025
[47]

Wu, Y., Zhang, L., Yao, H., Du, J., Yan, K., Ding, S., Wu, Y., Li, X.: Antidote: A unified framework for mitigating lvlm hallucinations in counterfactual presupposi- 18 S. Jia, S. Liu, et al. tion and object perception. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 14646–14656 (2025)

2025
[48]

arXiv preprint arXiv:2602.01740 (2026)

Xiao, Q., Zhou, K.: Macd: Model-aware contrastive decoding via counterfactual data. arXiv preprint arXiv:2602.01740 (2026)

work page arXiv 2026
[49]

In: Proceedings of the AAAI Conference on Artificial Intelligence

Xiao, W., Huang, Z., Gan, L., He, W., Li, H., Yu, Z., Shu, F., Jiang, H., Zhu, L.: Detecting and mitigating hallucination in large vision language models via fine-grained ai feedback. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 39, pp. 25543–25551 (2025)

2025
[50]

arXiv preprint arXiv:2506.22283 (2025)

Xu, R., Wang, Y., Luo, Y., Du, B.: Rethinking visual token reduction in lvlms under cross-modal misalignment. arXiv preprint arXiv:2506.22283 (2025)

work page arXiv 2025
[51]

In: Pro- ceedings of the Computer Vision and Pattern Recognition Conference

Yang, L., Zheng, Z., Chen, B., Zhao, Z., Lin, C., Shen, C.: Nullu: Mitigating object hallucinations in large vision-language models via halluspace projection. In: Pro- ceedings of the Computer Vision and Pattern Recognition Conference. pp. 14635– 14645 (2025)

2025
[52]

arXiv preprint arXiv:2512.02981 (2025)

Yang, Z., Yuan, Y., Jiang, X., An, B., Pang, W.: Inex: Hallucination mitiga- tion via introspection and cross-modal multi-agent collaboration. arXiv preprint arXiv:2512.02981 (2025)

work page arXiv 2025
[53]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Yin, H., Si, G., Wang, Z.: Clearsight: Visual signal enhancement for object hallu- cination mitigation in multimodal large language models. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 14625–14634 (2025)

2025
[54]

Science China Information Sciences67(12), 220105 (2024)

Yin, S., Fu, C., Zhao, S., Xu, T., Wang, H., Sui, D., Shen, Y., Li, K., Sun, X., Chen, E.: Woodpecker: Hallucination correction for multimodal large language models. Science China Information Sciences67(12), 220105 (2024)

2024
[55]

Self- correcting decoding with generative feedback for mitigat- ing hallucinations in large vision-language models.arXiv preprint arXiv:2502.06130, 2025

Zhang, C., Wan, Z., Kan, Z., Ma, M.Q., Stepputtis, S., Ramanan, D., Salakhutdi- nov, R., Morency, L.P., Sycara, K., Xie, Y.: Self-correcting decoding with genera- tive feedback for mitigating hallucinations in large vision-language models. arXiv preprint arXiv:2502.06130 (2025)

work page arXiv 2025
[56]

DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection

Zhang, H., Li, F., Liu, S., Zhang, L., Su, H., Zhu, J., Ni, L.M., Shum, H.Y.: Dino: Detr with improved denoising anchor boxes for end-to-end object detection. arXiv preprint arXiv:2203.03605 (2022)

work page internal anchor Pith review arXiv 2022
[57]

arXiv preprint arXiv:2504.17309 (2025)

Zhang, J., Liu, S., Liu, A., Gao, Y., Li, J., Gu, X., Hu, X.: Cohemark: A novel sentence-level watermark for enhanced text quality. arXiv preprint arXiv:2504.17309 (2025)

work page arXiv 2025
[58]

arXiv preprint arXiv:2601.20279 (2026)

Zhang, X., Zhu, Y., Gu, C., Yuan, X., Zhao, Q., Cao, J., Tang, F., Fan, S., Shen, Y., Shen, C., et al.: Hallucination begins where saliency drops. arXiv preprint arXiv:2601.20279 (2026)

work page arXiv 2026
[59]

Mm-rlhf: The next step forward in multimodal llm alignment,

Zhang, Y.F., Yu, T., Tian, H., Fu, C., Li, P., Zeng, J., Xie, W., Shi, Y., Zhang, H., Wu, J., et al.: Mm-rlhf: The next step forward in multimodal llm alignment. arXiv preprint arXiv:2502.10391 (2025)

work page arXiv 2025
[60]

arXiv preprint arXiv:2510.02342 (2025)

Zhang,Y.,Liu,S.,Yang,X.,Hu,X.:Catmark:Acontext-awarethresholdingframe- work for robust cross-task watermarking in large language models. arXiv preprint arXiv:2510.02342 (2025)

work page arXiv 2025
[61]

In: 2025 IEEE 8th International Conference on Multimedia Information Processing and Retrieval (MIPR)

Zhao, F., Zhang, C., Zhang, R., Wang, T., Li, X.: Mitigating image captioning hal- lucinations in vision-language models. In: 2025 IEEE 8th International Conference on Multimedia Information Processing and Retrieval (MIPR). pp. 297–302. IEEE (2025)

2025
[62]

arXiv preprint arXiv:2505.14257 (2025) DeP 19

Zhao, J., Zhang, F., Sun, X., Feng, C.: Aligning attention distribution to infor- mation flow for hallucination mitigation in large vision-language models. arXiv preprint arXiv:2505.14257 (2025) DeP 19

work page arXiv 2025
[63]

arXiv preprint arXiv:2505.10634 (2025)

Zhao, J., Zhang, F., Sun, X., Kong, L., Tan, Z., Feng, C.: Cross-image contrastive decoding: Precise, lossless suppression of language priors in large vision-language models. arXiv preprint arXiv:2505.10634 (2025)

work page arXiv 2025
[64]

Beyond hallucinations: Enhancing lvlms through hallucination-aware direct preference optimization,

Zhao, Z., Wang, B., Ouyang, L., Dong, X., Wang, J., He, C.: Beyond hallucinations: Enhancing lvlms through hallucination-aware direct preference optimization. arXiv preprint arXiv:2311.16839 (2023)

work page arXiv 2023
[65]

arXiv preprint arXiv:2601.07291 (2026)

Zheng, Q., Liu, S., Huang, Y., Jia, S., Li, J., Chen, L., Chen, J., Li, H., Liu, A., Yan, Y., et al.: A visual semantic adaptive watermark grounded by prefix-tuning for large vision-language model. arXiv preprint arXiv:2601.07291 (2026)

work page arXiv 2026
[66]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Zhu, L., Ji, D., Chen, T., Xu, P., Ye, J., Liu, J.: Ibd: Alleviating hallucinations in large vision-language models via image-biased decoding. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 1624–1633 (2025)

2025