pith. machine review for the scientific record. sign in

arxiv: 2604.12424 · v1 · submitted 2026-04-14 · 💻 cs.CL · cs.AI· cs.CV

Recognition: unknown

Decoding by Perturbation: Mitigating MLLM Hallucinations via Dynamic Textual Perturbation

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:12 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.CV
keywords multimodal hallucinationMLLMdecoding perturbationlanguage priorsattention variancelogits statisticstraining-free mitigationvisual grounding
0
0 comments X

The pith

Multimodal large language models produce fewer hallucinations when their decoding process applies dynamic textual perturbations to isolate language priors from visual evidence.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that hallucinations in multimodal models arise because visual grounding during decoding becomes overly sensitive to textual phrasing rather than sticking to the image content. It introduces Decoding by Perturbation as a training-free approach that probes the model with controlled changes to the text input at multiple levels to surface hidden language biases. Attention patterns then highlight stable visual regions while logit statistics define a correction direction to offset co-occurrence biases. This approach avoids changing the image or retraining the model, which matters for making outputs more reliable in real applications like visual question answering without losing the model's natural fluency.

Core claim

The paper establishes that multimodal hallucinations manifest as the hypersensitivity of visual grounding to textual phrasing during the decoding phase. Decoding by Perturbation counters this through a dynamic probe that applies multi-level textual perturbations to elicit latent language priors, then uses attention variance to strengthen stable evidence regions and suppress noise, while constructing an interpretable prior drift direction from logits statistics to counteract probability biases from textual co-occurrences.

What carries the argument

Decoding by Perturbation (DeP), a framework that uses multi-level textual perturbations during decoding to separate latent language priors from visual evidence through attention variance analysis and logit-derived drift directions.

If this is right

  • Hallucinations decrease across standard benchmarks without any model retraining or image alteration.
  • The method preserves the model's original generative fluency better than approaches that directly manipulate visual features.
  • The prior drift direction provides an explicit, interpretable mechanism to offset biases from textual co-occurrences.
  • Attention variance serves as a practical signal to identify and reinforce reliable visual evidence regions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same perturbation idea could extend to purely textual models to reduce similar prior-driven errors in long-form generation.
  • Combining textual and visual perturbation strategies might create more robust hybrid defenses against hallucinations.
  • Model architectures could incorporate lightweight decoding probes by default to automatically detect and adjust for such sensitivities.

Load-bearing premise

That controlled textual perturbations can reliably draw out and separate latent language priors from actual visual evidence without creating new biases or changing the model's natural generation behavior.

What would settle it

Apply DeP to a test set of images paired with prompts that contain strong conflicting language co-occurrences and measure whether hallucination rates drop compared to standard decoding on the same set.

Figures

Figures reproduced from arXiv: 2604.12424 by Shuliang Liu, Sihang Jia, Songbo Yang, Xin Zou, Xuming Hu, Yibo Yan.

Figure 1
Figure 1. Figure 1: Illustration of multimodal hallucination driven by text hypersensitivity. Al￾though the modified prompts preserve the core semantics of the original query, minor variations in their surface structures lead to drastically different hallucinated outputs (e.g., “fake” or “red”). As revealed by the attention maps on the right, these textual perturbations cause a severe drift in visual grounding. Notably, the g… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the DeP framework. DeP mitigates language prior-driven hallu￾cinations during inference based on text perturbations. First, DeP utilizes attention consistency statistics across perturbations to decouple visual evidence and suspicious regions for hidden state calibration. Then, it estimates the prior drift direction from perturbed logits, applying it as a penalty to yield the final prediction. p… view at source ↗
Figure 3
Figure 3. Figure 3: Sensitivity analysis of perturbation count M. (a) Score variance decreases as M increases, indicating improved stability in visual decoupling. (b) Inference latency scales linearly with M. (Popular) and MMHal benchmarks using LLaVA-1.5 as the baseline. The exper￾imental results are presented in the Tab. 3a. When removing the language prior correction module (w/o PC), the F1 score of the model on POPE incre… view at source ↗
Figure 4
Figure 4. Figure 4: Sensitivity of DeP performance to β across different α values with γ fixed to 0.3. We report overall score (a) and hallucination rate (b). The best-performing configuration is marked with red star. The dashed line indicates the results of the current best method (VTI). Perturbation Mode Selection. We replace our adaptive Neff selection with a ran￾dom baseline, applying the aggressive T3 with a fixed probab… view at source ↗
read the original abstract

Multimodal Large Language Models frequently suffer from inference hallucinations, partially stemming from language priors dominating visual evidence. Existing training-free mitigation methods either perturb the visual representation and deviate from the natural image distribution, or enforce intrusive manipulations that compromise the model's inherent generative fluency. We introduce a novel perspective that multimodal hallucination manifests as the hypersensitivity of visual grounding to textual phrasing during the decoding phase. Building on this insight, we propose Decoding by Perturbation (DeP), a training-free framework mitigating prior-induced hallucinations via controlled textual interventions. DeP employs a dynamic probe applying multi-level textual perturbations to elicit latent language priors. Leveraging attention variance, it enhances stable evidence regions while suppressing suspicious noise in the feature space. Furthermore, it constructs an interpretable prior drift direction using logits statistics to counteract probability biases from textual co-occurrences. Extensive experiments confirm DeP effectively reduces hallucinations and achieves superior performance across multiple benchmarks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper claims that multimodal hallucinations in MLLMs arise from the hypersensitivity of visual grounding to textual phrasing during decoding. It introduces Decoding by Perturbation (DeP), a training-free method that applies dynamic multi-level textual perturbations to elicit latent language priors, uses attention variance across perturbations to enhance stable visual evidence regions while suppressing noise, and constructs an interpretable prior drift direction from logits statistics to counteract textual co-occurrence biases, resulting in reduced hallucinations and superior benchmark performance.

Significance. If the empirical claims hold after addressing validation gaps, DeP would represent a useful training-free alternative to visual perturbation or intrusive decoding methods, with added interpretability from the logit-based drift direction. The approach could improve MLLM reliability in vision-language tasks by intervening only on text during inference. However, the significance is tempered by the absence of direct evidence linking the attention variance heuristic to actual grounding errors, limiting immediate impact until stronger controls and ablations are provided.

major comments (3)
  1. [Abstract / Method] Abstract and method description: The central claim that attention variance reliably separates stable visual evidence from prior-induced noise lacks direct empirical grounding; no controlled experiments are described correlating low/high variance regions with object-level hallucination labels or visual grounding errors versus other factors like token uncertainty or architecture artifacts.
  2. [Abstract] Abstract: The assertion of 'superior performance across multiple benchmarks' is not supported by details on exact metrics (e.g., CHAIR, POPE, or others), baseline implementations, statistical significance, or controls for confounding effects from the perturbations themselves, making it impossible to assess whether gains validate the proposed mechanism.
  3. [Method] Method: The construction of the 'prior drift direction' from logits statistics and the multi-level perturbation probe rely on internal model signals without reported ablations isolating their individual contributions or demonstrating they do not introduce new biases, which is load-bearing for the claim of mitigating prior dominance without deviating from natural behavior.
minor comments (2)
  1. [Experiments] The free parameters for perturbation strength and count should be explicitly listed with sensitivity analysis in the experiments section for reproducibility.
  2. [Method] Notation for attention variance and prior drift direction would benefit from a formal equation or pseudocode to clarify the computation steps.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We sincerely thank the referee for the thorough review and valuable suggestions. We have carefully addressed each major comment by providing additional empirical evidence, detailed experimental information, and ablations in the revised manuscript. Our responses are as follows.

read point-by-point responses
  1. Referee: [Abstract / Method] Abstract and method description: The central claim that attention variance reliably separates stable visual evidence from prior-induced noise lacks direct empirical grounding; no controlled experiments are described correlating low/high variance regions with object-level hallucination labels or visual grounding errors versus other factors like token uncertainty or architecture artifacts.

    Authors: We agree that direct empirical grounding for the attention variance heuristic would strengthen our central claim. Although the overall effectiveness of DeP in reducing hallucinations is demonstrated through benchmark improvements, we did not include object-level annotations correlating variance with specific errors in the original submission. In the revised version, we have incorporated controlled experiments on a subset of images from the POPE benchmark. We manually identified hallucinated objects and computed attention variance for corresponding visual regions, comparing them to correctly grounded objects and controlling for token uncertainty. The results show significantly lower variance in stable evidence regions, supporting the separation of prior-induced noise. We have also added discussion on potential architecture artifacts. revision: yes

  2. Referee: [Abstract] Abstract: The assertion of 'superior performance across multiple benchmarks' is not supported by details on exact metrics (e.g., CHAIR, POPE, or others), baseline implementations, statistical significance, or controls for confounding effects from the perturbations themselves, making it impossible to assess whether gains validate the proposed mechanism.

    Authors: The abstract is intended as a concise overview, with full experimental details provided in the body of the paper. However, to make the performance claims more transparent, we have revised the abstract to specify the benchmarks used (CHAIR, POPE, and others), report the key quantitative improvements, mention the baseline methods implemented, and note that statistical significance was assessed. Additionally, we have included a new analysis in the experiments section addressing potential confounding effects from perturbations by comparing DeP to variants with random text perturbations, showing that the structured perturbations are key to the gains. revision: yes

  3. Referee: [Method] Method: The construction of the 'prior drift direction' from logits statistics and the multi-level perturbation probe rely on internal model signals without reported ablations isolating their individual contributions or demonstrating they do not introduce new biases, which is load-bearing for the claim of mitigating prior dominance without deviating from natural behavior.

    Authors: We acknowledge the importance of isolating the contributions of each component. The original manuscript included some qualitative analysis of the prior drift direction, but we have now added quantitative ablations in a dedicated subsection. These include removing the drift direction (using only attention variance), using single-level perturbations instead of multi-level, and evaluating the effect on natural language generation quality using perplexity scores on held-out clean prompts. The ablations confirm additive benefits from each part without introducing new biases, as perplexity increases are minimal and not statistically significant compared to the baseline. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation relies on independent design choices and empirical claims

full rationale

The paper introduces a perspective on hallucinations as hypersensitivity to textual phrasing and proposes DeP using multi-level perturbations, attention variance for stable regions, and logits-based prior drift direction. No equations, self-definitions, or derivations are shown that reduce any claimed result to its inputs by construction (e.g., no fitted parameters renamed as predictions or ansatzes smuggled via self-citation). The method's use of internal model signals is a heuristic design choice, not a tautological loop, and the abstract emphasizes external benchmarks for validation. This is self-contained against the provided text with no load-bearing self-referential steps.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The approach rests on the assumption that language priors are the dominant source of hallucinations and that internal attention and logit signals can be used to isolate and correct them without external validation.

free parameters (1)
  • multi-level perturbation strength and count
    The number and intensity of textual perturbations are not specified and must be chosen to elicit priors effectively.
axioms (1)
  • domain assumption Language priors dominate visual evidence during MLLM decoding
    Stated directly in the abstract as the partial cause of hallucinations.
invented entities (1)
  • prior drift direction no independent evidence
    purpose: An interpretable vector constructed from logits statistics to counteract textual co-occurrence biases
    Introduced as a new construct without independent falsifiable evidence outside the model's own outputs.

pith-pipeline@v0.9.0 · 5472 in / 1453 out tokens · 71493 ms · 2026-05-10T15:12:16.465389+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

66 extracted references · 36 canonical work pages · 4 internal anchors

  1. [1]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    An, W., Tian, F., Leng, S., Nie, J., Lin, H., Wang, Q., Chen, P., Zhang, X., Lu, S.: Mitigating object hallucinations in large vision-language models with assembly of global and local attention. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 29915–29926 (2025)

  2. [2]

    Hallucination of Multimodal Large Language Models: A Survey

    Bai, Z., Wang, P., Xiao, T., He, T., Han, Z., Zhang, Z., Shou, M.Z.: Hallucination of multimodal large language models: A survey. arXiv preprint arXiv:2404.18930 (2024)

  3. [3]

    arXiv preprint arXiv:2506.21509 (2025)

    Chen, J., He, J., Shao, Q., Chen, Q., Ying, J., Xu, H., Chen, J., Zheng, J., Wu, J.: Mitigating hallucination of large vision-language models via dynamic logits calibration. arXiv preprint arXiv:2506.21509 (2025)

  4. [4]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Chen, J., Zhang, T., Huang, S., Niu, Y., Zhang, L., Wen, L., Hu, X.: Ict: Image- object cross-level trusted intervention for mitigating object hallucination in large vision-language models. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 4209–4221 (2025)

  5. [5]

    In: Eu- ropean Conference on Computer Vision

    Chen, L., Li, J., Dong, X., Zhang, P., He, C., Wang, J., Zhao, F., Lin, D.: Sharegpt4v: Improving large multi-modal models with better captions. In: Eu- ropean Conference on Computer Vision. pp. 370–387. Springer (2024)

  6. [6]

    arXiv preprint arXiv:2504.08809 (2025)

    Chen, W., Yan, X., Wen, B., Yang, F., Gao, T., Zhang, D., Chen, L.: Decou- pling contrastive decoding: Robust hallucination mitigation in multimodal large language models. arXiv preprint arXiv:2504.08809 (2025)

  7. [7]

    arXiv preprint (2025)

    Chen, Y., Wang, P., Qin, G., Wu, W., Chen, M., Hao, Y.: Attention re-alignment in multimodal large language models via intermediate-layer guidance. arXiv preprint (2025)

  8. [8]

    arXiv preprint arXiv:2505.17529 (2025)

    Cho,Y.,Kim,K.,Hwang,T.,Cho,S.:Doyoukeepaneyeonwhatiask?mitigating multimodal hallucination via attention-guided ensemble decoding. arXiv preprint arXiv:2505.17529 (2025)

  9. [9]

    Advances in neural information processing systems36, 49250–49267 (2023)

    Dai, W., Li, J., Li, D., Tiong, A., Zhao, J., Wang, W., Li, B., Fung, P.N., Hoi, S.: Instructblip: Towards general-purpose vision-language models with instruction tuning. Advances in neural information processing systems36, 49250–49267 (2023)

  10. [10]

    A benchmark for ultra-high-resolution remote sensing mllms.arXiv preprint arXiv:2512.17319, 2025

    Dang, Y., Zhu, M., Wang, D., Zhang, Y., Yang, J., Fan, Q., Yang, Y., Li, W., Miao, F., Gao, Y.: A benchmark for ultra-high-resolution remote sensing mllms. arXiv preprint arXiv:2512.17319 (2025)

  11. [11]

    arXiv preprint arXiv:2504.05810 (2025)

    Ding, X., Zhang, K., Han, J., Hong, L., Xu, H., Li, X.: Pami-vdpo: Mitigating video hallucinations by prompt-aware multi-instance video preference learning. arXiv preprint arXiv:2504.05810 (2025)

  12. [12]

    Truthprint: Miti- gating lvlm object hallucination via latent truthful-guided pre- intervention,

    Duan, J., Kong, F., Cheng, H., Diffenderfer, J., Kailkhura, B., Sun, L., Zhu, X., Shi, X., Xu, K.: Truthprint: Mitigating lvlm object hallucination via latent truthful- guided pre-intervention. arXiv preprint arXiv:2503.10602 (2025)

  13. [13]

    arXiv preprint arXiv:2505.19678 (2025)

    Fang, H., Zhou, C., Kong, J., Gao, K., Chen, B., Liang, T., Ma, G., Xia, S.T.: Grounding language with vision: A conditional mutual information cal- ibrated decoding strategy for reducing hallucinations in lvlms. arXiv preprint arXiv:2505.19678 (2025)

  14. [14]

    Mme-survey: A comprehensive survey on evaluation of multimodal llms.arXiv preprint arXiv:2411.15296,

    Fu, C., Zhang, Y.F., Yin, S., Li, B., Fang, X., Zhao, S., Duan, H., Sun, X., Liu, Z., Wang, L., et al.: Mme-survey: A comprehensive survey on evaluation of multimodal llms. arXiv preprint arXiv:2411.15296 (2024)

  15. [15]

    LingoLoop Attack: Trapping MLLMs via Linguistic Context and State Entrapment into Endless Loops

    Fu, J., Jiang, K., Hong, L., Li, J., Guo, H., Yang, D., Chen, Z., Zhang, W.: Lin- goloop attack: Trapping mllms via linguistic context and state entrapment into endless loops. arXiv preprint arXiv:2506.14493 (2025) 16 S. Jia, S. Liu, et al

  16. [16]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Guan, T., Liu, F., Wu, X., Xian, R., Li, Z., Liu, X., Wang, X., Chen, L., Huang, F., Yacoob, Y., et al.: Hallusionbench: an advanced diagnostic suite for entan- gled language hallucination and visual illusion in large vision-language models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 14375–14385 (2024)

  17. [17]

    In: The Thirty-ninth Annual Con- ference on Neural Information Processing Systems (NA)

    He, L., Chen, Z., Shi, Z., Yu, T., Sheng, L., Shao, J.: Systematic reward gap optimization for mitigating vlm hallucinations. In: The Thirty-ninth Annual Con- ference on Neural Information Processing Systems (NA)

  18. [18]

    arXiv preprint arXiv:2205.02225 (2022)

    Hu, X., Liu, S., Zhang, C., Li, S., Wen, L., Yu, P.S.: Hiure: Hierarchical ex- emplar contrastive learning for unsupervised relation extraction. arXiv preprint arXiv:2205.02225 (2022)

  19. [19]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Huang, Q., Dong, X., Zhang, P., Wang, B., He, C., Wang, J., Lin, D., Zhang, W., Yu, N.: Opera: Alleviating hallucination in multi-modal large language models via over-trust penalty and retrospection-allocation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 13418–13427 (2024)

  20. [20]

    arXiv preprint arXiv:2509.21057 (2025)

    Huo, J., Liu, S., Wang, B., Zhang, J., Yan, Y., Liu, A., Hu, X., Zhou, M.: Pmark: Towards robust and distortion-free semantic-level watermarking with channel con- straints. arXiv preprint arXiv:2509.21057 (2025)

  21. [21]

    Attention-space Contrastive Guidance for Efficient Hallucination Mitigation in LVLMs

    Jo, Y., Bae, S., Kim, T.: Attention-space contrastive guidance for efficient halluci- nation mitigation in lvlms. arXiv preprint arXiv:2601.13707 (2026)

  22. [22]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Leng, S., Zhang, H., Chen, G., Li, X., Lu, S., Miao, C., Bing, L.: Mitigating object hallucinations in large vision-language models through visual contrastive decoding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 13872–13882 (2024)

  23. [23]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Li, C., Im, E.W., Fazli, P.: Vidhalluc: Evaluating temporal hallucinations in mul- timodal large language models for video understanding. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 13723–13733 (2025)

  24. [24]

    arXiv preprint arXiv:2503.14895 (2025)

    Li,S.,Sun,J.,Zheng,G.,Fan,X.,Shen,Y.,Lu,Y.,Xi,Z.,Yang,Y.,Tan,W.,Ji,T., et al.: Mitigating object hallucinations in mllms via multi-frequency perturbations. arXiv preprint arXiv:2503.14895 (2025)

  25. [25]

    In: Proceedings of the 2023 conference on empirical methods in natural language processing

    Li, Y., Du, Y., Zhou, K., Wang, J., Zhao, W.X., Wen, J.R.: Evaluating object hal- lucination in large vision-language models. In: Proceedings of the 2023 conference on empirical methods in natural language processing. pp. 292–305 (2023)

  26. [26]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Liu, H., Li, C., Li, Y., Lee, Y.J.: Improved baselines with visual instruction tun- ing. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 26296–26306 (2024)

  27. [27]

    Advances in neural information processing systems36, 34892–34916 (2023)

    Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in neural information processing systems36, 34892–34916 (2023)

  28. [28]

    Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning (2023)

  29. [29]

    In: The Thirteenth International Conference on Learning Representations (2025)

    Liu, S., Ye, H., Zou, J.: Reducing hallucinations in large vision-language models via latent space steering. In: The Thirteenth International Conference on Learning Representations (2025)

  30. [30]

    arXiv preprint arXiv:2601.05144 (2026)

    Liu, S., Li, X., Liu, H., Yan, Y., Duan, B., Zheng, Q., Fang, D., Su, L., Hu, X.: Distilling the thought, watermarking the answer: A principle semantic guided watermark for large reasoning models. arXiv preprint arXiv:2601.05144 (2026)

  31. [31]

    arXiv preprint arXiv:2507.05288 (2025)

    Liu, S., Liu, H., Liu, A., Duan, B., Zheng, Q., Yan, Y., Geng, H., Jiang, P., Liu, J., Hu, X.: A survey on proactive defense strategies against misinformation in large language models. arXiv preprint arXiv:2507.05288 (2025)

  32. [32]

    arXiv preprint arXiv:2601.05159 (2026) DeP 17

    Liu, S., Yang, S., Fang, D., Jia, S., Tang, Y., Su, L., Peng, R., Yan, Y., Zou, X., Hu, X.: Vision-language introspection: Mitigating overconfident hallucinations in mllms via interpretable bi-causal steering. arXiv preprint arXiv:2601.05159 (2026) DeP 17

  33. [33]

    arXiv preprint arXiv:2507.14067 (2025)

    Liu,S.,Zheng,Q.,Xu,J.J.,Yan,Y.,Geng,H.,Liu,A.,Jiang,P.,Liu,J.,Tam,Y.C., Hu, X.: Vla-mark: A cross modal watermark for large vision-language alignment model. arXiv preprint arXiv:2507.14067 (2025)

  34. [34]

    Advances in Neural Information Processing Systems37, 122811–122832 (2024)

    Lyu, X., Chen, B., Gao, L., Shen, H., Song, J.: Alleviating hallucinations in large vision-language models through hallucination-induced optimization. Advances in Neural Information Processing Systems37, 122811–122832 (2024)

  35. [35]

    Computers in Biology and Medicine200, 111347 (2026)

    Mahdavi, Z., Khodakaramimaghsoud, Z., Khaloo, H., Taleshani, S.B., Hashemi, E., Kaleybar, J.M., Manzari, O.N.: Med-vcd: Mitigating hallucination for medical large vision language models through visual contrastive decoding. Computers in Biology and Medicine200, 111347 (2026)

  36. [36]

    arXiv preprint arXiv:2410.13321 (2024)

    Min, K., Kim, M., Lee, K.i., Lee, D., Jung, K.: Mitigating hallucinations in large vision-language models via summary-guided decoding. arXiv preprint arXiv:2410.13321 (2024)

  37. [37]

    arXiv preprint arXiv:2601.06224 (2026)

    Pan, M., Gan, W., Chen, J., Zhang, W., Sun, B., Yin, J., Zhang, X.: Ground what you see: Hallucination-resistant mllms via caption feedback, diversity-aware sampling, and conflict regularization. arXiv preprint arXiv:2601.06224 (2026)

  38. [38]

    arXiv preprint arXiv:2505.21448 (2025)

    Peng, Z., Liu, J., Zhang, H., Liu, X., Tang, S., Wan, P., Zhang, D., Liu, H., He, J.: Omnisync: Towards universal lip synchronization via diffusion transformers. arXiv preprint arXiv:2505.21448 (2025)

  39. [39]

    In: Inter- national Conference on Sustainable Computing and Intelligent Systems

    Saxena, V., Sathe, A., Sandosh, S.: Mitigating hallucinations in large language models: A comprehensive survey on detection and reduction strategies. In: Inter- national Conference on Sustainable Computing and Intelligent Systems. pp. 39–52. Springer (2024)

  40. [40]

    arXiv preprint arXiv:2601.23041 (2026)

    Shi, Y., Yang, S., Liu, D.: One-shot optimized steering vector for hallucination mitigation for vlms. arXiv preprint arXiv:2601.23041 (2026)

  41. [41]

    Sivakumar, A.: Model Control through Lightweight Activation Steering for Vision Language Models. Ph.D. thesis, Virginia Tech (2025)

  42. [42]

    In: Proceedings of the 63rd Annual Meeting of the Asso- ciation for Computational Linguistics (Volume 1: Long Papers)

    Su, J., Chen, J., Li, H., Chen, Y., Qing, L., Zhang, Z.: Activation steering decod- ing: Mitigating hallucination in large vision-language models through bidirectional hidden state intervention. In: Proceedings of the 63rd Annual Meeting of the Asso- ciation for Computational Linguistics (Volume 1: Long Papers). pp. 12964–12974 (2025)

  43. [43]

    In: Ku, L.W., Martins, A., Srikumar, V

    Sun, Z., Shen, S., Cao, S., Liu, H., Li, C., Shen, Y., Gan, C., Gui, L., Wang, Y.X., Yang, Y., Keutzer, K., Darrell, T.: Aligning large multimodal models with factu- ally augmented RLHF. In: Ku, L.W., Martins, A., Srikumar, V. (eds.) Findings of the Association for Computational Linguistics: ACL 2024. pp. 13088–13110. Association for Computational Linguis...

  44. [44]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Tang, F., Liu, C., Xu, Z., Hu, M., Huang, Z., Xue, H., Chen, Z., Peng, Z., Yang, Z., Zhou, S., et al.: Seeing far and clearly: Mitigating hallucinations in mllms with attention causal decoding. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 26147–26159 (2025)

  45. [45]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Wang, S., Zhang, Y., Zhu, Y., Liu, E., Li, J., Liu, Y., Ji, X.: Shift: Smoothing hallucinations by information flow tuning for multimodal large language models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 3639–3649 (2025)

  46. [46]

    arXiv preprint arXiv:2504.13169 , year=

    Wu, T.H., Lee, H., Ge, J., Gonzalez, J.E., Darrell, T., Chan, D.M.: Generate, but verify: Reducing hallucination in vision-language models with retrospective resampling. arXiv preprint arXiv:2504.13169 (2025)

  47. [47]

    Wu, Y., Zhang, L., Yao, H., Du, J., Yan, K., Ding, S., Wu, Y., Li, X.: Antidote: A unified framework for mitigating lvlm hallucinations in counterfactual presupposi- 18 S. Jia, S. Liu, et al. tion and object perception. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 14646–14656 (2025)

  48. [48]

    arXiv preprint arXiv:2602.01740 (2026)

    Xiao, Q., Zhou, K.: Macd: Model-aware contrastive decoding via counterfactual data. arXiv preprint arXiv:2602.01740 (2026)

  49. [49]

    In: Proceedings of the AAAI Conference on Artificial Intelligence

    Xiao, W., Huang, Z., Gan, L., He, W., Li, H., Yu, Z., Shu, F., Jiang, H., Zhu, L.: Detecting and mitigating hallucination in large vision language models via fine-grained ai feedback. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 39, pp. 25543–25551 (2025)

  50. [50]

    arXiv preprint arXiv:2506.22283 (2025)

    Xu, R., Wang, Y., Luo, Y., Du, B.: Rethinking visual token reduction in lvlms under cross-modal misalignment. arXiv preprint arXiv:2506.22283 (2025)

  51. [51]

    In: Pro- ceedings of the Computer Vision and Pattern Recognition Conference

    Yang, L., Zheng, Z., Chen, B., Zhao, Z., Lin, C., Shen, C.: Nullu: Mitigating object hallucinations in large vision-language models via halluspace projection. In: Pro- ceedings of the Computer Vision and Pattern Recognition Conference. pp. 14635– 14645 (2025)

  52. [52]

    arXiv preprint arXiv:2512.02981 (2025)

    Yang, Z., Yuan, Y., Jiang, X., An, B., Pang, W.: Inex: Hallucination mitiga- tion via introspection and cross-modal multi-agent collaboration. arXiv preprint arXiv:2512.02981 (2025)

  53. [53]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Yin, H., Si, G., Wang, Z.: Clearsight: Visual signal enhancement for object hallu- cination mitigation in multimodal large language models. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 14625–14634 (2025)

  54. [54]

    Science China Information Sciences67(12), 220105 (2024)

    Yin, S., Fu, C., Zhao, S., Xu, T., Wang, H., Sui, D., Shen, Y., Li, K., Sun, X., Chen, E.: Woodpecker: Hallucination correction for multimodal large language models. Science China Information Sciences67(12), 220105 (2024)

  55. [55]

    Self- correcting decoding with generative feedback for mitigat- ing hallucinations in large vision-language models.arXiv preprint arXiv:2502.06130, 2025

    Zhang, C., Wan, Z., Kan, Z., Ma, M.Q., Stepputtis, S., Ramanan, D., Salakhutdi- nov, R., Morency, L.P., Sycara, K., Xie, Y.: Self-correcting decoding with genera- tive feedback for mitigating hallucinations in large vision-language models. arXiv preprint arXiv:2502.06130 (2025)

  56. [56]

    DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection

    Zhang, H., Li, F., Liu, S., Zhang, L., Su, H., Zhu, J., Ni, L.M., Shum, H.Y.: Dino: Detr with improved denoising anchor boxes for end-to-end object detection. arXiv preprint arXiv:2203.03605 (2022)

  57. [57]

    arXiv preprint arXiv:2504.17309 (2025)

    Zhang, J., Liu, S., Liu, A., Gao, Y., Li, J., Gu, X., Hu, X.: Cohemark: A novel sentence-level watermark for enhanced text quality. arXiv preprint arXiv:2504.17309 (2025)

  58. [58]

    arXiv preprint arXiv:2601.20279 (2026)

    Zhang, X., Zhu, Y., Gu, C., Yuan, X., Zhao, Q., Cao, J., Tang, F., Fan, S., Shen, Y., Shen, C., et al.: Hallucination begins where saliency drops. arXiv preprint arXiv:2601.20279 (2026)

  59. [59]

    Mm-rlhf: The next step forward in multimodal llm alignment,

    Zhang, Y.F., Yu, T., Tian, H., Fu, C., Li, P., Zeng, J., Xie, W., Shi, Y., Zhang, H., Wu, J., et al.: Mm-rlhf: The next step forward in multimodal llm alignment. arXiv preprint arXiv:2502.10391 (2025)

  60. [60]

    arXiv preprint arXiv:2510.02342 (2025)

    Zhang,Y.,Liu,S.,Yang,X.,Hu,X.:Catmark:Acontext-awarethresholdingframe- work for robust cross-task watermarking in large language models. arXiv preprint arXiv:2510.02342 (2025)

  61. [61]

    In: 2025 IEEE 8th International Conference on Multimedia Information Processing and Retrieval (MIPR)

    Zhao, F., Zhang, C., Zhang, R., Wang, T., Li, X.: Mitigating image captioning hal- lucinations in vision-language models. In: 2025 IEEE 8th International Conference on Multimedia Information Processing and Retrieval (MIPR). pp. 297–302. IEEE (2025)

  62. [62]

    arXiv preprint arXiv:2505.14257 (2025) DeP 19

    Zhao, J., Zhang, F., Sun, X., Feng, C.: Aligning attention distribution to infor- mation flow for hallucination mitigation in large vision-language models. arXiv preprint arXiv:2505.14257 (2025) DeP 19

  63. [63]

    arXiv preprint arXiv:2505.10634 (2025)

    Zhao, J., Zhang, F., Sun, X., Kong, L., Tan, Z., Feng, C.: Cross-image contrastive decoding: Precise, lossless suppression of language priors in large vision-language models. arXiv preprint arXiv:2505.10634 (2025)

  64. [64]

    Beyond hallucinations: Enhancing lvlms through hallucination-aware direct preference optimization,

    Zhao, Z., Wang, B., Ouyang, L., Dong, X., Wang, J., He, C.: Beyond hallucinations: Enhancing lvlms through hallucination-aware direct preference optimization. arXiv preprint arXiv:2311.16839 (2023)

  65. [65]

    arXiv preprint arXiv:2601.07291 (2026)

    Zheng, Q., Liu, S., Huang, Y., Jia, S., Li, J., Chen, L., Chen, J., Li, H., Liu, A., Yan, Y., et al.: A visual semantic adaptive watermark grounded by prefix-tuning for large vision-language model. arXiv preprint arXiv:2601.07291 (2026)

  66. [66]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Zhu, L., Ji, D., Chen, T., Xu, P., Ye, J., Liu, J.: Ibd: Alleviating hallucinations in large vision-language models via image-biased decoding. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 1624–1633 (2025)