Recognition: unknown
Decoding by Perturbation: Mitigating MLLM Hallucinations via Dynamic Textual Perturbation
Pith reviewed 2026-05-10 15:12 UTC · model grok-4.3
The pith
Multimodal large language models produce fewer hallucinations when their decoding process applies dynamic textual perturbations to isolate language priors from visual evidence.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper establishes that multimodal hallucinations manifest as the hypersensitivity of visual grounding to textual phrasing during the decoding phase. Decoding by Perturbation counters this through a dynamic probe that applies multi-level textual perturbations to elicit latent language priors, then uses attention variance to strengthen stable evidence regions and suppress noise, while constructing an interpretable prior drift direction from logits statistics to counteract probability biases from textual co-occurrences.
What carries the argument
Decoding by Perturbation (DeP), a framework that uses multi-level textual perturbations during decoding to separate latent language priors from visual evidence through attention variance analysis and logit-derived drift directions.
If this is right
- Hallucinations decrease across standard benchmarks without any model retraining or image alteration.
- The method preserves the model's original generative fluency better than approaches that directly manipulate visual features.
- The prior drift direction provides an explicit, interpretable mechanism to offset biases from textual co-occurrences.
- Attention variance serves as a practical signal to identify and reinforce reliable visual evidence regions.
Where Pith is reading between the lines
- The same perturbation idea could extend to purely textual models to reduce similar prior-driven errors in long-form generation.
- Combining textual and visual perturbation strategies might create more robust hybrid defenses against hallucinations.
- Model architectures could incorporate lightweight decoding probes by default to automatically detect and adjust for such sensitivities.
Load-bearing premise
That controlled textual perturbations can reliably draw out and separate latent language priors from actual visual evidence without creating new biases or changing the model's natural generation behavior.
What would settle it
Apply DeP to a test set of images paired with prompts that contain strong conflicting language co-occurrences and measure whether hallucination rates drop compared to standard decoding on the same set.
Figures
read the original abstract
Multimodal Large Language Models frequently suffer from inference hallucinations, partially stemming from language priors dominating visual evidence. Existing training-free mitigation methods either perturb the visual representation and deviate from the natural image distribution, or enforce intrusive manipulations that compromise the model's inherent generative fluency. We introduce a novel perspective that multimodal hallucination manifests as the hypersensitivity of visual grounding to textual phrasing during the decoding phase. Building on this insight, we propose Decoding by Perturbation (DeP), a training-free framework mitigating prior-induced hallucinations via controlled textual interventions. DeP employs a dynamic probe applying multi-level textual perturbations to elicit latent language priors. Leveraging attention variance, it enhances stable evidence regions while suppressing suspicious noise in the feature space. Furthermore, it constructs an interpretable prior drift direction using logits statistics to counteract probability biases from textual co-occurrences. Extensive experiments confirm DeP effectively reduces hallucinations and achieves superior performance across multiple benchmarks.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that multimodal hallucinations in MLLMs arise from the hypersensitivity of visual grounding to textual phrasing during decoding. It introduces Decoding by Perturbation (DeP), a training-free method that applies dynamic multi-level textual perturbations to elicit latent language priors, uses attention variance across perturbations to enhance stable visual evidence regions while suppressing noise, and constructs an interpretable prior drift direction from logits statistics to counteract textual co-occurrence biases, resulting in reduced hallucinations and superior benchmark performance.
Significance. If the empirical claims hold after addressing validation gaps, DeP would represent a useful training-free alternative to visual perturbation or intrusive decoding methods, with added interpretability from the logit-based drift direction. The approach could improve MLLM reliability in vision-language tasks by intervening only on text during inference. However, the significance is tempered by the absence of direct evidence linking the attention variance heuristic to actual grounding errors, limiting immediate impact until stronger controls and ablations are provided.
major comments (3)
- [Abstract / Method] Abstract and method description: The central claim that attention variance reliably separates stable visual evidence from prior-induced noise lacks direct empirical grounding; no controlled experiments are described correlating low/high variance regions with object-level hallucination labels or visual grounding errors versus other factors like token uncertainty or architecture artifacts.
- [Abstract] Abstract: The assertion of 'superior performance across multiple benchmarks' is not supported by details on exact metrics (e.g., CHAIR, POPE, or others), baseline implementations, statistical significance, or controls for confounding effects from the perturbations themselves, making it impossible to assess whether gains validate the proposed mechanism.
- [Method] Method: The construction of the 'prior drift direction' from logits statistics and the multi-level perturbation probe rely on internal model signals without reported ablations isolating their individual contributions or demonstrating they do not introduce new biases, which is load-bearing for the claim of mitigating prior dominance without deviating from natural behavior.
minor comments (2)
- [Experiments] The free parameters for perturbation strength and count should be explicitly listed with sensitivity analysis in the experiments section for reproducibility.
- [Method] Notation for attention variance and prior drift direction would benefit from a formal equation or pseudocode to clarify the computation steps.
Simulated Author's Rebuttal
We sincerely thank the referee for the thorough review and valuable suggestions. We have carefully addressed each major comment by providing additional empirical evidence, detailed experimental information, and ablations in the revised manuscript. Our responses are as follows.
read point-by-point responses
-
Referee: [Abstract / Method] Abstract and method description: The central claim that attention variance reliably separates stable visual evidence from prior-induced noise lacks direct empirical grounding; no controlled experiments are described correlating low/high variance regions with object-level hallucination labels or visual grounding errors versus other factors like token uncertainty or architecture artifacts.
Authors: We agree that direct empirical grounding for the attention variance heuristic would strengthen our central claim. Although the overall effectiveness of DeP in reducing hallucinations is demonstrated through benchmark improvements, we did not include object-level annotations correlating variance with specific errors in the original submission. In the revised version, we have incorporated controlled experiments on a subset of images from the POPE benchmark. We manually identified hallucinated objects and computed attention variance for corresponding visual regions, comparing them to correctly grounded objects and controlling for token uncertainty. The results show significantly lower variance in stable evidence regions, supporting the separation of prior-induced noise. We have also added discussion on potential architecture artifacts. revision: yes
-
Referee: [Abstract] Abstract: The assertion of 'superior performance across multiple benchmarks' is not supported by details on exact metrics (e.g., CHAIR, POPE, or others), baseline implementations, statistical significance, or controls for confounding effects from the perturbations themselves, making it impossible to assess whether gains validate the proposed mechanism.
Authors: The abstract is intended as a concise overview, with full experimental details provided in the body of the paper. However, to make the performance claims more transparent, we have revised the abstract to specify the benchmarks used (CHAIR, POPE, and others), report the key quantitative improvements, mention the baseline methods implemented, and note that statistical significance was assessed. Additionally, we have included a new analysis in the experiments section addressing potential confounding effects from perturbations by comparing DeP to variants with random text perturbations, showing that the structured perturbations are key to the gains. revision: yes
-
Referee: [Method] Method: The construction of the 'prior drift direction' from logits statistics and the multi-level perturbation probe rely on internal model signals without reported ablations isolating their individual contributions or demonstrating they do not introduce new biases, which is load-bearing for the claim of mitigating prior dominance without deviating from natural behavior.
Authors: We acknowledge the importance of isolating the contributions of each component. The original manuscript included some qualitative analysis of the prior drift direction, but we have now added quantitative ablations in a dedicated subsection. These include removing the drift direction (using only attention variance), using single-level perturbations instead of multi-level, and evaluating the effect on natural language generation quality using perplexity scores on held-out clean prompts. The ablations confirm additive benefits from each part without introducing new biases, as perplexity increases are minimal and not statistically significant compared to the baseline. revision: yes
Circularity Check
No significant circularity; derivation relies on independent design choices and empirical claims
full rationale
The paper introduces a perspective on hallucinations as hypersensitivity to textual phrasing and proposes DeP using multi-level perturbations, attention variance for stable regions, and logits-based prior drift direction. No equations, self-definitions, or derivations are shown that reduce any claimed result to its inputs by construction (e.g., no fitted parameters renamed as predictions or ansatzes smuggled via self-citation). The method's use of internal model signals is a heuristic design choice, not a tautological loop, and the abstract emphasizes external benchmarks for validation. This is self-contained against the provided text with no load-bearing self-referential steps.
Axiom & Free-Parameter Ledger
free parameters (1)
- multi-level perturbation strength and count
axioms (1)
- domain assumption Language priors dominate visual evidence during MLLM decoding
invented entities (1)
-
prior drift direction
no independent evidence
Reference graph
Works this paper leans on
-
[1]
In: Proceedings of the Computer Vision and Pattern Recognition Conference
An, W., Tian, F., Leng, S., Nie, J., Lin, H., Wang, Q., Chen, P., Zhang, X., Lu, S.: Mitigating object hallucinations in large vision-language models with assembly of global and local attention. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 29915–29926 (2025)
2025
-
[2]
Hallucination of Multimodal Large Language Models: A Survey
Bai, Z., Wang, P., Xiao, T., He, T., Han, Z., Zhang, Z., Shou, M.Z.: Hallucination of multimodal large language models: A survey. arXiv preprint arXiv:2404.18930 (2024)
work page internal anchor Pith review arXiv 2024
-
[3]
arXiv preprint arXiv:2506.21509 (2025)
Chen, J., He, J., Shao, Q., Chen, Q., Ying, J., Xu, H., Chen, J., Zheng, J., Wu, J.: Mitigating hallucination of large vision-language models via dynamic logits calibration. arXiv preprint arXiv:2506.21509 (2025)
-
[4]
In: Proceedings of the Computer Vision and Pattern Recognition Conference
Chen, J., Zhang, T., Huang, S., Niu, Y., Zhang, L., Wen, L., Hu, X.: Ict: Image- object cross-level trusted intervention for mitigating object hallucination in large vision-language models. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 4209–4221 (2025)
2025
-
[5]
In: Eu- ropean Conference on Computer Vision
Chen, L., Li, J., Dong, X., Zhang, P., He, C., Wang, J., Zhao, F., Lin, D.: Sharegpt4v: Improving large multi-modal models with better captions. In: Eu- ropean Conference on Computer Vision. pp. 370–387. Springer (2024)
2024
-
[6]
arXiv preprint arXiv:2504.08809 (2025)
Chen, W., Yan, X., Wen, B., Yang, F., Gao, T., Zhang, D., Chen, L.: Decou- pling contrastive decoding: Robust hallucination mitigation in multimodal large language models. arXiv preprint arXiv:2504.08809 (2025)
-
[7]
arXiv preprint (2025)
Chen, Y., Wang, P., Qin, G., Wu, W., Chen, M., Hao, Y.: Attention re-alignment in multimodal large language models via intermediate-layer guidance. arXiv preprint (2025)
2025
-
[8]
arXiv preprint arXiv:2505.17529 (2025)
Cho,Y.,Kim,K.,Hwang,T.,Cho,S.:Doyoukeepaneyeonwhatiask?mitigating multimodal hallucination via attention-guided ensemble decoding. arXiv preprint arXiv:2505.17529 (2025)
-
[9]
Advances in neural information processing systems36, 49250–49267 (2023)
Dai, W., Li, J., Li, D., Tiong, A., Zhao, J., Wang, W., Li, B., Fung, P.N., Hoi, S.: Instructblip: Towards general-purpose vision-language models with instruction tuning. Advances in neural information processing systems36, 49250–49267 (2023)
2023
-
[10]
A benchmark for ultra-high-resolution remote sensing mllms.arXiv preprint arXiv:2512.17319, 2025
Dang, Y., Zhu, M., Wang, D., Zhang, Y., Yang, J., Fan, Q., Yang, Y., Li, W., Miao, F., Gao, Y.: A benchmark for ultra-high-resolution remote sensing mllms. arXiv preprint arXiv:2512.17319 (2025)
-
[11]
arXiv preprint arXiv:2504.05810 (2025)
Ding, X., Zhang, K., Han, J., Hong, L., Xu, H., Li, X.: Pami-vdpo: Mitigating video hallucinations by prompt-aware multi-instance video preference learning. arXiv preprint arXiv:2504.05810 (2025)
-
[12]
Truthprint: Miti- gating lvlm object hallucination via latent truthful-guided pre- intervention,
Duan, J., Kong, F., Cheng, H., Diffenderfer, J., Kailkhura, B., Sun, L., Zhu, X., Shi, X., Xu, K.: Truthprint: Mitigating lvlm object hallucination via latent truthful- guided pre-intervention. arXiv preprint arXiv:2503.10602 (2025)
-
[13]
arXiv preprint arXiv:2505.19678 (2025)
Fang, H., Zhou, C., Kong, J., Gao, K., Chen, B., Liang, T., Ma, G., Xia, S.T.: Grounding language with vision: A conditional mutual information cal- ibrated decoding strategy for reducing hallucinations in lvlms. arXiv preprint arXiv:2505.19678 (2025)
-
[14]
Mme-survey: A comprehensive survey on evaluation of multimodal llms.arXiv preprint arXiv:2411.15296,
Fu, C., Zhang, Y.F., Yin, S., Li, B., Fang, X., Zhao, S., Duan, H., Sun, X., Liu, Z., Wang, L., et al.: Mme-survey: A comprehensive survey on evaluation of multimodal llms. arXiv preprint arXiv:2411.15296 (2024)
-
[15]
LingoLoop Attack: Trapping MLLMs via Linguistic Context and State Entrapment into Endless Loops
Fu, J., Jiang, K., Hong, L., Li, J., Guo, H., Yang, D., Chen, Z., Zhang, W.: Lin- goloop attack: Trapping mllms via linguistic context and state entrapment into endless loops. arXiv preprint arXiv:2506.14493 (2025) 16 S. Jia, S. Liu, et al
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[16]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Guan, T., Liu, F., Wu, X., Xian, R., Li, Z., Liu, X., Wang, X., Chen, L., Huang, F., Yacoob, Y., et al.: Hallusionbench: an advanced diagnostic suite for entan- gled language hallucination and visual illusion in large vision-language models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 14375–14385 (2024)
2024
-
[17]
In: The Thirty-ninth Annual Con- ference on Neural Information Processing Systems (NA)
He, L., Chen, Z., Shi, Z., Yu, T., Sheng, L., Shao, J.: Systematic reward gap optimization for mitigating vlm hallucinations. In: The Thirty-ninth Annual Con- ference on Neural Information Processing Systems (NA)
-
[18]
arXiv preprint arXiv:2205.02225 (2022)
Hu, X., Liu, S., Zhang, C., Li, S., Wen, L., Yu, P.S.: Hiure: Hierarchical ex- emplar contrastive learning for unsupervised relation extraction. arXiv preprint arXiv:2205.02225 (2022)
-
[19]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Huang, Q., Dong, X., Zhang, P., Wang, B., He, C., Wang, J., Lin, D., Zhang, W., Yu, N.: Opera: Alleviating hallucination in multi-modal large language models via over-trust penalty and retrospection-allocation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 13418–13427 (2024)
2024
-
[20]
arXiv preprint arXiv:2509.21057 (2025)
Huo, J., Liu, S., Wang, B., Zhang, J., Yan, Y., Liu, A., Hu, X., Zhou, M.: Pmark: Towards robust and distortion-free semantic-level watermarking with channel con- straints. arXiv preprint arXiv:2509.21057 (2025)
-
[21]
Attention-space Contrastive Guidance for Efficient Hallucination Mitigation in LVLMs
Jo, Y., Bae, S., Kim, T.: Attention-space contrastive guidance for efficient halluci- nation mitigation in lvlms. arXiv preprint arXiv:2601.13707 (2026)
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[22]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Leng, S., Zhang, H., Chen, G., Li, X., Lu, S., Miao, C., Bing, L.: Mitigating object hallucinations in large vision-language models through visual contrastive decoding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 13872–13882 (2024)
2024
-
[23]
In: Proceedings of the Computer Vision and Pattern Recognition Conference
Li, C., Im, E.W., Fazli, P.: Vidhalluc: Evaluating temporal hallucinations in mul- timodal large language models for video understanding. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 13723–13733 (2025)
2025
-
[24]
arXiv preprint arXiv:2503.14895 (2025)
Li,S.,Sun,J.,Zheng,G.,Fan,X.,Shen,Y.,Lu,Y.,Xi,Z.,Yang,Y.,Tan,W.,Ji,T., et al.: Mitigating object hallucinations in mllms via multi-frequency perturbations. arXiv preprint arXiv:2503.14895 (2025)
-
[25]
In: Proceedings of the 2023 conference on empirical methods in natural language processing
Li, Y., Du, Y., Zhou, K., Wang, J., Zhao, W.X., Wen, J.R.: Evaluating object hal- lucination in large vision-language models. In: Proceedings of the 2023 conference on empirical methods in natural language processing. pp. 292–305 (2023)
2023
-
[26]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Liu, H., Li, C., Li, Y., Lee, Y.J.: Improved baselines with visual instruction tun- ing. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 26296–26306 (2024)
2024
-
[27]
Advances in neural information processing systems36, 34892–34916 (2023)
Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in neural information processing systems36, 34892–34916 (2023)
2023
-
[28]
Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning (2023)
2023
-
[29]
In: The Thirteenth International Conference on Learning Representations (2025)
Liu, S., Ye, H., Zou, J.: Reducing hallucinations in large vision-language models via latent space steering. In: The Thirteenth International Conference on Learning Representations (2025)
2025
-
[30]
arXiv preprint arXiv:2601.05144 (2026)
Liu, S., Li, X., Liu, H., Yan, Y., Duan, B., Zheng, Q., Fang, D., Su, L., Hu, X.: Distilling the thought, watermarking the answer: A principle semantic guided watermark for large reasoning models. arXiv preprint arXiv:2601.05144 (2026)
-
[31]
arXiv preprint arXiv:2507.05288 (2025)
Liu, S., Liu, H., Liu, A., Duan, B., Zheng, Q., Yan, Y., Geng, H., Jiang, P., Liu, J., Hu, X.: A survey on proactive defense strategies against misinformation in large language models. arXiv preprint arXiv:2507.05288 (2025)
-
[32]
arXiv preprint arXiv:2601.05159 (2026) DeP 17
Liu, S., Yang, S., Fang, D., Jia, S., Tang, Y., Su, L., Peng, R., Yan, Y., Zou, X., Hu, X.: Vision-language introspection: Mitigating overconfident hallucinations in mllms via interpretable bi-causal steering. arXiv preprint arXiv:2601.05159 (2026) DeP 17
-
[33]
arXiv preprint arXiv:2507.14067 (2025)
Liu,S.,Zheng,Q.,Xu,J.J.,Yan,Y.,Geng,H.,Liu,A.,Jiang,P.,Liu,J.,Tam,Y.C., Hu, X.: Vla-mark: A cross modal watermark for large vision-language alignment model. arXiv preprint arXiv:2507.14067 (2025)
-
[34]
Advances in Neural Information Processing Systems37, 122811–122832 (2024)
Lyu, X., Chen, B., Gao, L., Shen, H., Song, J.: Alleviating hallucinations in large vision-language models through hallucination-induced optimization. Advances in Neural Information Processing Systems37, 122811–122832 (2024)
2024
-
[35]
Computers in Biology and Medicine200, 111347 (2026)
Mahdavi, Z., Khodakaramimaghsoud, Z., Khaloo, H., Taleshani, S.B., Hashemi, E., Kaleybar, J.M., Manzari, O.N.: Med-vcd: Mitigating hallucination for medical large vision language models through visual contrastive decoding. Computers in Biology and Medicine200, 111347 (2026)
2026
-
[36]
arXiv preprint arXiv:2410.13321 (2024)
Min, K., Kim, M., Lee, K.i., Lee, D., Jung, K.: Mitigating hallucinations in large vision-language models via summary-guided decoding. arXiv preprint arXiv:2410.13321 (2024)
-
[37]
arXiv preprint arXiv:2601.06224 (2026)
Pan, M., Gan, W., Chen, J., Zhang, W., Sun, B., Yin, J., Zhang, X.: Ground what you see: Hallucination-resistant mllms via caption feedback, diversity-aware sampling, and conflict regularization. arXiv preprint arXiv:2601.06224 (2026)
-
[38]
arXiv preprint arXiv:2505.21448 (2025)
Peng, Z., Liu, J., Zhang, H., Liu, X., Tang, S., Wan, P., Zhang, D., Liu, H., He, J.: Omnisync: Towards universal lip synchronization via diffusion transformers. arXiv preprint arXiv:2505.21448 (2025)
-
[39]
In: Inter- national Conference on Sustainable Computing and Intelligent Systems
Saxena, V., Sathe, A., Sandosh, S.: Mitigating hallucinations in large language models: A comprehensive survey on detection and reduction strategies. In: Inter- national Conference on Sustainable Computing and Intelligent Systems. pp. 39–52. Springer (2024)
2024
-
[40]
arXiv preprint arXiv:2601.23041 (2026)
Shi, Y., Yang, S., Liu, D.: One-shot optimized steering vector for hallucination mitigation for vlms. arXiv preprint arXiv:2601.23041 (2026)
-
[41]
Sivakumar, A.: Model Control through Lightweight Activation Steering for Vision Language Models. Ph.D. thesis, Virginia Tech (2025)
2025
-
[42]
In: Proceedings of the 63rd Annual Meeting of the Asso- ciation for Computational Linguistics (Volume 1: Long Papers)
Su, J., Chen, J., Li, H., Chen, Y., Qing, L., Zhang, Z.: Activation steering decod- ing: Mitigating hallucination in large vision-language models through bidirectional hidden state intervention. In: Proceedings of the 63rd Annual Meeting of the Asso- ciation for Computational Linguistics (Volume 1: Long Papers). pp. 12964–12974 (2025)
2025
-
[43]
In: Ku, L.W., Martins, A., Srikumar, V
Sun, Z., Shen, S., Cao, S., Liu, H., Li, C., Shen, Y., Gan, C., Gui, L., Wang, Y.X., Yang, Y., Keutzer, K., Darrell, T.: Aligning large multimodal models with factu- ally augmented RLHF. In: Ku, L.W., Martins, A., Srikumar, V. (eds.) Findings of the Association for Computational Linguistics: ACL 2024. pp. 13088–13110. Association for Computational Linguis...
2024
-
[44]
In: Proceedings of the Computer Vision and Pattern Recognition Conference
Tang, F., Liu, C., Xu, Z., Hu, M., Huang, Z., Xue, H., Chen, Z., Peng, Z., Yang, Z., Zhou, S., et al.: Seeing far and clearly: Mitigating hallucinations in mllms with attention causal decoding. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 26147–26159 (2025)
2025
-
[45]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision
Wang, S., Zhang, Y., Zhu, Y., Liu, E., Li, J., Liu, Y., Ji, X.: Shift: Smoothing hallucinations by information flow tuning for multimodal large language models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 3639–3649 (2025)
2025
-
[46]
arXiv preprint arXiv:2504.13169 , year=
Wu, T.H., Lee, H., Ge, J., Gonzalez, J.E., Darrell, T., Chan, D.M.: Generate, but verify: Reducing hallucination in vision-language models with retrospective resampling. arXiv preprint arXiv:2504.13169 (2025)
-
[47]
Wu, Y., Zhang, L., Yao, H., Du, J., Yan, K., Ding, S., Wu, Y., Li, X.: Antidote: A unified framework for mitigating lvlm hallucinations in counterfactual presupposi- 18 S. Jia, S. Liu, et al. tion and object perception. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 14646–14656 (2025)
2025
-
[48]
arXiv preprint arXiv:2602.01740 (2026)
Xiao, Q., Zhou, K.: Macd: Model-aware contrastive decoding via counterfactual data. arXiv preprint arXiv:2602.01740 (2026)
-
[49]
In: Proceedings of the AAAI Conference on Artificial Intelligence
Xiao, W., Huang, Z., Gan, L., He, W., Li, H., Yu, Z., Shu, F., Jiang, H., Zhu, L.: Detecting and mitigating hallucination in large vision language models via fine-grained ai feedback. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 39, pp. 25543–25551 (2025)
2025
-
[50]
arXiv preprint arXiv:2506.22283 (2025)
Xu, R., Wang, Y., Luo, Y., Du, B.: Rethinking visual token reduction in lvlms under cross-modal misalignment. arXiv preprint arXiv:2506.22283 (2025)
-
[51]
In: Pro- ceedings of the Computer Vision and Pattern Recognition Conference
Yang, L., Zheng, Z., Chen, B., Zhao, Z., Lin, C., Shen, C.: Nullu: Mitigating object hallucinations in large vision-language models via halluspace projection. In: Pro- ceedings of the Computer Vision and Pattern Recognition Conference. pp. 14635– 14645 (2025)
2025
-
[52]
arXiv preprint arXiv:2512.02981 (2025)
Yang, Z., Yuan, Y., Jiang, X., An, B., Pang, W.: Inex: Hallucination mitiga- tion via introspection and cross-modal multi-agent collaboration. arXiv preprint arXiv:2512.02981 (2025)
-
[53]
In: Proceedings of the Computer Vision and Pattern Recognition Conference
Yin, H., Si, G., Wang, Z.: Clearsight: Visual signal enhancement for object hallu- cination mitigation in multimodal large language models. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 14625–14634 (2025)
2025
-
[54]
Science China Information Sciences67(12), 220105 (2024)
Yin, S., Fu, C., Zhao, S., Xu, T., Wang, H., Sui, D., Shen, Y., Li, K., Sun, X., Chen, E.: Woodpecker: Hallucination correction for multimodal large language models. Science China Information Sciences67(12), 220105 (2024)
2024
-
[55]
Zhang, C., Wan, Z., Kan, Z., Ma, M.Q., Stepputtis, S., Ramanan, D., Salakhutdi- nov, R., Morency, L.P., Sycara, K., Xie, Y.: Self-correcting decoding with genera- tive feedback for mitigating hallucinations in large vision-language models. arXiv preprint arXiv:2502.06130 (2025)
-
[56]
DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection
Zhang, H., Li, F., Liu, S., Zhang, L., Su, H., Zhu, J., Ni, L.M., Shum, H.Y.: Dino: Detr with improved denoising anchor boxes for end-to-end object detection. arXiv preprint arXiv:2203.03605 (2022)
work page internal anchor Pith review arXiv 2022
-
[57]
arXiv preprint arXiv:2504.17309 (2025)
Zhang, J., Liu, S., Liu, A., Gao, Y., Li, J., Gu, X., Hu, X.: Cohemark: A novel sentence-level watermark for enhanced text quality. arXiv preprint arXiv:2504.17309 (2025)
-
[58]
arXiv preprint arXiv:2601.20279 (2026)
Zhang, X., Zhu, Y., Gu, C., Yuan, X., Zhao, Q., Cao, J., Tang, F., Fan, S., Shen, Y., Shen, C., et al.: Hallucination begins where saliency drops. arXiv preprint arXiv:2601.20279 (2026)
-
[59]
Mm-rlhf: The next step forward in multimodal llm alignment,
Zhang, Y.F., Yu, T., Tian, H., Fu, C., Li, P., Zeng, J., Xie, W., Shi, Y., Zhang, H., Wu, J., et al.: Mm-rlhf: The next step forward in multimodal llm alignment. arXiv preprint arXiv:2502.10391 (2025)
-
[60]
arXiv preprint arXiv:2510.02342 (2025)
Zhang,Y.,Liu,S.,Yang,X.,Hu,X.:Catmark:Acontext-awarethresholdingframe- work for robust cross-task watermarking in large language models. arXiv preprint arXiv:2510.02342 (2025)
-
[61]
In: 2025 IEEE 8th International Conference on Multimedia Information Processing and Retrieval (MIPR)
Zhao, F., Zhang, C., Zhang, R., Wang, T., Li, X.: Mitigating image captioning hal- lucinations in vision-language models. In: 2025 IEEE 8th International Conference on Multimedia Information Processing and Retrieval (MIPR). pp. 297–302. IEEE (2025)
2025
-
[62]
arXiv preprint arXiv:2505.14257 (2025) DeP 19
Zhao, J., Zhang, F., Sun, X., Feng, C.: Aligning attention distribution to infor- mation flow for hallucination mitigation in large vision-language models. arXiv preprint arXiv:2505.14257 (2025) DeP 19
-
[63]
arXiv preprint arXiv:2505.10634 (2025)
Zhao, J., Zhang, F., Sun, X., Kong, L., Tan, Z., Feng, C.: Cross-image contrastive decoding: Precise, lossless suppression of language priors in large vision-language models. arXiv preprint arXiv:2505.10634 (2025)
-
[64]
Beyond hallucinations: Enhancing lvlms through hallucination-aware direct preference optimization,
Zhao, Z., Wang, B., Ouyang, L., Dong, X., Wang, J., He, C.: Beyond hallucinations: Enhancing lvlms through hallucination-aware direct preference optimization. arXiv preprint arXiv:2311.16839 (2023)
-
[65]
arXiv preprint arXiv:2601.07291 (2026)
Zheng, Q., Liu, S., Huang, Y., Jia, S., Li, J., Chen, L., Chen, J., Li, H., Liu, A., Yan, Y., et al.: A visual semantic adaptive watermark grounded by prefix-tuning for large vision-language model. arXiv preprint arXiv:2601.07291 (2026)
-
[66]
In: Proceedings of the Computer Vision and Pattern Recognition Conference
Zhu, L., Ji, D., Chen, T., Xu, P., Ye, J., Liu, J.: Ibd: Alleviating hallucinations in large vision-language models via image-biased decoding. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 1624–1633 (2025)
2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.