pith. machine review for the scientific record. sign in

arxiv: 2604.01989 · v2 · submitted 2026-04-02 · 💻 cs.CV · cs.AI

Recognition: no theorem link

Attention at Rest Stays at Rest: Breaking Visual Inertia for Cognitive Hallucination Mitigation

Authors on Pith no claims yet

Pith reviewed 2026-05-13 22:03 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords visual inertiacognitive hallucinationmultimodal large language modelsattention dynamicstraining-free methodcompositional inferencehallucination mitigationrelational reasoning
0
0 comments X

The pith

Visual attention in multimodal models remains static after initial steps, leading to failures in relational reasoning that a new excitation method can address.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that visual attention in MLLMs settles early and stays fixed, preventing the dynamic shifts needed for understanding relationships between objects. This visual inertia contributes to cognitive hallucinations, which involve incorrect deductions rather than misperceptions of single objects. Existing methods do not target this because they focus on perceptual errors. By analyzing token-wise attention, the authors link persistent focus on critical regions to poor compositional inference. They propose IVE as a training-free fix that promotes attention to emerging tokens and penalizes inertial concentration to support better inference.

Core claim

Through token-wise attention analysis, visual attention in MLLMs exhibits pronounced inertia, remaining largely static once settled during early decoding steps and failing to dynamically support relational inference required for cognitive understanding. The proposed Inertia-aware Visual Excitation (IVE) method breaks this pattern by selecting visual tokens that emerge dynamically relative to historical trends and applying an inertia-aware penalty to discourage over-concentration and persistence in localized regions, thereby modeling cognitive inference as dynamic responsiveness of visual attention.

What carries the argument

The Inertia-aware Visual Excitation (IVE) method, which identifies inertial attention behavior and excites dynamic token responsiveness through trend-based selection and penalties.

Load-bearing premise

That promoting dynamic visual token selection and penalizing attention persistence will enhance relational inference capabilities without causing new errors or reducing performance on non-cognitive tasks.

What would settle it

Running IVE on a cognitive hallucination benchmark and finding no reduction in error rates for questions requiring inter-object relational deduction compared to the base model.

Figures

Figures reproduced from arXiv: 2604.01989 by Boyang Gong, Fanye Kong, Jie Zhou, Jiwen Lu, Yu Zheng.

Figure 1
Figure 1. Figure 1: Overview of the cognitive hallucinations and their mitigation via the proposed IVE. Upper Left: Comparison between perceptual and cognitive hallucinations in MLLMs. Upper Right: Comparison with hallucination mitigation methods across multiple benchmarks. Bottom: Compared to PAI [32] which amplifies visual attention designed primarily for mitigating perceptual hallucinations, our IVE effectively reduces bot… view at source ↗
Figure 2
Figure 2. Figure 2: Limited Effectiveness of Visual Attention Amplification on Cognitive Hallucinations. We assess the improve￾ments of attention amplification (PAI) [32] on perceptual (POPE) [25] and cognitive (Reefknot) [63] hallucination benchmarks. Recent studies [32, 51, 55] have iden￾tified imbalanced token attention as a primary cause of perceptual hallu￾cinations, where models fail to ade￾quately attend to critical vi… view at source ↗
Figure 3
Figure 3. Figure 3: The naive visual attention amplification method exacerbates visual [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Overview of our proposed IVE framework. Left: [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 4
Figure 4. Figure 4: Specifically, the redistributed attention [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Results (%) on the MMBench [33] benchmark, which assesses the multidimen￾sional performance of MLLMs. Cognitive Hallucination: We eval￾uated the effectiveness of our method IVE in mitigating cognitive hallu￾cinations on two benchmarks [12, 63]. As shown in [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Visualization of visual activeness of each token, comparing hallucinated and non-hallucinated responses in cognitive hallucination tasks. Rows indicate layer indices in LLaVA-1.5 [30], and columns correspond to decoded tokens at each step. both LLaVA-1.5 [30] and In￾structBLIP [8] achieve strong performance on Reefknot [63] and POPE [25] benchmarks, reducing cognitive and perceptual hallucinations. Notably… view at source ↗
Figure 7
Figure 7. Figure 7: Radar chart comparison of hallucination mitigation methods across multiple benchmarks under InstructBLIP [8] and Qwen2.5-VL-Instruct [2]. Input PAI Our IVE Input PAI Our IVE [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Attention heatmap comparison between our IVE and the naive attention am￾plification method PAI [32]. Heatmap Comparison Between our IVE and PAI [PITH_FULL_IMAGE:figures/full_fig_p019_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Visual activeness comparison between our IVE and the naive attention ampli￾fication method PAI [32]. tention amplification method PAI [32] during image captioning. The results show that PAI [32] consistently exhibits lower pairwise distances, indicating reduced semantic activeness in visual attention. In contrast, IVE maintains higher pair￾wise distances, suggesting more active and dynamic attention transi… view at source ↗
Figure 10
Figure 10. Figure 10: ∼ 17 present several representative examples where IVE success￾fully suppresses both perceptual and cognitive hallucinations. Compared with the baseline responses, the outputs generated by IVE demonstrate improved alignment with the visual content, capturing relevant objects, attributes, and inter-object relationships more reliably. These examples highlight that IVE can effectively guide the model to atte… view at source ↗
Figure 11
Figure 11. Figure 11: An illustration of IVE mitigating cognitive hallucinations on Reefknot [63]. LLaVA + Our IVE LLaVA + PAI LLaVA Please describe the photo in detail. The image features a woman sitting at a red dining table with a teddy bear on her lap. She is smiling and appears to be enjoying her time at the table. The table is surrounded by chairs, and there is a red cloth covering it. In the background, there is a tent … view at source ↗
Figure 12
Figure 12. Figure 12: An illustration of IVE mitigating cognitive hallucinations on Reefknot [63] [PITH_FULL_IMAGE:figures/full_fig_p022_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: An illustration of IVE mitigating cognitive hallucinations on Reefknot [63]. The image shows four men gathered outdoors, working together to move a large white appliance. One of the men is holding the metal refrigerator door while another appears to be directing the group as they load the appliance into a nearby truck. Several bicycles are parked around them, and lush green plants with white and pink flow… view at source ↗
Figure 14
Figure 14. Figure 14: An illustration of IVE mitigating cognitive hallucinations on Reefknot [63] [PITH_FULL_IMAGE:figures/full_fig_p023_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: An illustration of IVE mitigating cognitive hallucinations on MME [12]. Is there no snowboard in the image? LLaVA + Our IVE No. There are some snowboards in the image. LLaVA + Our IVE No. There are some snowboards in the image. LLaVA + PAI Yes. There is no snowboard in the image. LLaVA + PAI Yes. There is no snowboard in the image. LLaVA Yes. There is no snowboard in the image. Is there a knife in the ima… view at source ↗
Figure 16
Figure 16. Figure 16: An illustration of IVE mitigating perceptual hallucinations on POPE [25]. LLaVA B. This egg will fall down LLaVA + PAI B. This egg will fall down LLaVA + Our IVE C. The egg will be broken Based on this image, please predict what will happen? A. This egg will be cooked B. This egg will fall down C. The egg will be broken LLaVA B. five LLaVA + PAI C. seven LLaVA + Our IVE A. six How many trucks are in this … view at source ↗
Figure 17
Figure 17. Figure 17: An illustration of IVE mitigating cognitive hallucinations on MMBench [33] [PITH_FULL_IMAGE:figures/full_fig_p024_17.png] view at source ↗
read the original abstract

Like a body at rest that stays at rest, we find that visual attention in multimodal large language models (MLLMs) exhibits pronounced inertia, remaining largely static once settled during early decoding steps and failing to support the compositional understanding required for cognitive inference. While existing hallucination mitigation methods mainly target perceptual hallucinations concerning object existence or attributes, they remain inadequate for such cognitive hallucinations that require inter-object relational deduction. Through token-wise attention analysis, we identify this visual inertia as a key factor: attention to semantically critical regions remains persistently focused and fails to dynamically support relational inference. We thereby propose a training-free Inertia-aware Visual Excitation (IVE) method that breaks this inertial pattern by modeling cognitive inference as the dynamic responsiveness of visual attention. Specifically, IVE selects visual tokens that are dynamically emerging relative to historical attention trends while distinguishing tokens exhibiting inertial behavior. To further facilitate compositional inference, IVE introduces an inertia-aware penalty that discourages over-concentration and limits the persistence of attention within localized regions. Extensive experiments show that IVE is effective across various base MLLMs and multiple hallucination benchmarks, particularly for cognitive hallucinations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper claims that visual attention in MLLMs exhibits pronounced inertia—remaining largely static after early decoding steps and failing to support compositional relational inference—leading to cognitive hallucinations distinct from perceptual ones. Through token-wise attention analysis, the authors identify persistent focus on semantically critical regions as the key factor and propose a training-free Inertia-aware Visual Excitation (IVE) method. IVE selects dynamically emerging visual tokens relative to historical attention trends, distinguishes inertial tokens, and applies an inertia-aware penalty to discourage over-concentration, thereby modeling cognitive inference as dynamic attention responsiveness. Experiments reportedly show effectiveness across base MLLMs and hallucination benchmarks, especially for cognitive cases.

Significance. If the central claim holds, the work provides a novel, training-free perspective on hallucination mitigation by directly targeting attention dynamics rather than perceptual object attributes. The parameter-free grounding in token-wise analysis and the explicit modeling of inertia as a barrier to relational deduction are strengths that could inform future MLLM interpretability research. However, the absence of ablations isolating the historical-trend mechanism limits the ability to credit the specific inertia-breaking logic over generic attention spreading.

major comments (2)
  1. [§3] §3 (IVE method description): the central claim that selecting tokens relative to historical trends plus an inertia-aware penalty specifically enables compositional inference is not supported by any ablation that isolates this component from generic attention redistribution or diversity injection; improvements on benchmarks could arise from any increase in attention spread.
  2. [Experiments] Experiments section: no ablation studies, quantitative metrics, or error analysis are referenced that would verify the causal role of visual inertia in cognitive hallucinations versus other attention patterns, leaving the data support for the mapping from IVE operations to reduced hallucinations unverified.
minor comments (1)
  1. [Abstract] Abstract: the phrase 'extensive experiments' is used without any specific benchmark names, metrics, or baseline comparisons, reducing clarity on the scope of validation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We agree that additional ablations are needed to strengthen the causal claims regarding visual inertia and the specific contributions of the historical-trend mechanism in IVE. We will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [§3] §3 (IVE method description): the central claim that selecting tokens relative to historical trends plus an inertia-aware penalty specifically enables compositional inference is not supported by any ablation that isolates this component from generic attention redistribution or diversity injection; improvements on benchmarks could arise from any increase in attention spread.

    Authors: We acknowledge that the current version does not include an ablation isolating the historical attention trend component from generic spreading. In the revision we will add a controlled comparison of IVE against a variant that applies uniform diversity injection without historical trend selection, together with attention visualization showing the difference in dynamic responsiveness on relational queries. revision: yes

  2. Referee: [Experiments] Experiments section: no ablation studies, quantitative metrics, or error analysis are referenced that would verify the causal role of visual inertia in cognitive hallucinations versus other attention patterns, leaving the data support for the mapping from IVE operations to reduced hallucinations unverified.

    Authors: We agree that stronger quantitative linkage is required. The revised experiments section will include (i) attention entropy and shift-rate metrics computed across decoding steps to quantify inertia, (ii) error analysis breaking down hallucination types before/after IVE, and (iii) an ablation that disables the inertia-aware penalty while keeping token selection, to isolate its contribution to cognitive hallucination reduction. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper grounds its identification of visual inertia directly in token-wise attention analysis of MLLM decoding steps and proposes an explicitly training-free IVE method that selects dynamically emerging tokens relative to historical trends while applying an inertia-aware penalty. No equations or steps reduce by construction to fitted parameters, self-definitions, or load-bearing self-citations; the mapping from observed attention patterns to the proposed operations remains an independent modeling choice supported by external hallucination benchmarks rather than internal equivalence.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Central claim rests on the domain assumption that token-wise attention trends accurately reveal and can be used to correct the root cause of cognitive hallucinations. No free parameters or invented physical entities are described in the abstract.

axioms (1)
  • domain assumption Token-wise attention analysis during decoding can identify persistent focus patterns that limit relational inference
    Invoked as the basis for identifying visual inertia as the key factor.
invented entities (1)
  • Inertia-aware Visual Excitation (IVE) no independent evidence
    purpose: To break visual attention inertia by selecting dynamically emerging tokens and applying a penalty on over-concentration
    New method introduced to model cognitive inference as dynamic attention responsiveness

pith-pipeline@v0.9.0 · 5507 in / 1335 out tokens · 53431 ms · 2026-05-13T22:03:04.941906+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

68 extracted references · 68 canonical work pages · 7 internal anchors

  1. [1]

    GPT-4 Technical Report

    Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F.L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al.: Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023)

  2. [2]

    Qwen2.5-VL Technical Report

    Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., et al.: Qwen2. 5-vl technical report. arXiv preprint arXiv:2502.13923 (2025)

  3. [3]

    In: ICRA

    Chen, L., Sinavski, O., Hünermann, J., Karnsund, A., Willmott, A.J., Birch, D., Maund, D., Shotton, J.: Driving with llms: Fusing object-level vector modality for explainable autonomous driving. In: ICRA. pp. 14093–14100 (2024)

  4. [4]

    In: ICML

    Chen, Z., Zhao, Z., Luo, H., Yao, H., Li, B., Zhou, J.: Halc: Object hallucination reduction via adaptive focal-contrast decoding. In: ICML. pp. 7824–7846 (2024)

  5. [5]

    In: CVPR

    Cherti, M., Beaumont, R., Wightman, R., Wortsman, M., Ilharco, G., Gordon, C., Schuhmann, C., Schmidt, L., Jitsev, J.: Reproducible scaling laws for contrastive language-image learning. In: CVPR. pp. 2818–2829 (2023)

  6. [6]

    See https://vicuna

    Chiang, W.L., Li, Z., Lin, Z., Sheng, Y., Wu, Z., Zhang, H., Zheng, L., Zhuang, S., Zhuang, Y., Gonzalez, J.E., et al.: Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See https://vicuna. lmsys. org (accessed 14 April 2023)2(3), 6 (2023)

  7. [7]

    In: ICLR

    Chuang, Y.S., Xie, Y., Luo, H., Kim, Y., Glass, J., He, P.: Dola: Decoding by contrasting layers improves factuality in large language models. In: ICLR. pp. 54158–54183 (2024)

  8. [8]

    In: NeurIPS

    Dai, W., Li, J., Li, D., Tiong, A., Zhao, J., Wang, W., Li, B., Fung, P.N., Hoi, S.: Instructblip: Towards general-purpose vision-language models with instruction tuning. In: NeurIPS. vol. 36, pp. 49250–49267 (2023)

  9. [9]

    In: EACL

    Dai, W., Liu, Z., Ji, Z., Su, D., Fung, P.: Plausible may not be faithful: Prob- ing object hallucination in vision-language pre-training. In: EACL. pp. 2136–2148 (2023)

  10. [10]

    In: NAACL

    Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidi- rectional transformers for language understanding. In: NAACL. pp. 4171–4186 (2019)

  11. [11]

    arXiv preprint arXiv:2509.11569 (2025)

    Ding, Y., Zhu, X., Xia, T., Wu, J., Chen, X., Liu, Q., Wang, L.: D 2 hscore: Reasoning-aware hallucination detection via semantic breadth and depth analysis in llms. arXiv preprint arXiv:2509.11569 (2025)

  12. [12]

    In: NeurIPS (2025)

    Fu, C., Chen, P., Shen, Y., Qin, Y., Zhang, M., Lin, X., Yang, J., Zheng, X., Li, K., Sun, X., et al.: Mme: A comprehensive evaluation benchmark for multimodal large language models. In: NeurIPS (2025)

  13. [13]

    ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools

    GLM, T., Zeng, A., Xu, B., Wang, B., Zhang, C., Yin, D., Zhang, D., Rojas, D., Feng, G., Zhao, H., et al.: Chatglm: A family of large language models from glm-130b to glm-4 all tools. arXiv preprint arXiv:2406.12793 (2024)

  14. [14]

    In: CVPR

    Guan, T., Liu, F., Wu, X., Xian, R., Li, Z., Liu, X., Wang, X., Chen, L., Huang, F., Yacoob, Y., et al.: Hallusionbench: an advanced diagnostic suite for entangled lan- guage hallucination and visual illusion in large vision-language models. In: CVPR. pp. 14375–14385 (2024)

  15. [15]

    In: AAAI

    Gunjal, A., Yin, J., Bas, E.: Detecting and preventing hallucinations in large vision language models. In: AAAI. vol. 38, pp. 18135–18143 (2024)

  16. [16]

    arXiv preprint arXiv:2304.04920 (2023) 26 Boyang Gong, Yu Zheng †, Fanye Kong, Jie Zhou, and Jiwen Lu

    Hu, M., Pan, S., Li, Y., Yang, X.: Advancing medical imaging with language mod- els: A journey from n-grams to chatgpt. arXiv preprint arXiv:2304.04920 (2023) 26 Boyang Gong, Yu Zheng †, Fanye Kong, Jie Zhou, and Jiwen Lu

  17. [17]

    In: CVPR

    Huang, Q., Dong, X., Zhang, P., Wang, B., He, C., Wang, J., Lin, D., Zhang, W., Yu, N.: Opera: Alleviating hallucination in multi-modal large language models via over-trust penalty and retrospection-allocation. In: CVPR. pp. 13418–13427 (2024)

  18. [18]

    In: CVPR

    Jiang, C., Xu, H., Dong, M., Chen, J., Ye, W., Yan, M., Ye, Q., Zhang, J., Huang, F., Zhang, S.: Hallucination augmented contrastive learning for multimodal large language model. In: CVPR. pp. 27036–27046 (2024)

  19. [19]

    In: ICML (2025)

    Jung, M., Lee, S., Kim, E., Yoon, S.: Visual attention never fades: Selective pro- gressive attention recalibration for detailed image captioning in multimodal large language models. In: ICML (2025)

  20. [20]

    In: ECCV

    Kim, M., Kim, M., Bae, J., Choi, S., Kim, S., Chang, B.: Exploiting semantic reconstruction to mitigate hallucinations in vision-language models. In: ECCV. pp. 236–252 (2024)

  21. [21]

    In: CVPR

    Leng, S., Zhang, H., Chen, G., Li, X., Lu, S., Miao, C., Bing, L.: Mitigating object hallucinations in large vision-language models through visual contrastive decoding. In: CVPR. pp. 13872–13882 (2024)

  22. [22]

    In: ICML

    Li, J., Li, D., Savarese, S., Hoi, S.: Blip-2: Bootstrapping language-image pre- training with frozen image encoders and large language models. In: ICML. pp. 19730–19742 (2023)

  23. [23]

    In: NeurIPS

    Li, K., Patel, O., Viégas, F., Pfister, H., Wattenberg, M.: Inference-time interven- tion: Eliciting truthful answers from a language model. In: NeurIPS. vol. 36, pp. 41451–41530 (2023)

  24. [24]

    Li, X.L., Holtzman, A., Fried, D., Liang, P., Eisner, J., Hashimoto, T., Zettlemoyer, L., Lewis, M.: Contrastive decoding: Open-ended text generation as optimization. In: ACL. p. 12286–12312 (2023)

  25. [25]

    In: EMNLP (2023)

    Li, Y., Du, Y., Zhou, K., Wang, J., Zhao, W.X., Wen, J.R.: Evaluating object hallucination in large vision-language models. In: EMNLP (2023)

  26. [26]

    In: ECCV

    Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft coco: Common objects in context. In: ECCV. pp. 740–755 (2014)

  27. [27]

    DeepSeek-V3 Technical Report

    Liu, A., Feng, B., Xue, B., Wang, B., Wu, B., Lu, C., Zhao, C., Deng, C., Zhang, C., Ruan, C., et al.: Deepseek-v3 technical report. arXiv preprint arXiv:2412.19437 (2024)

  28. [28]

    In: ICLR

    Liu, F., Lin, K., Li, L., Wang, J., Yacoob, Y., Wang, L.: Aligning large multi-modal model with robust instruction tuning. In: ICLR. pp. 57689–57733 (2024)

  29. [29]

    arXiv preprint arXiv:2308.14972 (2023)

    Liu, H., Zhu, Y., Kato, K., Kondo, I., Aoyama, T., Hasegawa, Y.: Llm-based human-robot collaboration framework for manipulation tasks. arXiv preprint arXiv:2308.14972 (2023)

  30. [30]

    In: CVPR

    Liu, H., Li, C., Li, Y., Lee, Y.J.: Improved baselines with visual instruction tuning. In: CVPR. pp. 26296–26306 (2024)

  31. [31]

    In: NeurIPS

    Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. In: NeurIPS. vol. 36, pp. 34892–34916 (2024)

  32. [32]

    In: ECCV

    Liu, S., Zheng, K., Chen, W.: Paying more attention to image: A training-free method for alleviating hallucination in lvlms. In: ECCV. pp. 125–140 (2024)

  33. [33]

    Liu, Y., Duan, H., Zhang, Y., Li, B., Zhang, S., Zhao, W., Yuan, Y., Wang, J., He, C., Liu, Z., et al.: Mmbench: Is your multi-modal model an all-around player? In: ECCV. pp. 216–233 (2024)

  34. [34]

    arXiv preprint arXiv:2304.09349 (2023) Title Suppressed Due to Excessive Length 27

    Mai, J., Chen, J., Li, B., Qian, G., Elhoseiny, M., Ghanem, B.: Llm as a robotic brain: Unifying egocentric memory and control. arXiv preprint arXiv:2304.09349 (2023) Title Suppressed Due to Excessive Length 27

  35. [35]

    In: ICML

    Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: ICML. pp. 8748–8763 (2021)

  36. [36]

    In: NeurIPS Workshop (2023)

    Ren, J., Zhao, Y., Vu, T., Liu, P.J., Lakshminarayanan, B.: Self-evaluation im- proves selective generation in large language models. In: NeurIPS Workshop (2023)

  37. [37]

    In: EMNLP (2018)

    Rohrbach, A., Hendricks, L.A., Burns, K., Darrell, T., Saenko, K.: Object halluci- nation in image captioning. In: EMNLP (2018)

  38. [38]

    Su, W., Wang, C., Ai, Q., Hu, Y., Wu, Z., Zhou, Y., Liu, Y.: Unsupervised real- time hallucination detection based on the internal states of large language models. In: ACL. pp. 14379–14391 (2025)

  39. [39]

    Sun, Z., Shen, S., Cao, S., Liu, H., Li, C., Shen, Y., Gan, C., Gui, L., Wang, Y.X., Yang, Y., et al.: Aligning large multimodal models with factually augmented rlhf. In: ACL. pp. 13088–13110 (2024)

  40. [40]

    Taori, R., Gulrajani, I., Zhang, T., Dubois, Y., Li, X., Guestrin, C., Liang, P., Hashimoto, T.B.: Stanford alpaca: An instruction-following llama model (2023)

  41. [41]

    Tian, K., Mitchell, E., Zhou, A., Sharma, A., Rafailov, R., Yao, H., Finn, C., Manning, C.D.: Just ask for calibration: Strategies for eliciting calibrated confi- dence scores from language models fine-tuned with human feedback. In: ACL. p. 5433–5442 (2023)

  42. [42]

    LLaMA: Open and Efficient Foundation Language Models

    Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023)

  43. [43]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bash- lykov, N., Batra, S., Bhargava, P., Bhosale, S., et al.: Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023)

  44. [44]

    NeurIPS30(2017)

    Vaswani,A.,Shazeer,N.,Parmar,N.,Uszkoreit,J.,Jones,L.,Gomez,A.N.,Kaiser, Ł., Polosukhin, I.: Attention is all you need. NeurIPS30(2017)

  45. [45]

    In: ICLR

    Wang, C., Chen, X., Zhang, N., Tian, B., Xu, H., Deng, S., Chen, H.: Mllm can see? dynamic correction decoding for hallucination mitigation. In: ICLR. pp. 13712– 13736 (2025)

  46. [46]

    Communications Engineering3(1), 133 (2024)

    Wang, S., Zhao, Z., Ouyang, X., Liu, T., Wang, Q., Shen, D.: Interactive computer- aided diagnosis on medical image using large language models. Communications Engineering3(1), 133 (2024)

  47. [47]

    Wang, X., Pan, J., Ding, L., Biemann, C.: Mitigating hallucinations in large vision- language models with instruction contrastive decoding. In: ACL. p. 15840–15853 (2024)

  48. [48]

    HuggingFace's Transformers: State-of-the-art Natural Language Processing

    Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of- the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019)

  49. [49]

    Embodied task planning with large language models

    Wu, Z., Wang, Z., Xu, X., Lu, J., Yan, H.: Embodied task planning with large language models. arXiv preprint arXiv:2307.01848 (2023)

  50. [50]

    In: Proc

    Xie, Y., Zhu, Z., Zhuang, X., Liang, L., Wang, Z., Zou, Y.: Gpa: Global and prototype alignment for audio-text retrieval. In: Proc. Interspeech. pp. 5078–5082 (2024)

  51. [51]

    In: CVPR

    Yin, H., Si, G., Wang, Z.: Clearsight: Visual signal enhancement for object hallu- cination mitigation in multimodal large language models. In: CVPR. pp. 14625– 14634 (2025)

  52. [52]

    Yin, Y., Xie, Y., Yang, W., Yang, D., Ru, J., Zhuang, X., Liang, L., Zou, Y.: Atri: Mitigating multilingual audio text retrieval inconsistencies by reducing data distribution errors. In: ACL. pp. 5491–5504 (2025) 28 Boyang Gong, Yu Zheng †, Fanye Kong, Jie Zhou, and Jiwen Lu

  53. [53]

    In: CVPR

    Yu, Q., Li, J., Wei, L., Pang, L., Ye, W., Qin, B., Tang, S., Tian, Q., Zhuang, Y.: Hallucidoctor: Mitigating hallucinatory toxicity in visual instruction data. In: CVPR. pp. 12944–12953 (2024)

  54. [54]

    In: CVPR

    Yu, T., Yao, Y., Zhang, H., He, T., Han, Y., Cui, G., Hu, J., Liu, Z., Zheng, H.T., Sun, M., et al.: Rlhf-v: Towards trustworthy mllms via behavior alignment from fine-grained correctional human feedback. In: CVPR. pp. 13807–13816 (2024)

  55. [55]

    arXiv preprint arXiv:2509.21789 (2025)

    Yu, X., Xu, C., Zhang, G., He, Y., Chen, Z., Xue, Z., Zhang, J., Liao, Y., Hu, X., Jiang, Y.G., et al.: Visual multi-agent system: Mitigating hallucination snowballing via visual flow. arXiv preprint arXiv:2509.21789 (2025)

  56. [56]

    In: EMNLP

    Yuan, F., Qin, C., Xu, X., Li, P.: Helpd: Mitigating hallucination of lvlms by hierarchical feedback learning with vision-enhanced penalty decoding. In: EMNLP. pp. 1768–1785 (2024)

  57. [57]

    Yue, Z., Zhang, L., Jin, Q.: Less is more: Mitigating multimodal hallucination from an eos decision perspective. In: ACL. pp. 11766–11781 (2024)

  58. [58]

    In: CVPR

    Zhai, X., Mustafa, B., Kolesnikov, A., Beyer, L.: Sigmoid loss for language image pre-training. In: CVPR. pp. 11975–11986 (2023)

  59. [59]

    In: NAACL

    Zhang, Y., Cui, L., Shi, S., et al.: Alleviating hallucinations of large language models through induced hallucinations. In: NAACL. pp. 8218–8232 (2025)

  60. [60]

    Computational Linguistics pp

    Zhang, Y., Li, Y., Cui, L., Cai, D., Liu, L., Fu, T., Huang, X., Zhao, E., Zhang, Y., Chen, Y., et al.: Siren’s song in the ai ocean: A survey on hallucination in large language models. Computational Linguistics pp. 1–46 (2025)

  61. [61]

    In: CVPR

    Zhao, Y., Li, K., Cheng, Z., Qiao, P., Zheng, X., Ji, R., Liu, C., Yuan, L., Chen, J.: Graco: Granularity-controllable interactive segmentation. In: CVPR. pp. 3501– 3510 (2024)

  62. [62]

    In: CVPR

    Zhao, Y., Lv, W., Xu, S., Wei, J., Wang, G., Dang, Q., Liu, Y., Chen, J.: Detrs beat yolos on real-time object detection. In: CVPR. pp. 16965–16974 (2024)

  63. [63]

    Zheng, K., Chen, J., Yan, Y., Zou, X., Hu, X.: Reefknot: A comprehensive bench- mark for relation hallucination evaluation, analysis and mitigation in multimodal large language models. In: ACL. pp. 6193–6212 (2024)

  64. [64]

    In: NeurIPS

    Zheng, L., Chiang, W.L., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y., Lin, Z., Li, Z., Li, D., Xing, E., et al.: Judging llm-as-a-judge with mt-bench and chatbot arena. In: NeurIPS. vol. 36, pp. 46595–46623 (2023)

  65. [65]

    In: ICLR

    Zhu, D., Chen, J., Shen, X., Li, X., Elhoseiny, M.: Minigpt-4: Enhancing vision- language understanding with advanced large language models. In: ICLR. pp. 18378–18394 (2024)

  66. [66]

    In: ACM MM (2024)

    Zhuang, X., Cheng, X., Zhu, Z., Chen, Z., Li, H., Zou, Y.: Towards multimodal-augmented pre-trained language models via self-balanced expectation- maximization iteration. In: ACM MM (2024)

  67. [67]

    In: CVPR

    Zhuang, X., Zhu, Z., Xie, Y., Liang, L., Zou, Y.: Vasparse: Towards efficient visual hallucination mitigation via visual-aware token sparsification. In: CVPR. pp. 4189– 4199 (2025)

  68. [68]

    In: ICML (2025)

    Zou, X., Wang, Y., Yan, Y., Lyu, Y., Zheng, K., Huang, S., Chen, J., Jiang, P., Liu, J., Tang, C., et al.: Look twice before you answer: Memory-space visual retracing for hallucination mitigation in multimodal large language models. In: ICML (2025)