arxiv: 2604.01989 · v2 · submitted 2026-04-02 · 💻 cs.CV · cs.AI

Recognition: no theorem link

Attention at Rest Stays at Rest: Breaking Visual Inertia for Cognitive Hallucination Mitigation

Boyang Gong , Yu Zheng , Fanye Kong , Jie Zhou , Jiwen Lu

Authors on Pith no claims yet

Pith reviewed 2026-05-13 22:03 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords visual inertiacognitive hallucinationmultimodal large language modelsattention dynamicstraining-free methodcompositional inferencehallucination mitigationrelational reasoning

0 comments

The pith

Visual attention in multimodal models remains static after initial steps, leading to failures in relational reasoning that a new excitation method can address.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that visual attention in MLLMs settles early and stays fixed, preventing the dynamic shifts needed for understanding relationships between objects. This visual inertia contributes to cognitive hallucinations, which involve incorrect deductions rather than misperceptions of single objects. Existing methods do not target this because they focus on perceptual errors. By analyzing token-wise attention, the authors link persistent focus on critical regions to poor compositional inference. They propose IVE as a training-free fix that promotes attention to emerging tokens and penalizes inertial concentration to support better inference.

Core claim

Through token-wise attention analysis, visual attention in MLLMs exhibits pronounced inertia, remaining largely static once settled during early decoding steps and failing to dynamically support relational inference required for cognitive understanding. The proposed Inertia-aware Visual Excitation (IVE) method breaks this pattern by selecting visual tokens that emerge dynamically relative to historical trends and applying an inertia-aware penalty to discourage over-concentration and persistence in localized regions, thereby modeling cognitive inference as dynamic responsiveness of visual attention.

What carries the argument

The Inertia-aware Visual Excitation (IVE) method, which identifies inertial attention behavior and excites dynamic token responsiveness through trend-based selection and penalties.

Load-bearing premise

That promoting dynamic visual token selection and penalizing attention persistence will enhance relational inference capabilities without causing new errors or reducing performance on non-cognitive tasks.

What would settle it

Running IVE on a cognitive hallucination benchmark and finding no reduction in error rates for questions requiring inter-object relational deduction compared to the base model.

Figures

Figures reproduced from arXiv: 2604.01989 by Boyang Gong, Fanye Kong, Jie Zhou, Jiwen Lu, Yu Zheng.

**Figure 1.** Figure 1: Overview of the cognitive hallucinations and their mitigation via the proposed IVE. Upper Left: Comparison between perceptual and cognitive hallucinations in MLLMs. Upper Right: Comparison with hallucination mitigation methods across multiple benchmarks. Bottom: Compared to PAI [32] which amplifies visual attention designed primarily for mitigating perceptual hallucinations, our IVE effectively reduces bot… view at source ↗

**Figure 2.** Figure 2: Limited Effectiveness of Visual Attention Amplification on Cognitive Hallucinations. We assess the improvements of attention amplification (PAI) [32] on perceptual (POPE) [25] and cognitive (Reefknot) [63] hallucination benchmarks. Recent studies [32, 51, 55] have identified imbalanced token attention as a primary cause of perceptual hallucinations, where models fail to adequately attend to critical vi… view at source ↗

**Figure 3.** Figure 3: The naive visual attention amplification method exacerbates visual [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Overview of our proposed IVE framework. Left: [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 4.** Figure 4: Specifically, the redistributed attention [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗

**Figure 5.** Figure 5: Results (%) on the MMBench [33] benchmark, which assesses the multidimensional performance of MLLMs. Cognitive Hallucination: We evaluated the effectiveness of our method IVE in mitigating cognitive hallucinations on two benchmarks [12, 63]. As shown in [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗

**Figure 6.** Figure 6: Visualization of visual activeness of each token, comparing hallucinated and non-hallucinated responses in cognitive hallucination tasks. Rows indicate layer indices in LLaVA-1.5 [30], and columns correspond to decoded tokens at each step. both LLaVA-1.5 [30] and InstructBLIP [8] achieve strong performance on Reefknot [63] and POPE [25] benchmarks, reducing cognitive and perceptual hallucinations. Notably… view at source ↗

**Figure 7.** Figure 7: Radar chart comparison of hallucination mitigation methods across multiple benchmarks under InstructBLIP [8] and Qwen2.5-VL-Instruct [2]. Input PAI Our IVE Input PAI Our IVE [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗

**Figure 8.** Figure 8: Attention heatmap comparison between our IVE and the naive attention amplification method PAI [32]. Heatmap Comparison Between our IVE and PAI [PITH_FULL_IMAGE:figures/full_fig_p019_8.png] view at source ↗

**Figure 9.** Figure 9: Visual activeness comparison between our IVE and the naive attention amplification method PAI [32]. tention amplification method PAI [32] during image captioning. The results show that PAI [32] consistently exhibits lower pairwise distances, indicating reduced semantic activeness in visual attention. In contrast, IVE maintains higher pairwise distances, suggesting more active and dynamic attention transi… view at source ↗

**Figure 10.** Figure 10: ∼ 17 present several representative examples where IVE successfully suppresses both perceptual and cognitive hallucinations. Compared with the baseline responses, the outputs generated by IVE demonstrate improved alignment with the visual content, capturing relevant objects, attributes, and inter-object relationships more reliably. These examples highlight that IVE can effectively guide the model to atte… view at source ↗

**Figure 11.** Figure 11: An illustration of IVE mitigating cognitive hallucinations on Reefknot [63]. LLaVA + Our IVE LLaVA + PAI LLaVA Please describe the photo in detail. The image features a woman sitting at a red dining table with a teddy bear on her lap. She is smiling and appears to be enjoying her time at the table. The table is surrounded by chairs, and there is a red cloth covering it. In the background, there is a tent … view at source ↗

**Figure 12.** Figure 12: An illustration of IVE mitigating cognitive hallucinations on Reefknot [63] [PITH_FULL_IMAGE:figures/full_fig_p022_12.png] view at source ↗

**Figure 13.** Figure 13: An illustration of IVE mitigating cognitive hallucinations on Reefknot [63]. The image shows four men gathered outdoors, working together to move a large white appliance. One of the men is holding the metal refrigerator door while another appears to be directing the group as they load the appliance into a nearby truck. Several bicycles are parked around them, and lush green plants with white and pink flow… view at source ↗

**Figure 14.** Figure 14: An illustration of IVE mitigating cognitive hallucinations on Reefknot [63] [PITH_FULL_IMAGE:figures/full_fig_p023_14.png] view at source ↗

**Figure 15.** Figure 15: An illustration of IVE mitigating cognitive hallucinations on MME [12]. Is there no snowboard in the image? LLaVA + Our IVE No. There are some snowboards in the image. LLaVA + Our IVE No. There are some snowboards in the image. LLaVA + PAI Yes. There is no snowboard in the image. LLaVA + PAI Yes. There is no snowboard in the image. LLaVA Yes. There is no snowboard in the image. Is there a knife in the ima… view at source ↗

**Figure 16.** Figure 16: An illustration of IVE mitigating perceptual hallucinations on POPE [25]. LLaVA B. This egg will fall down LLaVA + PAI B. This egg will fall down LLaVA + Our IVE C. The egg will be broken Based on this image, please predict what will happen? A. This egg will be cooked B. This egg will fall down C. The egg will be broken LLaVA B. five LLaVA + PAI C. seven LLaVA + Our IVE A. six How many trucks are in this … view at source ↗

**Figure 17.** Figure 17: An illustration of IVE mitigating cognitive hallucinations on MMBench [33] [PITH_FULL_IMAGE:figures/full_fig_p024_17.png] view at source ↗

read the original abstract

Like a body at rest that stays at rest, we find that visual attention in multimodal large language models (MLLMs) exhibits pronounced inertia, remaining largely static once settled during early decoding steps and failing to support the compositional understanding required for cognitive inference. While existing hallucination mitigation methods mainly target perceptual hallucinations concerning object existence or attributes, they remain inadequate for such cognitive hallucinations that require inter-object relational deduction. Through token-wise attention analysis, we identify this visual inertia as a key factor: attention to semantically critical regions remains persistently focused and fails to dynamically support relational inference. We thereby propose a training-free Inertia-aware Visual Excitation (IVE) method that breaks this inertial pattern by modeling cognitive inference as the dynamic responsiveness of visual attention. Specifically, IVE selects visual tokens that are dynamically emerging relative to historical attention trends while distinguishing tokens exhibiting inertial behavior. To further facilitate compositional inference, IVE introduces an inertia-aware penalty that discourages over-concentration and limits the persistence of attention within localized regions. Extensive experiments show that IVE is effective across various base MLLMs and multiple hallucination benchmarks, particularly for cognitive hallucinations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper flags visual inertia in MLLM attention as a cause of cognitive hallucinations and offers a training-free IVE fix, but the abstract leaves the mechanism untested.

read the letter

The main takeaway is that attention in multimodal models tends to settle early on a few regions and stays there, which the authors link to failures in relational reasoning that produce cognitive hallucinations. They contrast this with perceptual hallucination fixes and propose Inertia-aware Visual Excitation to select tokens that shift relative to prior steps while penalizing persistent local focus. The idea is presented as training-free and grounded in direct attention tracking.

Referee Report

2 major / 1 minor

Summary. The paper claims that visual attention in MLLMs exhibits pronounced inertia—remaining largely static after early decoding steps and failing to support compositional relational inference—leading to cognitive hallucinations distinct from perceptual ones. Through token-wise attention analysis, the authors identify persistent focus on semantically critical regions as the key factor and propose a training-free Inertia-aware Visual Excitation (IVE) method. IVE selects dynamically emerging visual tokens relative to historical attention trends, distinguishes inertial tokens, and applies an inertia-aware penalty to discourage over-concentration, thereby modeling cognitive inference as dynamic attention responsiveness. Experiments reportedly show effectiveness across base MLLMs and hallucination benchmarks, especially for cognitive cases.

Significance. If the central claim holds, the work provides a novel, training-free perspective on hallucination mitigation by directly targeting attention dynamics rather than perceptual object attributes. The parameter-free grounding in token-wise analysis and the explicit modeling of inertia as a barrier to relational deduction are strengths that could inform future MLLM interpretability research. However, the absence of ablations isolating the historical-trend mechanism limits the ability to credit the specific inertia-breaking logic over generic attention spreading.

major comments (2)

[§3] §3 (IVE method description): the central claim that selecting tokens relative to historical trends plus an inertia-aware penalty specifically enables compositional inference is not supported by any ablation that isolates this component from generic attention redistribution or diversity injection; improvements on benchmarks could arise from any increase in attention spread.
[Experiments] Experiments section: no ablation studies, quantitative metrics, or error analysis are referenced that would verify the causal role of visual inertia in cognitive hallucinations versus other attention patterns, leaving the data support for the mapping from IVE operations to reduced hallucinations unverified.

minor comments (1)

[Abstract] Abstract: the phrase 'extensive experiments' is used without any specific benchmark names, metrics, or baseline comparisons, reducing clarity on the scope of validation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We agree that additional ablations are needed to strengthen the causal claims regarding visual inertia and the specific contributions of the historical-trend mechanism in IVE. We will revise the manuscript accordingly.

read point-by-point responses

Referee: [§3] §3 (IVE method description): the central claim that selecting tokens relative to historical trends plus an inertia-aware penalty specifically enables compositional inference is not supported by any ablation that isolates this component from generic attention redistribution or diversity injection; improvements on benchmarks could arise from any increase in attention spread.

Authors: We acknowledge that the current version does not include an ablation isolating the historical attention trend component from generic spreading. In the revision we will add a controlled comparison of IVE against a variant that applies uniform diversity injection without historical trend selection, together with attention visualization showing the difference in dynamic responsiveness on relational queries. revision: yes
Referee: [Experiments] Experiments section: no ablation studies, quantitative metrics, or error analysis are referenced that would verify the causal role of visual inertia in cognitive hallucinations versus other attention patterns, leaving the data support for the mapping from IVE operations to reduced hallucinations unverified.

Authors: We agree that stronger quantitative linkage is required. The revised experiments section will include (i) attention entropy and shift-rate metrics computed across decoding steps to quantify inertia, (ii) error analysis breaking down hallucination types before/after IVE, and (iii) an ablation that disables the inertia-aware penalty while keeping token selection, to isolate its contribution to cognitive hallucination reduction. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper grounds its identification of visual inertia directly in token-wise attention analysis of MLLM decoding steps and proposes an explicitly training-free IVE method that selects dynamically emerging tokens relative to historical trends while applying an inertia-aware penalty. No equations or steps reduce by construction to fitted parameters, self-definitions, or load-bearing self-citations; the mapping from observed attention patterns to the proposed operations remains an independent modeling choice supported by external hallucination benchmarks rather than internal equivalence.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Central claim rests on the domain assumption that token-wise attention trends accurately reveal and can be used to correct the root cause of cognitive hallucinations. No free parameters or invented physical entities are described in the abstract.

axioms (1)

domain assumption Token-wise attention analysis during decoding can identify persistent focus patterns that limit relational inference
Invoked as the basis for identifying visual inertia as the key factor.

invented entities (1)

Inertia-aware Visual Excitation (IVE) no independent evidence
purpose: To break visual attention inertia by selecting dynamically emerging tokens and applying a penalty on over-concentration
New method introduced to model cognitive inference as dynamic attention responsiveness

pith-pipeline@v0.9.0 · 5507 in / 1335 out tokens · 53431 ms · 2026-05-13T22:03:04.941906+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

68 extracted references · 68 canonical work pages · 7 internal anchors

[1]

GPT-4 Technical Report

Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F.L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al.: Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Qwen2.5-VL Technical Report

Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., et al.: Qwen2. 5-vl technical report. arXiv preprint arXiv:2502.13923 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

In: ICRA

Chen, L., Sinavski, O., Hünermann, J., Karnsund, A., Willmott, A.J., Birch, D., Maund, D., Shotton, J.: Driving with llms: Fusing object-level vector modality for explainable autonomous driving. In: ICRA. pp. 14093–14100 (2024)

work page 2024
[4]

In: ICML

Chen, Z., Zhao, Z., Luo, H., Yao, H., Li, B., Zhou, J.: Halc: Object hallucination reduction via adaptive focal-contrast decoding. In: ICML. pp. 7824–7846 (2024)

work page 2024
[5]

In: CVPR

Cherti, M., Beaumont, R., Wightman, R., Wortsman, M., Ilharco, G., Gordon, C., Schuhmann, C., Schmidt, L., Jitsev, J.: Reproducible scaling laws for contrastive language-image learning. In: CVPR. pp. 2818–2829 (2023)

work page 2023
[6]

See https://vicuna

Chiang, W.L., Li, Z., Lin, Z., Sheng, Y., Wu, Z., Zhang, H., Zheng, L., Zhuang, S., Zhuang, Y., Gonzalez, J.E., et al.: Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See https://vicuna. lmsys. org (accessed 14 April 2023)2(3), 6 (2023)

work page 2023
[7]

In: ICLR

Chuang, Y.S., Xie, Y., Luo, H., Kim, Y., Glass, J., He, P.: Dola: Decoding by contrasting layers improves factuality in large language models. In: ICLR. pp. 54158–54183 (2024)

work page 2024
[8]

In: NeurIPS

Dai, W., Li, J., Li, D., Tiong, A., Zhao, J., Wang, W., Li, B., Fung, P.N., Hoi, S.: Instructblip: Towards general-purpose vision-language models with instruction tuning. In: NeurIPS. vol. 36, pp. 49250–49267 (2023)

work page 2023
[9]

In: EACL

Dai, W., Liu, Z., Ji, Z., Su, D., Fung, P.: Plausible may not be faithful: Prob- ing object hallucination in vision-language pre-training. In: EACL. pp. 2136–2148 (2023)

work page 2023
[10]

In: NAACL

Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidi- rectional transformers for language understanding. In: NAACL. pp. 4171–4186 (2019)

work page 2019
[11]

arXiv preprint arXiv:2509.11569 (2025)

Ding, Y., Zhu, X., Xia, T., Wu, J., Chen, X., Liu, Q., Wang, L.: D 2 hscore: Reasoning-aware hallucination detection via semantic breadth and depth analysis in llms. arXiv preprint arXiv:2509.11569 (2025)

work page arXiv 2025
[12]

In: NeurIPS (2025)

Fu, C., Chen, P., Shen, Y., Qin, Y., Zhang, M., Lin, X., Yang, J., Zheng, X., Li, K., Sun, X., et al.: Mme: A comprehensive evaluation benchmark for multimodal large language models. In: NeurIPS (2025)

work page 2025
[13]

ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools

GLM, T., Zeng, A., Xu, B., Wang, B., Zhang, C., Yin, D., Zhang, D., Rojas, D., Feng, G., Zhao, H., et al.: Chatglm: A family of large language models from glm-130b to glm-4 all tools. arXiv preprint arXiv:2406.12793 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[14]

In: CVPR

Guan, T., Liu, F., Wu, X., Xian, R., Li, Z., Liu, X., Wang, X., Chen, L., Huang, F., Yacoob, Y., et al.: Hallusionbench: an advanced diagnostic suite for entangled lan- guage hallucination and visual illusion in large vision-language models. In: CVPR. pp. 14375–14385 (2024)

work page 2024
[15]

In: AAAI

Gunjal, A., Yin, J., Bas, E.: Detecting and preventing hallucinations in large vision language models. In: AAAI. vol. 38, pp. 18135–18143 (2024)

work page 2024
[16]

arXiv preprint arXiv:2304.04920 (2023) 26 Boyang Gong, Yu Zheng †, Fanye Kong, Jie Zhou, and Jiwen Lu

Hu, M., Pan, S., Li, Y., Yang, X.: Advancing medical imaging with language mod- els: A journey from n-grams to chatgpt. arXiv preprint arXiv:2304.04920 (2023) 26 Boyang Gong, Yu Zheng †, Fanye Kong, Jie Zhou, and Jiwen Lu

work page arXiv 2023
[17]

In: CVPR

Huang, Q., Dong, X., Zhang, P., Wang, B., He, C., Wang, J., Lin, D., Zhang, W., Yu, N.: Opera: Alleviating hallucination in multi-modal large language models via over-trust penalty and retrospection-allocation. In: CVPR. pp. 13418–13427 (2024)

work page 2024
[18]

In: CVPR

Jiang, C., Xu, H., Dong, M., Chen, J., Ye, W., Yan, M., Ye, Q., Zhang, J., Huang, F., Zhang, S.: Hallucination augmented contrastive learning for multimodal large language model. In: CVPR. pp. 27036–27046 (2024)

work page 2024
[19]

In: ICML (2025)

Jung, M., Lee, S., Kim, E., Yoon, S.: Visual attention never fades: Selective pro- gressive attention recalibration for detailed image captioning in multimodal large language models. In: ICML (2025)

work page 2025
[20]

In: ECCV

Kim, M., Kim, M., Bae, J., Choi, S., Kim, S., Chang, B.: Exploiting semantic reconstruction to mitigate hallucinations in vision-language models. In: ECCV. pp. 236–252 (2024)

work page 2024
[21]

In: CVPR

Leng, S., Zhang, H., Chen, G., Li, X., Lu, S., Miao, C., Bing, L.: Mitigating object hallucinations in large vision-language models through visual contrastive decoding. In: CVPR. pp. 13872–13882 (2024)

work page 2024
[22]

In: ICML

Li, J., Li, D., Savarese, S., Hoi, S.: Blip-2: Bootstrapping language-image pre- training with frozen image encoders and large language models. In: ICML. pp. 19730–19742 (2023)

work page 2023
[23]

In: NeurIPS

Li, K., Patel, O., Viégas, F., Pfister, H., Wattenberg, M.: Inference-time interven- tion: Eliciting truthful answers from a language model. In: NeurIPS. vol. 36, pp. 41451–41530 (2023)

work page 2023
[24]

Li, X.L., Holtzman, A., Fried, D., Liang, P., Eisner, J., Hashimoto, T., Zettlemoyer, L., Lewis, M.: Contrastive decoding: Open-ended text generation as optimization. In: ACL. p. 12286–12312 (2023)

work page 2023
[25]

In: EMNLP (2023)

Li, Y., Du, Y., Zhou, K., Wang, J., Zhao, W.X., Wen, J.R.: Evaluating object hallucination in large vision-language models. In: EMNLP (2023)

work page 2023
[26]

In: ECCV

Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft coco: Common objects in context. In: ECCV. pp. 740–755 (2014)

work page 2014
[27]

DeepSeek-V3 Technical Report

Liu, A., Feng, B., Xue, B., Wang, B., Wu, B., Lu, C., Zhao, C., Deng, C., Zhang, C., Ruan, C., et al.: Deepseek-v3 technical report. arXiv preprint arXiv:2412.19437 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[28]

In: ICLR

Liu, F., Lin, K., Li, L., Wang, J., Yacoob, Y., Wang, L.: Aligning large multi-modal model with robust instruction tuning. In: ICLR. pp. 57689–57733 (2024)

work page 2024
[29]

arXiv preprint arXiv:2308.14972 (2023)

Liu, H., Zhu, Y., Kato, K., Kondo, I., Aoyama, T., Hasegawa, Y.: Llm-based human-robot collaboration framework for manipulation tasks. arXiv preprint arXiv:2308.14972 (2023)

work page arXiv 2023
[30]

In: CVPR

Liu, H., Li, C., Li, Y., Lee, Y.J.: Improved baselines with visual instruction tuning. In: CVPR. pp. 26296–26306 (2024)

work page 2024
[31]

In: NeurIPS

Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. In: NeurIPS. vol. 36, pp. 34892–34916 (2024)

work page 2024
[32]

In: ECCV

Liu, S., Zheng, K., Chen, W.: Paying more attention to image: A training-free method for alleviating hallucination in lvlms. In: ECCV. pp. 125–140 (2024)

work page 2024
[33]

Liu, Y., Duan, H., Zhang, Y., Li, B., Zhang, S., Zhao, W., Yuan, Y., Wang, J., He, C., Liu, Z., et al.: Mmbench: Is your multi-modal model an all-around player? In: ECCV. pp. 216–233 (2024)

work page 2024
[34]

arXiv preprint arXiv:2304.09349 (2023) Title Suppressed Due to Excessive Length 27

Mai, J., Chen, J., Li, B., Qian, G., Elhoseiny, M., Ghanem, B.: Llm as a robotic brain: Unifying egocentric memory and control. arXiv preprint arXiv:2304.09349 (2023) Title Suppressed Due to Excessive Length 27

work page arXiv 2023
[35]

In: ICML

Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: ICML. pp. 8748–8763 (2021)

work page 2021
[36]

In: NeurIPS Workshop (2023)

Ren, J., Zhao, Y., Vu, T., Liu, P.J., Lakshminarayanan, B.: Self-evaluation im- proves selective generation in large language models. In: NeurIPS Workshop (2023)

work page 2023
[37]

In: EMNLP (2018)

Rohrbach, A., Hendricks, L.A., Burns, K., Darrell, T., Saenko, K.: Object halluci- nation in image captioning. In: EMNLP (2018)

work page 2018
[38]

Su, W., Wang, C., Ai, Q., Hu, Y., Wu, Z., Zhou, Y., Liu, Y.: Unsupervised real- time hallucination detection based on the internal states of large language models. In: ACL. pp. 14379–14391 (2025)

work page 2025
[39]

Sun, Z., Shen, S., Cao, S., Liu, H., Li, C., Shen, Y., Gan, C., Gui, L., Wang, Y.X., Yang, Y., et al.: Aligning large multimodal models with factually augmented rlhf. In: ACL. pp. 13088–13110 (2024)

work page 2024
[40]

Taori, R., Gulrajani, I., Zhang, T., Dubois, Y., Li, X., Guestrin, C., Liang, P., Hashimoto, T.B.: Stanford alpaca: An instruction-following llama model (2023)

work page 2023
[41]

Tian, K., Mitchell, E., Zhou, A., Sharma, A., Rafailov, R., Yao, H., Finn, C., Manning, C.D.: Just ask for calibration: Strategies for eliciting calibrated confi- dence scores from language models fine-tuned with human feedback. In: ACL. p. 5433–5442 (2023)

work page 2023
[42]

LLaMA: Open and Efficient Foundation Language Models

Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[43]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bash- lykov, N., Batra, S., Bhargava, P., Bhosale, S., et al.: Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[44]

NeurIPS30(2017)

Vaswani,A.,Shazeer,N.,Parmar,N.,Uszkoreit,J.,Jones,L.,Gomez,A.N.,Kaiser, Ł., Polosukhin, I.: Attention is all you need. NeurIPS30(2017)

work page 2017
[45]

In: ICLR

Wang, C., Chen, X., Zhang, N., Tian, B., Xu, H., Deng, S., Chen, H.: Mllm can see? dynamic correction decoding for hallucination mitigation. In: ICLR. pp. 13712– 13736 (2025)

work page 2025
[46]

Communications Engineering3(1), 133 (2024)

Wang, S., Zhao, Z., Ouyang, X., Liu, T., Wang, Q., Shen, D.: Interactive computer- aided diagnosis on medical image using large language models. Communications Engineering3(1), 133 (2024)

work page 2024
[47]

Wang, X., Pan, J., Ding, L., Biemann, C.: Mitigating hallucinations in large vision- language models with instruction contrastive decoding. In: ACL. p. 15840–15853 (2024)

work page 2024
[48]

HuggingFace's Transformers: State-of-the-art Natural Language Processing

Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of- the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019)

work page internal anchor Pith review Pith/arXiv arXiv 1910
[49]

Embodied task planning with large language models

Wu, Z., Wang, Z., Xu, X., Lu, J., Yan, H.: Embodied task planning with large language models. arXiv preprint arXiv:2307.01848 (2023)

work page arXiv 2023
[50]

In: Proc

Xie, Y., Zhu, Z., Zhuang, X., Liang, L., Wang, Z., Zou, Y.: Gpa: Global and prototype alignment for audio-text retrieval. In: Proc. Interspeech. pp. 5078–5082 (2024)

work page 2024
[51]

In: CVPR

Yin, H., Si, G., Wang, Z.: Clearsight: Visual signal enhancement for object hallu- cination mitigation in multimodal large language models. In: CVPR. pp. 14625– 14634 (2025)

work page 2025
[52]

Yin, Y., Xie, Y., Yang, W., Yang, D., Ru, J., Zhuang, X., Liang, L., Zou, Y.: Atri: Mitigating multilingual audio text retrieval inconsistencies by reducing data distribution errors. In: ACL. pp. 5491–5504 (2025) 28 Boyang Gong, Yu Zheng †, Fanye Kong, Jie Zhou, and Jiwen Lu

work page 2025
[53]

In: CVPR

Yu, Q., Li, J., Wei, L., Pang, L., Ye, W., Qin, B., Tang, S., Tian, Q., Zhuang, Y.: Hallucidoctor: Mitigating hallucinatory toxicity in visual instruction data. In: CVPR. pp. 12944–12953 (2024)

work page 2024
[54]

In: CVPR

Yu, T., Yao, Y., Zhang, H., He, T., Han, Y., Cui, G., Hu, J., Liu, Z., Zheng, H.T., Sun, M., et al.: Rlhf-v: Towards trustworthy mllms via behavior alignment from fine-grained correctional human feedback. In: CVPR. pp. 13807–13816 (2024)

work page 2024
[55]

arXiv preprint arXiv:2509.21789 (2025)

Yu, X., Xu, C., Zhang, G., He, Y., Chen, Z., Xue, Z., Zhang, J., Liao, Y., Hu, X., Jiang, Y.G., et al.: Visual multi-agent system: Mitigating hallucination snowballing via visual flow. arXiv preprint arXiv:2509.21789 (2025)

work page arXiv 2025
[56]

In: EMNLP

Yuan, F., Qin, C., Xu, X., Li, P.: Helpd: Mitigating hallucination of lvlms by hierarchical feedback learning with vision-enhanced penalty decoding. In: EMNLP. pp. 1768–1785 (2024)

work page 2024
[57]

Yue, Z., Zhang, L., Jin, Q.: Less is more: Mitigating multimodal hallucination from an eos decision perspective. In: ACL. pp. 11766–11781 (2024)

work page 2024
[58]

In: CVPR

Zhai, X., Mustafa, B., Kolesnikov, A., Beyer, L.: Sigmoid loss for language image pre-training. In: CVPR. pp. 11975–11986 (2023)

work page 2023
[59]

In: NAACL

Zhang, Y., Cui, L., Shi, S., et al.: Alleviating hallucinations of large language models through induced hallucinations. In: NAACL. pp. 8218–8232 (2025)

work page 2025
[60]

Computational Linguistics pp

Zhang, Y., Li, Y., Cui, L., Cai, D., Liu, L., Fu, T., Huang, X., Zhao, E., Zhang, Y., Chen, Y., et al.: Siren’s song in the ai ocean: A survey on hallucination in large language models. Computational Linguistics pp. 1–46 (2025)

work page 2025
[61]

In: CVPR

Zhao, Y., Li, K., Cheng, Z., Qiao, P., Zheng, X., Ji, R., Liu, C., Yuan, L., Chen, J.: Graco: Granularity-controllable interactive segmentation. In: CVPR. pp. 3501– 3510 (2024)

work page 2024
[62]

In: CVPR

Zhao, Y., Lv, W., Xu, S., Wei, J., Wang, G., Dang, Q., Liu, Y., Chen, J.: Detrs beat yolos on real-time object detection. In: CVPR. pp. 16965–16974 (2024)

work page 2024
[63]

Zheng, K., Chen, J., Yan, Y., Zou, X., Hu, X.: Reefknot: A comprehensive bench- mark for relation hallucination evaluation, analysis and mitigation in multimodal large language models. In: ACL. pp. 6193–6212 (2024)

work page 2024
[64]

In: NeurIPS

Zheng, L., Chiang, W.L., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y., Lin, Z., Li, Z., Li, D., Xing, E., et al.: Judging llm-as-a-judge with mt-bench and chatbot arena. In: NeurIPS. vol. 36, pp. 46595–46623 (2023)

work page 2023
[65]

In: ICLR

Zhu, D., Chen, J., Shen, X., Li, X., Elhoseiny, M.: Minigpt-4: Enhancing vision- language understanding with advanced large language models. In: ICLR. pp. 18378–18394 (2024)

work page 2024
[66]

In: ACM MM (2024)

Zhuang, X., Cheng, X., Zhu, Z., Chen, Z., Li, H., Zou, Y.: Towards multimodal-augmented pre-trained language models via self-balanced expectation- maximization iteration. In: ACM MM (2024)

work page 2024
[67]

In: CVPR

Zhuang, X., Zhu, Z., Xie, Y., Liang, L., Zou, Y.: Vasparse: Towards efficient visual hallucination mitigation via visual-aware token sparsification. In: CVPR. pp. 4189– 4199 (2025)

work page 2025
[68]

In: ICML (2025)

Zou, X., Wang, Y., Yan, Y., Lyu, Y., Zheng, K., Huang, S., Chen, J., Jiang, P., Liu, J., Tang, C., et al.: Look twice before you answer: Memory-space visual retracing for hallucination mitigation in multimodal large language models. In: ICML (2025)

work page 2025