Correcting Visual Blur Induced by Attention Distraction to Reduce Hallucinations: Algorithm and Theory

Chenping Hou; Quanjiang Li; Tingjin Luo; Wei Luo; Zhiming Liu

arxiv: 2605.24602 · v2 · pith:G6BWZDVUnew · submitted 2026-05-23 · 💻 cs.CV · cs.AI

Correcting Visual Blur Induced by Attention Distraction to Reduce Hallucinations: Algorithm and Theory

Quanjiang Li , Zhiming Liu , Wei Luo , Tingjin Luo , Chenping Hou This is my paper

Pith reviewed 2026-06-30 13:04 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords object hallucinationsmultimodal large language modelsattention distractionvisual perceptionAFIPinference-time correctionmulti-head attention

0 comments

The pith

Object hallucinations in multimodal models arise from attention distraction shown as spatial inconsistency across heads and temporal fading on image tokens, which AFIP corrects at inference time.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Multimodal large language models produce object hallucinations when attention becomes distracted during decoding. This distraction appears as spatial inconsistency in multi-head attention and as fading focus on image tokens over successive steps, mirroring how divided attention blurs human vision. The paper shows theoretically that such dispersion raises model complexity and weakens generalization on classification tasks. To address it, the authors introduce AFIP, an inference-only method that enriches cross-head attention and dynamically reinforces historical attention to image tokens. Experiments across several benchmarks and models indicate that these corrections reduce hallucinations without any retraining.

Core claim

Hallucinations are strongly associated with attention distraction manifested as spatial inconsistency in multi-head attention and temporal fading of attention to image tokens during decoding. Attention dispersion increases model complexity and degrades classification generalization. AFIP corrects this distraction via cross-head attention enrichment and dynamic historical attention enhancement, improving visual perception at inference time without additional training.

What carries the argument

Attention-Focused Approach for Improved Image Perception (AFIP), which corrects attention distraction through cross-head attention enrichment and dynamic historical attention enhancement.

If this is right

Correcting attention distraction at inference time reduces object hallucinations on multiple benchmarks and models without retraining.
Enriching cross-head attention improves spatial consistency and visual grounding.
Dynamic historical attention enhancement counters temporal fading of focus on image tokens.
Attention dispersion raises model complexity and lowers generalization performance on classification tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If attention distraction proves causal, monitoring attention metrics during decoding could serve as an early detector for hallucination risk.
The method's success at inference time suggests similar attention-modulation interventions could address other perceptual failures in multimodal models.
The parallel drawn to human divided attention may encourage experiments that test whether human-like perceptual interventions transfer to model behavior.

Load-bearing premise

The observed patterns of attention distraction are the direct cause of hallucinations rather than a correlated symptom, and fixing them at inference time will reduce hallucinations in untested settings.

What would settle it

A controlled test in which attention distraction is induced while measuring whether hallucination rates rise, or in which distraction is prevented while checking whether hallucinations remain unchanged.

Figures

Figures reproduced from arXiv: 2605.24602 by Chenping Hou, Quanjiang Li, Tingjin Luo, Wei Luo, Zhiming Liu.

**Figure 1.** Figure 1: Motivation illustration and overview of AFIP. understanding and reasoning across diverse real-world scenarios. Despite this impressive versatility (Li et al., 2026; 2025a), modern MLLMs still suffer from a critical limitation, as they may generate outputs that are insufficiently grounded in the actual visual input. In particular, models can produce confident yet erroneous descriptions, such as attributi… view at source ↗

**Figure 2.** Figure 2: Correspondence between spatial attention inconsistency and object hallucination. (a) Attention maps for the correct token computer” and the hallucinated token potted plant”. Attention linked to the hallucinated token is broadly dispersed across the image, whereas attention for the correct token remains sharply concentrated on the target object. Additional examples are provided in Appendix D.6. (b) Distrib… view at source ↗

**Figure 3.** Figure 3: Temporal Fading of Visual Attention. (Left) VAR decreases as generation proceeds across multiple MLLMs, indicating progressive weakening of visual grounding in long-form responses. The shaded area represents the standard deviation band, evaluated on the COCO dataset. (Right) Heatmap shows VAR over normalized token position, with red boxes denoting hallucinated tokens. we compute the following entropy of … view at source ↗

**Figure 5.** Figure 5: Hallucination occurrence with different attention-based measures. The ROC (Receiver Operating Characteristic) curves of two models are presented for hallucination detection. where Wt denotes the size of the history window, fixed at 8 to constitute a sufficiently extended temporal context. Subsequently, the historical text token exerting the greatest influence on the decoding process is identified by r (l,h… view at source ↗

**Figure 6.** Figure 6: Parameter sensitivity analysis of k and τ . 7. Conclusion This paper proposes an Attention-Focused Approach for Improved Image Perception named AFIP to mitigate hallucinations in MLLMs. Extensive statistical analyses reveal that model hallucinations mainly arise from spatial inconsistencies in multi-head visual attention and temporal degradation in image perception, a pattern analogous to visual blur in… view at source ↗

**Figure 7.** Figure 7: illustrates the impact of dynamic historical attention enhancement on visual attention retention during long-form generation. In the left panel, we plot VAR on LLaVA-1.5-7B as a function of token position and compare the standard forward propagation with the enhanced variant. Throughout the decoding process, the enhancement mechanism consistently yields higher VAR values, indicating that the model maintain… view at source ↗

**Figure 8.** Figure 8: Case study of temporal fading of visual attention. D.5. Parameter studies on α and γ We present the parameter studies on α and γ here. From the result, we can observe that our AFIP is not sensitive to α and γ, and adjusting α and γ does not lead to significant improvements in performance. 0.2 0.3 0.4 0.5 0.6 0.7 0.8 5.0 7.5 10.0 12.5 15.0 17.5 20.0 22.5 25.0 C s LLaVA-1.5-7b LLaVA-1.5-13b (a) Performance c… view at source ↗

**Figure 9.** Figure 9: Parameter sensitivity analysis of α and γ. 26 [PITH_FULL_IMAGE:figures/full_fig_p026_9.png] view at source ↗

**Figure 10.** Figure 10: Visualization of attention maps over image for real and hallucinated object tokens. 27 [PITH_FULL_IMAGE:figures/full_fig_p027_10.png] view at source ↗

**Figure 11.** Figure 11: Visualization of attention maps over image for real and hallucinated object tokens. 28 [PITH_FULL_IMAGE:figures/full_fig_p028_11.png] view at source ↗

**Figure 12.** Figure 12: Visualization of attention maps over image for real and hallucinated object tokens. 29 [PITH_FULL_IMAGE:figures/full_fig_p029_12.png] view at source ↗

**Figure 13.** Figure 13: Qualitative comparison of visual attention maps before and after applying Head-Level Attention Distraction Correction. 30 [PITH_FULL_IMAGE:figures/full_fig_p030_13.png] view at source ↗

**Figure 14.** Figure 14: Qualitative comparison of visual attention maps before and after applying head-level attention distraction correction. 31 [PITH_FULL_IMAGE:figures/full_fig_p031_14.png] view at source ↗

**Figure 15.** Figure 15: Qualitative results of hallucination mitigation on LLaVA-1.5-7B 32 [PITH_FULL_IMAGE:figures/full_fig_p032_15.png] view at source ↗

**Figure 16.** Figure 16: Qualitative results of hallucination mitigation on LLaVA-1.5-13B 33 [PITH_FULL_IMAGE:figures/full_fig_p033_16.png] view at source ↗

**Figure 17.** Figure 17: Qualitative results of hallucination mitigation on MiniGPT-4 34 [PITH_FULL_IMAGE:figures/full_fig_p034_17.png] view at source ↗

**Figure 18.** Figure 18: Qualitative results of hallucination mitigation on Shikra 35 [PITH_FULL_IMAGE:figures/full_fig_p035_18.png] view at source ↗

**Figure 19.** Figure 19: Qualitative results of hallucination mitigation on Qwen-VL 36 [PITH_FULL_IMAGE:figures/full_fig_p036_19.png] view at source ↗

read the original abstract

Multimodal large language models (MLLMs) frequently suffer from object hallucinations, yet the visual perceptual mechanism underlying this failure remains poorly understood. In this work, we reveal that hallucinations are strongly associated with a human-like attention distraction phenomenon, where humans under divided focus experience degraded visual clarity and produce inaccurate descriptions, while in models the same mechanism manifests as spatial inconsistency in multi-head attention and temporal fading of attention to image tokens during decoding. We further provide theoretical insights that attention dispersion increases model complexity and degrades classification generalization. Motivated by these findings, we propose an Attention-Focused Approach for Improved Image Perception (AFIP), which corrects attention distraction via cross-head attention enrichment and reinforces visual grounding through dynamic historical attention enhancement. Extensive experiments on multiple benchmarks and models validate the effectiveness of AFIP without additional training.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper ties hallucinations to attention distraction in MLLMs and gives an inference-time fix via cross-head and historical attention tweaks, but the causal evidence stays associative.

read the letter

The main thing to know is that this work treats object hallucinations as tied to attention distraction—spatial inconsistency across heads plus fading focus on image tokens over time—and introduces AFIP to correct those patterns at inference without retraining.

What is new is the concrete combination of cross-head attention enrichment and dynamic historical attention enhancement, plus the claim that dispersion raises complexity and hurts generalization. The experiments test this on multiple benchmarks and models and report gains, which is useful for anyone who wants a lightweight patch.

The practical side holds up reasonably: no extra training is required, and the method is straightforward to implement. That makes it worth a look for applied reliability work.

The softer spot is the causal link. The abstract and framing show correlation between the attention patterns and hallucinations, along with the complexity theory, but there is no clear interventional test (targeted ablations or controlled perturbations) that would show the attention fixes are what actually reduce hallucinations rather than a side effect. If the theory section stays high-level without tight derivations, that part stays provisional.

This is for readers focused on attention diagnostics or quick fixes in vision-language models. It is not a broad theory paper. A serious referee should see it because the empirical results are there and the idea is testable, even if the causal story needs more work.

Referee Report

2 major / 1 minor

Summary. The paper claims that object hallucinations in multimodal LLMs arise from an attention distraction mechanism analogous to human visual blur under divided focus, manifesting as spatial inconsistency across multi-head attention maps and temporal decay of attention to image tokens during decoding. It supplies a theoretical argument that attention dispersion increases model complexity and harms generalization, and introduces the training-free AFIP method that applies cross-head attention enrichment plus dynamic historical attention boosting to restore focused visual grounding, with empirical validation across benchmarks and models.

Significance. If the proposed causal link holds and the interventions act specifically by restoring the identified attention statistics, the work would supply a lightweight, inference-only technique for improving visual grounding in MLLMs. The theoretical component on dispersion and complexity would be a useful contribution if it yields a concrete, testable relation between attention statistics and hallucination rates.

major comments (2)

[§4] §4 (Theory): the argument that attention dispersion increases complexity and degrades generalization is stated as supporting the hallucination-reduction claim, yet no explicit derivation, bound, or equation is supplied that connects the dispersion metric to either complexity or to object-level hallucination rates; without this link the theory does not establish that correcting the two attention statistics will causally reduce hallucinations.
[§5] §5 (Experiments): the reported gains from AFIP are consistent with association but do not include an interventional control (e.g., random perturbation of the same attention statistics or a counterfactual attention map) that would isolate whether the reduction in hallucinations is produced by restoring spatial consistency and temporal persistence rather than by incidental side-effects of the enrichment operations.

minor comments (1)

[Abstract] The abstract and introduction use “strongly associated” and “manifests as” interchangeably with causal language; a single clarifying sentence distinguishing correlation from the claimed mechanism would improve precision.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the theoretical linkage and experimental controls. We respond to each major comment below.

read point-by-point responses

Referee: [§4] §4 (Theory): the argument that attention dispersion increases complexity and degrades generalization is stated as supporting the hallucination-reduction claim, yet no explicit derivation, bound, or equation is supplied that connects the dispersion metric to either complexity or to object-level hallucination rates; without this link the theory does not establish that correcting the two attention statistics will causally reduce hallucinations.

Authors: We agree that an explicit derivation would strengthen the claim. The manuscript currently offers high-level insights linking dispersion to complexity and generalization; in revision we will add a formal connection, e.g., by expressing dispersion via attention entropy and deriving a PAC-style generalization bound that also correlates with object hallucination rates. revision: yes
Referee: [§5] §5 (Experiments): the reported gains from AFIP are consistent with association but do not include an interventional control (e.g., random perturbation of the same attention statistics or a counterfactual attention map) that would isolate whether the reduction in hallucinations is produced by restoring spatial consistency and temporal persistence rather than by incidental side-effects of the enrichment operations.

Authors: We acknowledge the value of stronger causal isolation. While current results show consistent gains, we will add an interventional ablation in the revision that applies controlled random perturbations to the same attention statistics and compares hallucination outcomes, thereby testing whether restoration of the identified statistics is the operative mechanism. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected; derivation remains self-contained

full rationale

The abstract and description present an observational association between attention distraction patterns and hallucinations, followed by independent theoretical claims about dispersion increasing complexity, then a motivated algorithmic intervention. No equations, fitted parameters renamed as predictions, or self-citations are exhibited that reduce the central claims to their own inputs by construction. The proposed AFIP method is described as correcting observed patterns without evidence that its justification loops back definitionally to the same attention maps used to define the problem. This is the common case of a paper whose core argument does not collapse into tautology or self-referential fitting.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; the central claim rests on an unstated assumption that attention distraction is measurable and causal.

pith-pipeline@v0.9.1-grok · 5676 in / 1049 out tokens · 24059 ms · 2026-06-30T13:04:00.459969+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

16 extracted references · 14 canonical work pages · 8 internal anchors

[1]

GPT-4 Technical Report

Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Qwen Technical Report

Bai, J., Bai, S., Chu, Y ., Cui, Z., Dang, K., Deng, X., Fan, Y ., Ge, W., Han, Y ., Huang, F., et al. Qwen technical report.arXiv preprint arXiv:2309.16609, 2023a. Bai, J., Bai, S., Yang, S., Wang, S., Tan, S., Wang, P., Lin, J., Zhou, C., and Zhou, J. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond, 2...

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J

URL https://arxiv.org/ abs/2110.01705. Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. Language models are few-shot learners. NIPS,

work page arXiv
[4]

Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic

Chen, K., Zhang, Z., Zeng, W., Zhang, R., Zhu, F., and Zhao, R. Shikra: Unleashing multimodal llm’s referential dialogue magic.arXiv preprint arXiv:2306.15195,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Halc: Object hallucination reduction via adaptive focal-contrast decoding, 2024

Chen, Z., Wu, J., Wang, W., Su, W., Chen, G., Xing, S., Zhong, M., Zhang, Q., Zhu, X., Lu, L., et al. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. InCVPR, pp. 24185– 24198, 2024a. Chen, Z., Zhao, Z., Luo, H., Yao, H., Li, B., and Zhou, J. Halc: Object hallucination reduction via adaptive focal- contrast d...

work page arXiv
[6]

Koroteev, M. V . Bert: a review of applications in natural language processing and understanding.arXiv preprint arXiv:2103.11943,

work page arXiv
[7]

Medgpt: Medical concept prediction from clinical narratives.arXiv preprint arXiv:2107.03134,

Kraljevic, Z., Shek, A., Bean, D., Bendayan, R., Teo, J., and Dobson, R. Medgpt: Medical concept prediction from clinical narratives.arXiv preprint arXiv:2107.03134,

work page arXiv
[8]

LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models

Li, F., Zhang, R., Zhang, H., Zhang, Y ., Li, B., Li, W., Ma, Z., and Li, C. Llava-next-interleave: Tackling multi- image, video, and 3d in large multimodal models.arXiv preprint arXiv:2407.07895,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Evaluating Object Hallucination in Large Vision-Language Models

Li, Y ., Du, Y ., Zhou, K., Wang, J., Zhao, W. X., and Wen, J.-R. Evaluating object hallucination in large vision- language models.arXiv preprint arXiv:2305.10355,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning

Liu, F., Lin, K., Li, L., Wang, J., Yacoob, Y ., and Wang, L. Mitigating hallucination in large multi-modal models via robust instruction tuning.arXiv preprint arXiv:2306.14565,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Docllm: A layout-aware generative language model for multimodal document understanding

Wang, D., Raman, N., Sibue, M., Ma, Z., Babkin, P., Kaur, S., Pei, Y ., Nourbakhsh, A., and Liu, X. Docllm: A layout-aware generative language model for multimodal document understanding. InACL, 2024a. Wang, J., Zhou, Y ., Xu, G., Shi, P., Zhao, C., Xu, H., Ye, Q., Yan, M., Zhang, J., Zhu, J., et al. Evaluation and analysis of hallucination in large visio...

work page arXiv
[12]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Wang, P., Bai, S., Tan, S., Wang, S., Fan, Z., Bai, J., Chen, K., Liu, X., Wang, J., Ge, W., et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024b. Wen, W., Gong, T., Dong, Y ., Yu, S., and Zhang, W. Towards the generalization of multi-view learn- ing: An information-theoretical...

work page internal anchor Pith review Pith/arXiv arXiv
[13]

Logical closed loop: Uncovering object hallucinations in large vision-language models.arXiv preprint arXiv:2402.11622,

Wu, J., Liu, Q., Wang, D., Zhang, J., Wu, S., Wang, L., and Tan, T. Logical closed loop: Uncovering object hallucinations in large vision-language models.arXiv preprint arXiv:2402.11622,

work page arXiv
[14]

MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

Zhu, B., Niu, Y ., Lee, S., Hur, M., and Zhang, H. Debi- ased fine-tuning for vision-language models by prompt regularization. InAAAI, 2023a. Zhu, D., Chen, J., Shen, X., Li, X., and Elhoseiny, M. Minigpt-4: Enhancing vision-language understanding with advanced large language models.arXiv preprint arXiv:2304.10592, 2023b. Zou, X., Wang, Y ., Yan, Y ., Lyu...

work page internal anchor Pith review Pith/arXiv arXiv
[15]

nX i=1 ∥vi∥2 # 1 2 ≤

12 Correcting Visual Blur Induced by Attention Distraction to Reduce Hallucinations: Algorithm and Theory A. Proof of The Theorem 5.1 We begin the proof of Theorem 5.1 by introducing the following lemmas: Lemma A.1((Edelman et al., 2022)).For vectorsθ 1, θ2 ∈R p, we have ∥softmax (θ1)−softmax (θ 2)∥1 ≤2∥θ 1 −θ 2∥∞ .(15) Lemma A.2((Lust-Piquard & Pisier, 1...

2022
[16]

Please describe this image in detail

Then, for any ϵ >0 , the following inequality holds: P kX i=1 ¯ai pi − Xi m > ϵ ! ≤exp − mϵ2 β , whereβ= 2 Pk i=1 ¯a2 i pi. Lemma B.2.For anyy∈ Y, if the loss functionl(·, y)isL l-Lipschitz, the following inequality exists: |l(u, y)−l(v, y)| ≤L l∥u−v∥ 2,∀u, v∈R.(41) Lemma B.3.If the functionψisL l-Lipschitz, with respect to the Euclidean norm∥ · ∥ 2, the ...

2025

[1] [1]

GPT-4 Technical Report

Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

Qwen Technical Report

Bai, J., Bai, S., Chu, Y ., Cui, Z., Dang, K., Deng, X., Fan, Y ., Ge, W., Han, Y ., Huang, F., et al. Qwen technical report.arXiv preprint arXiv:2309.16609, 2023a. Bai, J., Bai, S., Yang, S., Wang, S., Tan, S., Wang, P., Lin, J., Zhou, C., and Zhou, J. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond, 2...

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J

URL https://arxiv.org/ abs/2110.01705. Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. Language models are few-shot learners. NIPS,

work page arXiv

[4] [4]

Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic

Chen, K., Zhang, Z., Zeng, W., Zhang, R., Zhu, F., and Zhao, R. Shikra: Unleashing multimodal llm’s referential dialogue magic.arXiv preprint arXiv:2306.15195,

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

Halc: Object hallucination reduction via adaptive focal-contrast decoding, 2024

Chen, Z., Wu, J., Wang, W., Su, W., Chen, G., Xing, S., Zhong, M., Zhang, Q., Zhu, X., Lu, L., et al. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. InCVPR, pp. 24185– 24198, 2024a. Chen, Z., Zhao, Z., Luo, H., Yao, H., Li, B., and Zhou, J. Halc: Object hallucination reduction via adaptive focal- contrast d...

work page arXiv

[6] [6]

Koroteev, M. V . Bert: a review of applications in natural language processing and understanding.arXiv preprint arXiv:2103.11943,

work page arXiv

[7] [7]

Medgpt: Medical concept prediction from clinical narratives.arXiv preprint arXiv:2107.03134,

Kraljevic, Z., Shek, A., Bean, D., Bendayan, R., Teo, J., and Dobson, R. Medgpt: Medical concept prediction from clinical narratives.arXiv preprint arXiv:2107.03134,

work page arXiv

[8] [8]

LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models

Li, F., Zhang, R., Zhang, H., Zhang, Y ., Li, B., Li, W., Ma, Z., and Li, C. Llava-next-interleave: Tackling multi- image, video, and 3d in large multimodal models.arXiv preprint arXiv:2407.07895,

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

Evaluating Object Hallucination in Large Vision-Language Models

Li, Y ., Du, Y ., Zhou, K., Wang, J., Zhao, W. X., and Wen, J.-R. Evaluating object hallucination in large vision- language models.arXiv preprint arXiv:2305.10355,

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning

Liu, F., Lin, K., Li, L., Wang, J., Yacoob, Y ., and Wang, L. Mitigating hallucination in large multi-modal models via robust instruction tuning.arXiv preprint arXiv:2306.14565,

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

Docllm: A layout-aware generative language model for multimodal document understanding

Wang, D., Raman, N., Sibue, M., Ma, Z., Babkin, P., Kaur, S., Pei, Y ., Nourbakhsh, A., and Liu, X. Docllm: A layout-aware generative language model for multimodal document understanding. InACL, 2024a. Wang, J., Zhou, Y ., Xu, G., Shi, P., Zhao, C., Xu, H., Ye, Q., Yan, M., Zhang, J., Zhu, J., et al. Evaluation and analysis of hallucination in large visio...

work page arXiv

[12] [12]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Wang, P., Bai, S., Tan, S., Wang, S., Fan, Z., Bai, J., Chen, K., Liu, X., Wang, J., Ge, W., et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024b. Wen, W., Gong, T., Dong, Y ., Yu, S., and Zhang, W. Towards the generalization of multi-view learn- ing: An information-theoretical...

work page internal anchor Pith review Pith/arXiv arXiv

[13] [13]

Logical closed loop: Uncovering object hallucinations in large vision-language models.arXiv preprint arXiv:2402.11622,

Wu, J., Liu, Q., Wang, D., Zhang, J., Wu, S., Wang, L., and Tan, T. Logical closed loop: Uncovering object hallucinations in large vision-language models.arXiv preprint arXiv:2402.11622,

work page arXiv

[14] [14]

MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

Zhu, B., Niu, Y ., Lee, S., Hur, M., and Zhang, H. Debi- ased fine-tuning for vision-language models by prompt regularization. InAAAI, 2023a. Zhu, D., Chen, J., Shen, X., Li, X., and Elhoseiny, M. Minigpt-4: Enhancing vision-language understanding with advanced large language models.arXiv preprint arXiv:2304.10592, 2023b. Zou, X., Wang, Y ., Yan, Y ., Lyu...

work page internal anchor Pith review Pith/arXiv arXiv

[15] [15]

nX i=1 ∥vi∥2 # 1 2 ≤

12 Correcting Visual Blur Induced by Attention Distraction to Reduce Hallucinations: Algorithm and Theory A. Proof of The Theorem 5.1 We begin the proof of Theorem 5.1 by introducing the following lemmas: Lemma A.1((Edelman et al., 2022)).For vectorsθ 1, θ2 ∈R p, we have ∥softmax (θ1)−softmax (θ 2)∥1 ≤2∥θ 1 −θ 2∥∞ .(15) Lemma A.2((Lust-Piquard & Pisier, 1...

2022

[16] [16]

Please describe this image in detail

Then, for any ϵ >0 , the following inequality holds: P kX i=1 ¯ai pi − Xi m > ϵ ! ≤exp − mϵ2 β , whereβ= 2 Pk i=1 ¯a2 i pi. Lemma B.2.For anyy∈ Y, if the loss functionl(·, y)isL l-Lipschitz, the following inequality exists: |l(u, y)−l(v, y)| ≤L l∥u−v∥ 2,∀u, v∈R.(41) Lemma B.3.If the functionψisL l-Lipschitz, with respect to the Euclidean norm∥ · ∥ 2, the ...

2025