arxiv: 2605.11808 · v1 · submitted 2026-05-12 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

Mitigating Action-Relation Hallucinations in LVLMs via Relation-aware Visual Enhancement

Zhenxin Qin , Qiang Li , Qingzhuo Wang , Ruiyang Qin , Zhihua Wei , Wen Shen

Authors on Pith no claims yet

Pith reviewed 2026-05-13 05:35 UTC · model grok-4.3

classification 💻 cs.CV

keywords action-relation hallucinationslarge vision-language modelsattention enhancementrelation-aware visual enhancementaction-relation sensitivityobject hallucinationsspatial relations

0 comments

The pith

Identifying action-sensitive attention heads and boosting focus on relevant image regions reduces action-relation hallucinations in large vision-language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large vision-language models frequently produce text that misdescribes how objects interact in an image, known as action-relation hallucinations. The authors trace this issue to the models allocating too little attention to the visual input during generation. They introduce the Action-Relation Sensitivity score to find which attention heads respond most to changes in action relations, use it to locate the key image areas, and then apply Relation-aware Visual Enhancement to increase attention on those areas. Experiments show the approach outperforms prior methods on action-relation errors while adding negligible inference cost and also improves performance on spatial relations and object hallucinations.

Core claim

We empirically observe that the primary cause of action-relation hallucinations in LVLMs is the insufficient attention allocated to visual information. Thus, we propose a framework to locate action-relevant image regions and enhance the LVLM's attention to those regions. Specifically, we define the Action-Relation Sensitivity (ARS) score to identify attention heads that are most sensitive to action-relation changes, thereby localizing action-relevant image regions that contain key visual cues. Then, we propose the Relation-aware Visual Enhancement (RVE) method to enhance the LVLM's attention to these action-relevant image regions.

What carries the argument

The Action-Relation Sensitivity (ARS) score, which ranks attention heads by sensitivity to action-relation changes, paired with Relation-aware Visual Enhancement (RVE) that raises attention weights on the identified action-relevant image regions.

If this is right

LVLMs produce more accurate descriptions of object interactions at almost no extra inference cost.
The same attention-enhancement technique generalizes to lowering spatial-relation and object hallucinations.
Targeted boosting of relation-sensitive heads offers a lightweight way to improve visual grounding without retraining the full model.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The ARS scoring approach could be adapted to detect and correct other hallucination categories by swapping the sensitivity criterion.
Running the enhancement at training time instead of inference might produce models that require less post-hoc correction for relation errors.
Extending the method to video or multi-image inputs could address hallucinations involving temporal or cross-image relations.

Load-bearing premise

Insufficient attention to visual information is the main driver of action-relation hallucinations rather than other factors such as training data or model capacity.

What would settle it

A controlled test that applies the ARS-based region identification and attention boost yet shows no reduction in action-relation hallucination rates compared with the unmodified baseline model.

Figures

Figures reproduced from arXiv: 2605.11808 by Qiang Li, Qingzhuo Wang, Ruiyang Qin, Wen Shen, Zhenxin Qin, Zhihua Wei.

**Figure 1.** Figure 1: Comparison between object hallucination and action-relation hallucination. Existing objectcentric methods effectively mitigate object hallucinations (top), but they fail to mitigate action-relation hallucinations (bottom). Gunjal et al., 2024). This problem reduces the reliability of LVLMs in practical applications. Numerous methods have been proposed to mitigate hallucinations in LVLMs. Specifically, … view at source ↗

**Figure 2.** Figure 2: Attention weights (in the log space) of the [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: Analysis of the input token number ratios [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

**Figure 4.** Figure 4: Overview of the proposed framework, consisting of two core components: (Top) Action-Relation-Sensitive [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

**Figure 5.** Figure 5: Distribution of ARS score values across lay [PITH_FULL_IMAGE:figures/full_fig_p004_5.png] view at source ↗

**Figure 6.** Figure 6: Visualization of attention maps at Layer 14 of LLaVA-1.5-7B. Sensitive heads (top) localize image [PITH_FULL_IMAGE:figures/full_fig_p005_6.png] view at source ↗

**Figure 7.** Figure 7: Impact of the selection ratio α on LLaVA-1.5- 7B and InstructBLIP-7B evaluated on the real image subset of MMRel. 4.2 Ablation and Further Analysis Impact of global enhancement. In Equation (6), we apply our RVE method to all attention heads in the target layer. In this experiment, we explore whether enhancing all heads (termed Global) is more effective than enhancing only the sensitive heads (termed Sensi… view at source ↗

**Figure 9.** Figure 9: (a) Comparison of masking sensitive heads [PITH_FULL_IMAGE:figures/full_fig_p008_9.png] view at source ↗

**Figure 10.** Figure 10: Performance comparison between object and [PITH_FULL_IMAGE:figures/full_fig_p012_10.png] view at source ↗

**Figure 11.** Figure 11: The prompts used in our two-stage genera [PITH_FULL_IMAGE:figures/full_fig_p013_11.png] view at source ↗

**Figure 12.** Figure 12: The evaluation prompt used by the GPT-5- [PITH_FULL_IMAGE:figures/full_fig_p014_12.png] view at source ↗

**Figure 13.** Figure 13: A case study of action-relation hallucinations on the MMRel benchmark across both real-world and [PITH_FULL_IMAGE:figures/full_fig_p015_13.png] view at source ↗

**Figure 14.** Figure 14: A case study of action-relation hallucinations in open-ended generation tasks on the MMRel benchmark. [PITH_FULL_IMAGE:figures/full_fig_p015_14.png] view at source ↗

**Figure 15.** Figure 15: A case study comparing action-relation hallucinations between image-level and instance-level (Mask) [PITH_FULL_IMAGE:figures/full_fig_p016_15.png] view at source ↗

**Figure 16.** Figure 16: A case study comparing action-relation hallucinations between image-level and instance-level (Box) [PITH_FULL_IMAGE:figures/full_fig_p016_16.png] view at source ↗

**Figure 17.** Figure 17: A case study of action-relation hallucinations on the AMBER benchmark. [PITH_FULL_IMAGE:figures/full_fig_p017_17.png] view at source ↗

read the original abstract

Large Vision-Language Models (LVLMs) have achieved remarkable performance on diverse vision-language tasks. However, LVLMs still suffer from hallucinations, generating text that contradicts the visual input. Existing research has primarily focused on mitigating object hallucinations, but often overlooks more complex relation hallucinations, particularly action relations involving interactions between objects. In this study, we empirically observe that the primary cause of action-relation hallucinations in LVLMs is the insufficient attention allocated to visual information. Thus, we propose a framework to locate action-relevant image regions and enhance the LVLM's attention to those regions. Specifically, we define the Action-Relation Sensitivity (ARS) score to identify attention heads that are most sensitive to action-relation changes, thereby localizing action-relevant image regions that contain key visual cues. Then, we propose the Relation-aware Visual Enhancement (RVE) method to enhance the LVLM's attention to these action-relevant image regions. Extensive experiments demonstrate that, compared to existing baselines, our method achieves superior performance in mitigating action-relation hallucinations with negligible additional inference cost. Furthermore, it effectively generalizes to spatial-relation hallucinations and object hallucinations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper introduces an ARS score to find attention heads sensitive to action relations in LVLMs and then uses RVE to boost focus on the matching image regions, claiming better hallucination control at low cost, but the opening empirical claim about attention deficits lacks any controls or details.

read the letter

The main takeaway is a lightweight intervention for action-relation hallucinations: score heads by how much their attention shifts when action relations change, then amplify the visual tokens those heads care about. The abstract says this beats prior baselines on the target problem and carries over to spatial relations and object hallucinations, all with almost no extra inference time. That combination of targeted localization and cheap enhancement is the practical hook.

Referee Report

2 major / 2 minor

Summary. The paper claims that action-relation hallucinations in LVLMs arise primarily from insufficient attention to visual tokens. It introduces an Action-Relation Sensitivity (ARS) score to identify attention heads sensitive to action-relation changes and localize relevant image regions, followed by a Relation-aware Visual Enhancement (RVE) procedure that boosts attention to those regions during inference. Experiments are said to show that the resulting method outperforms existing baselines on action-relation hallucination benchmarks while adding negligible inference cost and generalizing to spatial-relation and object hallucinations.

Significance. If the motivating observation and reported gains hold under controlled conditions, the work would supply a lightweight, head-localized intervention for a class of hallucinations that most prior mitigation techniques have not addressed, offering a practical route to improved relational reasoning in deployed LVLMs.

major comments (2)

[Abstract / §1] Abstract and §1: the central premise that 'the primary cause of action-relation hallucinations ... is the insufficient attention allocated to visual information' is stated as an empirical observation, yet no LVLMs, datasets, prompt templates, attention-measurement protocol, or control conditions are specified. Because this observation directly motivates both the ARS head-selection step and the RVE enhancement, its lack of documentation renders the entire pipeline's rationale unverifiable.
[§3] §3 (Method): the definitions of the ARS score and the subsequent region-localization procedure are introduced without an explicit statement of how ARS thresholds or head-selection criteria are chosen; if these choices are tuned on the same evaluation sets used to report gains, the performance improvements may be partly circular.

minor comments (2)

[§4] §4 (Experiments): report the exact number of runs, random seeds, and statistical significance tests (e.g., paired t-tests or Wilcoxon) for all claimed improvements over baselines.
[Figure 2 / §3.2] Figure 2 and §3.2: clarify the precise mathematical formulation of RVE (including any scaling factors or masking operations) so that the negligible-inference-cost claim can be reproduced from the text alone.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The two major comments highlight important issues of verifiability and potential circularity in our empirical motivation and method design. We address each point below and commit to revisions that strengthen the manuscript without altering its core claims.

read point-by-point responses

Referee: [Abstract / §1] Abstract and §1: the central premise that 'the primary cause of action-relation hallucinations ... is the insufficient attention allocated to visual information' is stated as an empirical observation, yet no LVLMs, datasets, prompt templates, attention-measurement protocol, or control conditions are specified. Because this observation directly motivates both the ARS head-selection step and the RVE enhancement, its lack of documentation renders the entire pipeline's rationale unverifiable.

Authors: We agree that the empirical foundation for the central premise requires explicit documentation to be verifiable. The observation was derived from internal preliminary experiments (attention weight comparisons on action-relation vs. non-relation prompts), but these details were omitted from the submitted version for brevity. In the revised manuscript we will insert a new paragraph in §1 (and a supporting subsection in the appendix) that specifies: the LVLMs examined (LLaVA-1.5-7B/13B, InstructBLIP, Qwen-VL), the prompt templates and datasets used for the observation (subsets of Visual Genome and a custom action-relation prompt set), the exact attention-measurement protocol (mean attention mass on visual tokens for perturbed vs. original queries), and the control conditions (hallucination rate under visual masking). This addition will make the motivating claim fully reproducible while preserving the original narrative flow. revision: yes
Referee: [§3] §3 (Method): the definitions of the ARS score and the subsequent region-localization procedure are introduced without an explicit statement of how ARS thresholds or head-selection criteria are chosen; if these choices are tuned on the same evaluation sets used to report gains, the performance improvements may be partly circular.

Authors: We acknowledge the need for explicit, non-circular selection rules. The ARS score (Eq. 1) ranks heads by the absolute change in visual-token attention under action-relation perturbation; we select the top 20 % of heads by this ranking (a fixed percentile chosen once on a small development split of Visual Genome that is disjoint from all reported test benchmarks). Region localization thresholds the resulting attention map at the 75th percentile, again a fixed value validated only on the same held-out development split. In the revised §3.2–3.3 we will state these criteria verbatim, add a sentence confirming that no hyper-parameter search was performed on the action-relation hallucination test sets, and include a brief sensitivity table (varying the percentile from 60–85 %) to demonstrate robustness. These changes eliminate any appearance of circularity. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation chain is self-contained and independent of its inputs

full rationale

The paper motivates its framework from an empirical observation on attention allocation in LVLMs, then defines ARS as a new sensitivity metric over attention heads and RVE as an attention-enhancement procedure applied to localized regions. These steps are constructive definitions rather than reductions; the reported performance improvements are presented as outcomes of experiments on external benchmarks, with no equations or procedures shown to be equivalent to the inputs by construction. No self-citations, fitted parameters renamed as predictions, or uniqueness theorems appear in the load-bearing chain. The method therefore remains non-circular and externally falsifiable.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim rests on the assumption that attention deficits are the main driver of action hallucinations and that boosting attention to localized regions will reliably reduce them; the paper introduces two new constructs (ARS score and RVE) without independent external validation.

axioms (1)

domain assumption Action-relation hallucinations primarily stem from insufficient attention to visual information
Stated as an empirical observation that motivates the entire framework.

invented entities (2)

Action-Relation Sensitivity (ARS) score no independent evidence
purpose: Identify attention heads most sensitive to action-relation changes to localize relevant image regions
Newly defined metric used to select regions for enhancement
Relation-aware Visual Enhancement (RVE) no independent evidence
purpose: Enhance LVLM attention to the identified action-relevant regions
Proposed intervention technique

pith-pipeline@v0.9.0 · 5517 in / 1375 out tokens · 93437 ms · 2026-05-13T05:35:03.976989+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

we empirically observe that the primary cause of action-relation hallucinations in LVLMs is the insufficient attention allocated to visual information
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

define the Action-Relation Sensitivity (ARS) score ... ARS(l,h) = ∥A(l,h) − Â(l,h)∥F / ½(∥A∥F + ∥Â∥F)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages · 4 internal anchors

[1]

Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic

Shikra: Unleashing multimodal llm's referential dialogue magic , author=. arXiv preprint arXiv:2306.15195 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[2]

A Survey on Hallucination in Large Vision-Language Models

A survey on hallucination in large vision-language models , author=. arXiv preprint arXiv:2402.00253 , year=

work page internal anchor Pith review arXiv
[3]

International conference on machine learning , pages=

Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models , author=. International conference on machine learning , pages=. 2023 , organization=

work page 2023
[4]

2024 , eprint=

DeepSeek-VL: Towards Real-World Vision-Language Understanding , author=. 2024 , eprint=

work page 2024
[5]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =

Chen, Zhe and Wu, Jiannan and Wang, Wenhai and Su, Weijie and Chen, Guo and Xing, Sen and Zhong, Muyan and Zhang, Qinglong and Zhu, Xizhou and Lu, Lewei and Li, Bin and Luo, Ping and Lu, Tong and Qiao, Yu and Dai, Jifeng , title =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =. 2024 , pages =

work page 2024
[6]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Unified Hallucination Detection for Multimodal Large Language Models , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

work page
[7]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Throne: An object-based hallucination benchmark for the free-form generations of large vision-language models , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page
[8]

mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality

mplug-owl: Modularization empowers large language models with multimodality , author=. arXiv preprint arXiv:2304.14178 , year=

work page Pith review arXiv
[9]

Proceedings of the 2023 conference on empirical methods in natural language processing , pages=

Evaluating object hallucination in large vision-language models , author=. Proceedings of the 2023 conference on empirical methods in natural language processing , pages=

work page 2023
[10]

Evaluation and analysis of hal- lucination in large vision-language models

Evaluation and analysis of hallucination in large vision-language models , author=. arXiv preprint arXiv:2308.15126 , year=

work page arXiv
[11]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Detecting and preventing hallucinations in large vision language models , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

work page
[12]

12th International Conference on Learning Representations, ICLR 2024 , year=

ANALYZING AND MITIGATING OBJECT HALLUCINATION IN LARGE VISION-LANGUAGE MODELS , author=. 12th International Conference on Learning Representations, ICLR 2024 , year=

work page 2024
[13]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =

Yu, Tianyu and Yao, Yuan and Zhang, Haoye and He, Taiwen and Han, Yifeng and Cui, Ganqu and Hu, Jinyi and Liu, Zhiyuan and Zheng, Hai-Tao and Sun, Maosong and Chua, Tat-Seng , title =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =. 2024 , pages =

work page 2024
[14]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Opera: Alleviating hallucination in multi-modal large language models via over-trust penalty and retrospection-allocation , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page
[15]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =

Yu, Qifan and Li, Juncheng and Wei, Longhui and Pang, Liang and Ye, Wentao and Qin, Bosheng and Tang, Siliang and Tian, Qi and Zhuang, Yueting , title =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =. 2024 , pages =

work page 2024
[16]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Cracking the code of hallucination in lvlms with vision-aware head divergence , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

work page
[17]

arXiv preprint arXiv:2507.00898 , year=

ONLY: One-Layer Intervention Sufficiently Mitigates Hallucinations in Large Vision-Language Models , author=. arXiv preprint arXiv:2507.00898 , year=

work page arXiv
[18]

Beyond hallucinations: Enhancing lvlms through hallucination-aware direct preference optimization,

Beyond hallucinations: Enhancing lvlms through hallucination-aware direct preference optimization , author=. arXiv preprint arXiv:2311.16839 , year=

work page arXiv
[19]

Aligning Large Multimodal Models with Factually Augmented RLHF

Sun, Zhiqing and Shen, Sheng and Cao, Shengcao and Liu, Haotian and Li, Chunyuan and Shen, Yikang and Gan, Chuang and Gui, Liangyan and Wang, Yu-Xiong and Yang, Yiming and Keutzer, Kurt and Darrell, Trevor. Aligning Large Multimodal Models with Factually Augmented RLHF. Findings of the Association for Computational Linguistics: ACL 2024. 2024. doi:10.1865...

work page doi:10.18653/v1/2024.findings-acl.775 2024
[20]

Science China Information Sciences , volume=

Woodpecker: Hallucination correction for multimodal large language models , author=. Science China Information Sciences , volume=. 2024 , publisher=

work page 2024
[21]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Mitigating object hallucinations in large vision-language models through visual contrastive decoding , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page
[22]

Proceedings of the 41st International Conference on Machine Learning , pages=

HALC: object hallucination reduction via adaptive focal-contrast decoding , author=. Proceedings of the 41st International Conference on Machine Learning , pages=

work page
[23]

Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

Mitigating object hallucinations in large vision-language models with assembly of global and local attention , author=. Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

work page
[24]

Findings of the Association for Computational Linguistics: ACL 2024 , pages=

Mitigating hallucinations in large vision-language models with instruction contrastive decoding , author=. Findings of the Association for Computational Linguistics: ACL 2024 , pages=

work page 2024
[25]

Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

ClearSight: Visual Signal Enhancement for Object Hallucination Mitigation in Multimodal Large Language Models , author=. Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

work page
[26]

arXiv preprint arXiv:2404.18624 , year=

Do Vision & Language Decoders use Images and Text equally? How Self-consistent are their Explanations? , author=. arXiv preprint arXiv:2404.18624 , year=

work page arXiv
[27]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Less is More: Mitigating Multimodal Hallucination from an EOS Decision Perspective , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

work page
[28]

Why Is Spatial Reasoning Hard for VLMs? An Attention Mechanism Perspective on Focus Areas , author=

work page
[29]

See What You Are Told: Visual Attention Sink in Large Multimodal Models , author=

work page
[30]

Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

Devils in middle layers of large vision-language models: Interpreting, detecting and mitigating object hallucinations via attention lens , author=. Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

work page
[31]

Hallucination of Multimodal Large Language Models: A Survey

Hallucination of multimodal large language models: A survey , author=. arXiv preprint arXiv:2404.18930 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[32]

Findings of the association for computational linguistics: ACL 2024 , pages=

The revolution of multimodal large language models: A survey , author=. Findings of the association for computational linguistics: ACL 2024 , pages=

work page 2024
[33]

Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

A survey of state of the art large vision language models: Benchmark evaluations and challenges , author=. Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

work page
[34]

International Conference on Machine Learning , pages=

Evaluating and Analyzing Relationship Hallucinations in Large Vision-Language Models , author=. International Conference on Machine Learning , pages=. 2024 , organization=

work page 2024
[35]

arXiv preprint arXiv:2406.09121 , year=

MMRel: A Relation Understanding Benchmark in the MLLM Era , author=. arXiv preprint arXiv:2406.09121 , year=

work page arXiv
[36]

Findings of the Association for Computational Linguistics: ACL 2025 , pages=

Reefknot: A comprehensive benchmark for relation hallucination evaluation, analysis and mitigation in multimodal large language models , author=. Findings of the Association for Computational Linguistics: ACL 2025 , pages=

work page 2025
[37]

Head Pursuit: Probing Attention Specialization in Multimodal Transformers , author=

work page
[38]

13th International Conference on Learning Representations, ICLR 2025 , pages=

RETRIEVAL HEAD MECHANISTICALLY EXPLAINS LONG-CONTEXT FACTUALITY , author=. 13th International Conference on Learning Representations, ICLR 2025 , pages=. 2025 , organization=

work page 2025
[39]

Do Vision Transformers See Like Convolutional Neural Networks? , url =

Raghu, Maithra and Unterthiner, Thomas and Kornblith, Simon and Zhang, Chiyuan and Dosovitskiy, Alexey , booktitle =. Do Vision Transformers See Like Convolutional Neural Networks? , url =

work page
[40]

Layer by layer: Uncovering hidden representations in language models.arXiv preprint arXiv:2502.02013, 2025

Layer by layer: Uncovering hidden representations in language models , author=. arXiv preprint arXiv:2502.02013 , year=

work page arXiv
[41]

Advances in neural information processing systems , volume=

Instructblip: Towards general-purpose vision-language models with instruction tuning , author=. Advances in neural information processing systems , volume=

work page
[42]

Object Hallucination in Image Captioning

Object hallucination in image captioning , author=. arXiv preprint arXiv:1809.02156 , year=

work page Pith review arXiv
[43]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Improved baselines with visual instruction tuning , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

work page
[44]

Liu, Haotian and Li, Chunyuan and Li, Yuheng and Li, Bo and Zhang, Yuanhan and Shen, Sheng and Lee, Yong Jae , year=

work page
[45]

European Conference on Computer Vision , pages=

Sharegpt4v: Improving large multi-modal models with better captions , author=. European Conference on Computer Vision , pages=. 2024 , organization=

work page 2024
[46]

2024 , eprint=

AMBER: An LLM-free Multi-dimensional Benchmark for MLLMs Hallucination Evaluation , author=. 2024 , eprint=

work page 2024
[47]

European Conference on Computer Vision , pages=

Paying more attention to image: A training-free method for alleviating hallucination in lvlms , author=. European Conference on Computer Vision , pages=. 2024 , organization=

work page 2024
[48]

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

Mme: A comprehensive evaluation benchmark for multimodal large language models , author=. arXiv preprint arXiv:2306.13394 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[49]

European conference on computer vision , pages=

Mmbench: Is your multi-modal model an all-around player? , author=. European conference on computer vision , pages=. 2024 , organization=

work page 2024
[50]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

work page