pith. machine review for the scientific record. sign in

arxiv: 2605.10622 · v1 · submitted 2026-05-11 · 💻 cs.MM · cs.CV

Recognition: no theorem link

Vocabulary Hijacking in LVLMs: Unveiling Critical Attention Heads by Excluding Inert Tokens to Mitigate Hallucination

Authors on Pith no claims yet

Pith reviewed 2026-05-12 05:10 UTC · model grok-4.3

classification 💻 cs.MM cs.CV
keywords hallucination mitigationlarge vision-language modelsattention mechanismsinert tokensvocabulary hijackingtraining-free intervention
0
0 comments X

The pith

In large vision-language models, selectively strengthening attention heads that resist vocabulary hijacking reduces hallucinations without added training or compute.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large vision-language models often generate text that contradicts the images they see, creating hallucinations. The paper traces this to vocabulary hijacking, where certain visual tokens draw excessive attention and consistently map to the same unrelated words across layers. It introduces a metric to measure the share of attention that avoids this hijacking and uses it to locate heads critical for accurate image description. A simple adjustment then boosts those heads' focus on relevant visual details. Experiments across benchmarks show fewer hallucinations while model performance on other tasks stays intact and no extra computation is needed.

Core claim

Vocabulary hijacking arises when inert visual tokens disproportionately attract attention and their hidden states decode to a fixed set of unrelated words called hijacking anchors. Hijacking Anchor-Based Identification localizes these tokens. The Non-Hijacked Visual Attention Ratio then identifies attention heads that remain resilient to this collapse and support factual accuracy. Hijacking-Aware Visual Attention Enhancement strengthens the focus of these heads on salient visual content, reducing hallucinations in a training-free manner.

What carries the argument

The Non-Hijacked Visual Attention Ratio (NHAR), which measures the proportion of attention from resilient heads directed at salient visual content rather than inert tokens, used to select heads for strengthening in the training-free Hijacking-Aware Visual Attention Enhancement (HAVAE) method.

If this is right

  • Hallucination rates drop across multiple benchmarks when the identified heads receive targeted strengthening.
  • General model capabilities on non-hallucination tasks remain unchanged after the intervention.
  • The adjustment requires no training and adds no computational overhead at inference time.
  • Attention heads can be made more robust to semantic collapse caused by inert visual tokens.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Hallucinations appear driven by localized attention failures rather than uniform model-wide deficiencies.
  • Similar token hijacking patterns could be diagnosed in other multimodal or unimodal models using comparable projection techniques.
  • Dynamic per-input adjustment of head strengths based on real-time NHAR scores could further improve reliability.
  • The approach offers a diagnostic lens for interpreting how visual information flows through transformer layers.

Load-bearing premise

That the attention heads identified by NHAR are the primary drivers of factual accuracy and that selectively strengthening them will not introduce new failure modes or degrade performance on non-hallucination tasks.

What would settle it

Running HAVAE on standard visual question-answering benchmarks and observing either higher hallucination rates or lower scores on general capability tests would falsify the central claim.

Figures

Figures reproduced from arXiv: 2605.10622 by Guodong Du, Jing Li, Junlin Li, Weijun Yao, Wenya Wang, Xilai Ma, Yangneng Chen.

Figure 1
Figure 1. Figure 1: Attention maps for generating “phone”. (a) [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Illustration of the Vocabulary Hijacking phenomenon in LLaVA-1.5 7B. (a) Attention heatmap for the [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Empirical basis for the HABI method on LLaVA-1.5 7B. [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Statistical distributions of attention metrics. [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Ablation on hyperparameters α and K for LLaVA-1.5 7B. Red boxes highlight the parameter combinations we used. Model Selecting Strategy CHAIR POPE POPE Chat MME CHAIRs ↓ CHAIRi ↓ F1 ↑ Acc. ↑ F1 ↑ Acc. ↑ F1 ↑ Per. ↑ Cog. ↑ LLaVA-1.5-7B Max Attention 7.8 4.4 65.8 85.9 85.6 86.0 85.5 1399.0 277.0 HAVAE(Ours) 18.2 3.8 76.7 86.2 86.3 88.0 87.0 1483.9 327.9 [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: A case study demonstrates HAVAE correcting a baseline hallucination by redirecting the model’s focus [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Validation of the HABI identification strategy across diverse LVLM architectures. Consistent with [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Supporting analysis of metric distributions on additional models. [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Visualization of Vocabulary Hijacking in [PITH_FULL_IMAGE:figures/full_fig_p014_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Positional distribution of Inert tokens for LLaVA-1.5 7B and 13B within the 576-token visual sequence. [PITH_FULL_IMAGE:figures/full_fig_p015_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Conditional Probability of Attention Dis￾tribution. Although Inert Tokens (top row) constitute only ∼ 1% of the total visual tokens, over 75% of them are concentrated in the High Attention (Top 5%) region. In contrast, Normal Tokens (bottom row) follow a stan￾dard distribution. This confirms the hijacking nature of Inert Tokens, demonstrating their capacity to actively capture the model’s focus. model’s f… view at source ↗
Figure 12
Figure 12. Figure 12: Comparative analysis of attention anomaly identification methods. (a) Zero-Ablation Verification: Changes in POPE F1 scores when masking out specific token sets (Massive Activation vs. HABI), used to quantify the functional redundancy of the identified tokens. (b) Spatial Alignment: The proportion of tokens falling within image background regions (defined by segmentation masks) across different selection … view at source ↗
Figure 13
Figure 13. Figure 13: Visualization of hidden state distributions for five identified Inert Tokens at layer 5. The red vertical lines indicate the specific dimensions (1415 and 2533) used by the Massive Activation baseline to detect anomalies. Crucially, while some tokens exhibit the expected spikes, tokens 105 and 198 do not show massive activation in the designated dimensions. This absence demonstrates why the activation-bas… view at source ↗
Figure 14
Figure 14. Figure 14: Ablation on hyperparameters α and K for Shikra (a) and MiniGPT4 (b). Red boxes highlight the parameter combinations we used. F.2 AMBER Results on Additional Models In this section, we present the comprehensive ex￾perimental results on the AMBER benchmark. As shown in [PITH_FULL_IMAGE:figures/full_fig_p021_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Word cloud visualization of token clusters identified via K-Means clustering on LLaVA-1.5 7B. Tokens are grouped based on their behavioral signatures in the feature space (Dominance, Frequency, Attention). Cluster 4 (Hijacking Anchors) distinctively isolates the rigid, high-attention artifacts (e.g., kwiet, ") identified by our HABI method, separating them from the active semantic content in Cluster 2 and… view at source ↗
Figure 16
Figure 16. Figure 16: Additional qualitative comparison of attention maps. For each case, we contrast the attention map for [PITH_FULL_IMAGE:figures/full_fig_p025_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Logit lens visualization for three case studies from Appendix [PITH_FULL_IMAGE:figures/full_fig_p026_17.png] view at source ↗
read the original abstract

Large Vision-Language Models (LVLMs) have achieved remarkable progress in multimodal tasks, yet their reliability is persistently undermined by hallucinations-generating text that contradicts visual input. Recent studies often attribute these errors to inadequate visual attention. In this work, we analyze the attention mechanisms via the logit lens, uncovering a distinct anomaly we term Vocabulary Hijacking. We discover that specific visual tokens, defined as Inert Tokens, disproportionately attract attention. Crucially, when their intermediate hidden states are projected into the vocabulary space, they consistently decode to a fixed set of unrelated words (termed Hijacking Anchors) across layers, revealing a rigid semantic collapse. Leveraging this semantic rigidity, we propose Hijacking Anchor-Based Identification (HABI), a robust strategy to accurately localize these Inert Tokens. To quantify the impact of this phenomenon, we introduce the Non-Hijacked Visual Attention Ratio (NHAR), a novel metric designed to identify attention heads that remain resilient to hijacking and are critical for factual accuracy. Building on these insights, we propose Hijacking-Aware Visual Attention Enhancement (HAVAE), a training-free intervention that selectively strengthens the focus of these identified heads on salient visual content. Extensive experiments across multiple benchmarks demonstrate that HAVAE significantly mitigates hallucinations with no additional computational overhead, while preserving the model's general capabilities. Our code is publicly available at https://github.com/lab-klc/HAVAE.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper claims to uncover a phenomenon termed 'Vocabulary Hijacking' in LVLMs, where specific 'Inert Tokens' disproportionately attract attention and their hidden states decode to fixed unrelated 'Hijacking Anchors' via the logit lens. It introduces Hijacking Anchor-Based Identification (HABI) to localize these tokens, the Non-Hijacked Visual Attention Ratio (NHAR) metric to identify resilient attention heads critical for factual accuracy, and Hijacking-Aware Visual Attention Enhancement (HAVAE), a training-free intervention that strengthens focus in those heads on salient visual content. Extensive experiments are said to show that HAVAE significantly mitigates hallucinations across benchmarks with no added computational overhead while preserving general capabilities, and code is released publicly.

Significance. If the central claims hold after addressing controls, this would be a meaningful contribution to multimodal model reliability. The training-free nature of HAVAE and the mechanistic analysis via attention and logit lens offer practical value for reducing hallucinations in tasks like VQA and captioning. Public code release is a clear strength supporting reproducibility and extension.

major comments (3)
  1. Abstract and Experiments section: The claim of 'extensive experiments across multiple benchmarks' demonstrating significant mitigation lacks any specification of datasets, baseline methods, evaluation metrics for hallucinations, statistical tests, or controls for general capability preservation. This detail is load-bearing for assessing whether the reported gains are robust and comparable to prior work.
  2. HAVAE and NHAR description (method and results): The argument that NHAR-identified heads are the primary drivers of factual accuracy (and thus that selectively applying HAVAE to them reduces hallucinations) rests on correlation. No ablation is described comparing HAVAE on NHAR heads versus random heads, low-NHAR heads, or heads selected by alternative criteria (e.g., visual grounding strength). Without this, it is unclear if the gains are specific to the HABI/NHAR procedure or would arise from generic attention re-weighting.
  3. Method and overhead claim: The assertion of 'no additional computational overhead' for HAVAE does not quantify the one-time cost of computing NHAR via HABI (e.g., time for logit-lens projections over inert tokens across layers). If this pre-computation is non-negligible relative to inference, the practical advantage requires explicit measurement and amortization analysis.
minor comments (2)
  1. Abstract: Multiple new terms and acronyms (Vocabulary Hijacking, Inert Tokens, Hijacking Anchors, HABI, NHAR, HAVAE) are introduced without brief parenthetical definitions, which reduces immediate readability.
  2. Abstract: The statement that HAVAE 'preserves the model's general capabilities' does not indicate the specific tasks or metrics used to verify this (e.g., standard VLM benchmarks beyond hallucination-specific ones).

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and constructive feedback. We address each major comment point-by-point below. Where the manuscript required clarification or additional evidence, we have made revisions; we also note that some details were present in the full experiments section but have been made more prominent.

read point-by-point responses
  1. Referee: Abstract and Experiments section: The claim of 'extensive experiments across multiple benchmarks' demonstrating significant mitigation lacks any specification of datasets, baseline methods, evaluation metrics for hallucinations, statistical tests, or controls for general capability preservation. This detail is load-bearing for assessing whether the reported gains are robust and comparable to prior work.

    Authors: We agree that the abstract and high-level description would benefit from greater specificity. The full paper (Section 4) already specifies benchmarks (POPE, CHAIR, MMHal-Bench), baselines (e.g., OPERA, VCD), hallucination metrics (hallucination ratio, F1), and general-capability controls (VQA accuracy, captioning CIDEr). To address the concern directly, we have revised the abstract to list these elements explicitly and added a summary table of experimental settings plus t-test results for statistical significance in the revised manuscript. revision: yes

  2. Referee: HAVAE and NHAR description (method and results): The argument that NHAR-identified heads are the primary drivers of factual accuracy (and thus that selectively applying HAVAE to them reduces hallucinations) rests on correlation. No ablation is described comparing HAVAE on NHAR heads versus random heads, low-NHAR heads, or heads selected by alternative criteria (e.g., visual grounding strength). Without this, it is unclear if the gains are specific to the HABI/NHAR procedure or would arise from generic attention re-weighting.

    Authors: This is a valid point on the need for stronger causal evidence. The original submission focused on NHAR-selected heads but did not include explicit ablations versus random or alternative selections. In the revision we have added Section 4.3 with these ablations: HAVAE applied to NHAR heads outperforms random-head selection by ~12-15% on hallucination reduction and also outperforms low-NHAR heads and heads chosen by visual-grounding strength. These results are now reported with quantitative tables, supporting that the gains are tied to the resilient heads identified by our procedure rather than generic re-weighting. revision: yes

  3. Referee: Method and overhead claim: The assertion of 'no additional computational overhead' for HAVAE does not quantify the one-time cost of computing NHAR via HABI (e.g., time for logit-lens projections over inert tokens across layers). If this pre-computation is non-negligible relative to inference, the practical advantage requires explicit measurement and amortization analysis.

    Authors: We appreciate the call for explicit quantification. The pre-computation of HABI/NHAR is a one-time offline step per model. In the revised manuscript (Section 3.4 and Appendix C) we now report wall-clock timings: the full HABI process requires approximately 2-4 minutes on a single A100 GPU for LLaVA-1.5-scale models. We also provide an amortization analysis showing this cost is negligible when spread over even a few hundred inferences, confirming that HAVAE itself imposes zero additional overhead at inference time. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical identification and intervention validated externally

full rationale

The paper defines Vocabulary Hijacking, Inert Tokens, Hijacking Anchors, HABI, NHAR, and HAVAE through attention analysis and logit-lens projections on observed model behavior. NHAR is introduced as a metric to select heads, and HAVAE applies a training-free boost to those heads. No equations or steps are shown that reduce NHAR, HABI, or the hallucination-mitigation claim to quantities fitted from the evaluation data itself, nor do any self-citations serve as the sole justification for uniqueness or the central premise. Claims rest on benchmark experiments rather than self-referential definitions or predictions.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 6 invented entities

The central claims rest on the existence of vocabulary hijacking as a dominant cause of hallucinations and on the assumption that NHAR-selected heads are causally linked to factual output; these are introduced without external benchmarks or formal proofs in the abstract.

invented entities (6)
  • Vocabulary Hijacking no independent evidence
    purpose: Describes the observed collapse of inert token representations to fixed unrelated words
    Newly named phenomenon discovered via logit lens
  • Inert Tokens no independent evidence
    purpose: Visual tokens that disproportionately attract attention and cause hijacking
    Defined from attention analysis
  • Hijacking Anchors no independent evidence
    purpose: Fixed set of unrelated words decoded from inert tokens across layers
    Observed projection result
  • HABI no independent evidence
    purpose: Strategy to localize inert tokens using semantic rigidity
    Proposed identification method
  • NHAR no independent evidence
    purpose: Metric to quantify resilient attention heads
    Newly introduced evaluation metric
  • HAVAE no independent evidence
    purpose: Training-free enhancement of critical heads to reduce hallucinations
    Proposed intervention technique

pith-pipeline@v0.9.0 · 5578 in / 1342 out tokens · 39614 ms · 2026-05-12T05:10:54.305354+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

120 extracted references · 120 canonical work pages · 13 internal anchors

  1. [1]

    Aligning Large Multimodal Models with Factually Augmented

    Zhiqing Sun and Sheng Shen and Shengcao Cao and Haotian Liu and Chunyuan Li and Yikang Shen and Chuang Gan and Liang. Aligning Large Multimodal Models with Factually Augmented. Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL) , pages =

  2. [2]

    Proceedings of the International Conference on Learning Representations (ICLR) , year=

    Analyzing and mitigating object hallucination in large vision-language models , author=. Proceedings of the International Conference on Learning Representations (ICLR) , year=

  3. [3]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages =

    Tianyu Yu and Yuan Yao and Haoye Zhang and Taiwen He and Yifeng Han and Ganqu Cui and Jinyi Hu and Zhiyuan Liu and Hai. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages =

  4. [4]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages =

    Qidong Huang and Xiaoyi Dong and Pan Zhang and Bin Wang and Conghui He and Jiaqi Wang and Dahua Lin and Weiming Zhang and Nenghai Yu , title =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages =

  5. [5]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages =

    Sicong Leng and Hang Zhang and Guanzheng Chen and Xin Li and Shijian Lu and Chunyan Miao and Lidong Bing , title =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages =

  6. [7]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages=

    Rlhf-v: Towards trustworthy mllms via behavior alignment from fine-grained correctional human feedback , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages=

  7. [10]

    arXiv preprint arXiv:2407.02477 , year=

    Understanding alignment in multimodal llms: A comprehensive study , author=. arXiv preprint arXiv:2407.02477 , year=

  8. [11]

    Proceedings of the Advances in neural information processing systems (NeurIPS) , year=

    Automated Multi-level Preference for MLLMs , author=. Proceedings of the Advances in neural information processing systems (NeurIPS) , year=

  9. [12]

    arXiv preprint arXiv:2411.17265 , year=

    A Topic-level Self-Correctional Approach to Mitigate Hallucinations in MLLMs , author=. arXiv preprint arXiv:2411.17265 , year=

  10. [13]

    Proceedings of the European Conference on Computer Vision (ECCV) , pages=

    Clip-dpo: Vision-language models as a source of preference for fixing hallucinations in lvlms , author=. Proceedings of the European Conference on Computer Vision (ECCV) , pages=

  11. [14]

    2023 , archivePrefix=

    Beyond Hallucinations: Enhancing LVLMs through Hallucination-Aware Direct Preference Optimization , author=. 2023 , archivePrefix=

  12. [15]

    arXiv preprint arXiv:2404.14233 , year=

    Detecting and mitigating hallucination in large vision language models via fine-grained ai feedback , author=. arXiv preprint arXiv:2404.14233 , year=

  13. [16]

    Proceedings of the International Conference on Learning Representations (ICLR) , year=

    Mitigating Object Hallucination via Data Augmented Contrastive Tuning , author=. Proceedings of the International Conference on Learning Representations (ICLR) , year=

  14. [17]

    arXiv preprint arXiv:2410.15334 , year=

    Modality-Fair Preference Optimization for Trustworthy MLLM Alignment , author=. arXiv preprint arXiv:2410.15334 , year=

  15. [18]

    Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP) , pages=

    mDPO: Conditional Preference Optimization for Multimodal Large Language Models , author=. Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP) , pages=

  16. [19]

    In Advances in Neural Information Processing Systems (NeurIPS), volume 37, 2024

    Simpo: Simple preference optimization with a reference-free reward , author=. arXiv preprint arXiv:2405.14734 , year=

  17. [20]

    arXiv preprint arXiv:2402.05369 , year=

    Noise contrastive alignment of language models with explicit rewards , author=. arXiv preprint arXiv:2402.05369 , year=

  18. [21]

    Proceedings of the Advances in Neural Information Processing Systems (NeurIPS) , year=

    Direct preference optimization: Your language model is secretly a reward model , author=. Proceedings of the Advances in Neural Information Processing Systems (NeurIPS) , year=

  19. [22]

    A comprehensive overview of large language models,

    A comprehensive overview of large language models , author=. arXiv preprint arXiv:2307.06435 , year=

  20. [23]

    Challenges and applications of large language models

    Challenges and applications of large language models , author=. arXiv preprint arXiv:2307.10169 , year=

  21. [24]

    A Survey of Large Language Models

    A survey of large language models , author=. arXiv preprint arXiv:2303.18223 , year=

  22. [25]

    Trustworthy llms: a survey and guideline for evaluating large language models’ alignment

    Trustworthy LLMs: A survey and guideline for evaluating large language models' alignment , author=. arXiv preprint arXiv:2308.05374 , year=

  23. [26]

    The method of paired comparisons , author=

    Rank analysis of incomplete block designs: I. The method of paired comparisons , author=. Biometrika , volume=

  24. [27]

    Proceedings of the Advances in Neural Information Processing Systems (NeurIPS) , year=

    Training language models to follow instructions with human feedback , author=. Proceedings of the Advances in Neural Information Processing Systems (NeurIPS) , year=

  25. [28]

    Proceedings of the Advances in Neural Information Processing Systems (NeurIPS) , year=

    Deep reinforcement learning from human preferences , author=. Proceedings of the Advances in Neural Information Processing Systems (NeurIPS) , year=

  26. [29]

    Transactions on Machine Learning Research (TMLR) , volume =

    Open problems and fundamental limitations of reinforcement learning from human feedback , author=. Transactions on Machine Learning Research (TMLR) , volume =

  27. [30]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages=

    Improved baselines with visual instruction tuning , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages=

  28. [31]

    https://vicuna

    Vicuna: An open-source chatbot impressing gpt-4 with 90\ author=. https://vicuna. lmsys. org , volume=

  29. [32]

    Proceedings of the International Conference on Machine Learning (ICML) , pages=

    Learning transferable visual models from natural language supervision , author=. Proceedings of the International Conference on Machine Learning (ICML) , pages=

  30. [33]

    arXiv preprint arXiv:2501.16629 , year=

    CHiP: Cross-modal Hierarchical Direct Preference Optimization for Multimodal LLMs , author=. arXiv preprint arXiv:2501.16629 , year=

  31. [34]

    Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP) , pages=

    V-DPO: Mitigating Hallucination in Large Vision Language Models via Vision-Guided Direct Preference Optimization , author=. Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP) , pages=

  32. [35]

    GPT-4 Technical Report

    Gpt-4 technical report , author=. arXiv preprint arXiv:2303.08774 , year=

  33. [37]

    Proceedings of the Advances in Neural Information Processing Systems (NeurIPS) , year =

    Haotian Liu and Chunyuan Li and Qingyang Wu and Yong Jae Lee , title =. Proceedings of the Advances in Neural Information Processing Systems (NeurIPS) , year =

  34. [38]

    Wenliang Dai and Junnan Li and Dongxu Li and Anthony Meng Huat Tiong and Junqi Zhao and Weisheng Wang and Boyang Li and Pascale Fung and Steven C. H. Hoi , title =. Proceedings of the Advances in Neural Information Processing Systems (NeurIPS) , year =

  35. [39]

    Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

    Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond , author=. arXiv preprint arXiv:2308.12966 , year=

  36. [40]

    Beyond hallucinations: Enhancing lvlms through hallucination-aware direct preference optimization,

    Beyond hallucinations: Enhancing lvlms through hallucination-aware direct preference optimization , author=. arXiv preprint arXiv:2311.16839 , year=

  37. [41]

    Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP) , pages =

    Anna Rohrbach and Lisa Anne Hendricks and Kaylee Burns and Trevor Darrell and Kate Saenko , title =. Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP) , pages =

  38. [43]

    Proceedings of the Advances in Neural Information Processing Systems (NeurIPS) , year=

    Seeing the Image: Prioritizing Visual Correlation by Contrastive Alignment , author=. Proceedings of the Advances in Neural Information Processing Systems (NeurIPS) , year=

  39. [44]

    Proceedings of the European Conference on Computer Vision (ECCV) , pages=

    Paying more attention to image: A training-free method for alleviating hallucination in lvlms , author=. Proceedings of the European Conference on Computer Vision (ECCV) , pages=

  40. [45]

    Proceedings of the International Conference on Machine Learning (ICML) , year =

    HALC: Object Hallucination Reduction via Adaptive Focal-Contrast Decoding , author=. Proceedings of the International Conference on Machine Learning (ICML) , year =

  41. [46]

    A Survey on Hallucination in Large Vision-Language Models

    A survey on hallucination in large vision-language models , author=. arXiv preprint arXiv:2402.00253 , year=

  42. [47]

    Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP) , pages=

    Evaluating Object Hallucination in Large Vision-Language Models , author=. Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP) , pages=

  43. [48]

    Proceedings of the International Conference on Learning Representations (ICLR) , year=

    Analyzing and Mitigating Object Hallucination in Large Vision-Language Models , author=. Proceedings of the International Conference on Learning Representations (ICLR) , year=

  44. [49]

    ᄅ ᅵᄀ ᅳ ᄋ ᅩ ᄇ ᅳ ᄅ ᅦ전ᄃ ᅳ

    Halle-switch: Rethinking and controlling object existence hallucinations in large vision language models for detailed caption , author=. arXiv preprint arXiv:2310.01779 , year=

  45. [50]

    arXiv preprint arXiv:2402.08680 , year=

    Mitigating object hallucination in large vision-language models via classifier-free guidance , author=. arXiv preprint arXiv:2402.08680 , year=

  46. [51]

    Proceedings of the AAAI Conference on Artificial Intelligence (AAAI) , volume=

    Vigc: Visual instruction generation and correction , author=. Proceedings of the AAAI Conference on Artificial Intelligence (AAAI) , volume=

  47. [52]

    Woodpecker: Hallucination correction for multimodal large language models

    Woodpecker: Hallucination correction for multimodal large language models , author=. arXiv preprint arXiv:2310.16045 , year=

  48. [53]

    Proceedings of the International Conference on Machine Learning (ICML) , year=

    Mllm-as-a-judge: Assessing multimodal llm-as-a-judge with vision-language benchmark , author=. Proceedings of the International Conference on Machine Learning (ICML) , year=

  49. [54]

    Look twice before you answer: Memory-space visual retracing for hallucination mitiga- tion in multimodal large language models.arXiv preprint arXiv:2410.03577, 2024

    Look Twice Before You Answer: Memory-Space Visual Retracing for Hallucination Mitigation in Multimodal Large Language Models , author=. arXiv preprint arXiv:2410.03577 , year=

  50. [55]

    arXiv e-prints , pages=

    MLLM can see? Dynamic Correction Decoding for Hallucination Mitigation , author=. arXiv e-prints , pages=

  51. [56]

    Proceedings of the International Conference on Learning Representations (ICLR) , year=

    Mitigating hallucination in large multi-modal models via robust instruction tuning , author=. Proceedings of the International Conference on Learning Representations (ICLR) , year=

  52. [57]

    Aligning Large Multimodal Models with Factually Augmented

    Zhiqing Sun and Sheng Shen and Shengcao Cao and Haotian Liu and Chunyuan Li and Yikang Shen and Chuang Gan and Liangyan Gui and Yu-Xiong Wang and Yiming Yang and Kurt Keutzer and Trevor Darrell , booktitle=. Aligning Large Multimodal Models with Factually Augmented. 2024

  53. [58]

    arXiv preprint arXiv:2402.11411 , year=

    Aligning Modalities in Vision Large Language Models via Preference Fine-tuning , author=. arXiv preprint arXiv:2402.11411 , year=

  54. [59]

    Proceedings of the Advances in Neural Information Processing Systems (NeurIPS) , year =

    Mitigating object hallucination via concentric causal attention , author=. Proceedings of the Advances in Neural Information Processing Systems (NeurIPS) , year =

  55. [60]

    Proceedings of the International Conference on Learning Representations (ICLR) , year=

    Lora: Low-rank adaptation of large language models , author=. Proceedings of the International Conference on Learning Representations (ICLR) , year=

  56. [61]

    Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP) , pages=

    VLFeedback: A Large-Scale AI Feedback Dataset for Large Vision-Language Models Alignment , author=. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP) , pages=

  57. [62]

    Proceedings of the Advances in Neural Information Processing Systems (NeurIPS) , year =

    -DPO: Direct Preference Optimization with Dynamic , author=. Proceedings of the Advances in Neural Information Processing Systems (NeurIPS) , year =

  58. [63]

    Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

    Making the v in vqa matter: Elevating the role of image understanding in visual question answering , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

  59. [64]

    CVPR , year=

    Gqa: A new dataset for real-world visual reasoning and compositional question answering , author=. CVPR , year=

  60. [65]

    Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

    Vizwiz grand challenge: Answering visual questions from blind people , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

  61. [66]

    InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning

    Instructblip: Towards general-purpose vision-language models with instruction tuning , author=. arXiv preprint arXiv:2305.06500 , year=

  62. [67]

    Advances in Neural Information Processing Systems (NeurIPS) , year=

    Learn to explain: Multimodal reasoning via thought chains for science question answering , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=

  63. [68]

    Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

    Towards vqa models that can read , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

  64. [70]

    MMBench: Is Your Multi-modal Model an All-around Player?

    MMBench: Is Your Multi-modal Model an All-around Player? , author=. arXiv preprint arXiv:2307.06281 , year=

  65. [71]

    Smaug: Fixing failure modes of preference optimisation with dpo-positive.arXiv preprint arXiv:2402.13228, 2024

    Smaug: Fixing failure modes of preference optimisation with dpo-positive , author=. arXiv preprint arXiv:2402.13228 , year=

  66. [72]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages=

    Devils in middle layers of large vision-language models: Interpreting, detecting and mitigating object hallucinations via attention lens , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages=

  67. [73]

    The Thirteenth International Conference on Learning Representations,

    See What You Are Told: Visual Attention Sink in Large Multimodal Models , author =. The Thirteenth International Conference on Learning Representations,

  68. [74]

    Z., and Liu, Z

    Massive activations in large language models , author=. arXiv preprint arXiv:2402.17762 , year=

  69. [75]

    The Thirteenth International Conference on Learning Representations (ICLR) , year=

    When Attention Sink Emerges in Language Models: An Empirical View , author=. The Thirteenth International Conference on Learning Representations (ICLR) , year=

  70. [78]

    Forty-second International Conference on Machine Learning (ICLR) , year=

    The Hidden Life of Tokens: Reducing Hallucination of Large Vision-Language Models Via Visual Information Steering , author=. Forty-second International Conference on Machine Learning (ICLR) , year=

  71. [79]

    LISA: REASONING SEGMENTATION VIA LARGE LANGUAGE MODEL , author=

  72. [80]

    LLaVA-NeXT: Improved reasoning, OCR, and world knowledge , url=

    Liu, Haotian and Li, Chunyuan and Li, Yuheng and Li, Bo and Zhang, Yuanhan and Shen, Sheng and Lee, Yong Jae , month=. LLaVA-NeXT: Improved reasoning, OCR, and world knowledge , url=

  73. [81]

    Transformer Circuits Thread , volume=

    A mathematical framework for transformer circuits , author=. Transformer Circuits Thread , volume=

  74. [83]

    arXiv preprint arXiv:2403.00824 , year=

    Information flow routes: Automatically interpreting language models at scale , author=. arXiv preprint arXiv:2403.00824 , year=

  75. [84]

    URL https://arxiv

    Spectral Filters, Dark Signals, and Attention Sinks , author=. arXiv preprint arXiv:2402.09221 , year=

  76. [85]

    Computer Vision--ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014 , year=

    Microsoft coco: Common objects in context , author=. Computer Vision--ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014 , year=

  77. [86]

    ATMAN: Understanding Transformer Predictions Through Memory Efficient Attention Manipulation , volume =

    Deiseroth, Bj\". ATMAN: Understanding Transformer Predictions Through Memory Efficient Attention Manipulation , volume =. Advances in Neural Information Processing Systems (NeurIPS) , editor =

  78. [87]

    Tell Your Model Where to Attend: Post-hoc Attention Steering for

    Qingru Zhang and Chandan Singh and Liyuan Liu and Xiaodong Liu and Bin Yu and Jianfeng Gao and Tuo Zhao , booktitle=. Tell Your Model Where to Attend: Post-hoc Attention Steering for

  79. [88]

    Model Tells You What to Discard: Adaptive

    Suyu Ge and Yunan Zhang and Liyuan Liu and Minjia Zhang and Jiawei Han and Jianfeng Gao , booktitle=. Model Tells You What to Discard: Adaptive

  80. [90]

    Interpreting GPT: The logit lens , author =

Showing first 80 references.