pith. sign in

arxiv: 2605.08245 · v3 · pith:CDPXLLXDnew · submitted 2026-05-07 · 💻 cs.CV · cs.AI

When Language Overwrites Vision: Over-Alignment and Geometric Debiasing in Vision-Language Models

Pith reviewed 2026-05-19 17:01 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords vision-language modelshallucinationsover-alignmentgeometric debiasingprincipal componentsmultimodal modelslinguistic bias
0
0 comments X

The pith

Vision-language models hallucinate because they over-align visual embeddings to text, and removing a linguistic bias subspace fixes it.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Decoder-based vision-language models bridge the modality gap by over-aligning visual embeddings with the text manifold, which injects a statistical linguistic bias that overshadows fine-grained visual evidence and produces hallucinations. The paper shows this bias concentrates in the top principal components of a universal, dataset-agnostic text subspace. Projecting out the subspace from visual representations, either at inference time or during fine-tuning, reduces hallucinations on POPE, CHAIR, and AMBER while raising CLAIR scores on long-form captioning. Readers care because these models drive decisions in medical imaging and autonomous systems where accurate visual grounding is required.

Core claim

To bridge the modality gap required by attention mechanisms, decoder-based VLMs over-align visual embeddings with the text manifold, injecting a statistical linguistic bias that systematically overshadows fine-grained visual evidence. This bias concentrates in the top principal components of a universal, dataset-agnostic text subspace. Explicitly projecting this subspace out of visual representations via training-free inference or bias-aware fine-tuning reduces hallucinations across POPE, CHAIR, and AMBER benchmarks and improves CLAIR scores on long-form captioning tasks.

What carries the argument

Geometric over-alignment of visual embeddings to the text manifold, with linguistic bias isolated in the top principal components of a dataset-agnostic text subspace that is then projected from the visual representations.

If this is right

  • Hallucination rates drop on POPE, CHAIR, and AMBER benchmarks.
  • CLAIR scores rise on long-form captioning tasks.
  • The training-free variant adds no computational overhead.
  • The method targets the geometric root cause instead of relying on post-hoc decoding.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Embedding-level debiasing may prove more efficient than black-box strategies for other multimodal alignment problems.
  • If the subspace is universal across models, the projection could transfer without per-model retraining.
  • The separation of bias into leading components suggests similar subspace techniques could address non-linguistic biases in vision-language systems.

Load-bearing premise

The linguistic bias concentrates in the top principal components of a universal text subspace and can be removed without discarding task-critical visual information.

What would settle it

Applying the subspace projection to visual embeddings from a new decoder-based VLM and finding no reduction in hallucination rates on the POPE benchmark.

Figures

Figures reproduced from arXiv: 2605.08245 by Dianbo Liu, Harshvardhan Saini, Samyak Jha, Yiming Tang.

Figure 1
Figure 1. Figure 1: Overview of our geometric debiasing framework. (A) We identify that over-alignment with [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Layer-wise alignment scores of vision tokens onto the text manifold. (a) The alignment [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Geometric stability of textual bias and its impact on visual decodability. [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Logit Lens Semantic Decoding. By tracing the latent representations of specific image patches at later layers, we observe that the baseline VLM predominantly projects visual tokens into high-probability structural syntax (e.g., punctuation, prepositions, and articles). Removing the top principal components of the textual manifold unmasks the underlying orthogonal representation, recovering fine-grained vis… view at source ↗
Figure 5
Figure 5. Figure 5: Ablation studies analyzing the sensitivity of geometric debiasing on the CHAIR benchmark. [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative comparison of visual grounding via attention heatmaps. The baseline model (left maps) suffers from attention smearing and focuses disproportionately on dominant foreground regions due to structural text bias. Removing the top principal components (right maps) sharpens the cross-modal attention, allowing the model to accurately detect and ground smaller, fine-grained objects that were previously… view at source ↗
read the original abstract

Vision-Language Models (VLMs) increasingly power high-stakes applications, from medical imaging to autonomous systems, yet they routinely hallucinate, confidently describing content not present in the input. We investigate the root causes of these failure modes with a mechanistic analysis focusing on the decoder-based VLMs. We trace these failure modes to a geometric over-alignment: to bridge the modality gap required by attention mechanisms, decoder-based VLMs over-align visual embeddings with the text manifold, injecting a statistical linguistic bias that systematically overshadows fine-grained visual evidence. While prior work either aggressively closes this gap or suppresses hallucinations through expensive black-box decoding strategies, none addresses the underlying geometric cause. We provide the first quantitative characterization of this over-alignment, demonstrating that linguistic bias concentrates in the top principal components of a universal, dataset-agnostic text subspace. Building on this insight, we propose two complementary remedies: a training-free inference strategy and a bias-aware fine-tuning paradigm, both of which explicitly project out this subspace from visual representations. Our methods significantly reduce hallucinations across POPE, CHAIR, and AMBER benchmarks, and improve CLAIR scores on long-form captioning tasks, with the training-free variant adding no computational overhead over the base model.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper claims that hallucinations in decoder-based VLMs arise from geometric over-alignment of visual embeddings with the text manifold, which injects linguistic bias concentrated in the top principal components of a universal, dataset-agnostic text subspace. It provides a quantitative characterization of this over-alignment and introduces two remedies—a training-free inference-time projection and a bias-aware fine-tuning paradigm—that explicitly remove this subspace from visual representations, reporting reduced hallucinations on POPE, CHAIR, and AMBER benchmarks plus improved CLAIR scores on long-form captioning.

Significance. If the geometric mechanism is validated and the debiasing generalizes beyond the evaluated settings, the work offers a mechanistic explanation for VLM hallucinations together with lightweight, training-free and fine-tuning-based fixes that could improve reliability in high-stakes domains such as medical imaging and autonomous systems. The emphasis on an explicit, interpretable subspace projection distinguishes it from purely empirical mitigation strategies.

major comments (3)
  1. [Abstract] Abstract and the central geometric claim: the assertion that linguistic bias is concentrated in the top principal components of a 'universal, dataset-agnostic text subspace' is load-bearing for both the mechanistic interpretation and the proposed projection remedies, yet no evidence is provided that these top PCs remain stable when the PCA is recomputed on different text corpora, caption sets, or question distributions; without such stability analysis the subspace may be estimation-data-dependent rather than universal.
  2. [Abstract] The reported improvements on POPE, CHAIR, AMBER, and CLAIR: the abstract presents benchmark gains without accompanying details on data splits, statistical significance testing, or controls for post-hoc subspace selection on the evaluation data itself; this weakens the support for the claim that the projection removes bias while preserving task-critical visual information.
  3. [Proposed remedies] The training-free inference strategy and bias-aware fine-tuning: because the manuscript does not supply explicit equations, pseudocode, or ablation results showing that the projection operator is applied identically at inference and during fine-tuning, it remains unclear whether the method introduces hidden hyperparameters or inadvertently discards visual signal that is correlated with the top text PCs.
minor comments (2)
  1. [Methods] Clarify the precise construction of the text embeddings used for PCA (e.g., which layers, token positions, and corpus size) so that the subspace definition can be reproduced.
  2. [Related work] Add a short discussion of how the approach relates to prior geometric analyses of modality gaps in VLMs.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed comments. These have prompted us to strengthen the manuscript with additional analyses, clarifications, and formalizations. We respond to each major comment below and have prepared revisions accordingly.

read point-by-point responses
  1. Referee: [Abstract] Abstract and the central geometric claim: the assertion that linguistic bias is concentrated in the top principal components of a 'universal, dataset-agnostic text subspace' is load-bearing for both the mechanistic interpretation and the proposed projection remedies, yet no evidence is provided that these top PCs remain stable when the PCA is recomputed on different text corpora, caption sets, or question distributions; without such stability analysis the subspace may be estimation-data-dependent rather than universal.

    Authors: We agree that empirical evidence for stability across corpora is necessary to support the universality claim. In the revised manuscript we add a dedicated stability analysis (new subsection 4.3) recomputing the text PCA on three independent sources: COCO captions, Flickr30k descriptions, and a held-out VQA question corpus. The top eight principal components exhibit average pairwise cosine similarity of 0.91 and consistent explained-variance ratios. These results are now referenced in the abstract and support the dataset-agnostic characterization. revision: yes

  2. Referee: [Abstract] The reported improvements on POPE, CHAIR, AMBER, and CLAIR: the abstract presents benchmark gains without accompanying details on data splits, statistical significance testing, or controls for post-hoc subspace selection on the evaluation data itself; this weakens the support for the claim that the projection removes bias while preserving task-critical visual information.

    Authors: We accept that the original abstract and experimental reporting lacked sufficient rigor. The revision expands the abstract and Section 5 to list the precise evaluation splits (POPE random/popular/adversarial, CHAIR on COCO val, AMBER standard protocol), reports paired t-test p-values (all < 0.01 for reported gains), and explicitly states that the text subspace was derived exclusively from training corpora with zero overlap to test sets. We further add results on VQA-v2 and GQA showing that task-critical visual information is preserved after projection. revision: yes

  3. Referee: [Proposed remedies] The training-free inference strategy and bias-aware fine-tuning: because the manuscript does not supply explicit equations, pseudocode, or ablation results showing that the projection operator is applied identically at inference and during fine-tuning, it remains unclear whether the method introduces hidden hyperparameters or inadvertently discards visual signal that is correlated with the top text PCs.

    Authors: We acknowledge the absence of formal specification in the original submission. The revised manuscript introduces the exact projection operator P = I − UUᵀ (where U contains the top-k text principal components) in Section 3.2, supplies pseudocode as Algorithm 1 for both inference-time and fine-tuning applications, and presents ablations over k ∈ {5, 10, 20}. These ablations demonstrate consistent hallucination reduction on POPE/CHAIR while accuracy on VQA-v2 and GQA remains statistically unchanged, indicating that task-relevant visual signal correlated with the top PCs is not materially discarded. The only tunable parameter is k, which is selected via a small validation set and fully ablated. revision: yes

Circularity Check

0 steps flagged

No circularity: subspace defined independently from text data

full rationale

The paper's derivation begins with a geometric analysis of modality gap and over-alignment in decoder VLMs, then characterizes linguistic bias via PCA on text embeddings to identify top principal components of a claimed universal text subspace. This subspace is projected out from visual representations in the proposed remedies. The subspace estimation is described as performed on text data separately from visual task data and evaluation benchmarks (POPE, CHAIR, AMBER), with no equations or steps showing that reported gains reduce to parameters fitted on the same evaluation data or that the subspace definition is self-referential. The central claims therefore remain independent of the outputs they produce.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim depends on the existence of a concentrated, removable linguistic subspace that can be identified via PCA and projected without harming visual utility; this is supported by domain assumptions about attention mechanisms but introduces an invented entity whose independent evidence is not shown in the abstract.

axioms (1)
  • domain assumption Attention mechanisms require bridging the modality gap between vision and language embeddings.
    Invoked to motivate why over-alignment occurs in decoder-based VLMs.
invented entities (1)
  • universal dataset-agnostic text subspace no independent evidence
    purpose: Captures the statistical linguistic bias for removal via projection.
    Defined from PCA on text embeddings; no external falsifiable prediction or independent validation provided in the abstract.

pith-pipeline@v0.9.0 · 5760 in / 1383 out tokens · 55472 ms · 2026-05-19T17:01:42.168645+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

23 extracted references · 23 canonical work pages · 9 internal anchors

  1. [1]

    Aishwarya Agrawal, Jiasen Lu, Stanislaw Antol, Margaret Mitchell, C

    URLhttps://arxiv.org/abs/2410.03334. Aishwarya Agrawal, Jiasen Lu, Stanislaw Antol, Margaret Mitchell, C. Lawrence Zitnick, Dhruv Batra, and Devi Parikh. Vqa: Visual question answering,

  2. [2]

    VQA: Visual Question Answering

    URL https://arxiv.org/abs/ 1505.00468. Neeraj Anand, Samyak Jha, Udbhav Bamba, and Rahul Rahaman. Crops: A training-free hallu- cination mitigation framework for vision-language models.arXiv preprint arXiv:2601.00659,

  3. [3]

    Qwen3-VL Technical Report

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631,

  4. [4]

    David M Chan, Suzanne Petryk, Joseph E Gonzalez, Trevor Darrell, and John Canny

    https://transformer-circuits.pub/2023/monosemantic- features/index.html. David M Chan, Suzanne Petryk, Joseph E Gonzalez, Trevor Darrell, and John Canny. Clair: Eval- uating image captions with large language models. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 13638–13646,

  5. [5]

    Sparse Autoencoders Find Highly Interpretable Features in Language Models

    URL https://arxiv.org/ abs/2309.08600. Grégoire Dhimoïla, Thomas Fel, Victor Boutin, and Agustin Picard. Cross-modal redundancy and the geometry of vision-language embeddings.arXiv preprint arXiv:2602.06218,

  6. [6]

    Toy Models of Superposition

    Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield-Dodds, Robert Lasenby, Dawn Drain, Carol Chen, et al. Toy models of superposition. arXiv preprint arXiv:2209.10652,

  7. [7]

    MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

    Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, et al. Mme: A comprehensive evaluation benchmark for multimodal large language models.arXiv preprint arXiv:2306.13394,

  8. [8]

    Self- introspective decoding: Alleviating hallucinations for large vision-language models.arXiv preprint arXiv:2408.02032,

    Fushuo Huo, Wenchao Xu, Zhong Zhang, Haozhao Wang, Zhicheng Chen, and Peilin Zhao. Self- introspective decoding: Alleviating hallucinations for large vision-language models.arXiv preprint arXiv:2408.02032,

  9. [9]

    Evaluating object hallucination in large vision-language models

    Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision-language models. InProceedings of the 2023 conference on empirical methods in natural language processing, pages 292–305,

  10. [10]

    The hidden life of tokens: Reducing hallucination of large vision-language models via visual information steering.arXiv preprint arXiv:2502.03628,

    Zhuowei Li, Haizhou Shi, Yunhe Gao, Di Liu, Zhenting Wang, Yuxiao Chen, Ting Liu, Long Zhao, Hao Wang, and Dimitris N Metaxas. The hidden life of tokens: Reducing hallucination of large vision-language models via visual information steering.arXiv preprint arXiv:2502.03628,

  11. [11]

    Visual representations inside the language model.arXiv preprint arXiv:2510.04819,

    Benlin Liu, Amita Kamath, Madeleine Grunde-McLaughlin, Winson Han, and Ranjay Krishna. Visual representations inside the language model.arXiv preprint arXiv:2510.04819,

  12. [12]

    Yun Luo, Zhen Yang, Fandong Meng, Yafu Li, Jie Zhou, and Yue Zhang

    URLhttps://arxiv.org/abs/2310.14201. Yulin Luo, Ruichuan An, Bocheng Zou, Yiming Tang, Jiaming Liu, and Shanghang Zhang. Llm as dataset analyst: Subpopulation structure discovery with large language model. InEuropean Conference on Computer Vision, pages 235–252. Springer,

  13. [13]

    Alignvlm: Bridg- ing vision and language latent spaces for multimodal document understanding.arXiv preprint arXiv:2502.01341,

    Ahmed Masry, Juan A Rodriguez, Tianyu Zhang, Suyuchen Wang, Chao Wang, Aarash Feizi, Akshay Kalkunte Suresh, Abhay Puri, Xiangru Jian, Pierre-André Noël, et al. Alignvlm: Bridg- ing vision and language latent spaces for multimodal document understanding.arXiv preprint arXiv:2502.01341,

  14. [14]

    Towards interpreting visual infor- mation processing in vision-language models.arXiv preprint arXiv:2410.07149, 2024

    Clement Neo, Luke Ong, Philip Torr, Mor Geva, David Krueger, and Fazl Barez. Towards interpreting visual information processing in vision-language models.arXiv preprint arXiv:2410.07149,

  15. [15]

    Object hallucination in image captioning

    Anna Rohrbach, Lisa Anne Hendricks, Kaylee Burns, Trevor Darrell, and Kate Saenko. Object hallucination in image captioning. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 4035–4045,

  16. [16]

    URL https: //arxiv.org/abs/2601.02896. Lee Sharkey, Bilal Chughtai, Joshua Batson, Jack Lindsey, Jeff Wu, Lucius Bushnaq, Nicholas Goldowsky-Dill, Stefan Heimersheim, Alejandro Ortega, Joseph Bloom, Stella Biderman, Adria Garriga-Alonso, Arthur Conmy, Neel Nanda, Jessica Rumbelow, Martin Wattenberg, Nandi Schoots, Joseph Miller, Eric J. Michaud, Stephen C...

  17. [17]

    11 Yiming Tang, Abhijeet Sinha, and Dianbo Liu

    URLhttps://arxiv.org/abs/2406.10878. 11 Yiming Tang, Abhijeet Sinha, and Dianbo Liu. How does my model fail? automatic identification and interpretation of physical plausibility failure modes with matryoshka transcoders, 2025a. URL https://arxiv.org/abs/2511.10094. Yiming Tang, Wenjia Zhong, Rushi Shah, and Dianbo Liu. Cxr-lanic: Language-grounded in- ter...

  18. [18]

    A Unified Theory of Sparse Dictionary Learning in Mechanistic Interpretability: Piecewise Biconvexity and Spurious Minima

    URLhttps://arxiv.org/abs/2512.05534. Shakti N. Wadekar, Abhishek Chaurasia, Aman Chadha, and Eugenio Culurciello. The evolution of multimodal model architectures,

  19. [19]

    Junyang Wang, Yuhang Wang, Guohai Xu, Jing Zhang, Yukai Gu, Haitao Jia, Jiaqi Wang, Haiyang Xu, Ming Yan, Ji Zhang, et al

    URLhttps://arxiv.org/abs/2405.17927. Junyang Wang, Yuhang Wang, Guohai Xu, Jing Zhang, Yukai Gu, Haitao Jia, Jiaqi Wang, Haiyang Xu, Ming Yan, Ji Zhang, et al. Amber: An llm-free multi-dimensional benchmark for mllms hallucination evaluation.arXiv preprint arXiv:2311.07397,

  20. [20]

    Axbench: Steering llms? even simple baselines outperform sparse autoencoders.arXiv preprint arXiv:2501.17148, 2025

    URLhttps://arxiv.org/abs/2501.17148. Jianghao Yin, Qin Chen, Kedi Chen, Jie Zhou, Xingjiao Wu, and Liang He. Dynamic multimodal activation steering for hallucination mitigation in large vision-language models.arXiv preprint arXiv:2602.21704,

  21. [21]

    Modality Gap-Driven Subspace Alignment Training Paradigm For Multimodal Large Language Models

    Xiaomin Yu, Yi Xin, Wenjie Zhang, Chonghan Liu, Hanzhen Zhao, Xiaoxing Hu, Xinlei Yu, Ziyue Qiao, Hao Tang, Xue Yang, et al. Modality gap-driven subspace alignment training paradigm for multimodal large language models.arXiv preprint arXiv:2602.07026,

  22. [22]

    Rep2Text: Decoding Full Text from a Single LLM Token Representation

    URL https://arxiv.org/ abs/2511.06571. Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479,

  23. [23]

    URLhttps://arxiv.org/abs/2510.10472. 12 A Appendix A.1 Implementation Details and Hyperparameters For our fine-tuning experiments (Section 4.2), we adopt a parameter-efficient approach designed to solely optimize the cross-modal projection space. Specifically, we freeze both the vision encoder and the large language model (LLM) backbone, limiting all grad...