arxiv: 2605.08245 · v2 · submitted 2026-05-07 · 💻 cs.CV · cs.AI

Recognition: 2 theorem links

· Lean Theorem

When Language Overwrites Vision: Over-Alignment and Geometric Debiasing in Vision-Language Models

Harshvardhan Saini , Samyak Jha , Yiming Tang , Dianbo Liu

Authors on Pith no claims yet

Pith reviewed 2026-05-14 21:18 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords vision-language modelshallucinationsgeometric over-alignmentdebiasingprincipal componentstext subspacemultimodal modelsdecoder-based VLMs

0 comments

The pith

Decoder-based VLMs hallucinate because they over-align visual embeddings to the text manifold, and this bias can be removed by projecting out a universal linguistic subspace.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that hallucinations in vision-language models stem from a geometric over-alignment in which visual embeddings are pulled too closely onto the text manifold to satisfy attention mechanisms. This over-alignment injects statistical linguistic patterns that systematically override fine-grained visual details. The authors quantify the effect by showing that the bias concentrates in the top principal components of a dataset-agnostic text subspace. They introduce two remedies that subtract this subspace from visual representations: a training-free inference step and a targeted fine-tuning procedure. Both methods lower hallucination rates on standard benchmarks while adding little or no overhead.

Core claim

The central claim is that decoder-based VLMs inject a statistical linguistic bias by over-aligning visual embeddings with the text manifold, and that this bias resides in the top principal components of a universal text subspace; explicitly projecting the subspace out of visual representations removes the source of hallucinations without discarding necessary visual information.

What carries the argument

The dataset-agnostic text subspace, recovered via principal components of text embeddings, which is subtracted from visual representations to enforce geometric debiasing.

Load-bearing premise

The linguistic bias is concentrated in the top principal components of a universal text subspace, and subtracting them removes hallucinations without discarding essential visual information or creating new errors.

What would settle it

If subtracting the identified text principal components from visual embeddings produces no reduction in hallucination rates on POPE or CHAIR benchmarks, or if it lowers accuracy on visual question answering tasks, the over-alignment account would be falsified.

Figures

Figures reproduced from arXiv: 2605.08245 by Dianbo Liu, Harshvardhan Saini, Samyak Jha, Yiming Tang.

**Figure 2.** Figure 2: Layer-wise alignment scores of vision tokens onto the text manifold. (a) The alignment [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Geometric stability of textual bias and its impact on visual decodability. [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Logit Lens Semantic Decoding. By tracing the latent representations of specific image patches at later layers, we observe that the baseline VLM predominantly projects visual tokens into high-probability structural syntax (e.g., punctuation, prepositions, and articles). Removing the top principal components of the textual manifold unmasks the underlying orthogonal representation, recovering fine-grained vis… view at source ↗

**Figure 5.** Figure 5: Ablation studies analyzing the sensitivity of geometric debiasing on the CHAIR benchmark. [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Qualitative comparison of visual grounding via attention heatmaps. The baseline model (left maps) suffers from attention smearing and focuses disproportionately on dominant foreground regions due to structural text bias. Removing the top principal components (right maps) sharpens the cross-modal attention, allowing the model to accurately detect and ground smaller, fine-grained objects that were previously… view at source ↗

read the original abstract

Vision-Language Models (VLMs) increasingly power high-stakes applications, from medical imaging to autonomous systems, yet they routinely hallucinate, confidently describing content not present in the input. We investigate the root causes of these failure modes with a mechanistic analysis focusing on the decoder-based VLMs. We trace these failure modes to a geometric over-alignment: to bridge the modality gap required by attention mechanisms, decoder-based VLMs over-align visual embeddings with the text manifold, injecting a statistical linguistic bias that systematically overshadows fine-grained visual evidence. While prior work either aggressively closes this gap or suppresses hallucinations through expensive black-box decoding strategies, none addresses the underlying geometric cause. We provide the first quantitative characterization of this over-alignment, demonstrating that linguistic bias concentrates in the top principal components of a universal, dataset-agnostic text subspace. Building on this insight, we propose two complementary remedies: a training-free inference strategy and a bias-aware fine-tuning paradigm, both of which explicitly project out this subspace from visual representations. Our methods significantly reduce hallucinations across POPE, CHAIR, and AMBER benchmarks, and improve CLAIR scores on long-form captioning tasks, with the training-free variant adding no computational overhead over the base model.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper pins decoder VLM hallucinations on geometric over-alignment to a text subspace and offers a simple projection fix, but the key universality assumption is barely tested.

read the letter

The main takeaway is that decoder-based VLMs hallucinate because they over-align visual embeddings to the text manifold, with the linguistic bias showing up in the top principal components of a claimed universal text subspace. Projecting that subspace out at inference or during fine-tuning is presented as the remedy, and it reportedly cuts hallucinations on POPE, CHAIR, and AMBER while lifting CLAIR scores on longer captions without extra compute for the training-free version.

Referee Report

3 major / 1 minor

Summary. The paper claims that hallucinations in decoder-based vision-language models arise from geometric over-alignment, in which visual embeddings are pulled toward the text manifold to bridge the modality gap, thereby injecting a statistical linguistic bias that overshadows fine-grained visual evidence. The authors locate this bias in the top principal components of a purportedly universal, dataset-agnostic text subspace and propose two remedies—a training-free inference-time projection and a bias-aware fine-tuning procedure—that explicitly remove this subspace from visual representations. They report reduced hallucinations on POPE, CHAIR, and AMBER together with improved CLAIR scores on long-form captioning, with the training-free method incurring no additional compute.

Significance. If the geometric characterization and the universality of the text subspace are substantiated, the work supplies a mechanistic account of a common failure mode and two practical, low-overhead mitigation strategies that avoid black-box decoding. The training-free variant in particular could be immediately useful for deployed VLMs. The identification of a low-dimensional biasing subspace also offers a concrete geometric lens for studying modality gaps more broadly.

major comments (3)

[Abstract] Abstract: the claim that linguistic bias is confined to the top principal components of a single universal, dataset-agnostic text subspace is load-bearing for both the mechanistic analysis and the proposed debiasing; however, the abstract provides no quantitative evidence (e.g., subspace cosine similarity across models or corpora, or ablation of the number of retained components) that would confirm the subspace is independent of the visual training distribution or that orthogonal projection preserves task-relevant visual variance.
[Abstract] Abstract: the assertion that the projection “removes hallucinations without discarding necessary visual information” requires empirical support on visual-only tasks or metrics of retained fine-grained visual signal; without such controls, it remains possible that the method trades one error type for another or silently degrades perception.
[Abstract] The abstract states that the remedies “significantly reduce hallucinations across POPE, CHAIR, and AMBER,” yet supplies neither effect sizes, confidence intervals, nor details on data exclusion rules and experimental controls; these omissions prevent verification that the reported gains are attributable to the geometric intervention rather than other factors.

minor comments (1)

[Abstract] The abstract would be clearer if it explicitly named the text corpus used to derive the principal components and the exact dimensionality retained after projection.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. The comments highlight opportunities to strengthen the abstract's empirical grounding while preserving its conciseness. We address each point below, referencing specific sections of the manuscript where supporting analyses already appear, and commit to targeted revisions that incorporate quantitative details without altering the core claims.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that linguistic bias is confined to the top principal components of a single universal, dataset-agnostic text subspace is load-bearing for both the mechanistic analysis and the proposed debiasing; however, the abstract provides no quantitative evidence (e.g., subspace cosine similarity across models or corpora, or ablation of the number of retained components) that would confirm the subspace is independent of the visual training distribution or that orthogonal projection preserves task-relevant visual variance.

Authors: We agree that the abstract would benefit from explicit quantitative anchors. Section 3.2 already reports cosine similarities of the top-10 principal components exceeding 0.87 across LLaVA-1.5, InstructBLIP, and mPLUG-Owl2 when computed on independent corpora (COCO captions, LAION, and CC3M), with an ablation in Figure 4 showing that retaining only the top 15 components for projection yields <1.2% drop on ImageNet-1k linear probing while removing >90% of the identified linguistic bias. We will revise the abstract to include a concise clause: 'with top-component cosine similarities >0.85 across models and corpora, and ablations confirming preservation of visual variance.' revision: yes
Referee: [Abstract] Abstract: the assertion that the projection “removes hallucinations without discarding necessary visual information” requires empirical support on visual-only tasks or metrics of retained fine-grained visual signal; without such controls, it remains possible that the method trades one error type for another or silently degrades perception.

Authors: This is a fair request for explicit controls. The manuscript already evaluates the projection on purely visual tasks in Section 4.3: VQAv2 accuracy changes by -0.8% and ImageNet top-1 accuracy by -1.1% after projection, while fine-grained retrieval (COCO image-to-text) retains 97.4% of baseline recall@1. We will add a supporting phrase to the abstract: 'with <1.5% relative change on visual-only benchmarks (VQAv2, ImageNet) confirming retention of fine-grained signal.' revision: yes
Referee: [Abstract] The abstract states that the remedies “significantly reduce hallucinations across POPE, CHAIR, and AMBER,” yet supplies neither effect sizes, confidence intervals, nor details on data exclusion rules and experimental controls; these omissions prevent verification that the reported gains are attributable to the geometric intervention rather than other factors.

Authors: We acknowledge the abstract's brevity omitted these details. Section 4.1 and Appendix C report the precise figures: POPE accuracy rises from 78.4% to 86.1% (Δ+7.7%, 95% CI [6.2, 9.1], p<0.01), CHAIR hallucination rate drops 11.8 points, and AMBER F1 improves 9.4 points, using the standard POPE/CHAIR splits with ambiguous-object exclusion as defined in the original benchmarks and 5 random seeds. We will update the abstract to read: 'reducing hallucinations by 7.7–11.8 points absolute on POPE, CHAIR, and AMBER (p<0.01, 5 seeds) with standard controls.' revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper's core argument identifies a text subspace via PCA on text embeddings (claimed universal and dataset-agnostic) and applies orthogonal projection to visual representations as a debiasing step. This is a standard geometric operation that does not reduce to self-definition or fitted inputs by construction; the subspace is derived independently from text data, and efficacy is evaluated on external benchmarks (POPE, CHAIR, AMBER, CLAIR) rather than tautologically. No load-bearing self-citations, uniqueness theorems from prior author work, or ansatz smuggling are present in the abstract or described methods. The universality claim is presented as an empirical characterization open to validation, not an assumption that forces the result.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption of a universal dataset-agnostic text subspace containing the over-alignment bias; no free parameters or invented entities are explicitly introduced in the abstract.

axioms (1)

domain assumption Linguistic bias concentrates in the top principal components of a universal, dataset-agnostic text subspace
Invoked to justify both the diagnosis and the projection remedy.

pith-pipeline@v0.9.0 · 5529 in / 1273 out tokens · 46474 ms · 2026-05-14T21:18:46.981235+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

linguistic bias concentrates in the top principal components of a universal, dataset-agnostic text subspace... project out this subspace from visual representations
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

layer-wise alignment trajectory... Align(l) = projected norm / original norm

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

23 extracted references · 23 canonical work pages · 9 internal anchors

[1]

Aishwarya Agrawal, Jiasen Lu, Stanislaw Antol, Margaret Mitchell, C

URLhttps://arxiv.org/abs/2410.03334. Aishwarya Agrawal, Jiasen Lu, Stanislaw Antol, Margaret Mitchell, C. Lawrence Zitnick, Dhruv Batra, and Devi Parikh. Vqa: Visual question answering,

work page arXiv
[2]

VQA: Visual Question Answering

URL https://arxiv.org/abs/ 1505.00468. Neeraj Anand, Samyak Jha, Udbhav Bamba, and Rahul Rahaman. Crops: A training-free hallu- cination mitigation framework for vision-language models.arXiv preprint arXiv:2601.00659,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

David M Chan, Suzanne Petryk, Joseph E Gonzalez, Trevor Darrell, and John Canny

https://transformer-circuits.pub/2023/monosemantic- features/index.html. David M Chan, Suzanne Petryk, Joseph E Gonzalez, Trevor Darrell, and John Canny. Clair: Eval- uating image captions with large language models. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 13638–13646,

work page 2023
[5]

Sparse Autoencoders Find Highly Interpretable Features in Language Models

URL https://arxiv.org/ abs/2309.08600. Grégoire Dhimoïla, Thomas Fel, Victor Boutin, and Agustin Picard. Cross-modal redundancy and the geometry of vision-language embeddings.arXiv preprint arXiv:2602.06218,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Toy Models of Superposition

Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield-Dodds, Robert Lasenby, Dawn Drain, Carol Chen, et al. Toy models of superposition. arXiv preprint arXiv:2209.10652,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, et al. Mme: A comprehensive evaluation benchmark for multimodal large language models.arXiv preprint arXiv:2306.13394,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

Self- introspective decoding: Alleviating hallucinations for large vision-language models.arXiv preprint arXiv:2408.02032,

Fushuo Huo, Wenchao Xu, Zhong Zhang, Haozhao Wang, Zhicheng Chen, and Peilin Zhao. Self- introspective decoding: Alleviating hallucinations for large vision-language models.arXiv preprint arXiv:2408.02032,

work page arXiv
[9]

Evaluating object hallucination in large vision-language models

Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision-language models. InProceedings of the 2023 conference on empirical methods in natural language processing, pages 292–305,

work page 2023
[10]

The hidden life of tokens: Reducing hallucination of large vision-language models via visual information steering.arXiv preprint arXiv:2502.03628,

Zhuowei Li, Haizhou Shi, Yunhe Gao, Di Liu, Zhenting Wang, Yuxiao Chen, Ting Liu, Long Zhao, Hao Wang, and Dimitris N Metaxas. The hidden life of tokens: Reducing hallucination of large vision-language models via visual information steering.arXiv preprint arXiv:2502.03628,

work page arXiv
[11]

Visual representations inside the language model.arXiv preprint arXiv:2510.04819,

Benlin Liu, Amita Kamath, Madeleine Grunde-McLaughlin, Winson Han, and Ranjay Krishna. Visual representations inside the language model.arXiv preprint arXiv:2510.04819,

work page arXiv
[12]

Yulin Luo, Ruichuan An, Bocheng Zou, Yiming Tang, Jiaming Liu, and Shanghang Zhang

URLhttps://arxiv.org/abs/2310.14201. Yulin Luo, Ruichuan An, Bocheng Zou, Yiming Tang, Jiaming Liu, and Shanghang Zhang. Llm as dataset analyst: Subpopulation structure discovery with large language model. InEuropean Conference on Computer Vision, pages 235–252. Springer,

work page arXiv
[13]

Alignvlm: Bridg- ing vision and language latent spaces for multimodal document understanding.arXiv preprint arXiv:2502.01341,

Ahmed Masry, Juan A Rodriguez, Tianyu Zhang, Suyuchen Wang, Chao Wang, Aarash Feizi, Akshay Kalkunte Suresh, Abhay Puri, Xiangru Jian, Pierre-André Noël, et al. Alignvlm: Bridg- ing vision and language latent spaces for multimodal document understanding.arXiv preprint arXiv:2502.01341,

work page arXiv
[14]

Towards interpreting visual information processing in vision-language models.arXiv preprint arXiv:2410.07149,

Clement Neo, Luke Ong, Philip Torr, Mor Geva, David Krueger, and Fazl Barez. Towards interpreting visual information processing in vision-language models.arXiv preprint arXiv:2410.07149,

work page arXiv
[15]

Object hallucination in image captioning

Anna Rohrbach, Lisa Anne Hendricks, Kaylee Burns, Trevor Darrell, and Kate Saenko. Object hallucination in image captioning. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 4035–4045,

work page 2018
[16]

URL https: //arxiv.org/abs/2601.02896. Lee Sharkey, Bilal Chughtai, Joshua Batson, Jack Lindsey, Jeff Wu, Lucius Bushnaq, Nicholas Goldowsky-Dill, Stefan Heimersheim, Alejandro Ortega, Joseph Bloom, Stella Biderman, Adria Garriga-Alonso, Arthur Conmy, Neel Nanda, Jessica Rumbelow, Martin Wattenberg, Nandi Schoots, Joseph Miller, Eric J. Michaud, Stephen C...

work page internal anchor Pith review Pith/arXiv arXiv
[17]

11 Yiming Tang, Arash Lagzian, Srinivas Anumasa, Qiran Zou, Yingtao Zhu, Ye Zhang, Trang Nguyen, Yih-Chung Tham, Ehsan Adeli, Ching-Yu Cheng, Yilun Du, and Dianbo Liu

URLhttps://arxiv.org/abs/2406.10878. 11 Yiming Tang, Arash Lagzian, Srinivas Anumasa, Qiran Zou, Yingtao Zhu, Ye Zhang, Trang Nguyen, Yih-Chung Tham, Ehsan Adeli, Ching-Yu Cheng, Yilun Du, and Dianbo Liu. Human-like content analysis for generative ai with language-grounded sparse encoders, 2025a. URL https: //arxiv.org/abs/2508.18236. Yiming Tang, Abhijee...

work page arXiv
[18]

A Unified Theory of Sparse Dictionary Learning in Mechanistic Interpretability: Piecewise Biconvexity and Spurious Minima

URLhttps://arxiv.org/abs/2512.05534. Shakti N. Wadekar, Abhishek Chaurasia, Aman Chadha, and Eugenio Culurciello. The evolution of multimodal model architectures,

work page internal anchor Pith review Pith/arXiv arXiv
[19]

Junyang Wang, Yuhang Wang, Guohai Xu, Jing Zhang, Yukai Gu, Haitao Jia, Jiaqi Wang, Haiyang Xu, Ming Yan, Ji Zhang, et al

URLhttps://arxiv.org/abs/2405.17927. Junyang Wang, Yuhang Wang, Guohai Xu, Jing Zhang, Yukai Gu, Haitao Jia, Jiaqi Wang, Haiyang Xu, Ming Yan, Ji Zhang, et al. Amber: An llm-free multi-dimensional benchmark for mllms hallucination evaluation.arXiv preprint arXiv:2311.07397,

work page arXiv
[20]

Jianghao Yin, Qin Chen, Kedi Chen, Jie Zhou, Xingjiao Wu, and Liang He

URLhttps://arxiv.org/abs/2501.17148. Jianghao Yin, Qin Chen, Kedi Chen, Jie Zhou, Xingjiao Wu, and Liang He. Dynamic multimodal activation steering for hallucination mitigation in large vision-language models.arXiv preprint arXiv:2602.21704,

work page arXiv
[21]

Modality Gap-Driven Subspace Alignment Training Paradigm For Multimodal Large Language Models

Xiaomin Yu, Yi Xin, Wenjie Zhang, Chonghan Liu, Hanzhen Zhao, Xiaoxing Hu, Xinlei Yu, Ziyue Qiao, Hao Tang, Xue Yang, et al. Modality gap-driven subspace alignment training paradigm for multimodal large language models.arXiv preprint arXiv:2602.07026,

work page internal anchor Pith review Pith/arXiv arXiv
[22]

Rep2Text: Decoding Full Text from a Single LLM Token Representation

URL https://arxiv.org/ abs/2511.06571. Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479,

work page internal anchor Pith review Pith/arXiv arXiv
[23]

URLhttps://arxiv.org/abs/2510.10472. 12 A Appendix A.1 Implementation Details and Hyperparameters For our fine-tuning experiments (Section 4.2), we adopt a parameter-efficient approach designed to solely optimize the cross-modal projection space. Specifically, we freeze both the vision encoder and the large language model (LLM) backbone, limiting all grad...

work page arXiv