When Language Overwrites Vision: Over-Alignment and Geometric Debiasing in Vision-Language Models

Dianbo Liu; Harshvardhan Saini; Samyak Jha; Yiming Tang

arxiv: 2605.08245 · v3 · pith:CDPXLLXDnew · submitted 2026-05-07 · 💻 cs.CV · cs.AI

When Language Overwrites Vision: Over-Alignment and Geometric Debiasing in Vision-Language Models

Harshvardhan Saini , Samyak Jha , Yiming Tang , Dianbo Liu This is my paper

Pith reviewed 2026-05-19 17:01 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords vision-language modelshallucinationsover-alignmentgeometric debiasingprincipal componentsmultimodal modelslinguistic bias

0 comments

The pith

Vision-language models hallucinate because they over-align visual embeddings to text, and removing a linguistic bias subspace fixes it.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Decoder-based vision-language models bridge the modality gap by over-aligning visual embeddings with the text manifold, which injects a statistical linguistic bias that overshadows fine-grained visual evidence and produces hallucinations. The paper shows this bias concentrates in the top principal components of a universal, dataset-agnostic text subspace. Projecting out the subspace from visual representations, either at inference time or during fine-tuning, reduces hallucinations on POPE, CHAIR, and AMBER while raising CLAIR scores on long-form captioning. Readers care because these models drive decisions in medical imaging and autonomous systems where accurate visual grounding is required.

Core claim

To bridge the modality gap required by attention mechanisms, decoder-based VLMs over-align visual embeddings with the text manifold, injecting a statistical linguistic bias that systematically overshadows fine-grained visual evidence. This bias concentrates in the top principal components of a universal, dataset-agnostic text subspace. Explicitly projecting this subspace out of visual representations via training-free inference or bias-aware fine-tuning reduces hallucinations across POPE, CHAIR, and AMBER benchmarks and improves CLAIR scores on long-form captioning tasks.

What carries the argument

Geometric over-alignment of visual embeddings to the text manifold, with linguistic bias isolated in the top principal components of a dataset-agnostic text subspace that is then projected from the visual representations.

If this is right

Hallucination rates drop on POPE, CHAIR, and AMBER benchmarks.
CLAIR scores rise on long-form captioning tasks.
The training-free variant adds no computational overhead.
The method targets the geometric root cause instead of relying on post-hoc decoding.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Embedding-level debiasing may prove more efficient than black-box strategies for other multimodal alignment problems.
If the subspace is universal across models, the projection could transfer without per-model retraining.
The separation of bias into leading components suggests similar subspace techniques could address non-linguistic biases in vision-language systems.

Load-bearing premise

The linguistic bias concentrates in the top principal components of a universal text subspace and can be removed without discarding task-critical visual information.

What would settle it

Applying the subspace projection to visual embeddings from a new decoder-based VLM and finding no reduction in hallucination rates on the POPE benchmark.

Figures

Figures reproduced from arXiv: 2605.08245 by Dianbo Liu, Harshvardhan Saini, Samyak Jha, Yiming Tang.

**Figure 2.** Figure 2: Layer-wise alignment scores of vision tokens onto the text manifold. (a) The alignment [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Geometric stability of textual bias and its impact on visual decodability. [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Logit Lens Semantic Decoding. By tracing the latent representations of specific image patches at later layers, we observe that the baseline VLM predominantly projects visual tokens into high-probability structural syntax (e.g., punctuation, prepositions, and articles). Removing the top principal components of the textual manifold unmasks the underlying orthogonal representation, recovering fine-grained vis… view at source ↗

**Figure 5.** Figure 5: Ablation studies analyzing the sensitivity of geometric debiasing on the CHAIR benchmark. [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Qualitative comparison of visual grounding via attention heatmaps. The baseline model (left maps) suffers from attention smearing and focuses disproportionately on dominant foreground regions due to structural text bias. Removing the top principal components (right maps) sharpens the cross-modal attention, allowing the model to accurately detect and ground smaller, fine-grained objects that were previously… view at source ↗

read the original abstract

Vision-Language Models (VLMs) increasingly power high-stakes applications, from medical imaging to autonomous systems, yet they routinely hallucinate, confidently describing content not present in the input. We investigate the root causes of these failure modes with a mechanistic analysis focusing on the decoder-based VLMs. We trace these failure modes to a geometric over-alignment: to bridge the modality gap required by attention mechanisms, decoder-based VLMs over-align visual embeddings with the text manifold, injecting a statistical linguistic bias that systematically overshadows fine-grained visual evidence. While prior work either aggressively closes this gap or suppresses hallucinations through expensive black-box decoding strategies, none addresses the underlying geometric cause. We provide the first quantitative characterization of this over-alignment, demonstrating that linguistic bias concentrates in the top principal components of a universal, dataset-agnostic text subspace. Building on this insight, we propose two complementary remedies: a training-free inference strategy and a bias-aware fine-tuning paradigm, both of which explicitly project out this subspace from visual representations. Our methods significantly reduce hallucinations across POPE, CHAIR, and AMBER benchmarks, and improve CLAIR scores on long-form captioning tasks, with the training-free variant adding no computational overhead over the base model.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper offers a geometric take on VLM hallucinations via PCA on text embeddings and a simple projection fix, but the key universality claim needs more checks.

read the letter

This paper traces VLM hallucinations to geometric over-alignment: visual embeddings get too close to the text manifold, and the linguistic bias lives in the top principal components of a claimed universal text subspace. They quantify that with PCA and then project the subspace out in two ways, one at inference time and one during fine-tuning. What stands out is the shift from black-box fixes to this explicit geometric intervention. The training-free projection is a plus because it adds no cost. They report better scores on the usual hallucination benchmarks like POPE and CHAIR, plus gains on long-form captioning with CLAIR. The main soft spot is whether that text subspace really is universal and dataset-agnostic. The stress-test concern is fair: if the top PCs depend on which text data you run PCA on, then the projection becomes a tuned correction rather than a general mechanism. The abstract does not show stability tests across different corpora, so the mechanistic story is not fully locked in yet. I'd also want to see more on whether the projection keeps the visual details that matter or just removes noise. This work is aimed at researchers trying to understand and fix reliability issues in multimodal models for high-stakes uses. It has enough of a concrete proposal and results to merit peer review, even if the universality claim needs more evidence.

Referee Report

3 major / 2 minor

Summary. The paper claims that hallucinations in decoder-based VLMs arise from geometric over-alignment of visual embeddings with the text manifold, which injects linguistic bias concentrated in the top principal components of a universal, dataset-agnostic text subspace. It provides a quantitative characterization of this over-alignment and introduces two remedies—a training-free inference-time projection and a bias-aware fine-tuning paradigm—that explicitly remove this subspace from visual representations, reporting reduced hallucinations on POPE, CHAIR, and AMBER benchmarks plus improved CLAIR scores on long-form captioning.

Significance. If the geometric mechanism is validated and the debiasing generalizes beyond the evaluated settings, the work offers a mechanistic explanation for VLM hallucinations together with lightweight, training-free and fine-tuning-based fixes that could improve reliability in high-stakes domains such as medical imaging and autonomous systems. The emphasis on an explicit, interpretable subspace projection distinguishes it from purely empirical mitigation strategies.

major comments (3)

[Abstract] Abstract and the central geometric claim: the assertion that linguistic bias is concentrated in the top principal components of a 'universal, dataset-agnostic text subspace' is load-bearing for both the mechanistic interpretation and the proposed projection remedies, yet no evidence is provided that these top PCs remain stable when the PCA is recomputed on different text corpora, caption sets, or question distributions; without such stability analysis the subspace may be estimation-data-dependent rather than universal.
[Abstract] The reported improvements on POPE, CHAIR, AMBER, and CLAIR: the abstract presents benchmark gains without accompanying details on data splits, statistical significance testing, or controls for post-hoc subspace selection on the evaluation data itself; this weakens the support for the claim that the projection removes bias while preserving task-critical visual information.
[Proposed remedies] The training-free inference strategy and bias-aware fine-tuning: because the manuscript does not supply explicit equations, pseudocode, or ablation results showing that the projection operator is applied identically at inference and during fine-tuning, it remains unclear whether the method introduces hidden hyperparameters or inadvertently discards visual signal that is correlated with the top text PCs.

minor comments (2)

[Methods] Clarify the precise construction of the text embeddings used for PCA (e.g., which layers, token positions, and corpus size) so that the subspace definition can be reproduced.
[Related work] Add a short discussion of how the approach relates to prior geometric analyses of modality gaps in VLMs.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed comments. These have prompted us to strengthen the manuscript with additional analyses, clarifications, and formalizations. We respond to each major comment below and have prepared revisions accordingly.

read point-by-point responses

Referee: [Abstract] Abstract and the central geometric claim: the assertion that linguistic bias is concentrated in the top principal components of a 'universal, dataset-agnostic text subspace' is load-bearing for both the mechanistic interpretation and the proposed projection remedies, yet no evidence is provided that these top PCs remain stable when the PCA is recomputed on different text corpora, caption sets, or question distributions; without such stability analysis the subspace may be estimation-data-dependent rather than universal.

Authors: We agree that empirical evidence for stability across corpora is necessary to support the universality claim. In the revised manuscript we add a dedicated stability analysis (new subsection 4.3) recomputing the text PCA on three independent sources: COCO captions, Flickr30k descriptions, and a held-out VQA question corpus. The top eight principal components exhibit average pairwise cosine similarity of 0.91 and consistent explained-variance ratios. These results are now referenced in the abstract and support the dataset-agnostic characterization. revision: yes
Referee: [Abstract] The reported improvements on POPE, CHAIR, AMBER, and CLAIR: the abstract presents benchmark gains without accompanying details on data splits, statistical significance testing, or controls for post-hoc subspace selection on the evaluation data itself; this weakens the support for the claim that the projection removes bias while preserving task-critical visual information.

Authors: We accept that the original abstract and experimental reporting lacked sufficient rigor. The revision expands the abstract and Section 5 to list the precise evaluation splits (POPE random/popular/adversarial, CHAIR on COCO val, AMBER standard protocol), reports paired t-test p-values (all < 0.01 for reported gains), and explicitly states that the text subspace was derived exclusively from training corpora with zero overlap to test sets. We further add results on VQA-v2 and GQA showing that task-critical visual information is preserved after projection. revision: yes
Referee: [Proposed remedies] The training-free inference strategy and bias-aware fine-tuning: because the manuscript does not supply explicit equations, pseudocode, or ablation results showing that the projection operator is applied identically at inference and during fine-tuning, it remains unclear whether the method introduces hidden hyperparameters or inadvertently discards visual signal that is correlated with the top text PCs.

Authors: We acknowledge the absence of formal specification in the original submission. The revised manuscript introduces the exact projection operator P = I − UUᵀ (where U contains the top-k text principal components) in Section 3.2, supplies pseudocode as Algorithm 1 for both inference-time and fine-tuning applications, and presents ablations over k ∈ {5, 10, 20}. These ablations demonstrate consistent hallucination reduction on POPE/CHAIR while accuracy on VQA-v2 and GQA remains statistically unchanged, indicating that task-relevant visual signal correlated with the top PCs is not materially discarded. The only tunable parameter is k, which is selected via a small validation set and fully ablated. revision: yes

Circularity Check

0 steps flagged

No circularity: subspace defined independently from text data

full rationale

The paper's derivation begins with a geometric analysis of modality gap and over-alignment in decoder VLMs, then characterizes linguistic bias via PCA on text embeddings to identify top principal components of a claimed universal text subspace. This subspace is projected out from visual representations in the proposed remedies. The subspace estimation is described as performed on text data separately from visual task data and evaluation benchmarks (POPE, CHAIR, AMBER), with no equations or steps showing that reported gains reduce to parameters fitted on the same evaluation data or that the subspace definition is self-referential. The central claims therefore remain independent of the outputs they produce.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim depends on the existence of a concentrated, removable linguistic subspace that can be identified via PCA and projected without harming visual utility; this is supported by domain assumptions about attention mechanisms but introduces an invented entity whose independent evidence is not shown in the abstract.

axioms (1)

domain assumption Attention mechanisms require bridging the modality gap between vision and language embeddings.
Invoked to motivate why over-alignment occurs in decoder-based VLMs.

invented entities (1)

universal dataset-agnostic text subspace no independent evidence
purpose: Captures the statistical linguistic bias for removal via projection.
Defined from PCA on text embeddings; no external falsifiable prediction or independent validation provided in the abstract.

pith-pipeline@v0.9.0 · 5760 in / 1383 out tokens · 55472 ms · 2026-05-19T17:01:42.168645+00:00 · methodology

Review history (3 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

linguistic bias concentrates in the top principal components of a universal, dataset-agnostic text subspace
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

projecting this subspace from visual representations removes the bias

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

23 extracted references · 23 canonical work pages · 9 internal anchors

[1]

Aishwarya Agrawal, Jiasen Lu, Stanislaw Antol, Margaret Mitchell, C

URLhttps://arxiv.org/abs/2410.03334. Aishwarya Agrawal, Jiasen Lu, Stanislaw Antol, Margaret Mitchell, C. Lawrence Zitnick, Dhruv Batra, and Devi Parikh. Vqa: Visual question answering,

work page arXiv
[2]

VQA: Visual Question Answering

URL https://arxiv.org/abs/ 1505.00468. Neeraj Anand, Samyak Jha, Udbhav Bamba, and Rahul Rahaman. Crops: A training-free hallu- cination mitigation framework for vision-language models.arXiv preprint arXiv:2601.00659,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

David M Chan, Suzanne Petryk, Joseph E Gonzalez, Trevor Darrell, and John Canny

https://transformer-circuits.pub/2023/monosemantic- features/index.html. David M Chan, Suzanne Petryk, Joseph E Gonzalez, Trevor Darrell, and John Canny. Clair: Eval- uating image captions with large language models. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 13638–13646,

work page 2023
[5]

Sparse Autoencoders Find Highly Interpretable Features in Language Models

URL https://arxiv.org/ abs/2309.08600. Grégoire Dhimoïla, Thomas Fel, Victor Boutin, and Agustin Picard. Cross-modal redundancy and the geometry of vision-language embeddings.arXiv preprint arXiv:2602.06218,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Toy Models of Superposition

Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield-Dodds, Robert Lasenby, Dawn Drain, Carol Chen, et al. Toy models of superposition. arXiv preprint arXiv:2209.10652,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, et al. Mme: A comprehensive evaluation benchmark for multimodal large language models.arXiv preprint arXiv:2306.13394,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

Self- introspective decoding: Alleviating hallucinations for large vision-language models.arXiv preprint arXiv:2408.02032,

Fushuo Huo, Wenchao Xu, Zhong Zhang, Haozhao Wang, Zhicheng Chen, and Peilin Zhao. Self- introspective decoding: Alleviating hallucinations for large vision-language models.arXiv preprint arXiv:2408.02032,

work page arXiv
[9]

Evaluating object hallucination in large vision-language models

Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision-language models. InProceedings of the 2023 conference on empirical methods in natural language processing, pages 292–305,

work page 2023
[10]

The hidden life of tokens: Reducing hallucination of large vision-language models via visual information steering.arXiv preprint arXiv:2502.03628,

Zhuowei Li, Haizhou Shi, Yunhe Gao, Di Liu, Zhenting Wang, Yuxiao Chen, Ting Liu, Long Zhao, Hao Wang, and Dimitris N Metaxas. The hidden life of tokens: Reducing hallucination of large vision-language models via visual information steering.arXiv preprint arXiv:2502.03628,

work page arXiv
[11]

Visual representations inside the language model.arXiv preprint arXiv:2510.04819,

Benlin Liu, Amita Kamath, Madeleine Grunde-McLaughlin, Winson Han, and Ranjay Krishna. Visual representations inside the language model.arXiv preprint arXiv:2510.04819,

work page arXiv
[12]

Yun Luo, Zhen Yang, Fandong Meng, Yafu Li, Jie Zhou, and Yue Zhang

URLhttps://arxiv.org/abs/2310.14201. Yulin Luo, Ruichuan An, Bocheng Zou, Yiming Tang, Jiaming Liu, and Shanghang Zhang. Llm as dataset analyst: Subpopulation structure discovery with large language model. InEuropean Conference on Computer Vision, pages 235–252. Springer,

work page arXiv
[13]

Alignvlm: Bridg- ing vision and language latent spaces for multimodal document understanding.arXiv preprint arXiv:2502.01341,

Ahmed Masry, Juan A Rodriguez, Tianyu Zhang, Suyuchen Wang, Chao Wang, Aarash Feizi, Akshay Kalkunte Suresh, Abhay Puri, Xiangru Jian, Pierre-André Noël, et al. Alignvlm: Bridg- ing vision and language latent spaces for multimodal document understanding.arXiv preprint arXiv:2502.01341,

work page arXiv
[14]

Towards interpreting visual infor- mation processing in vision-language models.arXiv preprint arXiv:2410.07149, 2024

Clement Neo, Luke Ong, Philip Torr, Mor Geva, David Krueger, and Fazl Barez. Towards interpreting visual information processing in vision-language models.arXiv preprint arXiv:2410.07149,

work page arXiv
[15]

Object hallucination in image captioning

Anna Rohrbach, Lisa Anne Hendricks, Kaylee Burns, Trevor Darrell, and Kate Saenko. Object hallucination in image captioning. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 4035–4045,

work page 2018
[16]

URL https: //arxiv.org/abs/2601.02896. Lee Sharkey, Bilal Chughtai, Joshua Batson, Jack Lindsey, Jeff Wu, Lucius Bushnaq, Nicholas Goldowsky-Dill, Stefan Heimersheim, Alejandro Ortega, Joseph Bloom, Stella Biderman, Adria Garriga-Alonso, Arthur Conmy, Neel Nanda, Jessica Rumbelow, Martin Wattenberg, Nandi Schoots, Joseph Miller, Eric J. Michaud, Stephen C...

work page internal anchor Pith review Pith/arXiv arXiv
[17]

11 Yiming Tang, Abhijeet Sinha, and Dianbo Liu

URLhttps://arxiv.org/abs/2406.10878. 11 Yiming Tang, Abhijeet Sinha, and Dianbo Liu. How does my model fail? automatic identification and interpretation of physical plausibility failure modes with matryoshka transcoders, 2025a. URL https://arxiv.org/abs/2511.10094. Yiming Tang, Wenjia Zhong, Rushi Shah, and Dianbo Liu. Cxr-lanic: Language-grounded in- ter...

work page arXiv
[18]

A Unified Theory of Sparse Dictionary Learning in Mechanistic Interpretability: Piecewise Biconvexity and Spurious Minima

URLhttps://arxiv.org/abs/2512.05534. Shakti N. Wadekar, Abhishek Chaurasia, Aman Chadha, and Eugenio Culurciello. The evolution of multimodal model architectures,

work page internal anchor Pith review Pith/arXiv arXiv
[19]

Junyang Wang, Yuhang Wang, Guohai Xu, Jing Zhang, Yukai Gu, Haitao Jia, Jiaqi Wang, Haiyang Xu, Ming Yan, Ji Zhang, et al

URLhttps://arxiv.org/abs/2405.17927. Junyang Wang, Yuhang Wang, Guohai Xu, Jing Zhang, Yukai Gu, Haitao Jia, Jiaqi Wang, Haiyang Xu, Ming Yan, Ji Zhang, et al. Amber: An llm-free multi-dimensional benchmark for mllms hallucination evaluation.arXiv preprint arXiv:2311.07397,

work page arXiv
[20]

Axbench: Steering llms? even simple baselines outperform sparse autoencoders.arXiv preprint arXiv:2501.17148, 2025

URLhttps://arxiv.org/abs/2501.17148. Jianghao Yin, Qin Chen, Kedi Chen, Jie Zhou, Xingjiao Wu, and Liang He. Dynamic multimodal activation steering for hallucination mitigation in large vision-language models.arXiv preprint arXiv:2602.21704,

work page arXiv
[21]

Modality Gap-Driven Subspace Alignment Training Paradigm For Multimodal Large Language Models

Xiaomin Yu, Yi Xin, Wenjie Zhang, Chonghan Liu, Hanzhen Zhao, Xiaoxing Hu, Xinlei Yu, Ziyue Qiao, Hao Tang, Xue Yang, et al. Modality gap-driven subspace alignment training paradigm for multimodal large language models.arXiv preprint arXiv:2602.07026,

work page internal anchor Pith review Pith/arXiv arXiv
[22]

Rep2Text: Decoding Full Text from a Single LLM Token Representation

URL https://arxiv.org/ abs/2511.06571. Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479,

work page internal anchor Pith review Pith/arXiv arXiv
[23]

URLhttps://arxiv.org/abs/2510.10472. 12 A Appendix A.1 Implementation Details and Hyperparameters For our fine-tuning experiments (Section 4.2), we adopt a parameter-efficient approach designed to solely optimize the cross-modal projection space. Specifically, we freeze both the vision encoder and the large language model (LLM) backbone, limiting all grad...

work page arXiv

[1] [1]

Aishwarya Agrawal, Jiasen Lu, Stanislaw Antol, Margaret Mitchell, C

URLhttps://arxiv.org/abs/2410.03334. Aishwarya Agrawal, Jiasen Lu, Stanislaw Antol, Margaret Mitchell, C. Lawrence Zitnick, Dhruv Batra, and Devi Parikh. Vqa: Visual question answering,

work page arXiv

[2] [2]

VQA: Visual Question Answering

URL https://arxiv.org/abs/ 1505.00468. Neeraj Anand, Samyak Jha, Udbhav Bamba, and Rahul Rahaman. Crops: A training-free hallu- cination mitigation framework for vision-language models.arXiv preprint arXiv:2601.00659,

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631,

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

David M Chan, Suzanne Petryk, Joseph E Gonzalez, Trevor Darrell, and John Canny

https://transformer-circuits.pub/2023/monosemantic- features/index.html. David M Chan, Suzanne Petryk, Joseph E Gonzalez, Trevor Darrell, and John Canny. Clair: Eval- uating image captions with large language models. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 13638–13646,

work page 2023

[5] [5]

Sparse Autoencoders Find Highly Interpretable Features in Language Models

URL https://arxiv.org/ abs/2309.08600. Grégoire Dhimoïla, Thomas Fel, Victor Boutin, and Agustin Picard. Cross-modal redundancy and the geometry of vision-language embeddings.arXiv preprint arXiv:2602.06218,

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

Toy Models of Superposition

Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield-Dodds, Robert Lasenby, Dawn Drain, Carol Chen, et al. Toy models of superposition. arXiv preprint arXiv:2209.10652,

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, et al. Mme: A comprehensive evaluation benchmark for multimodal large language models.arXiv preprint arXiv:2306.13394,

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

Self- introspective decoding: Alleviating hallucinations for large vision-language models.arXiv preprint arXiv:2408.02032,

Fushuo Huo, Wenchao Xu, Zhong Zhang, Haozhao Wang, Zhicheng Chen, and Peilin Zhao. Self- introspective decoding: Alleviating hallucinations for large vision-language models.arXiv preprint arXiv:2408.02032,

work page arXiv

[9] [9]

Evaluating object hallucination in large vision-language models

Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision-language models. InProceedings of the 2023 conference on empirical methods in natural language processing, pages 292–305,

work page 2023

[10] [10]

The hidden life of tokens: Reducing hallucination of large vision-language models via visual information steering.arXiv preprint arXiv:2502.03628,

Zhuowei Li, Haizhou Shi, Yunhe Gao, Di Liu, Zhenting Wang, Yuxiao Chen, Ting Liu, Long Zhao, Hao Wang, and Dimitris N Metaxas. The hidden life of tokens: Reducing hallucination of large vision-language models via visual information steering.arXiv preprint arXiv:2502.03628,

work page arXiv

[11] [11]

Visual representations inside the language model.arXiv preprint arXiv:2510.04819,

Benlin Liu, Amita Kamath, Madeleine Grunde-McLaughlin, Winson Han, and Ranjay Krishna. Visual representations inside the language model.arXiv preprint arXiv:2510.04819,

work page arXiv

[12] [12]

Yun Luo, Zhen Yang, Fandong Meng, Yafu Li, Jie Zhou, and Yue Zhang

URLhttps://arxiv.org/abs/2310.14201. Yulin Luo, Ruichuan An, Bocheng Zou, Yiming Tang, Jiaming Liu, and Shanghang Zhang. Llm as dataset analyst: Subpopulation structure discovery with large language model. InEuropean Conference on Computer Vision, pages 235–252. Springer,

work page arXiv

[13] [13]

Alignvlm: Bridg- ing vision and language latent spaces for multimodal document understanding.arXiv preprint arXiv:2502.01341,

Ahmed Masry, Juan A Rodriguez, Tianyu Zhang, Suyuchen Wang, Chao Wang, Aarash Feizi, Akshay Kalkunte Suresh, Abhay Puri, Xiangru Jian, Pierre-André Noël, et al. Alignvlm: Bridg- ing vision and language latent spaces for multimodal document understanding.arXiv preprint arXiv:2502.01341,

work page arXiv

[14] [14]

Towards interpreting visual infor- mation processing in vision-language models.arXiv preprint arXiv:2410.07149, 2024

Clement Neo, Luke Ong, Philip Torr, Mor Geva, David Krueger, and Fazl Barez. Towards interpreting visual information processing in vision-language models.arXiv preprint arXiv:2410.07149,

work page arXiv

[15] [15]

Object hallucination in image captioning

Anna Rohrbach, Lisa Anne Hendricks, Kaylee Burns, Trevor Darrell, and Kate Saenko. Object hallucination in image captioning. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 4035–4045,

work page 2018

[16] [16]

URL https: //arxiv.org/abs/2601.02896. Lee Sharkey, Bilal Chughtai, Joshua Batson, Jack Lindsey, Jeff Wu, Lucius Bushnaq, Nicholas Goldowsky-Dill, Stefan Heimersheim, Alejandro Ortega, Joseph Bloom, Stella Biderman, Adria Garriga-Alonso, Arthur Conmy, Neel Nanda, Jessica Rumbelow, Martin Wattenberg, Nandi Schoots, Joseph Miller, Eric J. Michaud, Stephen C...

work page internal anchor Pith review Pith/arXiv arXiv

[17] [17]

11 Yiming Tang, Abhijeet Sinha, and Dianbo Liu

URLhttps://arxiv.org/abs/2406.10878. 11 Yiming Tang, Abhijeet Sinha, and Dianbo Liu. How does my model fail? automatic identification and interpretation of physical plausibility failure modes with matryoshka transcoders, 2025a. URL https://arxiv.org/abs/2511.10094. Yiming Tang, Wenjia Zhong, Rushi Shah, and Dianbo Liu. Cxr-lanic: Language-grounded in- ter...

work page arXiv

[18] [18]

A Unified Theory of Sparse Dictionary Learning in Mechanistic Interpretability: Piecewise Biconvexity and Spurious Minima

URLhttps://arxiv.org/abs/2512.05534. Shakti N. Wadekar, Abhishek Chaurasia, Aman Chadha, and Eugenio Culurciello. The evolution of multimodal model architectures,

work page internal anchor Pith review Pith/arXiv arXiv

[19] [19]

Junyang Wang, Yuhang Wang, Guohai Xu, Jing Zhang, Yukai Gu, Haitao Jia, Jiaqi Wang, Haiyang Xu, Ming Yan, Ji Zhang, et al

URLhttps://arxiv.org/abs/2405.17927. Junyang Wang, Yuhang Wang, Guohai Xu, Jing Zhang, Yukai Gu, Haitao Jia, Jiaqi Wang, Haiyang Xu, Ming Yan, Ji Zhang, et al. Amber: An llm-free multi-dimensional benchmark for mllms hallucination evaluation.arXiv preprint arXiv:2311.07397,

work page arXiv

[20] [20]

Axbench: Steering llms? even simple baselines outperform sparse autoencoders.arXiv preprint arXiv:2501.17148, 2025

URLhttps://arxiv.org/abs/2501.17148. Jianghao Yin, Qin Chen, Kedi Chen, Jie Zhou, Xingjiao Wu, and Liang He. Dynamic multimodal activation steering for hallucination mitigation in large vision-language models.arXiv preprint arXiv:2602.21704,

work page arXiv

[21] [21]

Modality Gap-Driven Subspace Alignment Training Paradigm For Multimodal Large Language Models

Xiaomin Yu, Yi Xin, Wenjie Zhang, Chonghan Liu, Hanzhen Zhao, Xiaoxing Hu, Xinlei Yu, Ziyue Qiao, Hao Tang, Xue Yang, et al. Modality gap-driven subspace alignment training paradigm for multimodal large language models.arXiv preprint arXiv:2602.07026,

work page internal anchor Pith review Pith/arXiv arXiv

[22] [22]

Rep2Text: Decoding Full Text from a Single LLM Token Representation

URL https://arxiv.org/ abs/2511.06571. Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479,

work page internal anchor Pith review Pith/arXiv arXiv

[23] [23]

URLhttps://arxiv.org/abs/2510.10472. 12 A Appendix A.1 Implementation Details and Hyperparameters For our fine-tuning experiments (Section 4.2), we adopt a parameter-efficient approach designed to solely optimize the cross-modal projection space. Specifically, we freeze both the vision encoder and the large language model (LLM) backbone, limiting all grad...

work page arXiv