arxiv: 2604.02486 · v2 · submitted 2026-04-02 · 💻 cs.CV · cs.CL

Recognition: no theorem link

VLMs Need Words: Vision Language Models Ignore Visual Detail In Favor of Semantic Anchors

Haz Sameen Shahgir , Xiaofu Chen , Yu Fu , Erfan Shayegani , Nael Abu-Ghazaleh , Yova Kementchedjhieva , Yue Dong

Authors on Pith no claims yet

Pith reviewed 2026-05-13 21:53 UTC · model grok-4.3

classification 💻 cs.CV cs.CL

keywords vision-language modelsfine-grained visual perceptionsemantic labelshallucinationlogit lensnameable entitiesmultimodal reasoning

0 comments

The pith

Vision-language models bypass visual detail by anchoring to semantic labels when available

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that vision-language models fail at fine-grained visual tasks because their language components lack labels for detailed visual features and default to semantic shortcuts instead. When entities match known concepts, models reason through language labels rather than comparing visual information directly. For entities without familiar names, they fall back on inaccurate or invented descriptions. Tests on semantic correspondence, shape matching, and face matching show markedly better results for nameable items, with mechanistic evidence from logit lens analysis confirming label recovery. Assigning arbitrary names or applying targeted finetuning can reduce the reliance on these shortcuts.

Core claim

VLMs perform much better when the relevant entities are nameable than when they are unnamable because they explicitly recover semantic labels for nameable entities and surface more unique tokens for unnameable ones. This limitation arises because the language model lacks semantic labels for fine-grained visual details, leading to bypassed visual comparison or hallucinated descriptions. The issue can be addressed by teaching arbitrary names or through task-specific finetuning that promotes real visual perception.

What carries the argument

Semantic label recovery in the language model component, which enables reasoning through known concepts rather than direct visual feature comparison.

If this is right

VLMs bypass visual comparison for nameable entities and reason through language.
Performance drops for unnamable entities due to brittle and hallucinated descriptions.
Teaching arbitrary names for unknown entities improves performance on visual tasks.
Task-specific finetuning enables stronger generalization through actual visual perception instead of language priors.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Training approaches that delay or limit early semantic mapping could encourage more robust visual processing.
The shortcut pattern may appear in other multimodal tasks that require precise discrimination of visual features.
Additional tests that systematically vary visual complexity would help isolate the role of nameability.

Load-bearing premise

Performance differences between nameable and unnamable visual entities result from the presence or absence of semantic labels rather than differences in visual complexity or training data distribution.

What would settle it

A controlled experiment showing equivalent performance on nameable and unnamable entities after matching for visual complexity and data exposure would disprove that missing semantic labels drive the gap.

Figures

Figures reproduced from arXiv: 2604.02486 by Erfan Shayegani, Haz Sameen Shahgir, Nael Abu-Ghazaleh, Xiaofu Chen, Yova Kementchedjhieva, Yue Dong, Yu Fu.

**Figure 2.** Figure 2: Example of our 2D shape correspondence task. Face correspondence example is [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Logit Lens analysis. Top row: layerwise decoded tokens for a known shape (star) [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Mean Jaccard Distance across layers for Qwen3VL-2B after learning arbitrary names. Finetuned models close the gap between the unknown and the known shapes baselines. pre-finetuning Representation Probe of 74.2%. Human names reach 70.2% and random names 62.8%. Qwen3VL-2B benefits most from ordinary names (86.0%), while Gemma3-4B benefits most from random names (65.1%). The three name sets differ in average… view at source ↗

**Figure 5.** Figure 5: Set of procedurally-generated shapes tested. [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Task-specific finetuning has lower Jaccard Distance than teaching ordinary names, despite higher VQA accuracy (98.7% vs 86.0%). evaluate on held-out squiggles with both simpler and more complex configurations, procedurally generated mazes whose rectilinear grid structure bears no geometric resemblance to the shapes seen during training, and on semantic and face correspondence tasks that represent extrem… view at source ↗

**Figure 7.** Figure 7: Chain-of-Thought reasoning allows the VLM to explicitly recover the semantic [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗

**Figure 8.** Figure 8: Layer-wise Representation Probing Performance for Nameable and No-Name [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗

**Figure 9.** Figure 9: Layer-wise Representation Probing Performance for Known (Celebrity) and [PITH_FULL_IMAGE:figures/full_fig_p014_9.png] view at source ↗

**Figure 10.** Figure 10: Layer-wise Representation Probing Performance for 2D Known and Procedurally [PITH_FULL_IMAGE:figures/full_fig_p015_10.png] view at source ↗

**Figure 11.** Figure 11: Face correspondence task. Known celebrities (left) and AI-generated faces (right). [PITH_FULL_IMAGE:figures/full_fig_p016_11.png] view at source ↗

**Figure 12.** Figure 12: Qualitative Logit Lens example from Gemma-3-12B. The results show that some [PITH_FULL_IMAGE:figures/full_fig_p019_12.png] view at source ↗

**Figure 13.** Figure 13: All Gemma3-12B Logit Lens tokens for (Unknown Shape 1) [PITH_FULL_IMAGE:figures/full_fig_p020_13.png] view at source ↗

**Figure 14.** Figure 14: All Gemma3-12B Logit Lens tokens for . (Unknown Shape 2) 20 [PITH_FULL_IMAGE:figures/full_fig_p020_14.png] view at source ↗

**Figure 15.** Figure 15: Representation Probing accuracy increases after teaching VLM names through [PITH_FULL_IMAGE:figures/full_fig_p021_15.png] view at source ↗

**Figure 16.** Figure 16: Chain-of-thought example on unknown shapes. Without a semantic anchor, the [PITH_FULL_IMAGE:figures/full_fig_p022_16.png] view at source ↗

read the original abstract

Vision-language models (VLMs) have achieved impressive performance across a wide range of multimodal tasks. However, they often fail on tasks that require fine-grained visual perception, even when the required information is still present in their internal representations. Prior work has attributed this ``hidden-in-plain-sight'' gap to the language model, but the cause remains unexplained. In this work, we demonstrate that this gap arises from the language model's lack of semantic labels for fine-grained visual details: when visual entities can be mapped to known concepts, VLMs bypass visual comparison and reason through language; when they cannot, VLMs resort to brittle and hallucinated descriptions. We verify this across semantic correspondence, synthetic shape matching, and face matching, and find that VLMs perform much better when the relevant entities are nameable than when they are unnamable. Mechanistically, Logit Lens analysis confirms that VLMs explicitly recover semantic labels for nameable entities and surface more unique tokens compared to unnameable entities. Furthermore, we show that this limitation can be addressed: teaching completely arbitrary names for unknown entities improves performance. More importantly, task-specific finetuning yields even stronger generalization without relying on language priors, i.e. through real visual perception. Our findings suggest that current VLM failures on visual tasks reflect a learned shortcut rather than a fundamental limitation of multimodal reasoning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper offers a plausible account of why VLMs struggle with fine visual details but the evidence for the nameability mechanism over other confounds is not yet conclusive.

read the letter

The paper's core claim is that vision-language models skip over fine visual details and instead route through language semantics when the objects have recognizable names. They show this pattern in semantic correspondence, synthetic shape matching, and face matching tasks, where performance drops for unnamable entities. Logit Lens analysis supports that the model recovers semantic labels for the nameable cases and produces more varied tokens otherwise. They also demonstrate that assigning arbitrary names to unknown entities boosts performance, and task-specific fine-tuning does even better by encouraging actual visual processing. This adds a mechanistic angle to earlier observations about VLM weaknesses on detailed vision. The arbitrary-name teaching result is a nice, direct test of the idea, and the fine-tuning finding points to a way forward that doesn't depend on language priors. The experiments line up with the story across three different setups, which is a plus. The interventions feel actionable for improving these models. That said, the nameable versus unnamable distinction might carry other differences along with it. Visual complexity, feature statistics, or how common the items are in training data could explain the performance gap without needing the semantic label mechanism. The stress test raises this, and from what's described, it's not clear they fully matched the stimuli on those dimensions. If the paper has detailed controls or ablations showing the effect survives those, it would be stronger. Right now the causal isolation looks incomplete. This kind of work is relevant for anyone trying to make VLMs more reliable on visual discrimination tasks. It has enough substance to warrant peer review so the community can examine the methods and results closely. I'd recommend sending it out rather than desk rejecting.

Referee Report

3 major / 2 minor

Summary. The paper claims that VLMs fail on fine-grained visual perception tasks because the language model lacks semantic labels for detailed visual features: nameable entities trigger language-based shortcuts that bypass visual comparison, while unnamable entities lead to brittle or hallucinated outputs. This is supported by performance gaps favoring nameable stimuli across semantic correspondence, synthetic shape matching, and face matching tasks; Logit Lens analysis showing explicit label recovery for nameable cases; and interventions where teaching arbitrary names improves results and task-specific finetuning yields stronger generalization via visual processing rather than language priors.

Significance. If the central claim is substantiated, the work offers a mechanistic account of a widespread VLM limitation, reframing it as a learned shortcut rather than an inherent multimodal deficit. The multi-task empirical comparisons, mechanistic evidence, and successful interventions (arbitrary-name teaching and finetuning) provide actionable insights for reducing reliance on semantic anchors, with potential to guide future training paradigms that emphasize genuine visual reasoning.

major comments (3)

[Section 4 (Experiments)] The experimental sections (semantic correspondence, synthetic shape matching, and face matching) do not report explicit matching or statistical controls for visual complexity, feature statistics, training-data frequency, or alignment with vision-encoder priors between nameable and unnamable stimuli. Without such controls, the observed performance gaps cannot be unambiguously attributed to the presence or absence of semantic labels rather than these confounds.
[Section 5 (Interventions)] The arbitrary-name teaching result and the finetuning result are presented as evidence that the limitation is addressable, yet the manuscript does not include ablation controls demonstrating that performance gains arise specifically from label acquisition rather than incidental changes in visual feature processing or data distribution during the intervention.
[Section 4.3 (Mechanistic Analysis)] Logit Lens analysis is used to confirm explicit recovery of semantic labels for nameable entities, but the manuscript does not quantify or compare the degree of visual-feature utilization (e.g., via attention maps or representation similarity) between nameable and unnamable conditions to rule out differential visual processing as an alternative explanation.

minor comments (2)

[Abstract and Section 4] The abstract states that VLMs 'resort to brittle and hallucinated descriptions' for unnamable entities; the main text should provide a precise operational definition and measurement protocol for 'hallucination' in these tasks.
[Figures 2-4] Figure captions and axis labels in the performance comparison plots should explicitly state the number of stimuli per condition and any statistical significance tests used.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback, which helps clarify the evidential requirements for our claims. We respond to each major comment below and have incorporated revisions to address the concerns raised.

read point-by-point responses

Referee: The experimental sections (semantic correspondence, synthetic shape matching, and face matching) do not report explicit matching or statistical controls for visual complexity, feature statistics, training-data frequency, or alignment with vision-encoder priors between nameable and unnamable stimuli. Without such controls, the observed performance gaps cannot be unambiguously attributed to the presence or absence of semantic labels rather than these confounds.

Authors: We agree that the original experiments did not include explicit statistical matching or regression controls for all listed confounds. While the synthetic shape stimuli were procedurally generated with matched parameters for complexity, we did not quantify or control for training-data frequency or vision-encoder alignment. In the revision we will add a dedicated analysis subsection reporting edge-density and symmetry metrics, corpus-frequency estimates, and linear-probe alignment scores between conditions, together with regression models that partial out these variables when testing the nameability effect. These additions will appear in Section 4. revision: yes
Referee: The arbitrary-name teaching result and the finetuning result are presented as evidence that the limitation is addressable, yet the manuscript does not include ablation controls demonstrating that performance gains arise specifically from label acquisition rather than incidental changes in visual feature processing or data distribution during the intervention.

Authors: We acknowledge the absence of targeted ablations. The arbitrary-name intervention introduces only new lexical entries without changing visual statistics, yet we did not test a shuffled-label control. For finetuning we compared against language-prior baselines but did not freeze the language component. The revised Section 5 will include (i) an ablation that assigns arbitrary names but randomly permutes them and (ii) a comparison of task-specific fine-tuning with the language model frozen versus updated. These controls will isolate the contribution of semantic-label acquisition. revision: yes
Referee: Logit Lens analysis is used to confirm explicit recovery of semantic labels for nameable entities, but the manuscript does not quantify or compare the degree of visual-feature utilization (e.g., via attention maps or representation similarity) between nameable and unnamable conditions to rule out differential visual processing as an alternative explanation.

Authors: The Logit Lens results show early-layer recovery of semantic tokens only for nameable entities. To rule out differential upstream visual processing we will augment Section 4.3 with (i) attention-weight comparisons over visual patches and (ii) centered kernel alignment (CKA) between vision-encoder outputs and final-layer representations across the two conditions. These metrics will be reported alongside the existing Logit Lens plots. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical comparisons and interventions stand independently

full rationale

The paper advances its central claim through direct experimental contrasts (nameable vs. unnamable entities on semantic correspondence, synthetic shape matching, and face matching tasks), Logit Lens mechanistic probes, an arbitrary-name teaching intervention, and task-specific finetuning. No equations, fitted parameters, or derivations appear that reduce to their own inputs by construction. Prior-work citations are used only to motivate the problem statement and are not invoked as uniqueness theorems or load-bearing justifications for the present results. The analysis therefore remains self-contained against external benchmarks and does not exhibit any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the domain assumption that required visual information is already encoded in internal representations and on the paper-specific hypothesis that nameability is the decisive factor separating successful from failed reasoning.

axioms (2)

domain assumption VLMs encode the required fine-grained visual information in their internal representations even when they fail to use it
Explicitly stated in the abstract as the premise for attributing failure to label absence rather than missing information
ad hoc to paper Performance gaps are driven by semantic label availability rather than other task or model factors
This is the load-bearing hypothesis tested via nameable vs. unnamable conditions

pith-pipeline@v0.9.0 · 5571 in / 1291 out tokens · 43504 ms · 2026-05-13T21:53:16.125389+00:00 · methodology

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

3D-Anchored Lookahead Planning for Persistent Robotic Scene Memory via World-Model-Based MCTS
cs.RO 2026-04 unverdicted novelty 7.0

3D-ALP achieves 0.65 success on memory-dependent 5-step robotic reach tasks versus near-zero for reactive baselines by anchoring MCTS planning to a persistent 3D camera-to-world frame.
The Cost of Language: Centroid Erasure Exposes and Exploits Modal Competition in Multimodal Language Models
cs.CL 2026-04 unverdicted novelty 6.0

Centroid erasure shows language representations overshadow vision in multimodal models, and text-centroid contrastive decoding recovers substantial accuracy on visual reasoning tasks.

Reference graph

Works this paper leans on

16 extracted references · 16 canonical work pages · cited by 2 Pith papers · 2 internal anchors

[1]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

URLhttps://arxiv.org/abs/2308.12966. Liang Chen, Weichu Xie, Yiyan Liang, Hongfeng He, Hans Zhao, Zhibo Yang, Zhiqi Huang, Haoning Wu, Haoyu Lu, Y. charles, Yiping Bao, Yuantao Fan, Guopeng Li, Haiyang Shen, Xuanzhong Chen, Wendong Xu, Shuzheng Si, Zefan Cai, Wenhao Chai, Ziqi Huang, Fangfu Liu, Tianyu Liu, Baobao Chang, Xiaobo Hu, Kaiyuan Chen, Yixin Ren...

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Francois Chollet, Mike Knoop, Gregory Kamradt, Bryan Landers, and Henry Pinkard

URL https://arxiv.org/abs/2601.06521. Francois Chollet, Mike Knoop, Gregory Kamradt, Bryan Landers, and Henry Pinkard. Arc- agi-2: A new challenge for frontier ai reasoning systems,

work page arXiv
[3]

Arc- agi-2: A new challenge for frontier ai reasoning systems

URL https://arxiv.org/ abs/2505.11831. 10 Preprint. Under review. Ido Cohen, Daniela Gottesman, Mor Geva, and Raja Giryes. Performance gap in entity knowledge extraction across modalities in vision language models. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 29095–29108,

work page arXiv
[4]

Hidden in plain sight: Vlms overlook their visual representations.arXiv preprint arXiv:2506.08008,

URLhttps://arxiv.org/abs/2506.08008. Xingyu Fu, Yushi Hu, Bangzheng Li, Yu Feng, Haoyu Wang, Xudong Lin, Dan Roth, Noah A. Smith, Wei-Chiu Ma, and Ranjay Krishna. Blink: Multimodal large language models can see but not perceive.arXiv preprint arXiv:2404.12390,

work page arXiv
[5]

Accessed: 2026-03-13

URLhttps://deepmind.google/models/gemini-image/. Accessed: 2026-03-13. Difei Gu, Yunhe Gao, Mu Zhou, and Dimitris Metaxas. Anatomy-VLM: A fine-grained vision-language model for medical interpretation.arXiv preprint arXiv:2511.08402,

work page arXiv 2026
[6]

Pisapia, Kenji Ikemura, Mert R

Zhenhao Guo, Rachit Saluja, Tianyuan Yao, Quan Liu, Yuankai Huo, Benjamin Liechty, David J. Pisapia, Kenji Ikemura, Mert R. Sabuncu, Yihe Yang, and Ruining Deng. Glo- VLMs: Leveraging vision-language models for fine-grained diseased glomerulus classifi- cation.arXiv preprint arXiv:2508.15960,

work page arXiv
[7]

Col BERT : Efficient and effective passage search via contextualized late interaction over bert

URL https://arxiv.org/abs/2004.12832. Minkyu Kim, Sangheon Lee, and Dongmin Park. VLM-SubtleBench: How far are VLMs from human-level subtle comparative reasoning?arXiv preprint arXiv:2603.07888,

work page arXiv 2004
[8]

Latentlens: Revealing highly interpretable visual tokens in llms

Benno Krojer, Shravan Nayak, Oscar Ma˜nas, Vaibhav Adlakha, Desmond Elliott, Siva Reddy, and Marius Mosbach. Latentlens: Revealing highly interpretable visual tokens in llms. arXiv preprint arXiv:2602.00462,

work page arXiv
[9]

Visual representations inside the language model.arXiv preprint arXiv:2510.04819,

Benlin Liu, Amita Kamath, Madeleine Grunde-McLaughlin, Winson Han, and Ranjay Krishna. Visual representations inside the language model.arXiv preprint arXiv:2510.04819,

work page arXiv
[10]

Linearly mapping from image to text space

Jack Merullo, Louis Castricato, Carsten Eickhoff, and Ellie Pavlick. Linearly mapping from image to text space.arXiv preprint arXiv:2209.15162,

work page arXiv
[11]

Spair-71k: A large-scale benchmark for semantic correspondence.arXiv preprint arXiv:1908.10543,

Juhong Min, Jongmin Lee, Jean Ponce, and Minsu Cho. Spair-71k: A large-scale benchmark for semantic correspondence.arXiv preprint arXiv:1908.10543,

work page arXiv 1908
[12]

Towards interpreting visual information processing in vision-language models.arXiv preprint arXiv:2410.07149,

11 Preprint. Under review. Clement Neo, Luke Ong, Philip Torr, Mor Geva, David Krueger, and Fazl Barez. Towards interpreting visual information processing in vision-language models.arXiv preprint arXiv:2410.07149,

work page arXiv
[13]

Same task, dif- ferent circuits: Disentangling modality-specific mechanisms in vlms.arXiv preprint arXiv:2506.09047,

Yaniv Nikankin, Dana Arad, Yossi Gandelsman, and Yonatan Belinkov. Same task, dif- ferent circuits: Disentangling modality-specific mechanisms in vlms.arXiv preprint arXiv:2506.09047,

work page arXiv
[14]

GPT-4 Technical Report

URLhttps://arxiv.org/abs/2303.08774. Haz Sameen Shahgir, Khondker Salman Sayeed, Abhik Bhattacharjee, Wasi Uddin Ahmad, Yue Dong, and Rifat Shahriyar. Illusionvqa: A challenging optical illusion dataset for vision language models. InConference on Language Modeling,

work page internal anchor Pith review Pith/arXiv arXiv
[15]

org/abs/2403.15952

URL https://arxiv. org/abs/2403.15952. Sho Takishita, Jay Gala, Abdelrahman Mohamed, Kentaro Inui, and Yova Kementched- jhieva. Llms can compensate for deficiencies in visual representations.arXiv preprint arXiv:2506.05439,

work page arXiv
[16]

Mathverse: Does your multi-modal llm truly see the diagrams in visual math problems?, 2024

URL https: //arxiv.org/abs/2403.14624. 12 Preprint. Under review. Supplementary Material: Appendices A Qualitative Chain-of-Thought Examples on Named and No-Name Semantic Correspondence Qwen3VL 2B's CoT Response on <<Name>> Subset of Semantic Correspondence Task Qwen3VL-2B’s CoT: To determine the correct correspondence, we need to analyze the reference po...

work page arXiv