pith. sign in

arxiv: 2605.22903 · v1 · pith:B7A6DTJJnew · submitted 2026-05-21 · 💻 cs.CV · cs.AI· cs.CL

Seeing without Looking: Do Vision-Language Benchmarks Really Test Vision?

Pith reviewed 2026-05-25 05:55 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.CL
keywords vision-language modelsvisual groundingbenchmark evaluationhallucinationimage tokensmultimodal modelsfine-grained understanding
0
0 comments X

The pith

Vision-language models maintain high benchmark scores even after most image tokens are removed.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper begins with the finding that deleting a large share of image tokens causes only small accuracy drops on a common hallucination benchmark. The authors then run controlled tests on open-source VLMs that include global image degradation, localized occlusion, question reformulation, answer-space expansion, and decision-level checks. They also examine how visual token representations change across layers. The combined results show that models incorporate visual input yet remain less sensitive to the loss of fine-grained visual evidence than accuracy scores alone would indicate.

Core claim

Although VLMs do incorporate visual input, their predictions are less sensitive to the loss of fine-grained visual evidence that standard accuracy should have suggested. Even when the final prediction remains unchanged, the model's internal support for the correct answer may already be weakened. Layer-wise analysis shows increasing similarity among visual tokens in deeper layers, providing a possible explanation for the behavioral findings.

What carries the argument

Systematic removal and degradation of image tokens combined with layer-wise analysis of vision-token geometry.

If this is right

  • Benchmark accuracy overestimates how much VLMs depend on detailed visual evidence.
  • Models can succeed via language priors or coarse visual features even when fine details are missing.
  • Internal evidence for the correct answer can weaken before the output changes.
  • Deeper network layers show greater similarity among visual tokens, limiting fine-grained distinctions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • New evaluation protocols could add systematic token-removal or occlusion tests to measure true visual grounding.
  • The same pattern may help explain why VLMs produce hallucinations when visual support is actually weak.
  • The approach could be extended to other multimodal tasks to detect shortcut learning.

Load-bearing premise

Removing image tokens isolates reliance on fine-grained visual evidence rather than language priors or coarse features.

What would settle it

A benchmark where performance drops sharply and in proportion to the amount of fine-grained visual detail removed, matching the expected dependence on that detail.

Figures

Figures reproduced from arXiv: 2605.22903 by Jiawei Zhou, Luzhe Sun, Matthew R. Walter, Zixuan Lan.

Figure 1
Figure 1. Figure 1: We intervene on the input image at both global and en [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Effect of random image token dropping on POPE ac [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Distribution of decision margins under different visual [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Layer-wise representational analysis of visual tokens in the vision encoder. We evaluate spatial discriminability from three [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
read the original abstract

Benchmark accuracy is often implicitly assumed to reflect grounded visual understanding in vision-language models (VLMs), yet it remains unclear to what extent such scores truly reflect reliance on visual evidence. Motivated by a surprising observation that removing a substantial fraction of image tokens only degrades model performance very slightly on a widely used hallucination benchmark, we systematically investigate this mismatch in a set of open-source VLMs. Our analysis spans multiple levels of granularity, spanning global visual degradation, localized occlusion, question reformulation, answer-space expansion, and decision-level analyses beyond standard accuracy. We further complement these behavioral results with a layer-wise analysis of vision-token geometry. Throughout the experiments, we find that although VLMs do incorporate visual input, their predictions are less sensitive to the loss of fine-grained visual evidence that standard accuracy should have suggested. Even when the final prediction remains unchanged, the model's internal support for the correct answer may already be weakened. We further complement a representation-level analysis, which shows increasing similarity among visual tokens in deeper layers, providing a possible explanation for our findings. Together, these results suggest that current benchmarks are not sufficient to reliably evaluate fine-grained visual grounding in VLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that vision-language model (VLM) benchmarks do not reliably evaluate fine-grained visual grounding, because removing substantial fractions of image tokens (and related manipulations such as occlusion and question reformulation) produces only minor accuracy drops on standard hallucination benchmarks, while layer-wise analyses reveal increasing similarity among vision tokens in deeper layers; the authors conclude that model predictions are less sensitive to fine-grained visual evidence than benchmark scores imply.

Significance. If the central empirical claim is substantiated, the work would be significant for VLM evaluation research by identifying a systematic mismatch between accuracy and visual sensitivity and by supplying both behavioral and representation-level evidence. The multi-granularity design (global degradation, localized occlusion, answer-space expansion, and decision-level metrics) plus the geometry analysis constitute a strength that could usefully inform future benchmark construction.

major comments (2)
  1. [token-removal experiments (abstract and §4)] The interpretation that stable performance after image-token removal demonstrates insufficient fine-grained visual grounding rests on the unverified assumption that the retained tokens (and internal representations) contain no residual fine-grained cues capable of supporting the observed predictions. Without explicit controls quantifying what visual information survives the removal operation, the behavioral results cannot be unambiguously attributed to language priors or coarse features rather than incomplete isolation of fine-grained signals.
  2. [layer-wise analysis (§5)] The layer-wise vision-token similarity analysis is presented as a possible mechanistic explanation, yet no quantitative mapping is provided between the reported increase in token similarity across layers and the magnitude (or absence) of accuracy change under each behavioral manipulation; this weakens the link between the representation-level findings and the central claim about benchmark sufficiency.
minor comments (2)
  1. The abstract refers to 'decision-level analyses beyond standard accuracy' without naming the concrete metrics (e.g., logit margins, calibration, or answer-probability ratios) used; these should be defined in the methods section for reproducibility.
  2. Statistical reporting (error bars, number of runs, significance tests) for the reported accuracy deltas is not mentioned in the provided description and should be added to all behavioral result tables or figures.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments and for noting the potential significance of the multi-granularity design. We respond to each major comment below.

read point-by-point responses
  1. Referee: [token-removal experiments (abstract and §4)] The interpretation that stable performance after image-token removal demonstrates insufficient fine-grained visual grounding rests on the unverified assumption that the retained tokens (and internal representations) contain no residual fine-grained cues capable of supporting the observed predictions. Without explicit controls quantifying what visual information survives the removal operation, the behavioral results cannot be unambiguously attributed to language priors or coarse features rather than incomplete isolation of fine-grained signals.

    Authors: We agree that the token-removal results would be strengthened by explicit quantification of residual visual information. Our design already incorporates complementary probes (localized occlusion, question reformulation, answer-space expansion, and decision-level metrics) that target fine-grained cues more directly than global removal alone. Nevertheless, we will add a control analysis that measures retained information via performance on auxiliary fine-grained visual tasks using only the surviving tokens. This constitutes a partial revision. revision: partial

  2. Referee: [layer-wise analysis (§5)] The layer-wise vision-token similarity analysis is presented as a possible mechanistic explanation, yet no quantitative mapping is provided between the reported increase in token similarity across layers and the magnitude (or absence) of accuracy change under each behavioral manipulation; this weakens the link between the representation-level findings and the central claim about benchmark sufficiency.

    Authors: The layer-wise similarity analysis is offered as a possible mechanistic account rather than a direct causal mapping. We acknowledge that an explicit quantitative link to the magnitude of behavioral changes would tighten the connection. In revision we will add a brief correlation analysis between per-layer similarity statistics and the accuracy (and decision-level) changes observed across the behavioral manipulations, together with clearer language that the geometry results are complementary rather than definitive. revision: partial

Circularity Check

0 steps flagged

No significant circularity; empirical observations are independent

full rationale

The paper reports direct experimental results from token removal, occlusion, question reformulation, and layer-wise token similarity measurements on VLMs. These are behavioral and representational observations that stand on their own without reducing to fitted parameters, self-definitions, or self-citation chains by construction. The central claim about benchmark sufficiency follows from the reported performance mismatches and internal analyses rather than any input being renamed or presupposed as output. No equations, ansatzes, or uniqueness theorems are invoked that collapse the argument onto itself.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on domain assumptions about what benchmarks and token manipulations measure; no free parameters or invented entities are introduced in the abstract.

axioms (2)
  • domain assumption Benchmark accuracy is assumed to reflect grounded visual understanding
    This is the motivating premise the paper tests and ultimately questions.
  • domain assumption Token removal and related manipulations isolate fine-grained visual reliance
    Central to interpreting slight degradation as evidence of insufficient visual grounding.

pith-pipeline@v0.9.0 · 5741 in / 1346 out tokens · 44341 ms · 2026-05-25T05:55:39.831563+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

38 extracted references · 38 canonical work pages

  1. [1]

    Qwen3-vl technical report, 2025

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhao- hai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Jun- yang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shix...

  2. [2]

    Halc: Object hallucination reduc- tion via adaptive focal-contrast decoding

    Zhaorun Chen, Zhuokai Zhao, Hongyin Luo, Huaxiu Yao, Bo Li, and Jiawei Zhou. Halc: Object hallucination reduc- tion via adaptive focal-contrast decoding. InForty-first In- ternational Conference on Machine Learning, 2024. 1, 2

  3. [3]

    Smith, Hannaneh Hajishirzi, Ross Girshick, Ali Farhadi, and Aniruddha Kembhavi

    Matt Deitke, Christopher Clark, Sangho Lee, Rohun Tri- pathi, Yue Yang, Jae Sung Park, Mohammadreza Salehi, Niklas Muennighoff, Kyle Lo, Luca Soldaini, Jiasen Lu, Taira Anderson, Erin Bransom, Kiana Ehsani, Huong Ngo, YenSung Chen, Ajay Patel, Mark Yatskar, Chris Callison- Burch, Andrew Head, Rose Hendrix, Favyen Bastani, Eli VanderBilt, Nathan Lambert, Y...

  4. [4]

    On statistical efficiency in learning.IEEE Transactions on In- formation Theory, 67(4):2488–2506, 2020

    Jie Ding, Enmao Diao, Jiawei Zhou, and Vahid Tarokh. On statistical efficiency in learning.IEEE Transactions on In- formation Theory, 67(4):2488–2506, 2020. 2

  5. [5]

    Enhancing vision-language model relia- bility with uncertainty-guided dropout decoding.Advances in Neural Information Processing Systems, 38:149193– 149218, 2025

    Yixiong Fang, Ziran Yang, Zhaorun Chen, Zhuokai Zhao, and Jiawei Zhou. Enhancing vision-language model relia- bility with uncertainty-guided dropout decoding.Advances in Neural Information Processing Systems, 38:149193– 149218, 2025. 1, 2

  6. [6]

    Mme: A comprehensive evaluation benchmark for multimodal large language models, 2025

    Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, Yunsheng Wu, Rongrong Ji, Caifeng Shan, and Ran He. Mme: A comprehensive evaluation benchmark for multimodal large language models, 2025. 1, 2, 3

  7. [7]

    Does ob- ject grounding really reduce hallucination of large vision- language models?, 2024

    Gregor Geigle, Radu Timofte, and Goran Glavaš. Does ob- ject grounding really reduce hallucination of large vision- language models?, 2024. 2

  8. [8]

    Hal- lusionbench: An advanced diagnostic suite for entangled language hallucination and visual illusion in large vision- language models, 2024

    Tianrui Guan, Fuxiao Liu, Xiyang Wu, Ruiqi Xian, Zongxia Li, Xiaoyu Liu, Xijun Wang, Lichang Chen, Furong Huang, Yaser Yacoob, Dinesh Manocha, and Tianyi Zhou. Hal- lusionbench: An advanced diagnostic suite for entangled language hallucination and visual illusion in large vision- language models, 2024. 2

  9. [9]

    Do vision-language models really understand visual lan- guage?, 2025

    Yifan Hou, Buse Giledereli, Yilei Tu, and Mrinmaya Sachan. Do vision-language models really understand visual lan- guage?, 2025. 2

  10. [10]

    A survey on evaluation of multimodal large language models, 2024

    Jiaxing Huang and Jingyi Zhang. A survey on evaluation of multimodal large language models, 2024. 1

  11. [11]

    Robustifying vision-language models via dynamic token reweighting.arXiv preprint arXiv:2505.17132, 2025

    Tanqiu Jiang, Jiacheng Liang, Rongyi Zhu, Jiawei Zhou, Fenglong Ma, and Ting Wang. Robustifying vision-language models via dynamic token reweighting.arXiv preprint arXiv:2505.17132, 2025. 2

  12. [12]

    A comprehensive analysis for visual object hallucination in large vision-language mod- els, 2025

    Liqiang Jing, Guiming Hardy Chen, Ehsan Aghazadeh, Xin Eric Wang, and Xinya Du. A comprehensive analysis for visual object hallucination in large vision-language mod- els, 2025. 2

  13. [13]

    Do you see me : A mul- tidimensional benchmark for evaluating visual perception in multimodal llms, 2025

    Aditya Kanade and Tanuja Ganu. Do you see me : A mul- tidimensional benchmark for evaluating visual perception in multimodal llms, 2025. 2

  14. [14]

    Prannay Kaul, Zhizhong Li, Hao Yang, Yonatan Dukler, Ashwin Swaminathan, C. J. Taylor, and Stefano Soatto. Throne: An object-based hallucination benchmark for the free-form generations of large vision-language models,

  15. [15]

    Halp: Detecting hallucinations in vision- language models without generating a single token

    Sai Akhil Kogilathota, Sripadha Vallabha EG, Luzhe Sun, and Jiawei Zhou. Halp: Detecting hallucinations in vision- language models without generating a single token. InPro- ceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6067–6085, 2026. 1, 2

  16. [16]

    VLind-bench: Measuring language priors in large vision- language models

    Kang-il Lee, Minbeom Kim, Seunghyun Yoon, Minsung Kim, Dongryeol Lee, Hyukhun Koh, and Kyomin Jung. VLind-bench: Measuring language priors in large vision- language models. InFindings of the Association for Compu- tational Linguistics: NAACL 2025, pages 4129–4144, Albu- querque, New Mexico, 2025. Association for Computational Linguistics. 2

  17. [17]

    Evaluating object hallucination in large vision-language models, 2023

    Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision-language models, 2023. 1, 2, 3

  18. [18]

    Text or pixels? evaluating efficiency and understanding of LLMs with vi- sual text inputs

    Yanhong Li, Zixuan Lan, and Jiawei Zhou. Text or pixels? evaluating efficiency and understanding of LLMs with vi- sual text inputs. InFindings of the Association for Com- putational Linguistics: EMNLP 2025, pages 10564–10578, Suzhou, China, 2025. Association for Computational Lin- guistics. 2

  19. [19]

    On the predictive power of representation dispersion in lan- guage models

    Yanhong Li, Ming Li, Karen Livescu, and Jiawei Zhou. On the predictive power of representation dispersion in lan- guage models. InThe Fourteenth International Conference on Learning Representations, 2026. 7

  20. [20]

    Visual instruction tuning, 2023

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning, 2023. 3

  21. [21]

    Improved baselines with visual instruction tuning, 2024

    Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning, 2024. 3

  22. [22]

    Grounding dino: Marry- ing dino with grounded pre-training for open-set object de- tection, 2024

    Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, and Lei Zhang. Grounding dino: Marry- ing dino with grounded pre-training for open-set object de- tection, 2024. 4

  23. [23]

    H-pope: Hierarchical polling-based probing evaluation of hallucinations in large vision-language models, 2024

    Nhi Pham and Michael Schott. H-pope: Hierarchical polling-based probing evaluation of hallucinations in large vision-language models, 2024. 2

  24. [24]

    Learning transferable visual models from natural language supervision, 2021

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision, 2021. 3

  25. [25]

    Sam 2: Segment anything in images and videos,

    Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junt- ing Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao- Yuan Wu, Ross Girshick, Piotr Dollár, and Christoph Feicht- enhofer. Sam 2: Segment anything in images and videos,

  26. [26]

    Object hallucination in image cap- tioning, 2019

    Anna Rohrbach, Lisa Anne Hendricks, Kaylee Burns, Trevor Darrell, and Kate Saenko. Object hallucination in image cap- tioning, 2019. 3

  27. [27]

    The effective rank: A mea- sure of effective dimensionality

    Olivier Roy and Martin Vetterli. The effective rank: A mea- sure of effective dimensionality. In2007 15th European Sig- nal Processing Conference, pages 606–610, 2007. 7

  28. [28]

    Iso- bench: Benchmarking multimodal causal reasoning in visual-language models through procedural plans.arXiv preprint arXiv:2507.23135, 2025

    Ananya Sadana, Yash Kumar Lal, and Jiawei Zhou. Iso- bench: Benchmarking multimodal causal reasoning in visual-language models through procedural plans.arXiv preprint arXiv:2507.23135, 2025. 2

  29. [29]

    A-okvqa: A benchmark for visual question answering using world knowl- edge, 2022

    Dustin Schwenk, Apoorv Khandelwal, Christopher Clark, Kenneth Marino, and Roozbeh Mottaghi. A-okvqa: A benchmark for visual question answering using world knowl- edge, 2022. 3

  30. [30]

    From behavioral performance to internal competence: Interpreting vision-language models with vlm- lens

    Hala Sheta, Eric Haoran Huang, Shuyu Wu, Ilia Alenabi, Ji- ajun Hong, Ryker Lin, Ruoxi Ning, Daniel Wei, Jialin Yang, Jiawei Zhou, et al. From behavioral performance to internal competence: Interpreting vision-language models with vlm- lens. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: System Demon- strations, ...

  31. [31]

    Openai gpt-5 system card, 2025

    Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, Akshay Nathan, Alan Luo, Alec Helyar, Aleksander Madry, Aleksandr Efremov, Alek- sandra Spyra, Alex Baker-Whitcomb, Alex Beutel, Alex Karpenko, Alex Makelov, Alex Neitz, Alex Wei, Alexan- dra Barr, Alexandre Kirchmeyer,...

  32. [32]

    From head to tail: Towards balanced representation in large vision-language models through adaptive data calibration

    Mingyang Song, Xiaoye Qu, Jiawei Zhou, and Yu Cheng. From head to tail: Towards balanced representation in large vision-language models through adaptive data calibration. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 9434–9444, 2025. 1, 2

  33. [33]

    Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean- Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalk- wyk, Andrew M. Dai, Anja Hauth, Katie Millican, David Silver, Melvin Johnson, Ioannis Antonoglou, Julian Schrit- twieser, Amelia Glaese, Jilin Chen, Emily Pitler, Timothy Lillicrap, Angeliki Lazaridou, Orhan Firat, James Molloy, Michael Isard, Paul ...

  34. [34]

    Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Ta- tiana Matejovicova, Alexandre Ramé, Morgane Rivière, Louis Rouillard, Thomas Mesnard, Geoffrey Cideron, Jean bastien Grill, Sabela Ramos, Edouard Yvinec, Michelle Cas- bon, Etienne Pot, Ivo Penchev, Gaël Liu, Francesco Visin, Kathleen Kenealy, Lucas...

  35. [35]

    Eyes wide shut? exploring the visual shortcomings of multimodal llms, 2024

    Shengbang Tong, Zhuang Liu, Yuexiang Zhai, Yi Ma, Yann LeCun, and Saining Xie. Eyes wide shut? exploring the visual shortcomings of multimodal llms, 2024. 2

  36. [36]

    Gomez, Lukasz Kaiser, and Illia Polosukhin

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need, 2023. 3

  37. [37]

    Amber: An llm-free multi- dimensional benchmark for mllms hallucination evaluation,

    Junyang Wang, Yuhang Wang, Guohai Xu, Jing Zhang, Yukai Gu, Haitao Jia, Jiaqi Wang, Haiyang Xu, Ming Yan, Ji Zhang, and Jitao Sang. Amber: An llm-free multi- dimensional benchmark for mllms hallucination evaluation,

  38. [38]

    no images

    Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shen- glong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, Zhangwei Gao, Erfei Cui, Xuehui Wang, Yue Cao, Yangzhou Liu, Xingguang Wei, Hongjie Zhang, Haomin Wang, Weiye Xu, Hao Li, Jiahao Wang, Nianchen Deng, Songze Li, Yinan He, Tan Jiang, Jiapeng Luo, Yi Wang, Con- ghui He, Botian Shi, Xingchen...