Seeing without Looking: Do Vision-Language Benchmarks Really Test Vision?
Pith reviewed 2026-05-25 05:55 UTC · model grok-4.3
The pith
Vision-language models maintain high benchmark scores even after most image tokens are removed.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Although VLMs do incorporate visual input, their predictions are less sensitive to the loss of fine-grained visual evidence that standard accuracy should have suggested. Even when the final prediction remains unchanged, the model's internal support for the correct answer may already be weakened. Layer-wise analysis shows increasing similarity among visual tokens in deeper layers, providing a possible explanation for the behavioral findings.
What carries the argument
Systematic removal and degradation of image tokens combined with layer-wise analysis of vision-token geometry.
If this is right
- Benchmark accuracy overestimates how much VLMs depend on detailed visual evidence.
- Models can succeed via language priors or coarse visual features even when fine details are missing.
- Internal evidence for the correct answer can weaken before the output changes.
- Deeper network layers show greater similarity among visual tokens, limiting fine-grained distinctions.
Where Pith is reading between the lines
- New evaluation protocols could add systematic token-removal or occlusion tests to measure true visual grounding.
- The same pattern may help explain why VLMs produce hallucinations when visual support is actually weak.
- The approach could be extended to other multimodal tasks to detect shortcut learning.
Load-bearing premise
Removing image tokens isolates reliance on fine-grained visual evidence rather than language priors or coarse features.
What would settle it
A benchmark where performance drops sharply and in proportion to the amount of fine-grained visual detail removed, matching the expected dependence on that detail.
Figures
read the original abstract
Benchmark accuracy is often implicitly assumed to reflect grounded visual understanding in vision-language models (VLMs), yet it remains unclear to what extent such scores truly reflect reliance on visual evidence. Motivated by a surprising observation that removing a substantial fraction of image tokens only degrades model performance very slightly on a widely used hallucination benchmark, we systematically investigate this mismatch in a set of open-source VLMs. Our analysis spans multiple levels of granularity, spanning global visual degradation, localized occlusion, question reformulation, answer-space expansion, and decision-level analyses beyond standard accuracy. We further complement these behavioral results with a layer-wise analysis of vision-token geometry. Throughout the experiments, we find that although VLMs do incorporate visual input, their predictions are less sensitive to the loss of fine-grained visual evidence that standard accuracy should have suggested. Even when the final prediction remains unchanged, the model's internal support for the correct answer may already be weakened. We further complement a representation-level analysis, which shows increasing similarity among visual tokens in deeper layers, providing a possible explanation for our findings. Together, these results suggest that current benchmarks are not sufficient to reliably evaluate fine-grained visual grounding in VLMs.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that vision-language model (VLM) benchmarks do not reliably evaluate fine-grained visual grounding, because removing substantial fractions of image tokens (and related manipulations such as occlusion and question reformulation) produces only minor accuracy drops on standard hallucination benchmarks, while layer-wise analyses reveal increasing similarity among vision tokens in deeper layers; the authors conclude that model predictions are less sensitive to fine-grained visual evidence than benchmark scores imply.
Significance. If the central empirical claim is substantiated, the work would be significant for VLM evaluation research by identifying a systematic mismatch between accuracy and visual sensitivity and by supplying both behavioral and representation-level evidence. The multi-granularity design (global degradation, localized occlusion, answer-space expansion, and decision-level metrics) plus the geometry analysis constitute a strength that could usefully inform future benchmark construction.
major comments (2)
- [token-removal experiments (abstract and §4)] The interpretation that stable performance after image-token removal demonstrates insufficient fine-grained visual grounding rests on the unverified assumption that the retained tokens (and internal representations) contain no residual fine-grained cues capable of supporting the observed predictions. Without explicit controls quantifying what visual information survives the removal operation, the behavioral results cannot be unambiguously attributed to language priors or coarse features rather than incomplete isolation of fine-grained signals.
- [layer-wise analysis (§5)] The layer-wise vision-token similarity analysis is presented as a possible mechanistic explanation, yet no quantitative mapping is provided between the reported increase in token similarity across layers and the magnitude (or absence) of accuracy change under each behavioral manipulation; this weakens the link between the representation-level findings and the central claim about benchmark sufficiency.
minor comments (2)
- The abstract refers to 'decision-level analyses beyond standard accuracy' without naming the concrete metrics (e.g., logit margins, calibration, or answer-probability ratios) used; these should be defined in the methods section for reproducibility.
- Statistical reporting (error bars, number of runs, significance tests) for the reported accuracy deltas is not mentioned in the provided description and should be added to all behavioral result tables or figures.
Simulated Author's Rebuttal
We thank the referee for the constructive comments and for noting the potential significance of the multi-granularity design. We respond to each major comment below.
read point-by-point responses
-
Referee: [token-removal experiments (abstract and §4)] The interpretation that stable performance after image-token removal demonstrates insufficient fine-grained visual grounding rests on the unverified assumption that the retained tokens (and internal representations) contain no residual fine-grained cues capable of supporting the observed predictions. Without explicit controls quantifying what visual information survives the removal operation, the behavioral results cannot be unambiguously attributed to language priors or coarse features rather than incomplete isolation of fine-grained signals.
Authors: We agree that the token-removal results would be strengthened by explicit quantification of residual visual information. Our design already incorporates complementary probes (localized occlusion, question reformulation, answer-space expansion, and decision-level metrics) that target fine-grained cues more directly than global removal alone. Nevertheless, we will add a control analysis that measures retained information via performance on auxiliary fine-grained visual tasks using only the surviving tokens. This constitutes a partial revision. revision: partial
-
Referee: [layer-wise analysis (§5)] The layer-wise vision-token similarity analysis is presented as a possible mechanistic explanation, yet no quantitative mapping is provided between the reported increase in token similarity across layers and the magnitude (or absence) of accuracy change under each behavioral manipulation; this weakens the link between the representation-level findings and the central claim about benchmark sufficiency.
Authors: The layer-wise similarity analysis is offered as a possible mechanistic account rather than a direct causal mapping. We acknowledge that an explicit quantitative link to the magnitude of behavioral changes would tighten the connection. In revision we will add a brief correlation analysis between per-layer similarity statistics and the accuracy (and decision-level) changes observed across the behavioral manipulations, together with clearer language that the geometry results are complementary rather than definitive. revision: partial
Circularity Check
No significant circularity; empirical observations are independent
full rationale
The paper reports direct experimental results from token removal, occlusion, question reformulation, and layer-wise token similarity measurements on VLMs. These are behavioral and representational observations that stand on their own without reducing to fitted parameters, self-definitions, or self-citation chains by construction. The central claim about benchmark sufficiency follows from the reported performance mismatches and internal analyses rather than any input being renamed or presupposed as output. No equations, ansatzes, or uniqueness theorems are invoked that collapse the argument onto itself.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Benchmark accuracy is assumed to reflect grounded visual understanding
- domain assumption Token removal and related manipulations isolate fine-grained visual reliance
Reference graph
Works this paper leans on
-
[1]
Qwen3-vl technical report, 2025
Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhao- hai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Jun- yang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shix...
work page 2025
-
[2]
Halc: Object hallucination reduc- tion via adaptive focal-contrast decoding
Zhaorun Chen, Zhuokai Zhao, Hongyin Luo, Huaxiu Yao, Bo Li, and Jiawei Zhou. Halc: Object hallucination reduc- tion via adaptive focal-contrast decoding. InForty-first In- ternational Conference on Machine Learning, 2024. 1, 2
work page 2024
-
[3]
Smith, Hannaneh Hajishirzi, Ross Girshick, Ali Farhadi, and Aniruddha Kembhavi
Matt Deitke, Christopher Clark, Sangho Lee, Rohun Tri- pathi, Yue Yang, Jae Sung Park, Mohammadreza Salehi, Niklas Muennighoff, Kyle Lo, Luca Soldaini, Jiasen Lu, Taira Anderson, Erin Bransom, Kiana Ehsani, Huong Ngo, YenSung Chen, Ajay Patel, Mark Yatskar, Chris Callison- Burch, Andrew Head, Rose Hendrix, Favyen Bastani, Eli VanderBilt, Nathan Lambert, Y...
work page 2024
-
[4]
Jie Ding, Enmao Diao, Jiawei Zhou, and Vahid Tarokh. On statistical efficiency in learning.IEEE Transactions on In- formation Theory, 67(4):2488–2506, 2020. 2
work page 2020
-
[5]
Yixiong Fang, Ziran Yang, Zhaorun Chen, Zhuokai Zhao, and Jiawei Zhou. Enhancing vision-language model relia- bility with uncertainty-guided dropout decoding.Advances in Neural Information Processing Systems, 38:149193– 149218, 2025. 1, 2
work page 2025
-
[6]
Mme: A comprehensive evaluation benchmark for multimodal large language models, 2025
Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, Yunsheng Wu, Rongrong Ji, Caifeng Shan, and Ran He. Mme: A comprehensive evaluation benchmark for multimodal large language models, 2025. 1, 2, 3
work page 2025
-
[7]
Does ob- ject grounding really reduce hallucination of large vision- language models?, 2024
Gregor Geigle, Radu Timofte, and Goran Glavaš. Does ob- ject grounding really reduce hallucination of large vision- language models?, 2024. 2
work page 2024
-
[8]
Tianrui Guan, Fuxiao Liu, Xiyang Wu, Ruiqi Xian, Zongxia Li, Xiaoyu Liu, Xijun Wang, Lichang Chen, Furong Huang, Yaser Yacoob, Dinesh Manocha, and Tianyi Zhou. Hal- lusionbench: An advanced diagnostic suite for entangled language hallucination and visual illusion in large vision- language models, 2024. 2
work page 2024
-
[9]
Do vision-language models really understand visual lan- guage?, 2025
Yifan Hou, Buse Giledereli, Yilei Tu, and Mrinmaya Sachan. Do vision-language models really understand visual lan- guage?, 2025. 2
work page 2025
-
[10]
A survey on evaluation of multimodal large language models, 2024
Jiaxing Huang and Jingyi Zhang. A survey on evaluation of multimodal large language models, 2024. 1
work page 2024
-
[11]
Tanqiu Jiang, Jiacheng Liang, Rongyi Zhu, Jiawei Zhou, Fenglong Ma, and Ting Wang. Robustifying vision-language models via dynamic token reweighting.arXiv preprint arXiv:2505.17132, 2025. 2
-
[12]
A comprehensive analysis for visual object hallucination in large vision-language mod- els, 2025
Liqiang Jing, Guiming Hardy Chen, Ehsan Aghazadeh, Xin Eric Wang, and Xinya Du. A comprehensive analysis for visual object hallucination in large vision-language mod- els, 2025. 2
work page 2025
-
[13]
Aditya Kanade and Tanuja Ganu. Do you see me : A mul- tidimensional benchmark for evaluating visual perception in multimodal llms, 2025. 2
work page 2025
-
[14]
Prannay Kaul, Zhizhong Li, Hao Yang, Yonatan Dukler, Ashwin Swaminathan, C. J. Taylor, and Stefano Soatto. Throne: An object-based hallucination benchmark for the free-form generations of large vision-language models,
-
[15]
Halp: Detecting hallucinations in vision- language models without generating a single token
Sai Akhil Kogilathota, Sripadha Vallabha EG, Luzhe Sun, and Jiawei Zhou. Halp: Detecting hallucinations in vision- language models without generating a single token. InPro- ceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6067–6085, 2026. 1, 2
work page 2026
-
[16]
VLind-bench: Measuring language priors in large vision- language models
Kang-il Lee, Minbeom Kim, Seunghyun Yoon, Minsung Kim, Dongryeol Lee, Hyukhun Koh, and Kyomin Jung. VLind-bench: Measuring language priors in large vision- language models. InFindings of the Association for Compu- tational Linguistics: NAACL 2025, pages 4129–4144, Albu- querque, New Mexico, 2025. Association for Computational Linguistics. 2
work page 2025
-
[17]
Evaluating object hallucination in large vision-language models, 2023
Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision-language models, 2023. 1, 2, 3
work page 2023
-
[18]
Text or pixels? evaluating efficiency and understanding of LLMs with vi- sual text inputs
Yanhong Li, Zixuan Lan, and Jiawei Zhou. Text or pixels? evaluating efficiency and understanding of LLMs with vi- sual text inputs. InFindings of the Association for Com- putational Linguistics: EMNLP 2025, pages 10564–10578, Suzhou, China, 2025. Association for Computational Lin- guistics. 2
work page 2025
-
[19]
On the predictive power of representation dispersion in lan- guage models
Yanhong Li, Ming Li, Karen Livescu, and Jiawei Zhou. On the predictive power of representation dispersion in lan- guage models. InThe Fourteenth International Conference on Learning Representations, 2026. 7
work page 2026
-
[20]
Visual instruction tuning, 2023
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning, 2023. 3
work page 2023
-
[21]
Improved baselines with visual instruction tuning, 2024
Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning, 2024. 3
work page 2024
-
[22]
Grounding dino: Marry- ing dino with grounded pre-training for open-set object de- tection, 2024
Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, and Lei Zhang. Grounding dino: Marry- ing dino with grounded pre-training for open-set object de- tection, 2024. 4
work page 2024
-
[23]
Nhi Pham and Michael Schott. H-pope: Hierarchical polling-based probing evaluation of hallucinations in large vision-language models, 2024. 2
work page 2024
-
[24]
Learning transferable visual models from natural language supervision, 2021
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision, 2021. 3
work page 2021
-
[25]
Sam 2: Segment anything in images and videos,
Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junt- ing Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao- Yuan Wu, Ross Girshick, Piotr Dollár, and Christoph Feicht- enhofer. Sam 2: Segment anything in images and videos,
-
[26]
Object hallucination in image cap- tioning, 2019
Anna Rohrbach, Lisa Anne Hendricks, Kaylee Burns, Trevor Darrell, and Kate Saenko. Object hallucination in image cap- tioning, 2019. 3
work page 2019
-
[27]
The effective rank: A mea- sure of effective dimensionality
Olivier Roy and Martin Vetterli. The effective rank: A mea- sure of effective dimensionality. In2007 15th European Sig- nal Processing Conference, pages 606–610, 2007. 7
work page 2007
-
[28]
Ananya Sadana, Yash Kumar Lal, and Jiawei Zhou. Iso- bench: Benchmarking multimodal causal reasoning in visual-language models through procedural plans.arXiv preprint arXiv:2507.23135, 2025. 2
-
[29]
A-okvqa: A benchmark for visual question answering using world knowl- edge, 2022
Dustin Schwenk, Apoorv Khandelwal, Christopher Clark, Kenneth Marino, and Roozbeh Mottaghi. A-okvqa: A benchmark for visual question answering using world knowl- edge, 2022. 3
work page 2022
-
[30]
Hala Sheta, Eric Haoran Huang, Shuyu Wu, Ilia Alenabi, Ji- ajun Hong, Ryker Lin, Ruoxi Ning, Daniel Wei, Jialin Yang, Jiawei Zhou, et al. From behavioral performance to internal competence: Interpreting vision-language models with vlm- lens. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: System Demon- strations, ...
work page 2025
-
[31]
Openai gpt-5 system card, 2025
Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, Akshay Nathan, Alan Luo, Alec Helyar, Aleksander Madry, Aleksandr Efremov, Alek- sandra Spyra, Alex Baker-Whitcomb, Alex Beutel, Alex Karpenko, Alex Makelov, Alex Neitz, Alex Wei, Alexan- dra Barr, Alexandre Kirchmeyer,...
work page 2025
-
[32]
Mingyang Song, Xiaoye Qu, Jiawei Zhou, and Yu Cheng. From head to tail: Towards balanced representation in large vision-language models through adaptive data calibration. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 9434–9444, 2025. 1, 2
work page 2025
-
[33]
Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean- Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalk- wyk, Andrew M. Dai, Anja Hauth, Katie Millican, David Silver, Melvin Johnson, Ioannis Antonoglou, Julian Schrit- twieser, Amelia Glaese, Jilin Chen, Emily Pitler, Timothy Lillicrap, Angeliki Lazaridou, Orhan Firat, James Molloy, Michael Isard, Paul ...
work page 2025
-
[34]
Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Ta- tiana Matejovicova, Alexandre Ramé, Morgane Rivière, Louis Rouillard, Thomas Mesnard, Geoffrey Cideron, Jean bastien Grill, Sabela Ramos, Edouard Yvinec, Michelle Cas- bon, Etienne Pot, Ivo Penchev, Gaël Liu, Francesco Visin, Kathleen Kenealy, Lucas...
work page 2025
-
[35]
Eyes wide shut? exploring the visual shortcomings of multimodal llms, 2024
Shengbang Tong, Zhuang Liu, Yuexiang Zhai, Yi Ma, Yann LeCun, and Saining Xie. Eyes wide shut? exploring the visual shortcomings of multimodal llms, 2024. 2
work page 2024
-
[36]
Gomez, Lukasz Kaiser, and Illia Polosukhin
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need, 2023. 3
work page 2023
-
[37]
Amber: An llm-free multi- dimensional benchmark for mllms hallucination evaluation,
Junyang Wang, Yuhang Wang, Guohai Xu, Jing Zhang, Yukai Gu, Haitao Jia, Jiaqi Wang, Haiyang Xu, Ming Yan, Ji Zhang, and Jitao Sang. Amber: An llm-free multi- dimensional benchmark for mllms hallucination evaluation,
-
[38]
Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shen- glong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, Zhangwei Gao, Erfei Cui, Xuehui Wang, Yue Cao, Yangzhou Liu, Xingguang Wei, Hongjie Zhang, Haomin Wang, Weiye Xu, Hao Li, Jiahao Wang, Nianchen Deng, Songze Li, Yinan He, Tan Jiang, Jiapeng Luo, Yi Wang, Con- ghui He, Botian Shi, Xingchen...
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.