Seeing without Looking: Do Vision-Language Benchmarks Really Test Vision?

Jiawei Zhou; Luzhe Sun; Matthew R. Walter; Zixuan Lan

arxiv: 2605.22903 · v1 · pith:B7A6DTJJnew · submitted 2026-05-21 · 💻 cs.CV · cs.AI· cs.CL

Seeing without Looking: Do Vision-Language Benchmarks Really Test Vision?

Zixuan Lan , Luzhe Sun , Matthew R. Walter , Jiawei Zhou This is my paper

Pith reviewed 2026-05-25 05:55 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.CL

keywords vision-language modelsvisual groundingbenchmark evaluationhallucinationimage tokensmultimodal modelsfine-grained understanding

0 comments

The pith

Vision-language models maintain high benchmark scores even after most image tokens are removed.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper begins with the finding that deleting a large share of image tokens causes only small accuracy drops on a common hallucination benchmark. The authors then run controlled tests on open-source VLMs that include global image degradation, localized occlusion, question reformulation, answer-space expansion, and decision-level checks. They also examine how visual token representations change across layers. The combined results show that models incorporate visual input yet remain less sensitive to the loss of fine-grained visual evidence than accuracy scores alone would indicate.

Core claim

Although VLMs do incorporate visual input, their predictions are less sensitive to the loss of fine-grained visual evidence that standard accuracy should have suggested. Even when the final prediction remains unchanged, the model's internal support for the correct answer may already be weakened. Layer-wise analysis shows increasing similarity among visual tokens in deeper layers, providing a possible explanation for the behavioral findings.

What carries the argument

Systematic removal and degradation of image tokens combined with layer-wise analysis of vision-token geometry.

If this is right

Benchmark accuracy overestimates how much VLMs depend on detailed visual evidence.
Models can succeed via language priors or coarse visual features even when fine details are missing.
Internal evidence for the correct answer can weaken before the output changes.
Deeper network layers show greater similarity among visual tokens, limiting fine-grained distinctions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

New evaluation protocols could add systematic token-removal or occlusion tests to measure true visual grounding.
The same pattern may help explain why VLMs produce hallucinations when visual support is actually weak.
The approach could be extended to other multimodal tasks to detect shortcut learning.

Load-bearing premise

Removing image tokens isolates reliance on fine-grained visual evidence rather than language priors or coarse features.

What would settle it

A benchmark where performance drops sharply and in proportion to the amount of fine-grained visual detail removed, matching the expected dependence on that detail.

Figures

Figures reproduced from arXiv: 2605.22903 by Jiawei Zhou, Luzhe Sun, Matthew R. Walter, Zixuan Lan.

**Figure 2.** Figure 2: Effect of random image token dropping on POPE ac [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Distribution of decision margins under different visual [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Layer-wise representational analysis of visual tokens in the vision encoder. We evaluate spatial discriminability from three [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

read the original abstract

Benchmark accuracy is often implicitly assumed to reflect grounded visual understanding in vision-language models (VLMs), yet it remains unclear to what extent such scores truly reflect reliance on visual evidence. Motivated by a surprising observation that removing a substantial fraction of image tokens only degrades model performance very slightly on a widely used hallucination benchmark, we systematically investigate this mismatch in a set of open-source VLMs. Our analysis spans multiple levels of granularity, spanning global visual degradation, localized occlusion, question reformulation, answer-space expansion, and decision-level analyses beyond standard accuracy. We further complement these behavioral results with a layer-wise analysis of vision-token geometry. Throughout the experiments, we find that although VLMs do incorporate visual input, their predictions are less sensitive to the loss of fine-grained visual evidence that standard accuracy should have suggested. Even when the final prediction remains unchanged, the model's internal support for the correct answer may already be weakened. We further complement a representation-level analysis, which shows increasing similarity among visual tokens in deeper layers, providing a possible explanation for our findings. Together, these results suggest that current benchmarks are not sufficient to reliably evaluate fine-grained visual grounding in VLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Token removal on hallucination benchmarks shows little accuracy drop in open VLMs, but the design leaves open whether coarse cues or priors explain the stability rather than proving weak fine-grained grounding tests.

read the letter

The core observation is that stripping out a large share of image tokens barely hurts performance on a standard hallucination benchmark, and the paper backs this with occlusion tests, question rewrites, answer-space changes, and layer-wise token similarity checks across open VLMs. That combination of behavioral and internal analyses is the main new piece; prior work has looked at hallucinations but not this systematic probe of visual token reliance at multiple granularities plus the geometry finding in deeper layers. The work is honest about what the numbers show and sticks to empirical patterns without overclaiming mechanisms upfront. Credit for running the tests on accessible models and reporting that internal support for answers can weaken even when the final output stays the same. The soft spot sits in the causal link. Token removal and the other manipulations need to cleanly separate fine-grained visual evidence from coarse global features or language priors, yet the abstract gives no detail on how much visual signal actually remains after removal or on controls for answer biases. If residual coarse information or statistical shortcuts still support the predictions, the mismatch between accuracy and visual sensitivity does not yet prove the benchmarks are insufficient for fine-grained grounding. The layer similarity result is interesting but sits at one remove from the behavioral claim. This is the kind of paper that belongs in a reading group focused on evaluation practices; anyone building or auditing VLM benchmarks would get concrete angles to test. It is worth sending to referees because the question matters for the field and the experiments are replicable in principle, though the write-up will need tighter justification of the isolation step and fuller statistical reporting before the central claim lands cleanly.

Referee Report

2 major / 2 minor

Summary. The paper claims that vision-language model (VLM) benchmarks do not reliably evaluate fine-grained visual grounding, because removing substantial fractions of image tokens (and related manipulations such as occlusion and question reformulation) produces only minor accuracy drops on standard hallucination benchmarks, while layer-wise analyses reveal increasing similarity among vision tokens in deeper layers; the authors conclude that model predictions are less sensitive to fine-grained visual evidence than benchmark scores imply.

Significance. If the central empirical claim is substantiated, the work would be significant for VLM evaluation research by identifying a systematic mismatch between accuracy and visual sensitivity and by supplying both behavioral and representation-level evidence. The multi-granularity design (global degradation, localized occlusion, answer-space expansion, and decision-level metrics) plus the geometry analysis constitute a strength that could usefully inform future benchmark construction.

major comments (2)

[token-removal experiments (abstract and §4)] The interpretation that stable performance after image-token removal demonstrates insufficient fine-grained visual grounding rests on the unverified assumption that the retained tokens (and internal representations) contain no residual fine-grained cues capable of supporting the observed predictions. Without explicit controls quantifying what visual information survives the removal operation, the behavioral results cannot be unambiguously attributed to language priors or coarse features rather than incomplete isolation of fine-grained signals.
[layer-wise analysis (§5)] The layer-wise vision-token similarity analysis is presented as a possible mechanistic explanation, yet no quantitative mapping is provided between the reported increase in token similarity across layers and the magnitude (or absence) of accuracy change under each behavioral manipulation; this weakens the link between the representation-level findings and the central claim about benchmark sufficiency.

minor comments (2)

The abstract refers to 'decision-level analyses beyond standard accuracy' without naming the concrete metrics (e.g., logit margins, calibration, or answer-probability ratios) used; these should be defined in the methods section for reproducibility.
Statistical reporting (error bars, number of runs, significance tests) for the reported accuracy deltas is not mentioned in the provided description and should be added to all behavioral result tables or figures.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments and for noting the potential significance of the multi-granularity design. We respond to each major comment below.

read point-by-point responses

Referee: [token-removal experiments (abstract and §4)] The interpretation that stable performance after image-token removal demonstrates insufficient fine-grained visual grounding rests on the unverified assumption that the retained tokens (and internal representations) contain no residual fine-grained cues capable of supporting the observed predictions. Without explicit controls quantifying what visual information survives the removal operation, the behavioral results cannot be unambiguously attributed to language priors or coarse features rather than incomplete isolation of fine-grained signals.

Authors: We agree that the token-removal results would be strengthened by explicit quantification of residual visual information. Our design already incorporates complementary probes (localized occlusion, question reformulation, answer-space expansion, and decision-level metrics) that target fine-grained cues more directly than global removal alone. Nevertheless, we will add a control analysis that measures retained information via performance on auxiliary fine-grained visual tasks using only the surviving tokens. This constitutes a partial revision. revision: partial
Referee: [layer-wise analysis (§5)] The layer-wise vision-token similarity analysis is presented as a possible mechanistic explanation, yet no quantitative mapping is provided between the reported increase in token similarity across layers and the magnitude (or absence) of accuracy change under each behavioral manipulation; this weakens the link between the representation-level findings and the central claim about benchmark sufficiency.

Authors: The layer-wise similarity analysis is offered as a possible mechanistic account rather than a direct causal mapping. We acknowledge that an explicit quantitative link to the magnitude of behavioral changes would tighten the connection. In revision we will add a brief correlation analysis between per-layer similarity statistics and the accuracy (and decision-level) changes observed across the behavioral manipulations, together with clearer language that the geometry results are complementary rather than definitive. revision: partial

Circularity Check

0 steps flagged

No significant circularity; empirical observations are independent

full rationale

The paper reports direct experimental results from token removal, occlusion, question reformulation, and layer-wise token similarity measurements on VLMs. These are behavioral and representational observations that stand on their own without reducing to fitted parameters, self-definitions, or self-citation chains by construction. The central claim about benchmark sufficiency follows from the reported performance mismatches and internal analyses rather than any input being renamed or presupposed as output. No equations, ansatzes, or uniqueness theorems are invoked that collapse the argument onto itself.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on domain assumptions about what benchmarks and token manipulations measure; no free parameters or invented entities are introduced in the abstract.

axioms (2)

domain assumption Benchmark accuracy is assumed to reflect grounded visual understanding
This is the motivating premise the paper tests and ultimately questions.
domain assumption Token removal and related manipulations isolate fine-grained visual reliance
Central to interpreting slight degradation as evidence of insufficient visual grounding.

pith-pipeline@v0.9.0 · 5741 in / 1346 out tokens · 44341 ms · 2026-05-25T05:55:39.831563+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

38 extracted references · 38 canonical work pages

[1]

Qwen3-vl technical report, 2025

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhao- hai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Jun- yang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shix...

work page 2025
[2]

Halc: Object hallucination reduc- tion via adaptive focal-contrast decoding

Zhaorun Chen, Zhuokai Zhao, Hongyin Luo, Huaxiu Yao, Bo Li, and Jiawei Zhou. Halc: Object hallucination reduc- tion via adaptive focal-contrast decoding. InForty-first In- ternational Conference on Machine Learning, 2024. 1, 2

work page 2024
[3]

Smith, Hannaneh Hajishirzi, Ross Girshick, Ali Farhadi, and Aniruddha Kembhavi

Matt Deitke, Christopher Clark, Sangho Lee, Rohun Tri- pathi, Yue Yang, Jae Sung Park, Mohammadreza Salehi, Niklas Muennighoff, Kyle Lo, Luca Soldaini, Jiasen Lu, Taira Anderson, Erin Bransom, Kiana Ehsani, Huong Ngo, YenSung Chen, Ajay Patel, Mark Yatskar, Chris Callison- Burch, Andrew Head, Rose Hendrix, Favyen Bastani, Eli VanderBilt, Nathan Lambert, Y...

work page 2024
[4]

On statistical efficiency in learning.IEEE Transactions on In- formation Theory, 67(4):2488–2506, 2020

Jie Ding, Enmao Diao, Jiawei Zhou, and Vahid Tarokh. On statistical efficiency in learning.IEEE Transactions on In- formation Theory, 67(4):2488–2506, 2020. 2

work page 2020
[5]

Enhancing vision-language model relia- bility with uncertainty-guided dropout decoding.Advances in Neural Information Processing Systems, 38:149193– 149218, 2025

Yixiong Fang, Ziran Yang, Zhaorun Chen, Zhuokai Zhao, and Jiawei Zhou. Enhancing vision-language model relia- bility with uncertainty-guided dropout decoding.Advances in Neural Information Processing Systems, 38:149193– 149218, 2025. 1, 2

work page 2025
[6]

Mme: A comprehensive evaluation benchmark for multimodal large language models, 2025

Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, Yunsheng Wu, Rongrong Ji, Caifeng Shan, and Ran He. Mme: A comprehensive evaluation benchmark for multimodal large language models, 2025. 1, 2, 3

work page 2025
[7]

Does ob- ject grounding really reduce hallucination of large vision- language models?, 2024

Gregor Geigle, Radu Timofte, and Goran Glavaš. Does ob- ject grounding really reduce hallucination of large vision- language models?, 2024. 2

work page 2024
[8]

Hal- lusionbench: An advanced diagnostic suite for entangled language hallucination and visual illusion in large vision- language models, 2024

Tianrui Guan, Fuxiao Liu, Xiyang Wu, Ruiqi Xian, Zongxia Li, Xiaoyu Liu, Xijun Wang, Lichang Chen, Furong Huang, Yaser Yacoob, Dinesh Manocha, and Tianyi Zhou. Hal- lusionbench: An advanced diagnostic suite for entangled language hallucination and visual illusion in large vision- language models, 2024. 2

work page 2024
[9]

Do vision-language models really understand visual lan- guage?, 2025

Yifan Hou, Buse Giledereli, Yilei Tu, and Mrinmaya Sachan. Do vision-language models really understand visual lan- guage?, 2025. 2

work page 2025
[10]

A survey on evaluation of multimodal large language models, 2024

Jiaxing Huang and Jingyi Zhang. A survey on evaluation of multimodal large language models, 2024. 1

work page 2024
[11]

Robustifying vision-language models via dynamic token reweighting.arXiv preprint arXiv:2505.17132, 2025

Tanqiu Jiang, Jiacheng Liang, Rongyi Zhu, Jiawei Zhou, Fenglong Ma, and Ting Wang. Robustifying vision-language models via dynamic token reweighting.arXiv preprint arXiv:2505.17132, 2025. 2

work page arXiv 2025
[12]

A comprehensive analysis for visual object hallucination in large vision-language mod- els, 2025

Liqiang Jing, Guiming Hardy Chen, Ehsan Aghazadeh, Xin Eric Wang, and Xinya Du. A comprehensive analysis for visual object hallucination in large vision-language mod- els, 2025. 2

work page 2025
[13]

Do you see me : A mul- tidimensional benchmark for evaluating visual perception in multimodal llms, 2025

Aditya Kanade and Tanuja Ganu. Do you see me : A mul- tidimensional benchmark for evaluating visual perception in multimodal llms, 2025. 2

work page 2025
[14]

Prannay Kaul, Zhizhong Li, Hao Yang, Yonatan Dukler, Ashwin Swaminathan, C. J. Taylor, and Stefano Soatto. Throne: An object-based hallucination benchmark for the free-form generations of large vision-language models,

work page
[15]

Halp: Detecting hallucinations in vision- language models without generating a single token

Sai Akhil Kogilathota, Sripadha Vallabha EG, Luzhe Sun, and Jiawei Zhou. Halp: Detecting hallucinations in vision- language models without generating a single token. InPro- ceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6067–6085, 2026. 1, 2

work page 2026
[16]

VLind-bench: Measuring language priors in large vision- language models

Kang-il Lee, Minbeom Kim, Seunghyun Yoon, Minsung Kim, Dongryeol Lee, Hyukhun Koh, and Kyomin Jung. VLind-bench: Measuring language priors in large vision- language models. InFindings of the Association for Compu- tational Linguistics: NAACL 2025, pages 4129–4144, Albu- querque, New Mexico, 2025. Association for Computational Linguistics. 2

work page 2025
[17]

Evaluating object hallucination in large vision-language models, 2023

Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision-language models, 2023. 1, 2, 3

work page 2023
[18]

Text or pixels? evaluating efficiency and understanding of LLMs with vi- sual text inputs

Yanhong Li, Zixuan Lan, and Jiawei Zhou. Text or pixels? evaluating efficiency and understanding of LLMs with vi- sual text inputs. InFindings of the Association for Com- putational Linguistics: EMNLP 2025, pages 10564–10578, Suzhou, China, 2025. Association for Computational Lin- guistics. 2

work page 2025
[19]

On the predictive power of representation dispersion in lan- guage models

Yanhong Li, Ming Li, Karen Livescu, and Jiawei Zhou. On the predictive power of representation dispersion in lan- guage models. InThe Fourteenth International Conference on Learning Representations, 2026. 7

work page 2026
[20]

Visual instruction tuning, 2023

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning, 2023. 3

work page 2023
[21]

Improved baselines with visual instruction tuning, 2024

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning, 2024. 3

work page 2024
[22]

Grounding dino: Marry- ing dino with grounded pre-training for open-set object de- tection, 2024

Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, and Lei Zhang. Grounding dino: Marry- ing dino with grounded pre-training for open-set object de- tection, 2024. 4

work page 2024
[23]

H-pope: Hierarchical polling-based probing evaluation of hallucinations in large vision-language models, 2024

Nhi Pham and Michael Schott. H-pope: Hierarchical polling-based probing evaluation of hallucinations in large vision-language models, 2024. 2

work page 2024
[24]

Learning transferable visual models from natural language supervision, 2021

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision, 2021. 3

work page 2021
[25]

Sam 2: Segment anything in images and videos,

Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junt- ing Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao- Yuan Wu, Ross Girshick, Piotr Dollár, and Christoph Feicht- enhofer. Sam 2: Segment anything in images and videos,

work page
[26]

Object hallucination in image cap- tioning, 2019

Anna Rohrbach, Lisa Anne Hendricks, Kaylee Burns, Trevor Darrell, and Kate Saenko. Object hallucination in image cap- tioning, 2019. 3

work page 2019
[27]

The effective rank: A mea- sure of effective dimensionality

Olivier Roy and Martin Vetterli. The effective rank: A mea- sure of effective dimensionality. In2007 15th European Sig- nal Processing Conference, pages 606–610, 2007. 7

work page 2007
[28]

Iso- bench: Benchmarking multimodal causal reasoning in visual-language models through procedural plans.arXiv preprint arXiv:2507.23135, 2025

Ananya Sadana, Yash Kumar Lal, and Jiawei Zhou. Iso- bench: Benchmarking multimodal causal reasoning in visual-language models through procedural plans.arXiv preprint arXiv:2507.23135, 2025. 2

work page arXiv 2025
[29]

A-okvqa: A benchmark for visual question answering using world knowl- edge, 2022

Dustin Schwenk, Apoorv Khandelwal, Christopher Clark, Kenneth Marino, and Roozbeh Mottaghi. A-okvqa: A benchmark for visual question answering using world knowl- edge, 2022. 3

work page 2022
[30]

From behavioral performance to internal competence: Interpreting vision-language models with vlm- lens

Hala Sheta, Eric Haoran Huang, Shuyu Wu, Ilia Alenabi, Ji- ajun Hong, Ryker Lin, Ruoxi Ning, Daniel Wei, Jialin Yang, Jiawei Zhou, et al. From behavioral performance to internal competence: Interpreting vision-language models with vlm- lens. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: System Demon- strations, ...

work page 2025
[31]

Openai gpt-5 system card, 2025

Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, Akshay Nathan, Alan Luo, Alec Helyar, Aleksander Madry, Aleksandr Efremov, Alek- sandra Spyra, Alex Baker-Whitcomb, Alex Beutel, Alex Karpenko, Alex Makelov, Alex Neitz, Alex Wei, Alexan- dra Barr, Alexandre Kirchmeyer,...

work page 2025
[32]

From head to tail: Towards balanced representation in large vision-language models through adaptive data calibration

Mingyang Song, Xiaoye Qu, Jiawei Zhou, and Yu Cheng. From head to tail: Towards balanced representation in large vision-language models through adaptive data calibration. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 9434–9444, 2025. 1, 2

work page 2025
[33]

Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean- Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalk- wyk, Andrew M. Dai, Anja Hauth, Katie Millican, David Silver, Melvin Johnson, Ioannis Antonoglou, Julian Schrit- twieser, Amelia Glaese, Jilin Chen, Emily Pitler, Timothy Lillicrap, Angeliki Lazaridou, Orhan Firat, James Molloy, Michael Isard, Paul ...

work page 2025
[34]

Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Ta- tiana Matejovicova, Alexandre Ramé, Morgane Rivière, Louis Rouillard, Thomas Mesnard, Geoffrey Cideron, Jean bastien Grill, Sabela Ramos, Edouard Yvinec, Michelle Cas- bon, Etienne Pot, Ivo Penchev, Gaël Liu, Francesco Visin, Kathleen Kenealy, Lucas...

work page 2025
[35]

Eyes wide shut? exploring the visual shortcomings of multimodal llms, 2024

Shengbang Tong, Zhuang Liu, Yuexiang Zhai, Yi Ma, Yann LeCun, and Saining Xie. Eyes wide shut? exploring the visual shortcomings of multimodal llms, 2024. 2

work page 2024
[36]

Gomez, Lukasz Kaiser, and Illia Polosukhin

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need, 2023. 3

work page 2023
[37]

Amber: An llm-free multi- dimensional benchmark for mllms hallucination evaluation,

Junyang Wang, Yuhang Wang, Guohai Xu, Jing Zhang, Yukai Gu, Haitao Jia, Jiaqi Wang, Haiyang Xu, Ming Yan, Ji Zhang, and Jitao Sang. Amber: An llm-free multi- dimensional benchmark for mllms hallucination evaluation,

work page
[38]

no images

Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shen- glong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, Zhangwei Gao, Erfei Cui, Xuehui Wang, Yue Cao, Yangzhou Liu, Xingguang Wei, Hongjie Zhang, Haomin Wang, Weiye Xu, Hao Li, Jiahao Wang, Nianchen Deng, Songze Li, Yinan He, Tan Jiang, Jiapeng Luo, Yi Wang, Con- ghui He, Botian Shi, Xingchen...

work page 2025

[1] [1]

Qwen3-vl technical report, 2025

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhao- hai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Jun- yang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shix...

work page 2025

[2] [2]

Halc: Object hallucination reduc- tion via adaptive focal-contrast decoding

Zhaorun Chen, Zhuokai Zhao, Hongyin Luo, Huaxiu Yao, Bo Li, and Jiawei Zhou. Halc: Object hallucination reduc- tion via adaptive focal-contrast decoding. InForty-first In- ternational Conference on Machine Learning, 2024. 1, 2

work page 2024

[3] [3]

Smith, Hannaneh Hajishirzi, Ross Girshick, Ali Farhadi, and Aniruddha Kembhavi

Matt Deitke, Christopher Clark, Sangho Lee, Rohun Tri- pathi, Yue Yang, Jae Sung Park, Mohammadreza Salehi, Niklas Muennighoff, Kyle Lo, Luca Soldaini, Jiasen Lu, Taira Anderson, Erin Bransom, Kiana Ehsani, Huong Ngo, YenSung Chen, Ajay Patel, Mark Yatskar, Chris Callison- Burch, Andrew Head, Rose Hendrix, Favyen Bastani, Eli VanderBilt, Nathan Lambert, Y...

work page 2024

[4] [4]

On statistical efficiency in learning.IEEE Transactions on In- formation Theory, 67(4):2488–2506, 2020

Jie Ding, Enmao Diao, Jiawei Zhou, and Vahid Tarokh. On statistical efficiency in learning.IEEE Transactions on In- formation Theory, 67(4):2488–2506, 2020. 2

work page 2020

[5] [5]

Enhancing vision-language model relia- bility with uncertainty-guided dropout decoding.Advances in Neural Information Processing Systems, 38:149193– 149218, 2025

Yixiong Fang, Ziran Yang, Zhaorun Chen, Zhuokai Zhao, and Jiawei Zhou. Enhancing vision-language model relia- bility with uncertainty-guided dropout decoding.Advances in Neural Information Processing Systems, 38:149193– 149218, 2025. 1, 2

work page 2025

[6] [6]

Mme: A comprehensive evaluation benchmark for multimodal large language models, 2025

Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, Yunsheng Wu, Rongrong Ji, Caifeng Shan, and Ran He. Mme: A comprehensive evaluation benchmark for multimodal large language models, 2025. 1, 2, 3

work page 2025

[7] [7]

Does ob- ject grounding really reduce hallucination of large vision- language models?, 2024

Gregor Geigle, Radu Timofte, and Goran Glavaš. Does ob- ject grounding really reduce hallucination of large vision- language models?, 2024. 2

work page 2024

[8] [8]

Hal- lusionbench: An advanced diagnostic suite for entangled language hallucination and visual illusion in large vision- language models, 2024

Tianrui Guan, Fuxiao Liu, Xiyang Wu, Ruiqi Xian, Zongxia Li, Xiaoyu Liu, Xijun Wang, Lichang Chen, Furong Huang, Yaser Yacoob, Dinesh Manocha, and Tianyi Zhou. Hal- lusionbench: An advanced diagnostic suite for entangled language hallucination and visual illusion in large vision- language models, 2024. 2

work page 2024

[9] [9]

Do vision-language models really understand visual lan- guage?, 2025

Yifan Hou, Buse Giledereli, Yilei Tu, and Mrinmaya Sachan. Do vision-language models really understand visual lan- guage?, 2025. 2

work page 2025

[10] [10]

A survey on evaluation of multimodal large language models, 2024

Jiaxing Huang and Jingyi Zhang. A survey on evaluation of multimodal large language models, 2024. 1

work page 2024

[11] [11]

Robustifying vision-language models via dynamic token reweighting.arXiv preprint arXiv:2505.17132, 2025

Tanqiu Jiang, Jiacheng Liang, Rongyi Zhu, Jiawei Zhou, Fenglong Ma, and Ting Wang. Robustifying vision-language models via dynamic token reweighting.arXiv preprint arXiv:2505.17132, 2025. 2

work page arXiv 2025

[12] [12]

A comprehensive analysis for visual object hallucination in large vision-language mod- els, 2025

Liqiang Jing, Guiming Hardy Chen, Ehsan Aghazadeh, Xin Eric Wang, and Xinya Du. A comprehensive analysis for visual object hallucination in large vision-language mod- els, 2025. 2

work page 2025

[13] [13]

Do you see me : A mul- tidimensional benchmark for evaluating visual perception in multimodal llms, 2025

Aditya Kanade and Tanuja Ganu. Do you see me : A mul- tidimensional benchmark for evaluating visual perception in multimodal llms, 2025. 2

work page 2025

[14] [14]

Prannay Kaul, Zhizhong Li, Hao Yang, Yonatan Dukler, Ashwin Swaminathan, C. J. Taylor, and Stefano Soatto. Throne: An object-based hallucination benchmark for the free-form generations of large vision-language models,

work page

[15] [15]

Halp: Detecting hallucinations in vision- language models without generating a single token

Sai Akhil Kogilathota, Sripadha Vallabha EG, Luzhe Sun, and Jiawei Zhou. Halp: Detecting hallucinations in vision- language models without generating a single token. InPro- ceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6067–6085, 2026. 1, 2

work page 2026

[16] [16]

VLind-bench: Measuring language priors in large vision- language models

Kang-il Lee, Minbeom Kim, Seunghyun Yoon, Minsung Kim, Dongryeol Lee, Hyukhun Koh, and Kyomin Jung. VLind-bench: Measuring language priors in large vision- language models. InFindings of the Association for Compu- tational Linguistics: NAACL 2025, pages 4129–4144, Albu- querque, New Mexico, 2025. Association for Computational Linguistics. 2

work page 2025

[17] [17]

Evaluating object hallucination in large vision-language models, 2023

Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision-language models, 2023. 1, 2, 3

work page 2023

[18] [18]

Text or pixels? evaluating efficiency and understanding of LLMs with vi- sual text inputs

Yanhong Li, Zixuan Lan, and Jiawei Zhou. Text or pixels? evaluating efficiency and understanding of LLMs with vi- sual text inputs. InFindings of the Association for Com- putational Linguistics: EMNLP 2025, pages 10564–10578, Suzhou, China, 2025. Association for Computational Lin- guistics. 2

work page 2025

[19] [19]

On the predictive power of representation dispersion in lan- guage models

Yanhong Li, Ming Li, Karen Livescu, and Jiawei Zhou. On the predictive power of representation dispersion in lan- guage models. InThe Fourteenth International Conference on Learning Representations, 2026. 7

work page 2026

[20] [20]

Visual instruction tuning, 2023

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning, 2023. 3

work page 2023

[21] [21]

Improved baselines with visual instruction tuning, 2024

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning, 2024. 3

work page 2024

[22] [22]

Grounding dino: Marry- ing dino with grounded pre-training for open-set object de- tection, 2024

Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, and Lei Zhang. Grounding dino: Marry- ing dino with grounded pre-training for open-set object de- tection, 2024. 4

work page 2024

[23] [23]

H-pope: Hierarchical polling-based probing evaluation of hallucinations in large vision-language models, 2024

Nhi Pham and Michael Schott. H-pope: Hierarchical polling-based probing evaluation of hallucinations in large vision-language models, 2024. 2

work page 2024

[24] [24]

Learning transferable visual models from natural language supervision, 2021

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision, 2021. 3

work page 2021

[25] [25]

Sam 2: Segment anything in images and videos,

Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junt- ing Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao- Yuan Wu, Ross Girshick, Piotr Dollár, and Christoph Feicht- enhofer. Sam 2: Segment anything in images and videos,

work page

[26] [26]

Object hallucination in image cap- tioning, 2019

Anna Rohrbach, Lisa Anne Hendricks, Kaylee Burns, Trevor Darrell, and Kate Saenko. Object hallucination in image cap- tioning, 2019. 3

work page 2019

[27] [27]

The effective rank: A mea- sure of effective dimensionality

Olivier Roy and Martin Vetterli. The effective rank: A mea- sure of effective dimensionality. In2007 15th European Sig- nal Processing Conference, pages 606–610, 2007. 7

work page 2007

[28] [28]

Iso- bench: Benchmarking multimodal causal reasoning in visual-language models through procedural plans.arXiv preprint arXiv:2507.23135, 2025

Ananya Sadana, Yash Kumar Lal, and Jiawei Zhou. Iso- bench: Benchmarking multimodal causal reasoning in visual-language models through procedural plans.arXiv preprint arXiv:2507.23135, 2025. 2

work page arXiv 2025

[29] [29]

A-okvqa: A benchmark for visual question answering using world knowl- edge, 2022

Dustin Schwenk, Apoorv Khandelwal, Christopher Clark, Kenneth Marino, and Roozbeh Mottaghi. A-okvqa: A benchmark for visual question answering using world knowl- edge, 2022. 3

work page 2022

[30] [30]

From behavioral performance to internal competence: Interpreting vision-language models with vlm- lens

Hala Sheta, Eric Haoran Huang, Shuyu Wu, Ilia Alenabi, Ji- ajun Hong, Ryker Lin, Ruoxi Ning, Daniel Wei, Jialin Yang, Jiawei Zhou, et al. From behavioral performance to internal competence: Interpreting vision-language models with vlm- lens. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: System Demon- strations, ...

work page 2025

[31] [31]

Openai gpt-5 system card, 2025

Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, Akshay Nathan, Alan Luo, Alec Helyar, Aleksander Madry, Aleksandr Efremov, Alek- sandra Spyra, Alex Baker-Whitcomb, Alex Beutel, Alex Karpenko, Alex Makelov, Alex Neitz, Alex Wei, Alexan- dra Barr, Alexandre Kirchmeyer,...

work page 2025

[32] [32]

From head to tail: Towards balanced representation in large vision-language models through adaptive data calibration

Mingyang Song, Xiaoye Qu, Jiawei Zhou, and Yu Cheng. From head to tail: Towards balanced representation in large vision-language models through adaptive data calibration. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 9434–9444, 2025. 1, 2

work page 2025

[33] [33]

Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean- Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalk- wyk, Andrew M. Dai, Anja Hauth, Katie Millican, David Silver, Melvin Johnson, Ioannis Antonoglou, Julian Schrit- twieser, Amelia Glaese, Jilin Chen, Emily Pitler, Timothy Lillicrap, Angeliki Lazaridou, Orhan Firat, James Molloy, Michael Isard, Paul ...

work page 2025

[34] [34]

Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Ta- tiana Matejovicova, Alexandre Ramé, Morgane Rivière, Louis Rouillard, Thomas Mesnard, Geoffrey Cideron, Jean bastien Grill, Sabela Ramos, Edouard Yvinec, Michelle Cas- bon, Etienne Pot, Ivo Penchev, Gaël Liu, Francesco Visin, Kathleen Kenealy, Lucas...

work page 2025

[35] [35]

Eyes wide shut? exploring the visual shortcomings of multimodal llms, 2024

Shengbang Tong, Zhuang Liu, Yuexiang Zhai, Yi Ma, Yann LeCun, and Saining Xie. Eyes wide shut? exploring the visual shortcomings of multimodal llms, 2024. 2

work page 2024

[36] [36]

Gomez, Lukasz Kaiser, and Illia Polosukhin

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need, 2023. 3

work page 2023

[37] [37]

Amber: An llm-free multi- dimensional benchmark for mllms hallucination evaluation,

Junyang Wang, Yuhang Wang, Guohai Xu, Jing Zhang, Yukai Gu, Haitao Jia, Jiaqi Wang, Haiyang Xu, Ming Yan, Ji Zhang, and Jitao Sang. Amber: An llm-free multi- dimensional benchmark for mllms hallucination evaluation,

work page

[38] [38]

no images

Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shen- glong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, Zhangwei Gao, Erfei Cui, Xuehui Wang, Yue Cao, Yangzhou Liu, Xingguang Wei, Hongjie Zhang, Haomin Wang, Weiye Xu, Hao Li, Jiahao Wang, Nianchen Deng, Songze Li, Yinan He, Tan Jiang, Jiapeng Luo, Yi Wang, Con- ghui He, Botian Shi, Xingchen...

work page 2025