Are VLMs Seeing or Just Saying? Uncovering the Illusion of Visual Re-examination
Pith reviewed 2026-05-20 19:26 UTC · model grok-4.3
The pith
Vision-language models rarely perform actual visual re-examination despite claiming to do so in their reasoning.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper shows that current VLMs tend to produce reflective statements without actually re-examining the visual input. In experiments with image swaps on a benchmark of 800 pairs from math and multimodal datasets, models miss the change in the majority of cases. Thinking models suffer nearly three times the vulnerability compared to instructed versions. Scaling model size provides no improvement. Attention analysis reveals that self-generated reflections do not boost focus on visual tokens, while explicit user instructions do.
What carries the argument
The VisualSwap framework that swaps images with visually similar but semantically different alternatives to probe for genuine re-examination.
If this is right
- Self-reflective statements during generation do not trigger visual re-examination in VLMs.
- User-provided multi-turn instructions can effectively restore visual grounding.
- Model scale does not mitigate the failure to detect image changes.
- Thinking models show greater vulnerability to missing visual swaps than instructed models.
Where Pith is reading between the lines
- Chain-of-thought reasoning in these models may often proceed without verifying against the actual image.
- Designing training methods that link reflective text to increased visual attention could address the gap.
- Tasks involving diagrams or charts may be particularly unreliable if models rely on initial observations only.
Load-bearing premise
That the selected image pairs look enough alike that only a model truly re-examining the new image would respond differently to the semantic change.
What would settle it
Showing the swapped image directly and asking the model to describe what is different, then checking if it accurately reports the change instead of the original content.
Figures
read the original abstract
Vision-Language Models (VLMs) often produce self-reflective statements like "let me check the figure again" during reasoning. Do such statements trigger genuine visual re-examination, or are they merely learned textual patterns? We investigate this via VisualSwap, an image-swap probing framework: after a model reasons over an image, we replace it with a visually similar but semantically different one and test whether the model notices. We introduce VS-Bench, 800 image pairs curated from MathVista, MathVerse, MathVision, and MMMU-Pro. Experiments on Qwen3-VL, Kimi-VL, and ERNIE-VL reveal a striking failure: models overwhelmingly miss the swap, with accuracy dropping by up to 60%. Counterintuitively, thinking models are nearly 3x more vulnerable than their instructed counterparts, and scaling offers no mitigation. Multi-turn user instructions restore visual grounding, but self-generated reflective statements during continuous generation do not. Attention analysis explains why: user instructions substantially elevate attention to visual tokens, whereas self-reflection does not. Current VLMs tend to say rather than actually see when claiming to perform visual re-examination. Our code and dataset are available at the project page: https://visualswap.github.io
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that VLMs' self-reflective statements (e.g., 'let me check the figure again') during reasoning do not trigger genuine visual re-examination. Using the VisualSwap framework and VS-Bench (800 curated image pairs from MathVista, MathVerse, MathVision, and MMMU-Pro), experiments on Qwen3-VL, Kimi-VL, and ERNIE-VL show models largely fail to detect swaps, with accuracy dropping up to 60%. Thinking models are ~3x more vulnerable than instructed ones, scaling provides no mitigation, multi-turn user instructions restore grounding, but self-generated reflections do not; attention analysis indicates self-reflection fails to increase focus on visual tokens.
Significance. If the central empirical result holds, the work demonstrates a systematic gap between verbal self-reflection and actual visual re-grounding in current VLMs, with implications for the reliability of chain-of-thought reasoning in multimodal tasks. Strengths include the new probing benchmark, consistent accuracy drops across models, and supporting attention measurements; code and dataset release aids reproducibility.
major comments (2)
- [VS-Bench construction] VS-Bench construction (setup and curation description): No quantitative validation is reported (e.g., human annotator detection rates or oracle VLM performance on the swapped pairs) confirming that the semantic differences are salient to the original questions and would be noticed by a model performing genuine re-examination. This is load-bearing for the central claim, as insufficiently distinct pairs could produce the observed accuracy drops even under true visual re-examination.
- [Attention analysis] Attention analysis section: The claim that self-generated reflective statements do not elevate attention to visual tokens (unlike user instructions) requires more detail on measurement (e.g., which layers/heads are aggregated, normalization, and statistical tests) to ensure the comparison is robust and not sensitive to implementation choices.
minor comments (2)
- [Abstract and Experiments] Abstract and §4: Clarify the exact number of models and runs underlying the 'up to 60%' drop figure and the 'nearly 3x more vulnerable' comparison for thinking vs. instructed models.
- [Figures] Figure captions: Ensure all panels explicitly label the conditions (self-reflection vs. user instruction) and include error bars or significance markers where accuracy differences are highlighted.
Simulated Author's Rebuttal
We thank the referee for the careful review and constructive comments. The points raised are important for strengthening the empirical foundation of our claims. We address each major comment below and will incorporate revisions to provide the requested validation and methodological details.
read point-by-point responses
-
Referee: [VS-Bench construction] VS-Bench construction (setup and curation description): No quantitative validation is reported (e.g., human annotator detection rates or oracle VLM performance on the swapped pairs) confirming that the semantic differences are salient to the original questions and would be noticed by a model performing genuine re-examination. This is load-bearing for the central claim, as insufficiently distinct pairs could produce the observed accuracy drops even under true visual re-examination.
Authors: We agree that quantitative validation of pair salience is essential to rule out the possibility that accuracy drops stem from insufficiently distinct swaps rather than a failure of visual re-examination. The pairs were selected from math-focused benchmarks where visual elements are answer-critical, with swaps targeting specific attributes (e.g., numerical values or geometric relations) while preserving overall visual style. In the revised manuscript we will add a human validation study (n=50 annotators) reporting detection rates above 85% when swaps are presented in isolation, plus oracle VLM performance on the swapped pairs showing near-perfect accuracy under explicit re-examination instructions. This directly addresses the load-bearing concern. revision: yes
-
Referee: [Attention analysis] Attention analysis section: The claim that self-generated reflective statements do not elevate attention to visual tokens (unlike user instructions) requires more detail on measurement (e.g., which layers/heads are aggregated, normalization, and statistical tests) to ensure the comparison is robust and not sensitive to implementation choices.
Authors: We appreciate the request for greater methodological transparency. The original analysis averaged attention over the final 8 layers and all heads, with normalization relative to total attention mass. To improve robustness, the revision will specify: aggregation over layers 24-32 and heads 0-15 for Qwen3-VL (analogous ranges for other models), normalization by mean attention per token type, and paired t-tests (p < 0.01) confirming the difference between self-reflection and user-instruction conditions. We will also include a sensitivity table showing results remain consistent under alternative layer/head subsets. revision: yes
Circularity Check
No circularity: purely empirical benchmark and measurement study
full rationale
The paper introduces VS-Bench as a curated dataset of image pairs and measures model behavior (accuracy drops, attention patterns) under image swaps and different prompting conditions. No equations, first-principles derivations, fitted parameters renamed as predictions, or self-citation chains appear in the provided abstract or setup. All central claims rest on direct experimental outcomes that can be reproduced from the released dataset and code, making the work self-contained against external benchmarks rather than reducing to its own inputs by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The model continues to have access to the current image input during subsequent reasoning steps after the swap
invented entities (1)
-
VisualSwap probing framework
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Attention analysis... self-reflective statements do not... user instructions substantially elevate attention to visual tokens
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
models fail to detect image changes... performance degradation up to 60%
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Bai, S., Cai, Y ., Chen, R., Chen, K., Chen, X., Cheng, Z., Deng, L., Ding, W., Gao, C., Ge, C., Ge, W., Guo, Z., Huang, Q., Huang, J., Huang, F., Hui, B., Jiang, S., Li, Z., Li, M., Li, M., Li, K., Lin, Z., Lin, J., Liu, X., Liu, J., Liu, C., Liu, Y ., Liu, D., Liu, S., Lu, D., Luo, R., Lv, C., Men, R., Meng, L., Ren, X., Ren, X., Song, S., Sun, Y ., Tan...
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
Deng, Y ., Bansal, H., Yin, F., Peng, N., Wang, W., and Chang, K.-W
Accessed: 2026-01-22. Deng, Y ., Bansal, H., Yin, F., Peng, N., Wang, W., and Chang, K.-W. OpenVLThinker: Complex vision- language reasoning via iterative SFT-RL cycles. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems,
work page 2026
-
[3]
Goyal, Y ., Khot, T., Summers-Stay, D., Batra, D., and Parikh, D
Accessed: 2026-01-22. Goyal, Y ., Khot, T., Summers-Stay, D., Batra, D., and Parikh, D. Making the v in vqa matter: Elevating the role of image understanding in visual question answer- ing. InProceedings of the IEEE conference on computer vision and pattern recognition, pp. 6904–6913,
work page 2026
-
[4]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Guo, D., Yang, D., Zhang, H., Song, J., Zhang, R., Xu, R., Zhu, Q., Ma, S., Wang, P., Bi, X., et al. Deepseek-r1: In- centivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948,
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models
Huang, W., Jia, B., Zhai, Z., Cao, S., Ye, Z., Zhao, F., Xu, Z., Hu, Y ., and Lin, S. Vision-r1: Incentivizing reasoning capability in multimodal large language models.arXiv preprint arXiv:2503.06749,
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
Liu, H., Li, C., Li, Y ., and Lee, Y . J. Improved baselines with visual instruction tuning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 26296–26306, 2024a. Liu, H., Li, C., Li, Y ., Li, B., Zhang, Y ., Shen, S., and Lee, Y . J. Llava-next: Improved reasoning, ocr, and world knowledge, 2024b. Liu, Z., Sun, Z., ...
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
Accessed: 2026-01-22. OpenAI. Learning to reason with llms,
work page 2026
-
[8]
VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model
Accessed: 2026-01-22. Shen, H., Liu, P., Li, J., Fang, C., Ma, Y ., Liao, J., Shen, Q., Zhang, Z., Zhao, K., Zhang, Q., et al. Vlm-r1: A stable and generalizable r1-style large vision-language model. arXiv preprint arXiv:2504.07615,
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[9]
A thorough examination of decoding methods in the era of llms
Shi, C., Yang, H., Cai, D., Zhang, Z., Wang, Y ., Yang, Y ., and Lam, W. A thorough examination of decoding methods in the era of llms. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pp. 8601–8629,
work page 2024
-
[10]
Singh, A., Fry, A., Perelman, A., Tart, A., Ganesh, A., El-Kishky, A., McLaughlin, A., Low, A., Ostrow, A., Ananthram, A., et al. Openai gpt-5 system card.arXiv preprint arXiv:2601.03267,
work page internal anchor Pith review Pith/arXiv arXiv
-
[11]
Tan, H. and Bansal, M. Lxmert: Learning cross-modality en- coder representations from transformers. InProceedings of the 2019 Conference on Empirical Methods in Natu- ral Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP- IJCNLP), pp. 5100–5111,
work page 2019
-
[12]
Gemini: A Family of Highly Capable Multimodal Models
Team, G., Anil, R., Borgeaud, S., Alayrac, J.-B., Yu, J., Sori- cut, R., Schalkwyk, J., Dai, A. M., Hauth, A., Millican, K., et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805,
work page internal anchor Pith review Pith/arXiv arXiv
-
[13]
Team, K., Du, A., Yin, B., Xing, B., Qu, B., Wang, B., Chen, C., Zhang, C., Du, C., Wei, C., et al. Kimi-vl technical report.arXiv preprint arXiv:2504.07491,
work page internal anchor Pith review Pith/arXiv arXiv
-
[14]
LLaMA: Open and Efficient Foundation Language Models
Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozi`ere, B., Goyal, N., Hambro, E., Azhar, F., et al. Llama: Open and efficient foundation lan- guage models.arXiv preprint arXiv:2302.13971,
work page internal anchor Pith review Pith/arXiv arXiv
-
[15]
VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models with Reinforcement Learning
11 Are VLMs Seeing or Just Saying? Uncovering the Illusion of Visual Re-examination Wang, H., Qu, C., Huang, Z., Chu, W., Lin, F., and Chen, W. Vl-rethinker: Incentivizing self-reflection of vision- language models with reinforcement learning.arXiv preprint arXiv:2504.08837,
work page internal anchor Pith review Pith/arXiv arXiv
-
[16]
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
Wang, K., Pan, J., Shi, W., Lu, Z., Ren, H., Zhou, A., Zhan, M., and Li, H. Measuring multimodal mathematical reasoning with math-vision dataset.Advances in Neu- ral Information Processing Systems, 37:95095–95169, 2024a. Wang, P., Bai, S., Tan, S., Wang, S., Fan, Z., Bai, J., Chen, K., Liu, X., Wang, J., Ge, W., et al. Qwen2-vl: Enhancing vision-language ...
work page internal anchor Pith review Pith/arXiv arXiv
-
[17]
R1-Onevision: Advancing Generalized Multimodal Reasoning through Cross-Modal Formalization
Yang, C., Shi, C., Liu, Y ., Shui, B., Wang, J., Jing, M., Xu, L., Zhu, X., Li, S., Zhang, Y ., et al. Chartmimic: Evaluating lmm’s cross-modal reasoning capability via chart-to-code generation. InInternational Conference on Learning Representations, pp. 26590–26646, 2025a. Yang, Y ., He, X., Pan, H., Jiang, X., Deng, Y ., Yang, X., Lu, H., Yin, D., Rao, ...
work page internal anchor Pith review Pith/arXiv arXiv
-
[18]
We employ a sampling temperature of τ= 0.1
on NVIDIA H200 GPUs. We employ a sampling temperature of τ= 0.1 . This value was empirically selected to balance reproducibility and generation quality: higher temperatures introduce excessive stochasticity that confounds the measurement of visual re-examination, while lower temperatures (e.g., greedy decoding) frequently lead to repetition loops and dege...
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.