pith. sign in

arxiv: 2605.30557 · v1 · pith:Y4WCG5ALnew · submitted 2026-05-28 · 💻 cs.CV · cs.AI· cs.CL

Seeing Isn't Knowing: Do VLMs Know When Not to Answer Spatial Questions (and Why)?

Pith reviewed 2026-06-29 07:46 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.CL
keywords vision-language modelsspatial reasoningocclusionperspective ambiguityabstentionuncertaintybenchmarkvisual evidence
0
0 comments X

The pith

VLMs overconfidently answer spatial questions even when views are occluded or misleading.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces SpatialUncertain to test whether vision-language models recognize when spatial questions lack sufficient visual evidence and should not be answered. It creates two challenges: occlusion that hides needed information and perspective shifts that create misleading geometric cues. Questions are answerable from clean observations but require abstention under these conditions. Tests on multiple frontier models show they attempt answers anyway, reaching only around 30 percent accuracy with occlusion and below 10 percent with perspective ambiguity, while also struggling to select useful additional views. Readers should care because real-world spatial tasks depend on models knowing the limits of their observations rather than guessing from incomplete data.

Core claim

The paper establishes that frontier open- and closed-source VLMs exhibit consistent overconfident answering on spatial reasoning tasks under occlusion and perspective ambiguity challenges, producing average accuracies around 30 percent and below 10 percent respectively, while many models select additional resolving views at rates near random chance.

What carries the argument

SpatialUncertain, a controlled evaluation framework that pairs answerable spatial questions with introduced occlusion and perspective ambiguity conditions to require abstention or view selection.

If this is right

  • Spatial reasoning benchmarks must test for recognition of insufficient evidence in addition to answer correctness.
  • Models need explicit mechanisms to detect incomplete or misleading observations and choose to abstain.
  • Evaluation protocols should measure the ability to identify which additional viewpoints would resolve ambiguity.
  • Real-world deployment of VLMs for spatial tasks requires handling observation uncertainty rather than assuming reliable inputs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Robotics and navigation systems using these models may produce errors when relying on partial or angled camera feeds.
  • Training approaches could incorporate explicit rewards for abstention on ambiguous spatial inputs.
  • The results point toward integrating active perception, where models request new observations when current ones are insufficient.

Load-bearing premise

The constructed spatial questions become genuinely unanswerable from the challenged views, so that abstention is the required behavior.

What would settle it

A model that abstains at high rates on the occlusion and ambiguity cases or selects resolving views well above chance while maintaining high accuracy on clean cases would contradict the reported failure modes.

Figures

Figures reproduced from arXiv: 2605.30557 by Han Lin, Idan Szpektor, Mohit Bansal, Yonatan Bitton, Yue Zhang, Zun Wang.

Figure 1
Figure 1. Figure 1: Visual observations are inherently 2D projections of a 3D world and may provide sufficient, [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of our evaluation framework of SPATIALUNCERTAIN. (Top) Occlusion: A target object is occluded to create partial or full occlusion configurations, each paired with a clean reference. (Bottom) Perspective: Same-category object pairs are viewed from a reference (equidistant) and an ambiguous (shifted) camera position. We further introduce ViewSel (single-stage view selection) and AbstainViewSel (two-… view at source ↗
Figure 3
Figure 3. Figure 3: Camera placement under different conditions (left) and the resulting distribution of [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Examples of our controlled evaluation scenes. [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Model accuracy across question types under occlusion (top) and perspective ambiguity [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Annotation Interface for Occlusion and Perspective Scenes. [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗
read the original abstract

Spatial reasoning is a fundamental capability for vision-language models (VLMs) deployed in real-world environments. However, visual observations are inherently limited representations of a 3D world: occlusion can render objects invisible, and perspective can make geometric properties misleading. Despite this, existing spatial reasoning benchmarks typically assume that observations are sufficient and reliable, focusing on whether models produce correct answers rather than whether they recognize when a question cannot be answered and what additional observations would be needed. In this work, we challenge this assumption by constructing a controlled evaluation framework, SpatialUncertain, and introducing two types of observation challenges: (1) occlusion, which hides target information, and (2) perspective ambiguity, which produces misleading visual cues. For each configuration, we design spatial questions that are answerable under clean observations but require abstention under the introduced challenges. We further evaluate whether models can identify which additional viewpoints would resolve perspective ambiguity. Our results across a diverse set of frontier open- and closed-source VLMs reveal two consistent failure modes. First, models are prone to overconfident answering, attempting to solve spatial reasoning tasks even when visual evidence is incomplete or misleading, with average accuracy around 30\% under occlusion and below 10\% under perspective ambiguity. Second, even when additional views are available, some models perform near random chance in identifying which would provide reliable evidence. Together, our findings call for moving beyond answer correctness toward evaluating whether models know when to abstain and how to seek reliable evidence.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper introduces SpatialUncertain, a controlled evaluation framework for VLMs on spatial reasoning questions under two observation challenges: occlusion (hiding target information) and perspective ambiguity (misleading geometric cues). It claims that questions are answerable from clean views but require abstention under challenges, and reports that frontier VLMs exhibit overconfident answering (average accuracy ~30% under occlusion, <10% under perspective ambiguity) and often fail to identify helpful additional viewpoints even when available.

Significance. If the central claims hold after validation, the work is significant for highlighting a gap between answer correctness and uncertainty awareness in VLMs, which is critical for real-world deployment. The empirical evaluation across open- and closed-source models provides concrete failure mode data that could motivate new benchmarks focused on abstention and evidence-seeking behavior.

major comments (2)
  1. [Abstract / SpatialUncertain description] Abstract and SpatialUncertain construction: the claim that designed questions 'are answerable under clean observations but require abstention under the introduced challenges' is load-bearing for interpreting low accuracies as overconfident failure to abstain rather than task hardness, yet the manuscript provides no human performance baselines on the challenged images, no formal geometric argument for unanswerability, and no independent verification that correct answers are impossible from the given views.
  2. [Abstract] Abstract: the specific accuracy figures (~30% occlusion, below 10% perspective) are presented without any description of question construction details, prompting strategy, number of trials, or statistical controls, preventing verification of the 'systematic failure modes' claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback, which identifies key areas for strengthening the validation of our claims. We address each major comment below and will incorporate revisions to provide additional evidence and details.

read point-by-point responses
  1. Referee: [Abstract / SpatialUncertain description] Abstract and SpatialUncertain construction: the claim that designed questions 'are answerable under clean observations but require abstention under the introduced challenges' is load-bearing for interpreting low accuracies as overconfident failure to abstain rather than task hardness, yet the manuscript provides no human performance baselines on the challenged images, no formal geometric argument for unanswerability, and no independent verification that correct answers are impossible from the given views.

    Authors: We agree these elements would strengthen the central claim. The manuscript describes the question design process in Section 3, where configurations were selected such that target information is inaccessible or misleading under the challenges while solvable from clean views. However, we acknowledge the absence of human baselines, formal geometric arguments, and documented independent verification. In revision, we will add: (1) human performance results on a subset of clean and challenged views demonstrating appropriate abstention; (2) a geometric analysis subsection with diagrams showing why answers cannot be determined; and (3) details on multi-annotator verification of unanswerability during construction. revision: yes

  2. Referee: [Abstract] Abstract: the specific accuracy figures (~30% occlusion, below 10% perspective) are presented without any description of question construction details, prompting strategy, number of trials, or statistical controls, preventing verification of the 'systematic failure modes' claim.

    Authors: The abstract prioritizes brevity, while the full manuscript provides these details in Sections 3 (construction of occlusion and perspective ambiguity cases) and 4 (evaluation protocol, including prompting variants and trial counts with variance). To improve verifiability, we will revise the abstract to include a brief reference to the evaluation scale and direct readers to the methods for construction, prompting, and statistical details. revision: yes

Circularity Check

0 steps flagged

Empirical evaluation on new benchmark; no derivations or fitted predictions

full rationale

The paper introduces SpatialUncertain by constructing questions asserted to be answerable under clean observations but requiring abstention under occlusion/perspective challenges. It then reports empirical accuracies of existing VLMs on this test set (e.g., ~30% under occlusion). No equations, parameters, or predictions are fitted within the paper and then renamed as results. No self-citation load-bearing steps, uniqueness theorems, or ansatzes are present. The evaluation is externally falsifiable via human baselines or further testing on the released set. This is a standard empirical benchmark paper with no internal circular reductions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that the constructed challenges render the chosen questions unanswerable, plus standard assumptions in VLM benchmarking about prompt consistency and answer extraction.

axioms (1)
  • domain assumption Visual observations can be insufficient or misleading for spatial questions due to occlusion and perspective.
    Foundational premise for defining when abstention is required in SpatialUncertain.

pith-pipeline@v0.9.1-grok · 5818 in / 1198 out tokens · 36178 ms · 2026-06-29T07:46:12.882589+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

25 extracted references · 17 canonical work pages · 10 internal anchors

  1. [1]

    Qwen2.5-VL Technical Report

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report.Ar...

  2. [2]

    Spatialrgpt: Grounded spatial reasoning in vision language model.ArXiv, abs/2406.01584,

    An-Chieh Cheng, Hongxu Yin, Yang Fu, Qiushan Guo, Ruihan Yang, Jan Kautz, Xiaolong Wang, and Sifei Liu. Spatialrgpt: Grounded spatial reasoning in vision language model.ArXiv, abs/2406.01584,

  3. [3]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Deepmind. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.https://arxiv.org/abs/2507.06261, 2025a. Deepmind. Gemini 3 flash: Frontier intelligence built for speed. https://blog.google/produc ts/gemini/gemini-3-flash/, 2025b. Jiafei Duan, Samson Yu, Hui Li Tan, Hongyuan Zhu, an...

  4. [4]

    Selectively answering visual questions

    Julian Eisenschlos, Hernán Maina, Guido Ivetta, and Luciana Benotti. Selectively answering visual questions. InFindings of the Association for Computational Linguistics: ACL 2024, pages 4219–4229,

  5. [5]

    Tubench: Benchmarking large vision-language models on trustworthiness with unanswerable questions.arXiv preprint arXiv:2410.04107,

    Xingwei He, Qianru Zhang, A Jin, Yuan Yuan, Siu-Ming Yiu, et al. Tubench: Benchmarking large vision-language models on trustworthiness with unanswerable questions.arXiv preprint arXiv:2410.04107,

  6. [6]

    A Baseline for Detecting Misclassified and Out-of-Distribution Examples in Neural Networks

    Dan Hendrycks and Kevin Gimpel. A baseline for detecting misclassified and out-of-distribution examples in neural networks.arXiv preprint arXiv:1610.02136,

  7. [7]

    LoRA: Low-Rank Adaptation of Large Language Models

    10 J. Edward Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models.ArXiv, abs/2106.09685,

  8. [8]

    Mengdi Jia, Zekun Qi, Shaochen Zhang, Wenyao Zhang, Xinqiang Yu, Jiawei He, He Wang, and Li Yi

    URL https://api.semanticscholar.org/CorpusID:235458009. Mengdi Jia, Zekun Qi, Shaochen Zhang, Wenyao Zhang, Xinqiang Yu, Jiawei He, He Wang, and Li Yi. Omnispatial: Towards comprehensive spatial reasoning benchmark for vision language models.arXiv preprint arXiv:2506.03135,

  9. [9]

    Language Models (Mostly) Know What They Know

    Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield-Dodds, Nova DasSarma, Eli Tran-Johnson, et al. Language models (mostly) know what they know.arXiv preprint arXiv:2207.05221,

  10. [10]

    What’s “up” with vision-language models? investigating their struggle with spatial reasoning

    Amita Kamath, Jack Hessel, and Kai-Wei Chang. What’s “up” with vision-language models? investigating their struggle with spatial reasoning. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 9161–9175,

  11. [11]

    AI2-THOR: An Interactive 3D Environment for Visual AI

    Eric Kolve, Roozbeh Mottaghi, Winson Han, Eli VanderBilt, Luca Weihs, Alvaro Herrasti, Matt Deitke, Kiana Ehsani, Daniel Gordon, Yuke Zhu, et al. Ai2-thor: An interactive 3d environment for visual ai.arXiv preprint arXiv:1712.05474,

  12. [12]

    Evaluating object hallucination in large vision-language models

    Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision-language models. InProceedings of the 2023 conference on empirical methods in natural language processing, pages 292–305,

  13. [13]

    Sqa3d: Situated question answering in 3d scenes.arXiv preprint arXiv:2210.07474,

    Xiaojian Ma, Silong Yong, Zilong Zheng, Qing Li, Yitao Liang, Song-Chun Zhu, and Siyuan Huang. Sqa3d: Situated question answering in 3d scenes.arXiv preprint arXiv:2210.07474,

  14. [14]

    Selfcheckgpt: Zero-resource black-box hallucina- tion detection for generative large language models

    Potsawee Manakul, Adian Liusie, and Mark Gales. Selfcheckgpt: Zero-resource black-box hallucina- tion detection for generative large language models. InProceedings of the 2023 conference on empirical methods in natural language processing, pages 9004–9017,

  15. [15]

    Gsr-bench: A benchmark for grounded spatial reasoning evaluation via multimodal llms.ArXiv, abs/2406.13246,

    Navid Rajabi and Jana Kosecka. Gsr-bench: A benchmark for grounded spatial reasoning evaluation via multimodal llms.ArXiv, abs/2406.13246,

  16. [16]

    Anna Rohrbach, Lisa Anne Hendricks, Kaylee Burns, Trevor Darrell, and Kate Saenko

    URL https://api.semanticscholar.or g/CorpusID:270619607. Anna Rohrbach, Lisa Anne Hendricks, Kaylee Burns, Trevor Darrell, and Kate Saenko. Object hallucination in image captioning. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 4035–4045,

  17. [17]

    OpenAI GPT-5 System Card

    11 Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al. Openai gpt-5 system card.arXiv preprint arXiv:2601.03267,

  18. [18]

    Mind the gap: Benchmarking spatial reasoning in vision-language models.arXiv preprint arXiv:2503.19707,

    Ilias Stogiannidis, Steven McDonagh, and Sotirios A Tsaftaris. Mind the gap: Benchmarking spatial reasoning in vision-language models.arXiv preprint arXiv:2503.19707,

  19. [19]

    Aligning large multimodal models with factually augmented rlhf

    Zhiqing Sun, Sheng Shen, Shengcao Cao, Haotian Liu, Chunyuan Li, Yikang Shen, Chuang Gan, Liangyan Gui, Yu-Xiong Wang, Yiming Yang, et al. Aligning large multimodal models with factually augmented rlhf. InFindings of the Association for Computational Linguistics: ACL 2024, pages 13088–13110,

  20. [20]

    Just ask for calibration: Strategies for eliciting calibrated confidence scores from language models fine-tuned with human feedback

    Katherine Tian, Eric Mitchell, Allan Zhou, Archit Sharma, Rafael Rafailov, Huaxiu Yao, Chelsea Finn, and Christopher D Manning. Just ask for calibration: Strategies for eliciting calibrated confidence scores from language models fine-tuned with human feedback. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages ...

  21. [21]

    SpatialBench: Benchmarking Multimodal Large Language Models for Spatial Cognition

    URL https: //openreview.net/forum?id=gjeQKFxFpZ. Peiran Xu, Sudong Wang, Yao Zhu, Jianing Li, Gege Qi, and Yunjian Zhang. Spatialbench: Bench- marking multimodal large language models for spatial cognition.arXiv preprint arXiv:2511.21471,

  22. [22]

    MMSI-Bench: A Benchmark for Multi-Image Spatial Intelligence

    Jihan Yang, Shusheng Yang, Anjali W Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie. Thinking in space: How multimodal large language models see, remember, and recall spaces. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 10632–10643, 2025a. Shusheng Yang, Jihan Yang, Pinzhi Huang, Ellis L Brown II, Zihao Yang, Yue Yu, Shengb...

  23. [23]

    Do large language models know what they don’t know? InFindings of the association for Computational Linguistics: ACL 2023, pages 8653–8665,

    12 Zhangyue Yin, Qiushi Sun, Qipeng Guo, Jiawen Wu, Xipeng Qiu, and Xuan-Jing Huang. Do large language models know what they don’t know? InFindings of the association for Computational Linguistics: ACL 2023, pages 8653–8665,

  24. [24]

    When and How Much to Imagine: Adaptive Test-Time Scaling with World Models for Visual Spatial Reasoning

    Shoubin Yu, Yue Zhang, Zun Wang, Jaehong Yoon, Huaxiu Yao, Mingyu Ding, and Mohit Bansal. When and how much to imagine: Adaptive test-time scaling with world models for visual spatial reasoning.ArXiv, abs/2602.08236, 2026a. URL https://api.semanticscholar.org/Corp usID:285452504. Shoubin Yu, Yue Zhang, Zun Wang, Jaehong Yoon, Huaxiu Yao, Mingyu Ding, and ...

  25. [25]

    Yue Zhang, Ziqiao Ma, Jialu Li, Yanyuan Qiao, Zun Wang, Joyce Chai, Qi Wu, Mohit Bansal, and Parisa Kordjamshidi

    URL https: //api.semanticscholar.org/CorpusID:257038436. Yue Zhang, Ziqiao Ma, Jialu Li, Yanyuan Qiao, Zun Wang, Joyce Chai, Qi Wu, Mohit Bansal, and Parisa Kordjamshidi. Vision-and-language navigation today and tomorrow: A survey in the era of foundation models.arXiv preprint arXiv:2407.07035, 2024a. Yue Zhang, Zhiyang Xu, Ying Shen, Parisa Kordjamshidi,...